My personal and professional life

2018-11-15

MySQL Community Data Masking and De-Identification

Oracle recently introduced a Data Masking functionality in MySQL, but it's available only in the Enterprise edition (see MySQL Enterprise Data Masking and De-Identification in the manual). I'm not a customer, but I'm acquainted with the concept as well as stored routines, so I decided to try emulating what MySQL has implemented for their Enterprise edition of the database.

I started by reading the manual, then reviewed formats of payment card and social security numbers to finally came up with a simple implementation this evening. I have implemented only the general and special purpose masking functions and the functions for generation of random data (e.g. e-mail, primary account number, social security number (SSN) and US phone number). I have an additional function gen_rnd_string for generation of a random string with given length (only alphabetic characters by default). I've made some test and the results look promising, but I have to experiment more and perhaps generate better data that could eventually pass verifications, because an application using randomly generated data could easily fail if it cannot validate the data. It remains me to implement the function for generation of random data using dictionaries for which I think I could use LOAD DATA INFILE and CREATE TEMPORARY TABLE for loading dictionary.

It's just my simple idea of how one could implement data masking and de-identification even in MySQL Community edition and as stored routines, because MySQL's implementation if based on user-defined functions (UDF). I wonder why they chose this approach and here is my take on the possible answers:
  • Visibility: A user-defined function has global visibility, so it could be called without proceeding schema name just like any other built-in function. My implementation has to be loaded into a schema and thus each function would have to be called with schema name in front.
  • Default arguments: MySQL still doesn't support default arguments for stored routines, but this is possible for user-defined functions as these could be written only "in C or C++ (or another language that can use C calling conventions)". The general purpose masking functions mask_inner and mask_outer accept an optional argument for the masking character. In a stored procedures implementation this could be only a user-defined variable. I'll implement this later.
  • Hiding implementation details: User-defined functions confine MySQL's implementation into the shared library binary. The source code of the stored routines could be easily reviewed by a user with enough permissions (e.g. an administrator). To my knowledge MySQL doesn't offer obfuscation utility like Oracle (see Obfuscating PL/SQL Source Code). A side point to this is that the manual also doesn't provide enough details on the possible values by the data generation functions.
  • Security: This could be related to how the server is handling the memory for UDF and stored routines calls, but that's too much internals.
  • Performance: User-defined functions by presumption should have better performance compared to stored routines, because the former are function calls in natively compiled shared library and the later are interpreted by the server on each invocation.
I'm eager to discuss this tomorrow on BGOUG Autumn Conference 2018 where there would be a presentation exactly on this topic. I'll update this post or write another one afterwards.

Update 2018-11-16 19:01 EET: The BGOUG conference confirmed my presumptions about the reasons to implement data masking as UDF functions. One of the reasons cited was exactly "performance".

Update 2018-11-18 19:23 EET: After playing a bit more and trying to implement the dictionary functions today, I hit two restrictions I haven't though about before. The first restriction is the impossibility to use prepared statements into a function (i.e. from chapter Prepared SQL Statement Syntax in the manual "SQL syntax for prepared statements can be used within stored procedures, but not in stored functions or triggers"). The second restriction is the impossibility to use LOAD DATA INFILE as prepared statement (see section SQL Syntax Allowed in Prepared Statements in the manual). The first restrictions means that dictionary functions could be implemented only as procedures. The second restrictions means that it's impossible to load dictionary from file, so implementation of function gen_dictionary_load cannot be completed - it could create the temporary table for the dictionary, but it cannot load it. I still created the procedures and was able go generate random US cities. I also benchmarked the performance with gen_rnd_pan function as it does both random number generation and Luhn number calculation. On my MySQL 5.7 server running on Intel Pentium G3420 @ 3.2 GHz with 8 GB DDR3 RAM I ran SELECT BENCHMARK(count, gen_rnd_pan()) and the results for different execution counts were the following:
  • 1000 times - 0.28 seconds;
  • 10000 times - 3.13 seconds;
  • 100000 times - 31.41 seconds;
  • 1000000 times - 278.19 seconds.
Unless MySQL removes the restrictions for me the topic is concluded. I could try to implement the data masking functions as UDF like MySQL if I have the time and if I have the real need.

Post a Comment