My personal and professional life

2018-11-28

Execution plan change in MySQL 8.0

About 6 weeks ago at work I had to write a new report query, because an application page was loading slowly (about 25 seconds). It turned out there is a problem in the application, because it was first executing a relatively slow query (taking about 6 seconds) and then a relatively fast query for each returned row from the first query. That second query was taking between 0.3 and 0.5 seconds per execution, but it was executed about 60 times, so the delay accumulated and explained the slow page load.
I decided to provide a single query that produces the whole result set, so I had to join 6 tables. One table (msg_progs) had just 13 rows, another two had just 64 rows (bldenv and prjprocstat) and another one just about 100 rows (p) and these four had no tendency of growing much more. There was one table with several tens of thousands of rows (m) and another with several hundreds thousands of rows (mtp), which for sure were going to grow on a daily basis. To improve the selectivity in the second largest table (m) I created a multi-column index on three columns involved in the query, but the problem was that sill much of its data was selected. Anyway, I ended up with a query that was executing for 6-7 seconds, which was much more acceptable from the initial 25 seconds. I'm not able to provide all the details, but I have captured the execution plan and I consider it safe to publish.
Query explain plan in MySQL 5.7.20
Remark: MySQL Workbench nicely visualizes the execution plans graphically (notice the lines connecting the different operations - the higher the number of rows the thicker the line). Textual execution plans are thing of the past, right?
That was until 3 weeks ago when we upgraded to MySQL 8.0.13 (from 5.7.20). The same query and in the same database started running for 24 seconds. When I checked the execution plan it has changed (see below).
Original explain plan in MySQL 8.0
I think you could easily notice that the join order has changed. Previously the second largest table was considered in the first nested loop, then the largest table and finally the smaller tables. Now it's the opposite as the optimizer first considers the smaller tables and only then the largest ones. I could have used a hint to fix the join order (e.g. JOIN_ORDER), but I noticed that if I re-create the multi-column index on just two of the columns (that are more selective in fact) the optimizer chooses a different plan, but the query execution time is still about 14 seconds. I finally decided to remove one of the tables in the joins (the smallest msg_progs with just 13 rows), which made the optimizer choose a different execution plan (here below) and the query now runs for about 4 seconds.
Finaly explain plan in MySQL 8.0
I'm not sure where this difference come from apart from the server, but I cannot blame without a proof (e.g. a reference in the manual that explains the difference), which unfortunately I haven't had the time to search. If you know the reason for the change of execution plans in MySQL 8.0 please, let me know.
The situation reminded me of Oracle database where after each major release we had queries with modified execution plans and had to use query hints or other tricks to make them fast again. It appears to be no different with MySQL nowadays.
Update 2018-12-17: I made a premature conclusion above, because the change of the plan happened (or was at least noticed) after we upgraded the server. However, as it turned out the real reason was the number of rows into tables m and mtp for which the server was choosing a different execution plan. We found it out experimenting with different in volume data sets.

2018-11-27

Game of authorities

In the management theory and practice there are cases where the use of authority is necessary (e.g. for enforcing deadlines, for resolving conflicts, etc.). However, while I understand the need, I really do not understand the managers that have to use authority when left with no other argument. I had situations in the past when authority was necessary, but I was always reluctant and I have never used it without a strong argument (which has been kindly acknowledged). What I mean is that I cannot simply enforce something onto someone if I do not have a good reason for this and if I have not justified it in front of the involved parties.
The point is that sometimes the arguments of the parties in a discussion may seem equally important and valid, so arbitration and use of authority may be inevitable. One way to solve this is to base on previous facts like previous written discussions, agreements, procedures and practices about the same or similar matters. What if the other party simply doesn't accept the additional arguments? Well, I really do not know, but I think it speaks enough of the other party. In enterprise communities without good documentation management and constant change of authorities knowledge is easily lost over time. New rules and procedures are created every day, because nobody is aware or willing to accept what was previously defined. That is how legends are created.

2018-11-15

MySQL Community Data Masking and De-Identification

Oracle recently introduced a Data Masking functionality in MySQL, but it's available only in the Enterprise edition (see MySQL Enterprise Data Masking and De-Identification in the manual). I'm not a customer, but I'm acquainted with the concept as well as stored routines, so I decided to try emulating what MySQL has implemented for their Enterprise edition of the database.

I started by reading the manual, then reviewed formats of payment card and social security numbers to finally came up with a simple implementation this evening. I have implemented only the general and special purpose masking functions and the functions for generation of random data (e.g. e-mail, primary account number, social security number (SSN) and US phone number). I have an additional function gen_rnd_string for generation of a random string with given length (only alphabetic characters by default). I've made some test and the results look promising, but I have to experiment more and perhaps generate better data that could eventually pass verifications, because an application using randomly generated data could easily fail if it cannot validate the data. It remains me to implement the function for generation of random data using dictionaries for which I think I could use LOAD DATA INFILE and CREATE TEMPORARY TABLE for loading dictionary.

It's just my simple idea of how one could implement data masking and de-identification even in MySQL Community edition and as stored routines, because MySQL's implementation if based on user-defined functions (UDF). I wonder why they chose this approach and here is my take on the possible answers:
  • Visibility: A user-defined function has global visibility, so it could be called without proceeding schema name just like any other built-in function. My implementation has to be loaded into a schema and thus each function would have to be called with schema name in front.
  • Default arguments: MySQL still doesn't support default arguments for stored routines, but this is possible for user-defined functions as these could be written only "in C or C++ (or another language that can use C calling conventions)". The general purpose masking functions mask_inner and mask_outer accept an optional argument for the masking character. In a stored procedures implementation this could be only a user-defined variable. I'll implement this later.
  • Hiding implementation details: User-defined functions confine MySQL's implementation into the shared library binary. The source code of the stored routines could be easily reviewed by a user with enough permissions (e.g. an administrator). To my knowledge MySQL doesn't offer obfuscation utility like Oracle (see Obfuscating PL/SQL Source Code). A side point to this is that the manual also doesn't provide enough details on the possible values by the data generation functions.
  • Security: This could be related to how the server is handling the memory for UDF and stored routines calls, but that's too much internals.
  • Performance: User-defined functions by presumption should have better performance compared to stored routines, because the former are function calls in natively compiled shared library and the later are interpreted by the server on each invocation.
I'm eager to discuss this tomorrow on BGOUG Autumn Conference 2018 where there would be a presentation exactly on this topic. I'll update this post or write another one afterwards.

Update 2018-11-16 19:01 EET: The BGOUG conference confirmed my presumptions about the reasons to implement data masking as UDF functions. One of the reasons cited was exactly "performance".

Update 2018-11-18 19:23 EET: After playing a bit more and trying to implement the dictionary functions today, I hit two restrictions I haven't though about before. The first restriction is the impossibility to use prepared statements into a function (i.e. from chapter Prepared SQL Statement Syntax in the manual "SQL syntax for prepared statements can be used within stored procedures, but not in stored functions or triggers"). The second restriction is the impossibility to use LOAD DATA INFILE as prepared statement (see section SQL Syntax Allowed in Prepared Statements in the manual). The first restrictions means that dictionary functions could be implemented only as procedures. The second restrictions means that it's impossible to load dictionary from file, so implementation of function gen_dictionary_load cannot be completed - it could create the temporary table for the dictionary, but it cannot load it. I still created the procedures and was able go generate random US cities. I also benchmarked the performance with gen_rnd_pan function as it does both random number generation and Luhn number calculation. On my MySQL 5.7 server running on Intel Pentium G3420 @ 3.2 GHz with 8 GB DDR3 RAM I ran SELECT BENCHMARK(count, gen_rnd_pan()) and the results for different execution counts were the following:
  • 1000 times - 0.28 seconds;
  • 10000 times - 3.13 seconds;
  • 100000 times - 31.41 seconds;
  • 1000000 times - 278.19 seconds.
Unless MySQL removes the restrictions for me the topic is concluded. I could try to implement the data masking functions as UDF like MySQL if I have the time and if I have the real need.