My personal and professional life

2019-10-18

New features in MySQL 8.0.18

On Monday, Oracle released MySQL 8.0.18 (see also release notes) and unlike with previous releases there was lot of information about what would be included into this one. I personally attended several presentations and read some posts in social media. Like I already wrote the major new features in this release are hash joins and EXPLAIN ANALYZE, which aim to respectively improve the performance of some queries and give better insight about how optimizer plan compares to actual execution.

Hash join optimization

Before 8.0.18 the only type of join done by MySQL was nested loop (a loop within a loop, an inner loop within an outer one). And there were two algorithms implementing it:
  • Nested-Loop Join - it's the simplest where each row from the outer loop is passed to the inner loop for processing. The obvious drawback of this one is that the inner table needs to be read many times.
  • Block Nested Loop (BNL) - uses buffering of rows from the outer table to reduce the number of times the inner table is read. In this case instead of passing just one row to the inner loop multiple rows from the buffer could be passed at once. This greatly reduces the number of time the inner table needs to be read. Of course, this requires more memory (see join_buffer_size). This algorithm is used for range, index and ALL join types.
There is also Batched Key Access (BKA) algorithm, which uses buffering like BNL, but puts keys for the rows in the buffer and passes them in a batch to the database engine for index lookups. The keys are then used to fetch the rows from the inner table. This algorithm can be used when there is index access to the inner table.

Here come hash joins. The idea is to build a hash table for the values from the outer table, which normally is the smallest. Then, read the inner table and find matches in the hash table. Like this both tables could be read at best just once. Hash joins are best for large results sets where indexes cannot be used. Due to their nature they are useful only for equality joins. Since 8.0.18 MySQL would choose hash join before BNL for any query that uses equality join condition (e.g. ref or eq_ref join types) and uses no indexes. Let's try it out.

For the example, I added a new table called job_sal with salary ranges per job to the dept_emp schema and I also generated 1 million more employees. Now let's say you want to find all the employees whose salary is outside the range. I would come up with a query like this:

SELECT E.ename, E.sal, JS.sal_min, JS.sal_max
  FROM emp     E,

       job_sal JS
 WHERE E.job = JS.job

   AND E.sal NOT BETWEEN JS.sal_min AND JS.sal_max;


Which given the absence of indexes would be executed using BNL algorithm in MySQL prior to 8.0.18, but in the new release the same query would benefit from the hash join optimization as visualized by the following execution plan in tree format:

+----------------------------------------------------------------------------------------+
| EXPLAIN                                                                                |
+----------------------------------------------------------------------------------------+
| -> Filter: (E.sal not between JS.sal_min and JS.sal_max)  (cost=499211.71 rows=442901)
    -> Inner hash join (E.job = JS.job)  (cost=499211.71 rows=442901)
        -> Table scan on E  (cost=1962.39 rows=996514)
        -> Hash
            -> Table scan on JS  (cost=0.75 rows=5)
 |
+----------------------------------------------------------------------------------------+
1 row in set (0.0023 sec)


Visual Explain Plan
The plan reveals that the smaller table JS would be hashed and the join would be done using the hash table. Unfortunately, the hash join optimization is not visible in traditional explain plan, JSON and thus into MySQL Workbench. The visual explain plan (look right) would be thus misleading showing BNL as the join algorithm. This is understandable, because the hash join optimization is possible only with the new iterator executor, which actually generates the explain plan in TREE format. This executor is not able to explain some queries, so you may see only the message "not executable by iterator executor". However, I really hope this is improved in future releases, because the explain plan should be consistent between formats. I reported a feature request as bug 97280.

I tested the performance of the query for 1M employees and its execution time was 0.89 sec. Disabling the hash join optimization with NO_HASH_JOIN hint increased the execution time to 1.26 sec. Hash join for sure would be much more beneficial when outer table has more rows.

EXPLAIN ANALYZE

This new feature also comes on top of TREE explain plan and it represents kind of a profiling tool, because apart from information about how optimizer plans to execute the query (see above) there's also information from its actual execution. This information includes:
  • the time for returning the first row (in ms);
  • the time for returning all rows (in ms);
  • the number of rows read;
  • the number of loops.
It makes it possible to compare how optimizer estimations compare to the real execution of the query. Let's try it with the previous query. I have intentionally shortened the output.

+-------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                               |
+-------------------------------------------------------------------------------------------------------+
| -> Filter: (E.sal not between JS.sal_min and JS.sal_max)
       (cost=502072.82 rows=442901) (actual time=0.372..747.742 rows=915768 loops=1)
    -> Inner hash join (E.job = JS.job)
         (cost=502072.82 rows=442901) (actual time=0.355..575.011 rows=1000014 loops=1)
        -> Table scan on E  (cost=2534.62 rows=996514) (actual time=0.185..353.877 rows=1000014 loops=1)
        -> Hash
            -> Table scan on JS  (cost=0.75 rows=5) (actual time=0.133..0.144 rows=5 loops=1)
 |
+-------------------------------------------------------------------------------------------------------+
1 row in set (0.7754 sec)


If it looks familiar to you check PostgreSQL's EXPLAIN ANALYZE. Please, note that times are milliseconds, not seconds! I think it's easy to note that number of rows estimated by the optimizer for the scan of table E differs, because table's statistics are not accurate. I have added 1M rows to table emp, so it's necessary to increase the number of sampling pages (see innodb_stats_persistent_sample_pages) and run ANALYZE TABLE again. After doing it the optimizer estimation and actual number of rows match. However, the optimizer is also wrong about the rows involved into the upper level operations - the hash join and filter, but this cannot be fixed with index statistics. According to Norvald H. Ryeng (see his article MySQL EXPLAIN ANALYZE) both "estimated and actual numbers" of rows "are averages over all loop iterations", but I have just one loop for all the operations in the plan.

In any case EXPLAIN ANALYZE is a nice addition to the instrumentation of the optimizer. Unfortunately, neither TREE format nor EXPLAIN ANALYZE are available currently even in the latest MySQL Workbench, so another feature request from me as bug 97282.

MySQL is OpenSSL only

With this version the support for YaSSL and WolfSSL libraries are removed, so MySQL could be compiled only with OpenSSL. I have personally always built MySQL with OpenSSL (i.e. using option -DWITH_SSL=system), because this is what comes with Slackware, but of course the more important is that "the MySQL/OpenSSL combination is very well tested and production proven" as Georgi Kodinov explained in his post MySQL is OpenSSL-only now !.

Spatial

Function ST_Distance now accepts SRS arguments of all geometry types and not only arguments of types Point and Point, or Point and MultiPoint as before.
Happy using MySQL!

Update 2019-11-09: That timings are in milliseconds wasn't specified in the manual, so I filed bug 97492 as I also think that it would be good to print the unit next to timings.

No comments: