Summary: The SQL optimizer is essentially an implementation of a highly abstract data interface, designed so that customers can use a more general and easy-to-understand SQL language to manipulate and process data without having to focus on and abstract their own data interfaces, which greatly liberates their applications.

The authors | | passenger source ali technology to the public

Background and architecture

We all know that the use of writing programs to dynamically implement the logic we need to apply, so that the program execution to get the results we need. A database is an application that quickly retrieves data by entering SQL strings. Of course, assuming that there is no database this kind of system application, how to achieve with the program? We may find that regardless of how data is stored and whether it is accessed concurrently, we still need to constantly modify programs to handle different requests for data from different applications. In big data, for example, we usually get data through non-relational database apis. However, although this approach is simple to get started, it is extremely difficult to maintain, and the universality is not strong. Even if the software architecture design or abstract reconstruction is carried out constantly, it still needs to constantly change the application, which is why the non-relational database back to embrace the database SQL optimizer.

The SQL optimizer is essentially an implementation of a highly abstract data interface. By this design, customers can use a more general and easy-to-understand SQL language to manipulate and process data without paying attention to and abstracting their own data interfaces, which greatly liberates their applications.

In this article, we will show you how the MySQL 8.0 SQL optimizer can turn a simple string (SQL) into a sequence of execution that the database executor can understand and ultimately return data to the customer. A powerful optimizer does not require the customer to focus on how the SQL is written to get the data they need faster, so the optimizer must make some equivalent changes to the original SQL. In MySQL 8.0, we introduce the Server layer parser, optimizer, and executor, including some code structures and changes. The simple_joins function is used to show how the MySQL optimizer can simplify the optimization of nested joins in logical transformations. In this article we take you step by step into the magic of the optimizer and see how each step of the optimizer’s optimization section changes the final execution of an SQL query.

This article is based on the latest version of MySQL8.0.25. Due to the length of the optimizer transformation section, we split the article into two parts. The first part introduces the basic structure Setup and Resolve resolution transformation process, and the second part introduces the more complex subquery, partition table, and join transformation process.

Setup and Resolve

  • setup_tables : Set up table leaves in the query block based on list of tables.
  • resolve_placeholder_tables/merge_derived/setup_table_function/setup_materialized_derived : Resolve derived table, view or table function references in query block.
  • setup_natural_join_row_types : Compute and store the row types of the top-most NATURAL/USING joins.
  • setup_wild : Expand all ‘*’ in list of expressions with the matching column references.
  • setup_base_ref_items : Set query_block’s base_ref_items.
  • setup_fields : Check that all given fields exists and fill struct with current data.
  • setup_conds : Resolve WHERE condition and join conditions.
  • setup_group : Resolve and set up the GROUP BY list.
  • m_having_cond->fix_fields : Setup the HAVING clause.
  • resolve_rollup : Resolve items in SELECT list and ORDER BY list for rollup processing.
  • resolve_rollup_item : Resolve an item (and its tree) for rollup processing by replacing items matching grouped expressions with Item_rollup_group_items and updating properties (m_nullable, PROP_ROLLUP_FIELD). Also check any GROUPING function for incorrect column.
  • setup_order : Set up the ORDER BY clause.
  • resolve_limits : Resolve OFFSET and LIMIT clauses.
  • Window::setup_windows1: Set up windows after setup_order() and before setup_order_final().
  • setup_order_final: Do final setup of ORDER BY clause, after the query block is fully resolved.
  • setup_ftfuncs : Setup full-text functions after resolving HAVING.
  • resolve_rollup_wfs : Replace group by field references inside window functions with references in the presence of ROLLUP.

Detailed conversion process

The entire framework of transformation is a top-down or bottom-up process from Query_expression to Query_block by calling prepare (SQL/SQL_resolver.cc), depending on the requirements of the transformation rules.

The picture

Pass NULL to join inner table list (propagate_NULlability)

The table may contain all NULlable rows. Null rows can be propagated according to the JOIN relationship (TOP_JOIN_LIST). If a table can be determined as nullable, some optimizations will degenerate, such as access method cannot be changed to EQ_REF, outer join cannot be changed to inner join, etc.

Leave_tables (setup_tables)

SELECT
  t1.c1
FROM t1,
     (SELECT
       t2.c1
     FROM t2,
          (SELECT
            t3.c1
          FROM t3
          UNION
          SELECT
            t4.c1
          FROM t4) AS t3a) AS t2a;
Copy the code

Before setup_TABLE is called, the leaf_tables for each Query_block is 0.

What this function does is build leaf_tables, including the base tables and Derived tables lists, for subsequent optimizations. Setup_tables is not called recursively, but only resolves tables for this layer and counts the number of Derived tables for this layer. But then we call resolve_placeholder_tables()->resolve_derived()->derived(Query_expression)::prepare->Query_block::prepare specifically to process d recursively Query_expression for Erived table.

Next we proceed to the resolve_placeholder_tables function for derived table processing in the order in which prepare was called.

Query block mydb – Derived Table View Table resolve_placeholder_tables

Derived table this function is used to manipulate derived tables, views, and table functions if the table is merged, or if it is called using transform_grouped_to_derived(), If the Materialized table mode has been decided, it is directly ignored.

The merge_derived() function, which changes Query_expression/Query_block, is derived from resolve_derived(). Merge the Derived Table or View into the Query Block.

Merge_derived processing and merging Derived table

1) Prerequisites for Merge_Derived Transformation

  • Whether the outer query block allows merge (allow_MERGE_Derived) whose outer Query block is nullptr to subquery the outer Query expression is nullptr, The derived table is the outer query block of the first subquery layer, allow_MERGE_DERIVED =true, or SELECT/SET without the outer query block
  • Merge (lex->can_use_merged()+lex->can_no_use_merged())
  • Whether the Derived table has been marked as needing to materialize, Derived_table -> ALGORITHM= = VIEW_ALGORITHM_TEMPTABLE
  • The entire dervived table is in a query expression unit that cannot be (Query_expression::is_mergeable()) : A Union query contains aggregate, HAVING, DISTINCT, WINDOWS, or LIMIT without any table list
  • HINT or Optimizer_switch does not disable deriveD_merge
  • Heuristic recommends merging (derived_query_expressionmerge_heuristic()) if the Derived table contains a subquery SELECT list that depends on its own columns, Not supported A Dependant SubQuery is not supported if multiple executions are required
  • STRAIGHT_JOIN Is not supported in derived Table if the query block contains SEMI/ anti-join and specifies STRAIGHT_JOIN
  • Not supported if the merged Derived Table and the leaf table count of the existing Query Block are about MAX_TABLES

2) Merge_Derived transformation

  • The deriveD_table -> nesteD_JOIN structure is used to assist with OUTER JOIN.
  • Merge the derived table into NESTED_JOIN (derived_table->merge_underlying_tables())
  • Join all the tables in the Derived Table into the table_list list of the parent query, and remove the Derived table from the parent query.
  • Recalculate all the relevant data structures with parent queries (LEAF_TABLE_count, DERIveD_TABLE_COUNT, TABLE_func_COUNT, materializED_DERIveD_TABLE_COUNT, has_SJ_INFO, Has_aj_info, partitioned_table_count, cond_count, between_count, select_n_having_items)
  • Propagate parent OPTION_SCHEMA_TABLE (add_base_options()) and nullable (propagate_NULlability ()) if it is an inner table of an external query JOIN
  • Merge where conditions from Derived table into external query (merge_WHERE ())
  • Create a reference to the columns derived table needs (create_field_translation())
  • Remove the structure of Derived Table from the parent query (exclude_level())
  • Incorporate columns or table renaming from Derived Table into the parent query (fix_tables_after_pullout()/ repoint_contexts_of_JOIN_INFO ())
  • Because you have merged the tables contained in the Derived Table into the parent query, you need to relocate the tables in the TABLE_LIST (remap_tables())
  • After the derived Table is merged into the parent query, all references to all columns in the Derived Table in the original Derived Table need to be modified (fix_tables_after_pullout())
  • If the Derived Table contains an ORDER By statement, the Derived Table will retain ORDER By and merge it into the parent query if the following conditions are met, otherwise ORDER By will be ignored: You can have a WHERE condition if the parent query allows sorting and just happens to be that only the Derived table is not a UNION, but not a group by or the aggregate function itself is not ordered

The process is simplified as:

Merge_derived diagram process

It looks like the official Derived Merge isn’t perfect enough to include opt trace in a bottom-up recursive merge:

trace_derived.add_utf8_table(derived_table)
       .add("select#", derived_query_block->select_number)
       .add("merged", true);

trace_derived.add_alnum("transformations_to_derived_table", "removed_ordering");
Copy the code

This optimization can be controlled by setting Optimizer_switch =”derived_merge= ON /off”.

Setup_materialized_derived sets the materialized Derived Table

For the remaining Derived tables that cannot use the Merge algorithm, they will be processed in the materialize way. The actual materialization takes place in the Executor phase.

  • Setup_materialized_derived_tmp_table (): Sets a temporary Table containing all rows of the materialized Derived Table.

  • Check_materialized_derived_query_blocks (): Sets the query block structure that belongs to the current Derived Table.

    trace_derived.add_utf8_table(this) .add(“select#”, derived->first_query_block()->select_number) .add(“materialized”, true);

Setup_table_function handles table functions

If the Query block has a table function, the whole process is processed twice. The first pass will skip the table of the table function, and the second pass will execute the above logic for the table of the table function. The consideration here should be to resolve the external context (as opposed to the table function) first, because it is possible that function arguments will have externally-dependent Tables.

trace_derived.add_utf8_table(this)
       .add_utf8("function_name", func_name, func_name_len)
       .add("materialized", true);
Copy the code

4 Expand SELECT * wildcard into specific fields (setup_wild)

Create Query_block level base_ref_items (setup_base_ref_items)

Base_ref_items records the location of all items so that other items of the query block can be referenced or referenced directly through Item_ref and its Item_ref subclasses. For example, subquery references (Item_view_ref), aggregate function references (Item_aggregate_ref), external query column references (Item_outer_ref), and subquery subqueries generate NULL Value reference helper (Item_ref_null_helper).

Example of a complex Item_outer_ref:

6 Check with select_fields fix_fields() and column permission (setup_fields)

The figure below is a fixed field process for a more complex tape query. Some fields are associated with tables, while others add the corresponding Item_xxx_ref reference.

7 Parse and fixed_fields WHERE conditions and Join conditions (setup_conds)

Setup_join_cond If there is a nested_join, setup_join_cond is recursively called to resolve and set it. The simplify_const_condition function, by the way, replaces the entire condition with Item_func_true/Item_func_false if it finds a const Item that can be deleted, as shown in the figure below.

8 Parse and set the ROLLUP statement (resolve_rollup)

In the database query statement, the GROUP BY expression is followed BY the WITH ROLLUP statement, so that the data can be analyzed and counted at different levels BY a single query statement.

SELECT YEAR, country, product, SUM(profit) AS profit FROM sales GROUP BY YEAR, country, product WITH ROLLUP; +------+---------+------------+--------+ | year | country | product | profit | +------+---------+------------+--------+ | 2000 | Finland | Computer | 1500 | | 2000 | Finland | Phone | 100 | | 2000 | Finland | NULL | 1600 | | 2000 | India | Calculator | 150 | | 2000 | India | Computer | 1200 | | 2000 | India | NULL | 1350 | | 2000 | USA | Calculator | 75 | | 2000 | USA | Computer | 1500 | | 2000 | USA | NULL | 1575 | | 2000 | NULL | NULL | 4525 | | 2001 | Finland | Phone | 10 | | 2001 | Finland | NULL | 10 | | 2001 | USA | Calculator | 50 | | 2001 | USA | Computer | 2700 | | 2001 | USA | TV | 250 | | 2001 | USA | NULL | 3000 | | 2001 | NULL | NULL | 3010 | | NULL | NULL | NULL | 7535 | + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- + equivalent to do the following query:  SELECT * FROM (SELECT YEAR, country, product, SUM(profit) AS profit FROM sales GROUP BY YEAR, country, product UNION ALL SELECT YEAR, country, NULL, SUM(profit) AS profit FROM sales GROUP BY YEAR, country UNION ALL SELECT YEAR, NULL, NULL, SUM(profit) AS profit FROM sales GROUP BY YEAR UNION ALL SELECT NULL, NULL, NULL, SUM(profit) AS profit FROM sales) AS sum_table ORDER BY YEAR, country, product; +------+---------+------------+--------+ | YEAR | country | product | profit | +------+---------+------------+--------+ | NULL | NULL | NULL | 7535 | | 2000 | NULL | NULL | 4525 | | 2000 | Finland | NULL | 1600 | | 2000 | Finland | Computer  | 1500 | | 2000 | Finland | Phone | 100 | | 2000 | India | NULL | 1350 | | 2000 | India | Calculator | 150 | | 2000 | India | Computer | 1200 | | 2000 | USA | NULL | 1575 | | 2000 | USA | Calculator | 75 | | 2000 | USA | Computer | 1500 |  | 2001 | NULL | NULL | 3010 | | 2001 | Finland | NULL | 10 | | 2001 | Finland | Phone | 10 | | 2001 | USA | NULL | 3000  | | 2001 | USA | Calculator | 50 | | 2001 | USA | Computer | 2700 | | 2001 | USA | TV | 250 | +------+---------+------------+--------+Copy the code

Sorting is very difficult because of the NULL problem, and the group column changes and SQL complexity changes back and forth, while ROLLUP is very simple to implement. Let’s see what transformations ROLLUP does in the parsing process to achieve unexpected results.

Set GROUP BY/ORDER BY statement (setup_group/setup_order)

One of the functions find_order_in_list(): try to find columns in the select fields that can be mapped, otherwise you have to be the first in the all fields of the final projection, and do fix_fields as well.

  • M_having_cond ->fix_fields: Fixed_fields for having conditions.
  • Resolve_limits: handles OFFSET and LIMIT clauses (offset_limit and select_limit Items).
  • Setup_ftfuncs: Fix_fields the related Item if there is a full-text function.

remove_redundant_subquery_clause : For Table Subquery expressions, usually IN/ANY/ALL/EXISTS/etc, if there are no aggregate functions and Having clauses, it is usually advisable to remove unnecessary ORDER/DISTINCT/GROUP BY. The function supports three REMOVE_ORDER | REMOVE_DISTINCT | REMOVE_GROUP, if is SINGLEROW_SUBS subquery, consider only delete REMOVE_ORDER.

select c1 from t1 where t1.c2 in (select distinct c1 from t2 group by c1, c2 order by c1); => select c1 from T1 where t1. C2 in (select c1 from t2);Copy the code
  • Handles whether unnecessary DISTINCT statements can be removed if the GROUP BY columns are in the SELECT list and there are no ROLLUP and Window functions.

    is_grouped() && hidden_group_field_count == 0 && olap == UNSPECIFIED_OLAP_TYPE

For example:

SELECT DISTINCT c1, max(c2) from t1 group by c1;
Copy the code

Windows ::setup_windows1

SELECT id, release_year, rating, avg(rating) over(PARTITION BY release_year) AS year_avg FROM tw; +------+--------------+--------+-------------------+ | id | release_year | rating | year_avg | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 2015 | 1 8 8.5 | | | 3 | 2015 | 9 8.5 | | | | 2015 | | 8.5 2 8.3 8.2 8.5 | | | 2016 | | | | | 2016 | | 8.4 8.3 | | 6 7 | | 2017 | | 7 +------+--------------+--------+-------------------+Copy the code

The process and results are similar to the following:

Let’s take a look at what it does to start Query_block::prepare:

If m_windows is not empty, call Window::setup_windows1

  • Walk through the list of window functions and call resolve_WINDOW_ordering to resolve m_partition_BY and m_ORDER_BY

  • Handles inter-window references (e.g. Window w1 AS (w2), w2 AS (), w3 AS (w1)), but must be a directed acyclic graph (DAG)

  • Re-iterate to check if the unique name is check_unique_name and create reference items for window partition by and Window Order by

  • Check Window function characteristics (Window::check_window_functions1(THD)

    thd, _block

    Select * from window (select * from window); A static window determines whether the frame definition has upper and lower boundaries. M_static_aggregates is true, meaning it is a static window, and each partition can be evaluated once. If ma_static_aggregates is false, it further determines whether its sliding window is scope-based or row-based. WFS -> check_wF_semantics1 (THD, select, This method is used to determine whether row_buffer is needed for evaluation. If we only look at the row of the current partition and cannot calculate correctly, we need to use row_buffer instead of looking at the row after or before.

Three reviews

This article focuses on one part of the rule-based optimization of the optimizer, with more emphasis on the basic SQL operators, such as table, column, function, aggregation, grouping, sorting elements parsing and setting, as well as some obvious structural changes. Stay tuned for the next article, where we will continue the transformation of subqueries, partitioned tables, and JOIN operations.

Iv Reference Materials

  • MySQL 8.0 Server Layer Architecture
  • Derived_MySQL — derived_MySQL — derived_MySQL
  • ROLLUP Performance Enhancement
  • WL#9236, WL#9603 and WL#9727 – Add SQL window functions to MySQL

V. About Us

PolarDB is a cloud native distributed relational database independently developed by Alibaba. It entered the Leader quadrant of Gartner global database in 2020 and won the first prize of Science and Technology Progress awarded by China Electronics Society in 2020. PolarDB, based on the cloud native distributed database architecture, provides large-scale online transaction processing capability, and has the parallel processing capability for complex queries. As a whole, PolarDB has reached the international leading level in the field of cloud native distributed database, and has been widely recognized by the market. In the best practices within Alibaba Group, PolarDB also fully supported the 2020 Tmall Double 11, and refreshed the peak database processing record, as high as 140 million TPS. We are looking forward to working with you to build a world-class next-generation cloud native distributed relational database.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.