Abstract:

This is part of MaxCompute’s series on SQL optimizer principles. We will continue to publish other articles on SQL optimizer optimization rules and frameworks. Add pin group “Relational Algebra Optimization Techniques” (group no. 11719083) to get the latest articles published.

This article is mainly an overview of database query optimizer, including:

  • Query optimizer definition, classification
  • Query optimizer execution process
  • Introduction to CBO framework Calcite

1. What is the query optimizer

The database is mainly composed of three parts, namely the parser, the optimizer and the execution engine, as shown in the following figure:

Optimizer is the core component used to transform relational expression into execution plan in database, which determines the performance of a system to a large extent.

2. Query optimizer categories

There are two types of query optimizers: rule-based Optimizer (RBO) and cost-based Optimizer (CBO) :

  • Rule-based Optimizer (RBO)

The relational expression is transformed according to the optimization rules. The transformation here means that a relational expression will be transformed into another relational expression after the optimization rules. At the same time, the original expression will be cut out and the final execution plan will be generated after a series of transformations.

RBO contains a set of optimization rules with strict order. The same SQL will generate the same execution plan no matter what data is in the table read. At the same time, differences in the way SQL is written in RBO can affect the final execution plan, which can affect script performance.

  • Cost-based Optimizer (CBO)

Relational expressions are transformed according to optimization rules. The transformation here means that a relational expression will generate another relational expression after optimization rules, while the original expression will also be retained. Multiple execution plans will be generated after a series of transformations. Then CBO will calculate the Cost of each execution plan based on statistical information and Cost Model, and select the execution plan with the lowest Cost. It can be seen from the above that there are two dependencies in CBO: statistical information and cost model. Whether the statistical information is accurate or not and whether the cost model is reasonable or not will affect CBO to choose the optimal plan.

As can be seen from the above description, CBO is superior to RBO because RBO is a rigid optimizer that only recognizes rules and is insensitive to data. However, in the actual process, data often changes, and the execution plan generated through RBO is likely to be not optimal.

In fact, at present, all major databases and big data computing engines tend to use CBO. For example, since Oracle 10G, Oracle has completely abandoned RBO and switched to CBO. Hive also introduced CBO in version 0.14.

3. Query optimizer execution process

Both RBO and CBO contain a series of optimization rules, which can perform equivalent transformation of relational expressions. Common optimization rules include:

  • Predicate push-down
  • Column cutting
  • Constant folding
  • other

On the basis of these optimization rules, the corresponding equivalent transformation of the relational expression can be done to generate the execution plan. The following describes the execution of both the RBO and CBO optimizers.

  • RBO The RBO execution process is simple and consists of the following steps:

A relational expression is traversed by a schema that can be transformed as long as it satisfies certain optimization rules. After Step1, a logical execution Plan is generated, but this is only logically feasible. It is also necessary to Build the logical execution Plan into a Physical execution Plan, that is, to determine the specific implementation of each Operator. For example, the specific implementation of Join operator should choose BroadcastHashJoin or SortMergeJoin.

  • CBO CBO query optimization consists of three steps:

1) Exploration performs equivalent transformation according to optimization rules to generate equivalent relation expressions, and the original relation expressions will be retained. 2) Build the Physical Plan to determine the specific implementation of each Operator. 3) Find Best Plan Calculate the Cost of each execution Plan according to the statistics, and select the execution Plan with the lowest Cost.

CBO implementation has two models, namely Volcano model [1] and Cascades model [2]. Calcite uses Volcano model, while Orca[3] uses Cascades model. The Cascades model does not start with Explore and then Build, but instead builds as it goes, further tailoring execution plans. I’m not going to do it here, but if you’re interested in it, you can look at the papers.

4. Introduction to CBO framework Calcite

Apache Calcite is a storage and execution independent SQL optimization engine widely used in open source big data computing engines such as Flink, Drill, Hive, Kylin, etc. In addition, MaxCompute also uses Calcite as the optimizer framework. The architecture of Calcite is shown below:

Where Operator Expressions refer to relational Expressions, a relational expression is represented in Calcite as RelNode, often with the root node representing the entire query tree. There are two ways to generate RelNode in Calcite:

  • Directly generated by Parser

As can be seen from the above architecture diagram, Calcite also provides Parser for SQL parsing, and RelNode Tree can be obtained by using Parser directly.

  • Generated by Expressions Builder conversion

Parser syntax varies from system to system, so Parser might be different. In this case, Calcite provides Expressions Builder to transform an abstract syntax Tree (or other data structure) into a RelNode Tree. For example, Hive(a Data Processing System) uses this method.

Query Optimizer performs a series of equivalent conversions to the Operator Expressions according to Pluggable Rules, generating different execution plans, and finally selecting the least expensive execution plan. The cost calculation uses statistics provided by the Metadata Providers.

In fact, Calcite offers two optimizations, RBO and CBO, corresponding to HepPlanner and VolcanoPlanner respectively. This article will not expand on this either, but there will be time to go into the detailed implementation of Calcite later.

5. To summarize

This paper is an overview of query optimizer, including its classification, execution process, and Calcite.

Read the original

This article is the original content of the cloud habitat community, shall not be reproduced without permission.