Let your database fly! The Greenplum query optimizes parsing

On June 19, Guo Feng, developer of the original Greenplum kernel, shared the third issue of “Greenplum Kernel” series of live broadcast “Query optimization”. Don’t despair if you didn’t take part in the event. We’ve uploaded the video to Greenplum’s Chinese community B channel and you can watch it by clicking here. This article summarizes the essence of the article content, welcome message exchange.

Live Video (Tencent Channel)

V.qq.com/x/page/i310…

This article will mainly includes two parts, the first part is the introduction of the query optimizer, the second part is the specific process of query optimization, the whole process will be divided into four stages: the pretreatment of the query tree, scan/connection optimization, optimization scanning/connection, plan post-processing of the tree, we will be launched in this paper.

First, let’s take a look at what the query optimizer does. For a given query statement, the query optimizer finds the query plan with the lowest “cost.” The “cost” here can have different dimensions or definitions in different application scenarios, such as finding a query plan with the shortest execution time, or finding a query plan with the least amount of execution resources.

The same query statement can be executed in many ways, so the query optimizer tries to go through every possible execution to find the least “expensive” one and convert it into an executable plan tree.

A JOIN B on a.i = b.I; A JOIN B on a.i = b.I; In the first execution, we can perform an index scan on B, then a sequential scan on A, and then perform a Nested Loop. In the second execution, we can do a sequential scan of A, create a Hash table for the result of the scan, then do a sequential scan of B, and finally use Hash Join to get a result. The third way is to perform a sequential scan on a, perform a sort sort on the key values of a.i, perform an index scan on b, and Merge Join to obtain the final result.

As can be seen from these three execution modes, the following “cost” information is different. According to the cost information, the optimizer will select a method with the lowest “cost” to form the final execution scheme.

Having introduced the query optimizer, let’s take a look at the query plan. A query plan is a tree of plan nodes, each representing a particular type of processing operation, that contains all the information the executor needs to execute. At execution time, the schedule node generates output tuples. Generally, scanning nodes get input tuples from the data table, and most other nodes get input tuples from their child schedule nodes and handle generating output tuples.

Plan nodes can be divided into three types,

Scanning node

Including sequence scan, index scan, bitmap scan and so on;

Connect the nodes

Nestloop, Hash, merge, etc.

The SPJ node

Including Sort, aggregate, set operations (UNION, etc.)

Let’s look at another example. For Query in the example, the Query plan is shown in the figure below, where Seq Scan on A and Seq Scan on B are both Scan nodes. The Scan result is Hash Join, which is a Join node. On top of Hash Join, we make a HashAggregate, which is an aggregation node, and then sort the clustered nodes, which is a sort node.

After talking about query optimization and query planning, let’s take a look at the specific process of Greenplum query optimization. Greenplum query optimization consists of four phases, as shown in the figure below.

Let’s take a look at what each of the four stages does.

1. Simplify constant expressions

In expressions, there are some constant expressions that can be simplified when planning. For example, for function expressions, the following two types of expressions can be simplified:

The function itself is strict

If the input of an expression is NULL and the output is either NULL or FALSE, we can say that the function is “strict”. We can check whether a function is strict by checking the properties in the CATELOG table. If the function itself is strict and the input parameter contains a NULL value, we can use NULL to replace the function expression during planning.

The function itself is IMMUTABLE

A function is IMMUTABLE, meaning that if the input is fixed, the output is also IMMUTABLE. We can also check whether this function is IMMUTABLE by looking at the properties in the CATELOG table. If the function itself is IMMUTABLE, and the input arguments are all constants, then the function expression can be evaluated and replaced by constants.

Boolean expressions can also be simplified in some cases, such as the two cases below.

CASE expressions can also be simplified if a branch condition is a constant. For example, if 2+2=4 in the example below, the entire CASE expression can be simplified to x+1. It’s equal to 1/0 in the else, but because we’re simplifying it, we don’t have to divide by 0 like that.

Why should we simplify constant expressions at the planning stage? Simplifying constant expressions provides the following benefits:

Instead of doing one calculation for each row tuple, do one calculation
Both view expansion and function inlining may present new opportunities for constant expression simplification
Simplified constant expressions also reduce computation for statistics class functions

2. Inline simple SQL functions

The second thing that query tree preprocessing did early on was to inline simple SQL functions, such as in the example below, SELECT incr4(a) FROM foo can be inlined to SELECT a+4 FROM foo. And this is not just an inline, but a simplification of the constant expression.

By inlining simple SQL functions, you can

Avoid the cost of SQL function calls
Provides new opportunities to simplify constant expressions

3. Promote sublinks

A sublink indicates a subquery in an expression, usually in a WHERE or JOIN/ON clause.

For example, in the following example, the subquery appears in a WHERE clause and is therefore classified as a sublink. In this case, we can change an EXISTS sublink to a SEMI JOIN.

SELECT * FROM foo WHERE EXISTS (SELECT 1 FROM bar WHERE foo.a = bar.c);
Copy the code

The figure below shows the query tree before the transformation, and you can see that EXISTS_SUBLINK appears in the WHERE constraint before the transformation.

After the transformation, the sublink becomes a SEMI JOIN, which is the promotion process for a sublink.

4. Promote subqueries

Subqueries typically exist in the form of a range table, usually in the FROM clause. In this example, the subquery appears after the FROM. After the promotion, the query is an inner join for three tables. We can do the inner join for these three tables in any order. To get a better query plan. Before we can ascend, we have to connect bar and Baz.

SELECT * FROM foo JOIN (SELECT bar.c FROM bar JOIN baz ON TRUE) AS sub ON foo.a = sub.c;
Copy the code

The following diagram shows the internal structure of the query tree before the promotion. We can see that SUBSELECT is an item in a scope table belonging to the parent query.

After upgrading, we can see that the three tables become three inner join relations, so that we can swap the three tables in any order to find the lowest cost query plan.

By promoting the child query to the parent query, you can make the child query participate in the overall plan search space to find a better execution plan. Otherwise, we have to plan for the child query separately, and then treat the child query as a “black box” when planning for the parent query.

5. Remove external connections

Eliminating external connections is also an important thing to do during preprocessing. First, let’s take a look at the two tables below: the student table and the course selection table. We have two students in the student table. In the course selection table, we have a student whose student number is 1 who chooses the course numbered 100 and gets 60 points. If we want to find out which students took which classes and how many grades they got, we join the two tables. If we use INNER JOIN to do the JOIN, we get the result of the first table, which only has information about one student who has selected the course. If we look at LEFT JOIN, we find not only student no. 1 who has already selected a course, but student No. 2 who has not selected a course also appears in the result, but his course selection is filled with NULL. If you add a WHERE condition to the left JOIN, the result is the same as the INNER JOIN. In this case, we can turn the LEFT JOIN into an INNER JOIN.

To summarize: an outer connection can be converted to an inner connection if there is a “strict” constraint on the upper level of the nullable side that defines a variable from the Nullable side as a non-NULL value.

SELECT ... FROM foo LEFT JOIN bar ON (...) WHERE bar.d = 42;
Copy the code

Now let’s look at the second case of eliminating outer joins. First of all, we made a LEFT JOIN, and then two results came out, where the course information of students who did not choose courses was NULL. Weekly SID is null; least-join (); least-join (); least-join (); In this case, the LEFT JOIN does exactly what the anti-Join does.

When we look at the statement plan, we see that Greenplum has turned it into an anti-Join. So this is the second case of eliminating outer joins.

An outer Join can be converted into an anti-join if the outer Join itself has a “strict” connection condition that references a variable from the Nullable Side and is restricted to NULL by the upper constraint.

SELECT * FROM foo LEFT JOIN bar ON foo.a = bar.c WHERE bar.c IS NULL;
Copy the code

Stage 1: Query tree preprocessing (afterperiod)

Early on, we do some transformations to the query tree itself, and later on in the query tree preprocessing, we do the following:

Distribute WHERE and JOIN/ON constraints
Collect information about connection order restrictions
Eliminate useless connections
etc.

1. Distribution constraints

Let’s look at the distribution constraints, and in general, we want to push down the constraints as much as possible. If there are only inner joins, we can push a constraint to its “natural semantic” position. If there are outer joins, the pushdown of the constraint may be prevented from being pushed down to its “natural semantic” location. For a constraint that is blocked by an outer join, we prevent it from being pushed down below the outer join by making its “required_relids” include all the base tables required by the outer join.

First, let’s look at which constraints are blocked by outer joins. Let’s look at the first case. In the example on the left, the constraint only refers to the student table. Can we push it down to the scan of the student table? In Greenplum’s plan, the student. Sage = 18 constraint is not pushed down to the student scan, but is retained at the layer where the student and the weekly outer links JOIN, which is an example of being blocked by outer links.

In the Query on the right, we make some changes to the form of a subquery. By looking at the plan, we successfully push sage=18 down to the student table. Executing with the plan, we find that the result generated is not the same as the result on the left.

If the join condition of the outer join itself references the table of the non-Nullable side, then the join condition cannot be pushed down below the outer join, or we may lose some nullable extended tuples.

Let’s look at the second case where we can get blocked by external connections. First of all, if we look at the left hand side, we can see that the constraint only refers to weekly. Can we push this constraint to the scan of periodicals? When we look at the query plan, we see that it does not do this, but rather remains at the join layer. After execution, we find that it results in a single line of results.

Once we transform our SQL statement, as shown on the right side of the picture, we find that in the plan, we successfully push the constraint to the scan of the club table, but the result is different from the left side, there is one more row of results.

If a constraint on the outer join refers to a nullable side variable, then the constraint cannot be pushed below the outer join; otherwise, nullable side tuples may be added.

2. Limit the connection sequence

When there is an external join, the join order can not be arbitrarily exchanged, because the external join will limit the exchange of the join order to some extent. A non-full-Join can freely JOIN the left end (LHS) of an outer JOIN, whereas a non-full-Join cannot JOIN the right end (RHS) of an outer JOIN. Take a look at the two examples below. In the example on the left, you can see from the query plan that a and C do an inner join first, and then do a left join with B. In the example on the right, you can see that in the query plan, although a and B do not have connection conditions, they still do left join on A and B, and then do an INNER join with C.

3. Delete unnecessary connections

Another thing you do late in the preprocessing process is eliminate unwanted connections. Useless connections can be eliminated if the three conditions in the figure below are met. In the example below, the LEFT JOIN can be eliminated.

Phase 2: Scan/connection optimization

In this phase, the FROM and WHERE parts of the query are processed, and the ORDER BY information is also taken into account. This part is all driven by “cost”.

Scan/join optimization works like this:

First, the scan path is determined for the base table, and the cost and size of the scan path are estimated
Dynamic programming algorithm is used to search the whole connection sequence space and generate the connection path
When searching the join order space, you need to take into account the join order constraints imposed by outer joins

In the process of dynamic programming,

Start by generating scan paths for each base table
Generate join paths for all possible joins of two tables
Generate connection paths for all possible joins of three tables
Generate join paths for all possible joins of four tables
.
Until all the base tables are joined together

As you can see, this dynamic programming is done level by level. SELECT * FROM A JOIN (B JOIN C ON B.j =C.j) ON A.i = B.i

In fact, this process, “cost” is very high, N table join, theoretically n! It is not feasible to iterate over all possible join sequences. We use some heuristics to reduce the search space. For the two tables that do not have connection conditions, do not join as far as possible, and decompose a big problem into many sub-problems. For example, in the example below, we have reduced the join of 10 tables into multiple small problems. The process of dynamic programming for each subproblem reduces the complexity.

Stage 3: Optimization beyond scan/join

At this stage, we do the following:

Handle GROUP BY, aggregation, window functions, and DISTINCT
Handle set operations, UNION/INTERSECT/EXCEPT
Add the last SORT node if ORDER BY is required
Add LockRows, Limit, ModifyTable nodes

After doing this, we almost have a complete plan tree. But we still need to do some post-processing.

The fourth stage plans the post-processing of the tree

At this stage, we need to convert the least expensive path into a plan tree and adjust some details in the plan tree:

Flatten the scope table for subqueries
Change the variable in the upper plan node to OUTER_VAR or INNER_VAR to point to the output of the child plan
Delete unnecessary SubqueryScan, Append and other nodes

After doing this, we have the complete plan tree and can hand it off to the executor to execute.

That’s all for query optimization, and stay tuned for the rest of the Greenplum Kernel series.

Let your database fly! The Greenplum query optimizes parsing

Related Posts

Pinpoint use as a distributed link tracking system

Write an operating system from scratch – 1.5 real mode and protected mode

F5 BIGip Load balancing IP algorithm decryption tool