Abstract: Path generation is the main stage of determining table association mode. This paper introduces several factors affecting path generation: cost_param, SCAN mode, JOIN mode, STREAM mode, and analyzes how to intervene path generation in principle.
1. Cost model selection
As the name implies, cost_param is a parameter that controls cost. Before we get to cost_param, let’s review the concept of selection rate. In the GaussDB optimizer, selection rate is a real number between 0 and 1 as a percentage of the total number of rows that can be selected by a filter or association condition. Selection rate is an important concept in the optimizer and is mainly used to estimate the number of rows and distinct values, which are essential elements in plan generation.
First, let’s look at how to estimate the number of rows in the base table with filter conditions. If a table has only one filter condition, the number of filtered rows can be obtained by multiplying the selection rate by the number of rows in the table. If there are multiple filter conditions, then we need to calculate a comprehensive selection rate, how to calculate? There are two methods: one is to calculate directly through multi-column statistics, the other is to combine the selection rate of single column. So the composition is determined by the cost_param parameter, and in particular,
P_brand = ‘Brand#45’ and p_container = ‘WRAP CASE’ p_container = ‘WRAP CASE’
(1) cost_param = 0
(2) cost_param = 2
From the comparison of the estimated (e-rows) and the actual (A-rows), we can see that the cost_param=0 model is suitable for p_brand and P_container columns of the part table.
Secondly, how to estimate the number of rows in a Join? The principle is similar to the row number estimation of filtering conditions. If there is no multi-column statistics available, the selection rate of each condition needs to be calculated separately, and then the comprehensive selection rate and the number of trips can be calculated. For example, TPC-H 1x lineitem is associated with Orders under the following association conditions: L_orderKey = O_ORDERKey and O_CUSTKey = L_suppkey. The execution of different cost_PARam is as follows:
(1) cost_param = 0
(2) cost_param = 2
In this case, the columns of a Join also fit perfectly with the correlation model, which is consistent with the distribution of L_orderKey and L_suppkey.
Because tPC-H model is close to completely unrelated model, cost_param=0 model can better describe the scenario. In practical application, users can adjust the model according to specific business scenarios. Accuracy of line estimation is an important guarantee for plan generation, and the most direct place to check estimation in tuning. In future versions, more models will be added to meet service requirements.
Second,ScanChoice of mode
The Scan modes in GaussDB are classified into sequential Scan and index Scan. Each Scan mode corresponds to several Scan operators. The sequential Scan operators are Seq Scan and CStore Scan respectively. Most of these Scan operators can be adjusted by enabling Seq Scan. For example, if enable_seqSCAN =off is set, Seq Scan is not selected preferentially. The choice of scanning mode largely determines the path to obtain base table data. Let’s use the following example to illustrate:
select l_orderkey, o_custkey from lineitem, orders where l_orderkey =
o_orderkey;
Copy the code
The LINEItem distribution key is l_ORDERKey with index on l_ORDERKey and the Orders distribution key is o_ORDERKey. By default, the Scan mode is as follows:
The two tables are sequentially scanned paths, and the association mode is Hash Join. If Seq Scan is turned off (enable_seqSCAN =off), the plan is as follows:
The Scan for lineItem becomes Index Only Scan (because l_ORDERKey is of type int), and Seq Scan is still selected on orders (because there is no other path), and the association is changed to Nest Loop. Because Hash Join requires a full table Scan (Seq Scan for LineItem has been turned off). The way the optimizer is selected can also be seen in the e-costs column. Turn off Index Only Scan again and see how the plan changes:
All Scan paths are changed to Seq Scan, and the cost of Seq Scan is very high. Turn off Nest Loop and see what Hash Join plans cost:
It can be seen from the cost that the Total cost of Hash Join is smaller than that of Nest Loop, but the optimizer does not select Hash Join. This is because when the optimizer compares the path cost, it will compare Startup and Total cost, namely Startup cost and Total cost. Considering comprehensively, the Total cost is displayed in the e-Costs column. Set explain_perf_mode to normal and check the original Nest Loop startup cost:
The two costs in the red box are startup cost and total cost respectively. When looking at the cost of Hash Join, it is obvious that the startup cost of Hash Join is much higher than that of Nest Loop (startup cost represents the cost of outputting the first data). The optimizer integrates these two costs when comparing paths. Finally, the path of Nest Loop is recommended.
Can be seen from the above example, the regulation of scanning path can change the path generated, the premise of reasonable collocation is to generate the optimal plan, by default, GaussDB optimizer can choose according to the existing path (as the lineitem have two scanning path, the orders only a scanning path), and finally determine the optimal one. When comparing the cost of two paths, the total cost is not the only factor, but the smaller the total cost is, the easier it is to be selected generally.
3. Selection of association mode
The following table associations are used in the GaussDB optimizer: Nest Loop, Hash Join, and Merge Join can be controlled by enable_nestloop, enabLE_hashJoin, and enabLE_mergeJoin, respectively. This control is not absolute, and can be construed as whether to select the Nest Loop, Hash Join, and Merge Join. In most scenarios, the cost relation of the three paths is as follows: Hash Join < Merge Join < Nest Loop. As a simple correlation example, store_RETURNS and store_SALES are two tables in TPC-DS 1x with the following SQL:
select count(*) from store_returns, store_sales where sr_customer_sk =
ss_customer_sk;
Copy the code
By default, the optimizer recommends the Hash Join path, which is planned as follows:
If Hash Join is turned off, the optimizer selects the Merge Join path:
If you turn off the Merge Join path again, you might choose the Nest Loop path. Associative control switches are generally used to tune or avoid problems, but whether they work depends on the specific statement, and there is no other way to do it than the current associative way. In actual scenarios, there are many operators associated with one statement. It is difficult to control the Join mode of two tables with enable_hashJOIN, enable_nestloop or enable_mergeJoin. A more detailed statement-level tuning tool for GaussDB is the Plan Hint, which is available in the product manual.
Iv. Selection of Stream mode
The Stream operator is one of the key operators for the distributed execution of GaussDB. It mainly plays the role of network transmission. For a brief introduction, see: GaussDB(DWS) Performance Tuning Series 1: General Tuning Strategy of 18 Martial Arts. The Stream operator is controlled by the enable_STREAM_operator argument. If you turn off the Stream operator, it may result in plans that will not be pushed, for example:
Because the key l_partkey associated with the LINEItem table is not the distribution key of the lineItem table, we need to add a Stream operator, but the Stream function is disabled, so we can only generate no push plan.
The main Stream operators commonly used in GaussDB schemes include Redistribute, Broadcast, and Gather. Gather is generally a distributed plan in which CN is used to collect DN data for final processing. Unless the number of rows finally collected is very large, this operator usually involves less performance problems. Redistribute and Broadcast are complementary operators. The former is used for redistribution, and the latter for Broadcast. When generating a plan, the optimizer selects one based on the cost. If the Join Key does not contain the distribution Key of the table, the Redistribute path is generally added. You can select the Redistribute path or theoretically select the Broadcast path. The final selection depends on the cost estimated by the optimizer. The two operators can be controlled by using the enable_redistribute and enable_broadcast parameters.
When SMP is enabled and doP is greater than 1, Local Redistribute, Split Redistribute, Local Broadcast, and Split Broadcast are also available. When tilt optimization is enabled, PART REDISTRIBUTE PART ROUNDROBIN, PART_REDISTRIBUTE_PART_BROADCAST, PART_REDISTERIBUTE_PART_LOCAL, and so on. These are also Stream operators, which are mainly extended forms of redistribution, broadcast, and RoundRobin. We will not introduce them here, but you can refer to the GaussDB DWS product manual.
Let’s consider a simple association of two tables, store_sales and SR_TBL, whose distribution keys are SS_ITEM_SK and SR_expectation D_date_sk, respectively, The Join condition is store_sales.ss_customer_sk = sr_Tbl. sr_customer_sk. The result is as follows:
Because the distribution keys of the two tables are not Join keys, to use the Hash Join path, one table needs to be Broadcast or both tables need to be Redistribute, but the store_SALES table is large (e-rows displays 2.87 billion rows). The estimated number of rows in the SR_TBL table is small (e-rows displays 100 rows), and the optimizer considers it suitable for Broadcast. So the final choice of one side Broadcast plan.
For this plan, because the sr_TBL table statistics are inaccurate (if they are intermediate result sets, they are not estimated correctly), one way to tune the plan is to re-collect the SR_TBL table statistics more accurately (if sr_TBL is intermediate result sets, they cannot be collected), Another approach is to make SR_TBL go to the Redistribute path, which we can implement in two ways: Plan Hint, which tells the optimizer to go to the Redistribute path when generating the Plan, and turn off Broadcast. After Broadcast is disabled, the plan is as follows:
In this column, SMP adaptive is enabled. That is, the optimizer determines the degree of parallelism (DOP) based on system resources and the current number of Active SQL. If Redistribute and Broadcast are improperly selected, this parameter may cause
(1) Broadcast plan will appear
(2) The parallelism of the two plans is different, and the final execution time may be quite different.
For Stream control, the usual tuning options are Plan hints, GUC parameters, improve statistics, or estimate information.
V. Concluding remarks
Cost_param in this document is a lower-level cost parameter. Dbas who are familiar with data features and application scenarios are advised to use it with caution. The basic basis for Scan, Join, and Stream regulation is also cost, which is generally reflected in the execution time. During tuning, Performance bottlenecks can be identified from Performance to analyze whether the selected operator matches the cost. In addition to the session-level control parameters described in this article, you can also control the number of rows in the base table and intermediate results. You can also use the Plan Hint to control the statement level. For details, see the GaussDB DWS product documentation.
This article is shared by huawei cloud community “GaussDB(DWS) Performance Tuning Series Five: The Path of eighteen Martial Arts Intervention”, original author: – Dao zhi Jian -.
Click to follow, the first time to learn about Huawei cloud fresh technology ~