Abstract: This paper briefly introduces what statistics are, what statistics record, why to collect statistics, how to collect statistics and when to collect statistics.
1 WHY: WHY do I need statistics
1.1 Query Execution process
The following figure describes the key steps that the SQL engine of GaussDB goes through from receiving SQL statements from the client to executing SQL statements and the factors that may affect the execution
- Lexical & grammatical analysis
According to the agreed SQL statement rules, the input SQL statement is converted from a string to a formatted structure (Stmt). If the SQL statement has syntax errors, errors will be reported in this link.
- Semantic parsing
Semantic parsing is similar to a translator, translate the external input visual object identifiable for the inside of a database object (such as table name as a string in the Stmt record into the database within identifiable oid), if the statement is the semantic error (query table object does not exist, for example), the database will be an error in this section.
- Query rewriting
The output of “semantic parsing” is equivalent to a more optimized structure for execution according to the rules, such as table queries that expand the views in the query statement layer by layer to the lowest level.
- Query optimization
The process in which the database confirms the SQL execution mode and generates the execution plan
- Query execution
The process of executing SQL and printing results according to the execution plan
In the whole execution process, the optimizer determines the specific execution mode of the query statement and plays a key role in the performance of the SQL statement. Database query optimizers fall into two categories: rule-based Optimizer (RBO) and cost-based Optimizer (CBO). RBO is a kind of rule-based optimization that adopts a specified execution mode for a specified scenario. This optimization model is not sensitive to data. The writing method of SQL often affects the execution plan. The SQL performance developed by people who do not know the DETAILS of RBO cannot be controlled, so RBO is gradually abandoned. Currently, optimizers from database vendors such as GaussDB use CBO model. CBO model generates a set of execution plans that can be used according to SQL statements, estimates the cost of each execution plan, and finally selects the least expensive execution mode.
1.2 the CBO model
When a database executes an SQL statement, it breaks the execution into several steps, such as the following SQL statement
select *
from t1 join t2 on t1.a=t2.b
where t1.b = 2 and t2.a = 3;
The specific execution will be split into two main query actions: table scan and table association. For example, SeqScan, IndexScan, IndexOnlyScan, BitmapScan and other execution modes exist for table scanning, and NestLoop, HashJoin, and MergeJoin are three execution modes for table association. The core work of the optimizer is what kind of query action is the least costly way to execute in a specific business scenario.
The main working principle of CBO is to estimate the Cost of each execution mode through the Cost Model and statistical information, and then select an execution mode with the best execution Cost. The cost model is the core algorithm logic, and the statistical information is the data source of cost calculation. If statistical information is missing, the cost model will use the default value to calculate cost during calculation. Of course, there will be a large deviation between cost and real value at this time, and there is a high probability of selecting a non-optimal execution plan. Therefore, statistical information is the data input for cost calculation in CBO model, and it is one of the most core technologies of CBO.
2 WHAT are the statistics
Statistical information refers to the information that describes the characteristics of tables or index data in a database. Common information includes table record number, page number and other information that describes table scale, as well as MCV(high frequency non-NULL value), HISTOGRAM (HISTOGRAM), CORRELATION and other information that describes data distribution characteristics.
This article uses the following use case to show how statistics represent the data characteristics of a table
DROP TABLE public.test;
CREATE TABLE public.test(a int, b int, c int[]);
INSERT INTO public.test VALUES (generate_series(1, 20), generate_series(1, 1200));
INSERT INTO public.test VALUES (generate_series(1, 1200), generate_series(1, 1200));
UPDATE public.test SET c = (‘{‘ || a || ‘,’|| a || ‘}’)::int[] WHERE b <= 1000;
UPDATE public.test SET c = (‘{‘ || a || ‘,’|| b || ‘}’)::int[] WHERE b > 1000;
ANALYZE public.test;
3 WHERE: WHERE is the statistics
3.1 Table scale information
Reltuples and Relpages in system table PG_class can reflect the scale information of the table. Relpages records the storage of table data in several pages, which is mainly used to calculate the cost of scanning data from storage interfaces. Reltuples records the number of table records and is mainly used to estimate the number of rows in the scan result set.
Query the estimated table size in PG_class. The table size is 2400 rows
Query the full data of a single table. View the estimated size of the table through Explain. The estimated number of scanned rows of the table is 2400.
3.2 Single-column statistics
Single-column statistics refer to the data characteristics of a single column of a table and are stored in the pg_STATISTIC system table. Because PG_STATISTIC stores some key sample values to describe data characteristics, pg_STATISTIC data is sensitive and only superusers can access it. In general, it is recommended that the user use the query system view PG_STATS to query the statistics of the tables that the current user has query permission on. Besides, the pg_STATS information is more readable. The pg_STATS field information is as follows
According to the new statistics, the proportion of NULL values in column A of public. Test is 0, and there are 120 DISTINCT values. 1 and 20 are MCV values, and the probability of each occurrence is 0.0254167. 21 1200 appears in the histogram statistics;
Query statement”
SELECT count(1) FROM public.test WHERE a < 44;
“Is used as an example to illustrate the role of statistics in the scenario of line count estimation during optimization
A) All MCV values meet a < 44, and the ratio of all MCV values is 0.0254167 * 20 = 0.5083340
B) 44 is the third boundary in the histogram, and the proportion of values satisfying a < 44 in the histogram is (1-0.5083340)/100 *(3-1)=.0098333200
Then the number of tuples satisfying a<56 in the table is 1243.6015680 ≈1244. The execution plan printed through Explain is as follows
3.3 Extending Statistics
Extended statistics are stored in the system table PG_STATISTIC_EXT. Currently, only multi-column extended statistics are supported. Pg_statistic_ext stores some key sample values to describe data characteristics, so pg_STATISTIC_EXT data is sensitive and only superusers can access it, It is generally recommended that users use the query system view pg_ext_STATS to query extended statistics for which the current user has query permissions.
Try to collect multi-column statistics when multiple columns of a table are correlated and the query has filtering conditions, association conditions, or grouping operations based on those columns. Extended statistics need to be collected manually (detailed collection methods will be described in the next section). The following are the statistics for columns (A, B) of table TEST
4 HOW: HOW do I generate statistics
4.1 Explicitly Collecting Statistics
4.1.1 Single-column statistics
Run the following command to collect single-column statistics:
{ ANALYZE | ANALYSE } [ VERBOSE ] [ table_name [ ( column_name [, …] ) ] ];
As described in the syntax, we support statistics on the specified columns, but in practice it is difficult to calculate which columns of the current table are used in the actual business SQL for cost estimation. Therefore, it is recommended to collect statistics on the whole table in general.
4.1.2 Extending Statistics
Run the following command to collect multi-column statistics:
{ANALYZE | ANALYSE} [ VERBOSE ] table_name (( column_1_name, column_2_name [, …] ));
Note that currently, extended statistics can only be generated in percentage sampling mode. Therefore, before collecting extended statistics, ensure that the GUC parameter default_STATISticS_target is negative
4.2 Improving the quality of statistics
Analyze takes samples from a table based on a random sampling algorithm and calculates table data characteristics from the samples. The number of statistics_target samples can be controlled by configuring default_STATIStics_target. The value of default_STATIStics_target ranges from -100 to 10000. The default value is 100.
1) When default_STATISticS_target > 0; The number of samples is 300* default_STATISticS_target. The larger the value of default_STATIStics_target is, the larger the sample size is, the larger the memory space occupied by the sample size is, and the longer the calculation time is
2) If default_STATISticS_target < 0, the sample number is (default_STATIStics_target)/100* Total number of rows in the table. The smaller the value of default_STATIStics_target is, the larger the sample number is. However, if default_STATISticS_target < 0, the sampled data will be dumped to the disk. There is no problem of memory space occupied by the sample. However, the problem of long calculation time also exists because the sample size is too large
When default_STATISticS_target < 0, the actual number of samples is (default_STATIStics_target)/100* total rows of the table, so we also call it percentage sampling.
4.3 Collecting Statistics Automatically
When the autoAnalyze parameter is enabled, the optimizer finds that the query statement does not have statistics in the table and automatically collects statistics to meet the optimizer’s requirements. Take the document’s case as the column
Note: The automatic collection of statistics is triggered only when the SQL statement executing complex query actions (such as associating multiple tables) sensitive to statistics is executed. Simple queries (such as single point, single table aggregation, and so on) do not trigger automatic statistics collection
5 WHEN: WHEN are statistics collected
5.1 Large-scale data changes
Large-scale data import, UPDATE, or DELETE operations cause the number of rows in a table to change. In this case, you need to collect statistics about the table again
5.2 Querying New Data
The most common and hidden statistics are not updated in a timely manner. The main characteristics of this scenario are as follows
-
There is a business table that grows over time
-
The business table stores the data of the new day every day
-
Query new data for data processing and analysis after data entry
Passtime > ‘2020-01-1900:00:00’ AND passtime < ‘2020-01-20 00:00:00’, passtime < ‘2020-01-20 00:00:00’, The optimizer finds that the value of passtime in the Filter condition exceeds the upper limit of the value of passtime recorded in the statistics. Passtime > ‘2020-01-19 00:00:00’ AND passtime < ‘2020-01-20 00:00:00’ will be calculated as 1 tuple, resulting in distortion of the verification of the estimated number of rows
6 WHO: WHO collects statistics
In AP scenarios, the amount of data in a service table is large, and the amount of data imported at a single time is large, and data is often imported and used for use. Therefore, you are advised to perform the analyze function for related tables when required based on data changes and query characteristics.
This document is shared in the Huawei cloud community GaussDB(DWS) Performance Tuning Basics 1: The Beginning of All Things Analyze Statistics, written in 譡.
Click to follow, the first time to learn about Huawei cloud fresh technology ~