A group by + Order by performance optimization analysis

Get aliyun coupon for free. My live broadcast – “THE Road to PHP Advancement”

Recently, I found a special card when ranking through a log table. Finally, the problem was solved. I summarized some experience of index and MySQL execution process, but there are still 5 puzzles unsolved.

Thank you dinqi teacher “MySQL Practice 45 lecture”

It mainly includes the following knowledge points

How does the number of rows scanned by the slow log count
Find out the optimization scheme from the group by implementation principle
Implementation details of sorting
GDB source debugging

background

The TOP10 articles visited this month and this week need to be counted separately. The log table is as follows

CREATE TABLE `article_rank` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `aid` int(11) unsigned NOT NULL, 'pv' int(11) unsigned NOT NULL DEFAULT '1', 'day' int(11) NOT NULL COMMENT 'date e.g. 20171016', PRIMARY KEY (' id') KEY `idx_day_aid_pv` (`day`,`aid`,`pv`), KEY `idx_aid_day_pv` (`aid`,`day`,`pv`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8Copy the code

The preparatory work

In order to clearly verify some of my guesses, I installed a debug version of mysql in the virtual machine and enabled slow log collection to count the number of scanned rows

The installation

Download the source code
Compile the installation
Create mysql user
Initializing the database
Initialize the mysql configuration file
Change the password

If you are interested, you can refer to my blog and install mengkang.net/1335.html step by step

Enabling Slow Logs

Edit the configuration file and add it under the [mysqld] block

slow_query_log=1
slow_query_log_file=xxx
long_query_time=0
log_queries_not_using_indexes=1Copy the code

Performance analysis

Found the problem

If I need to query the SQL of the top 10 articles with the most page views in the 5 days from December 20, 2018 to December 24, 2018, I will first use Explain to see the analysis results

mysql> explain select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+ | 1 | SIMPLE | article_rank | NULL | Range | idx_day_aid_pv, idx_aid_day_pv | idx_day_aid_pv | | NULL | 404607 | | 100.00 Using the where; Using index; Using temporary; Using filesort | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+Copy the code

The default index used by the system is IDx_DAY_aid_pv. According to the Extra information, we can see that when idX_DAY_AId_pv is used, the index will be overwritten, but the temporary table will be used, and there will be sorting.

Let’s take a look at the records in the slow log

# User@Host: root[root] @localhost [] Id: 6 # Query_time: 56.959484 Lock_time: 0.000195 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552791747; select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

Why is the number of rows scanned 1337315

We query two data, one is the number of rows that meet the condition, one is the number of rows after group by statistics.

mysql> select count(*) from article_rank where day>=20181220 and day<=20181224; +----------+ | count(*) | +----------+ | 785102 | +----------+ mysql> select count(distinct aid) from article_rank where  day>=20181220 and day<=20181224; +---------------------+ | count(distinct aid) | +---------------------+ | 552203 | +---------------------+Copy the code

Total number of detected lines (785102) + Total number of lines after group by (552203) + Limit value = Rows_examined measured in slow logs.

To answer this question, you need to understand exactly how the above SQL works.

Perform process analysis

The index sample

To make it easier to understand, I first simulate a small portion of the idx_DAY_AId_PV index following the rules of indexing

day	aid	pv	id
20181220	1	23	1234
20181220	3	2	1231
20181220	4	1	1212
20181220	7	2	1221
20181221	1	5	1257
20181221	10	1	1251
20181221	11	8	1258

Because the left-most column of index IDx_DAY_AId_pv is day, when we need to find the PV sum of articles between 20181220 and 20181224, we need to traverse the index of data from 20181220 to 20181224.

View optimizer Trace information

Optimizer_trace set optimizer_trace='enabled=on'; SQL select * from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10; Select trace from 'information_schema'. 'optimizer_trace' \G;Copy the code

Extract the last execution result as follows

{
  "join_execution": {
    "select#": 1,
    "steps": [
      {
        "creating_tmp_table": {
          "tmp_table_info": {
            "table": "intermediate_tmp_table",
            "row_length": 20,
            "key_length": 4,
            "unique_constraint": false,
            "location": "memory (heap)",
            "row_limit_estimate": 838860
          }
        }
      },
      {
        "converting_tmp_table_to_ondisk": {
          "cause": "memory_table_size_exceeded",
          "tmp_table_info": {
            "table": "intermediate_tmp_table",
            "row_length": 20,
            "key_length": 4,
            "unique_constraint": false,
            "location": "disk (InnoDB)",
            "record_format": "fixed"
          }
        }
      },
      {
        "filesort_information": [
          {
            "direction": "desc",
            "table": "intermediate_tmp_table",
            "field": "num"
          }
        ],
        "filesort_priority_queue_optimization": {
          "limit": 10,
          "rows_estimate": 1057,
          "row_size": 36,
          "memory_available": 262144,
          "chosen": true
        },
        "filesort_execution": [
        ],
        "filesort_summary": {
          "rows": 11,
          "examined_rows": 552203,
          "number_of_tmp_files": 0,
          "sort_buffer_size": 488,
          "sort_mode": "<sort_key, additional_fields>"
        }
      }
    ]
  }
}Copy the code

Analyze temporary table fields

Mysql GDB debugging for more details mengkang.net/1336.html

Verify that the fields on the temporary table are AID and num through GDB debugging

Breakpoint 1, trace_tmp_table (trace=0x7eff94003088, table=0x7eff94937200) at /root/newdb/mysql-server/sql/sql_tmp_table.cc:2306
warning: Source file is more recent than executable.
2306      trace_tmp.add("row_length",table->s->reclength).
(gdb) p table->s->reclength
$1 = 20
(gdb) p table->s->fields
$2 = 2
(gdb) p (*(table->field+0))->field_name
$3 = 0x7eff94010b0c "aid"
(gdb) p (*(table->field+1))->field_name
$4 = 0x7eff94007518 "num"
(gdb) p (*(table->field+0))->row_pack_length()
$5 = 4
(gdb) p (*(table->field+1))->row_pack_length()
$6 = 15
(gdb) p (*(table->field+0))->type()
$7 = MYSQL_TYPE_LONG
(gdb) p (*(table->field+1))->type()
$8 = MYSQL_TYPE_NEWDECIMAL
(gdb)Copy the code

From the print above, the field types are confirmed, one aid is MYSQL_TYPE_LONG, which is 4 bytes, and num is MYSQL_TYPE_NEWDECIMAL, which is 15 bytes.

The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE). (Before MySQL 5.0.3, SUM() and AVG() return DOUBLE for all numeric arguments.)

But as you can see from our print above, the length of the two fields adds up to 19, while tmp_table_info.reclength in optimizer_trace is 20. In other experiments it has been found that the length of a table-> S -> Reclength is the length of all the fields in the table-> Field array plus one.

Summary execution process

Try using it on the heapmemoryTo store a temporary table in memorygroup byData, found insufficient memory;
Create a temporary table with two fields,aidandnumField (sum(pv) as num);
From the indexidx_day_aid_pvInsert 1 row into temporary table. The insertion rule is ifaidIf it does not exist, insert it directly. If it exists, insert itpvThe sum of the values ofnumOn;
Loop through the indexidx_day_aid_pvon20181220~20181224If the value is between, go to Step 3.
Based on temporary tablesnumThe value of do priority queue sort;
Fetches the last 10 rows of data left in the heap (priority queue heap) and returns them directly as a result set without returning to the table;

Supplementary Remarks Analysis of the execution steps of priority queue sorting:

Extract the first 10 rows from the temporary table (not sorted) and selectnumandaidAs 10 elements form a small top heap, that is, the smallest num is at the top of the heap.
Take the next line, compare the value of num with the heap top value, and replace the word if it is greater than the heap top value. The new heap is then heap sorted.
Repeat step 2 until line 552203 completes the comparison.

To optimize the

Scheme 1 uses idx_AId_DAY_PV index

# Query_time: 4.406927 Lock_time: 0.000200 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552791804; select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

The number of scanned lines is 1337,315. Why is the execution time 12 times faster?

The index sample

To make it easier to understand, I also simulate a small portion of the idx_AID_DAY_PV index following the rules of indexing

aid	day	pv	id
1	20181220	23	1234
1	20181221	5	1257
3	20181220	2	1231
3	20181222	22	1331
3	20181224	13	1431
4	20181220	1	1212
7	20181220	2	1221
10	20181221	1	1251
11	20181221	8	1258

Group by The case where temporary tables are not required

The idx_AID_DAY_pv index aid is definitely ordered, so temporary tables are not created when group by is executed. Temporary tables are only needed when sorting. If this is true, we can see it through the following execution plan

Idx_day_aid_pv index:

mysql> explain select aid,sum(pv) as num from article_rank force index(idx_day_aid_pv) where day>=20181220 and day<=20181224 group by aid order by null limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+ | 1 | SIMPLE | article_rank | NULL | range | Idx_day_aid_pv, idx_aid_day_pv | idx_day_aid_pv | | NULL | 404607 | | 100.00 Using the where; Using index; Using temporary | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+Copy the code

Note that I used order by NULL above to force the result of group by not to be sorted. If you do not add order by null, the SQL above will appear Using filesort

Effect of idx_aid_DAY_pv index:

mysql> explain select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by null limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+ | 1 | SIMPLE | article_rank | NULL | index | Idx_day_aid_pv, idx_aid_day_pv | idx_aid_day_pv 12 | | NULL 10 | | | 11.11 Using the where; Using index | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+Copy the code

View optimizer Trace information

Optimizer_trace set optimizer_trace='enabled=on'; Select * from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 select * from article_rank force index(idx_aid_day_pv) where day>=20181220 group by aid order by num desc limit 10; Select trace from 'information_schema'. 'optimizer_trace' \G;Copy the code

Extract the last execution result as follows

{
  "join_execution": {
    "select#": 1,
    "steps": [
      {
        "creating_tmp_table": {
          "tmp_table_info": {
            "table": "intermediate_tmp_table",
            "row_length": 20,
            "key_length": 0,
            "unique_constraint": false,
            "location": "memory (heap)",
            "row_limit_estimate": 838860
          }
        }
      },
      {
        "filesort_information": [
          {
            "direction": "desc",
            "table": "intermediate_tmp_table",
            "field": "num"
          }
        ],
        "filesort_priority_queue_optimization": {
          "limit": 10,
          "rows_estimate": 552213,
          "row_size": 24,
          "memory_available": 262144,
          "chosen": true
        },
        "filesort_execution": [
        ],
        "filesort_summary": {
          "rows": 11,
          "examined_rows": 552203,
          "number_of_tmp_files": 0,
          "sort_buffer_size": 352,
          "sort_mode": "<sort_key, rowid>"
        }
      }
    ]
  }
}Copy the code

The execution process is as follows

Create a temporary table with two fields,aidandnumField (sum(pv) as num);
Read the indexidx_aid_day_pvAnd then see if the condition is met ifdayField is not in the condition range (20181220~20181224Between), the next line is read; ifdayIf the field is within the condition range, thepvValue accumulation (not operating in temporary tables);
Read the indexidx_aid_day_pvThe next line in, ifaidIf the value is the same as that in Step 1 and the conditions are metpvValue accumulation (not operating in temporary tables). ifaidIf the result set is inconsistent with step 1, the previous result set is written to the temporary table.
Repeat steps 2 and 3 until the scan is completeidx_aid_day_pvIndexes;
Based on temporary tablesnumThe value of do priority queue sort;
Based on the top 10 queriesrowidA return table (temporary table) returns a result set.

Supplementary Remarks Analysis of the execution steps of priority queue sorting:

Extract the first 10 rows from the temporary table (not sorted) and selectnumandrowidAs 10 elements form a small top heap, that is, the smallest num is at the top of the heap.
Take the next line, compare the value of num with the heap top value, and replace the word if it is greater than the heap top value. The new heap is then heap sorted.
Repeat step 2 until line 552203 completes the comparison.

Feasibility of the scheme

The experiment found that when I added a row of data of 20181219, although this row did not meet our requirements, the scan index would also read this row. Because I did this experiment, I only got the data from 20181220 to 201812245 days, so the number of rows to be scanned is exactly the number of rows in the whole table.

What if instead of 5 days of data, the table stores 10 days of data, or 365 days of data? Is this plan still viable? First simulate 10 days of data, add 5 days to the existing time, and the number of lines is 785,102 as now.

drop procedure if exists idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  declare aid int;
  declare pv int;
  declare post_day int;
  set i=1;
  while(i<=785102)do
    set aid = round(rand()*500000);
    set pv = round(rand()*100);
    set post_day = 20181225 + i%5;
    insert into article_rank (`aid`,`pv`,`day`) values(aid, pv, post_day);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();Copy the code

# Query_time: 9.151270 Lock_time: 0.000508 Rows_sent: 10 Rows_examined: 2122417 SET TIMESTAMP =1552889936; select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

The number of rows scanned is 2122417 because I need to scan the entire index, and the number of rows in the entire index is the number of rows in the entire table, because I just inserted 785102 rows.

When I double the amount of data, the query time here has obviously doubled. So this optimization is not stable.

Solution 2 Expand the upper limit of the temporary tablespace

The default temporary tablespace size is 16MB

mysql> show global variables like '%table_size';
+---------------------+----------+
| Variable_name       | Value    |
+---------------------+----------+
| max_heap_table_size | 16777216 |
| tmp_table_size      | 16777216 |
+---------------------+----------+Copy the code

Dev.mysql.com/doc/refman/… Dev.mysql.com/doc/refman/…

max_heap_table_size

This variable sets the maximum size to which user-created MEMORY tables are permitted to grow. The value of the variable is used to calculate MEMORY table MAX_ROWS values. Setting this variable has no effect on any existing MEMORY table, unless the table is re-created with a statement such as CREATE TABLE or altered with ALTER TABLE or TRUNCATE TABLE. A server restart also sets the maximum size of existing MEMORY tables to the global max_heap_table_size value.

tmp_table_size

The maximum size of internal in-memory temporary tables. This variable does not apply to user-created MEMORY tables.

The actual limit is determined from whichever of the values of tmp_table_size and max_heap_table_size is smaller. If an in-memory temporary table exceeds the limit, MySQL automatically converts it to an on-disk temporary table. The internal_tmp_disk_storage_engine option defines the storage engine used for on-disk temporary tables.

Max_heap_table_size is also limited by tmp_table_size.

So here we adjust to 32MB and execute the original SQL

set tmp_table_size=33554432;
set max_heap_table_size=33554432;Copy the code

# Query_time: 5.910553 Lock_time: 0.000210 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552803869; select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

Scheme 3 uses SQL_BIG_RESULT optimization

Tell the optimizer that there are many query results, and the temporary table is stored directly on disk.

# Query_time: 6.144315 Lock_time: 0.000183 Rows_sent: 10 Rows_examined: 2122417 SET TIMESTAMP =1552802804; select SQL_BIG_RESULT aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

The number of lines scanned is the number of lines scanned by 2x (785102) + the number of lines scanned by 2x (552203) + the limit value.

Incidentally, when I doubled the amount of data, the query time remained almost the same using this method. Because the number of rows scanned is still the same. The actual test cost 6.197484

conclusion

The optimization effect of scheme 1 is unstable. When the total amount of table data is the same as the total amount of query scope and does not exceed the size limit of temporary table in memory, the performance reaches the best. When the amount of query data occupies more of the total table data, the optimization effect is less obvious. Scheme 2 needs to adjust the size of temporary table memory, which is feasible. However, if the database size exceeds 32MB, you need to increase the temporary table size. Scenario 3 directly declares the use of disk for temporary tables, although the number of rows scanned is an additional scan of the total number of eligible rows. However, the overall response time was 0.1 seconds slower than scenario 2. Due to the comparison of data volume here, I think this time difference is acceptable.

Therefore, in the end, plan 3 is more appropriate.

Questions and puzzles

# SQL1
select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;
# SQL2
select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code

Mysql > select * from SQL1; mysql > select * from SQL1;
SQL1 and SQL2group byThe number of rows that I get after that is zero552203, why does SQL1 run out of memory, and what other details are there?
From the tracecreating_tmp_table.tmp_table_info.row_limit_estimateAre all838860; The calculation is based on the memory limit of the temporary table16MB, and the space required for a line is 20 bytes, so the most that can be accommodatedfloor(16777216/20) = 838860Row, and the actual number of rows we need to put into the temporary table is785102. Why is that?
SQL1 useSQL_BIG_RESULTAfter optimization, the number of rows scanned by the original table is multiplied by 2. What is the logic behind this? Why is it that simply not trying to write to a temporary table in memory makes a 10-fold difference in performance?
From the source code, we can see that many lines scanned in the trace message are not the actual number of lines scanned. Since it is actually executed, why does the trace message not output the real number of lines scanned and the capacity, etc., for examplefilesort_priority_queue_optimization.rows_estimateThe number of rows scanned in SQL1 I see the calculation rules by GDB as shown in Figure 1 in the appendix
Is there a tool that can count I/O counts during SQL execution?

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

A group by + Order by performance optimization analysis

background

The preparatory work

The installation

Enabling Slow Logs

Performance analysis

Found the problem

Why is the number of rows scanned 1337315

Perform process analysis

The index sample

View optimizer Trace information

Analyze temporary table fields

Summary execution process

To optimize the

Scheme 1 uses idx_AId_DAY_PV index

The index sample

Group by The case where temporary tables are not required

View optimizer Trace information

The execution process is as follows

Feasibility of the scheme

Solution 2 Expand the upper limit of the temporary tablespace

Scheme 3 uses SQL_BIG_RESULT optimization

conclusion

Questions and puzzles

The appendix

A group by + Order by performance optimization analysis

background

The preparatory work

The installation

Enabling Slow Logs

Performance analysis

Found the problem

Why is the number of rows scanned 1337315

Perform process analysis

The index sample

View optimizer Trace information

Analyze temporary table fields

Summary execution process

To optimize the

Scheme 1 uses idx_AId_DAY_PV index

The index sample

Group by The case where temporary tables are not required

View optimizer Trace information

The execution process is as follows

Feasibility of the scheme

Solution 2 Expand the upper limit of the temporary tablespace

Scheme 3 uses SQL_BIG_RESULT optimization

conclusion

Questions and puzzles

The appendix

Related Posts

Basic concepts, architecture and new update -Kafka Body of Knowledge 1

Set syntax sugar Lambda expressions | weekend study

Docker is installed offline