Get aliyun coupon for free. My live broadcast – “THE Road to PHP Advancement”
Recently, I found a special card when ranking through a log table. Finally, the problem was solved. I summarized some experience of index and MySQL execution process, but there are still 5 puzzles unsolved.
Thank you dinqi teacher “MySQL Practice 45 lecture”
It mainly includes the following knowledge points
- How does the number of rows scanned by the slow log count
- Find out the optimization scheme from the group by implementation principle
- Implementation details of sorting
- GDB source debugging
background
The TOP10 articles visited this month and this week need to be counted separately. The log table is as follows
CREATE TABLE `article_rank` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `aid` int(11) unsigned NOT NULL, 'pv' int(11) unsigned NOT NULL DEFAULT '1', 'day' int(11) NOT NULL COMMENT 'date e.g. 20171016', PRIMARY KEY (' id') KEY `idx_day_aid_pv` (`day`,`aid`,`pv`), KEY `idx_aid_day_pv` (`aid`,`day`,`pv`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8Copy the code
The preparatory work
In order to clearly verify some of my guesses, I installed a debug version of mysql in the virtual machine and enabled slow log collection to count the number of scanned rows
The installation
- Download the source code
- Compile the installation
- Create mysql user
- Initializing the database
- Initialize the mysql configuration file
- Change the password
If you are interested, you can refer to my blog and install mengkang.net/1335.html step by step
Enabling Slow Logs
Edit the configuration file and add it under the [mysqld] block
slow_query_log=1
slow_query_log_file=xxx
long_query_time=0
log_queries_not_using_indexes=1Copy the code
Performance analysis
Found the problem
If I need to query the SQL of the top 10 articles with the most page views in the 5 days from December 20, 2018 to December 24, 2018, I will first use Explain to see the analysis results
mysql> explain select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+ | 1 | SIMPLE | article_rank | NULL | Range | idx_day_aid_pv, idx_aid_day_pv | idx_day_aid_pv | | NULL | 404607 | | 100.00 Using the where; Using index; Using temporary; Using filesort | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-----------------------------------------------------------+Copy the code
The default index used by the system is IDx_DAY_aid_pv. According to the Extra information, we can see that when idX_DAY_AId_pv is used, the index will be overwritten, but the temporary table will be used, and there will be sorting.
Let’s take a look at the records in the slow log
# User@Host: root[root] @localhost [] Id: 6 # Query_time: 56.959484 Lock_time: 0.000195 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552791747; select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
Why is the number of rows scanned 1337315
We query two data, one is the number of rows that meet the condition, one is the number of rows after group by statistics.
mysql> select count(*) from article_rank where day>=20181220 and day<=20181224; +----------+ | count(*) | +----------+ | 785102 | +----------+ mysql> select count(distinct aid) from article_rank where day>=20181220 and day<=20181224; +---------------------+ | count(distinct aid) | +---------------------+ | 552203 | +---------------------+Copy the code
Total number of detected lines (785102) + Total number of lines after group by (552203) + Limit value = Rows_examined measured in slow logs.
To answer this question, you need to understand exactly how the above SQL works.
Perform process analysis
The index sample
To make it easier to understand, I first simulate a small portion of the idx_DAY_AId_PV index following the rules of indexing
day | aid | pv | id |
---|---|---|---|
20181220 | 1 | 23 | 1234 |
20181220 | 3 | 2 | 1231 |
20181220 | 4 | 1 | 1212 |
20181220 | 7 | 2 | 1221 |
20181221 | 1 | 5 | 1257 |
20181221 | 10 | 1 | 1251 |
20181221 | 11 | 8 | 1258 |
Because the left-most column of index IDx_DAY_AId_pv is day, when we need to find the PV sum of articles between 20181220 and 20181224, we need to traverse the index of data from 20181220 to 20181224.
View optimizer Trace information
Optimizer_trace set optimizer_trace='enabled=on'; SQL select * from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10; Select trace from 'information_schema'. 'optimizer_trace' \G;Copy the code
Extract the last execution result as follows
{
"join_execution": {
"select#": 1,
"steps": [
{
"creating_tmp_table": {
"tmp_table_info": {
"table": "intermediate_tmp_table",
"row_length": 20,
"key_length": 4,
"unique_constraint": false,
"location": "memory (heap)",
"row_limit_estimate": 838860
}
}
},
{
"converting_tmp_table_to_ondisk": {
"cause": "memory_table_size_exceeded",
"tmp_table_info": {
"table": "intermediate_tmp_table",
"row_length": 20,
"key_length": 4,
"unique_constraint": false,
"location": "disk (InnoDB)",
"record_format": "fixed"
}
}
},
{
"filesort_information": [
{
"direction": "desc",
"table": "intermediate_tmp_table",
"field": "num"
}
],
"filesort_priority_queue_optimization": {
"limit": 10,
"rows_estimate": 1057,
"row_size": 36,
"memory_available": 262144,
"chosen": true
},
"filesort_execution": [
],
"filesort_summary": {
"rows": 11,
"examined_rows": 552203,
"number_of_tmp_files": 0,
"sort_buffer_size": 488,
"sort_mode": "<sort_key, additional_fields>"
}
}
]
}
}Copy the code
Analyze temporary table fields
Mysql GDB debugging for more details mengkang.net/1336.html
Verify that the fields on the temporary table are AID and num through GDB debugging
Breakpoint 1, trace_tmp_table (trace=0x7eff94003088, table=0x7eff94937200) at /root/newdb/mysql-server/sql/sql_tmp_table.cc:2306
warning: Source file is more recent than executable.
2306 trace_tmp.add("row_length",table->s->reclength).
(gdb) p table->s->reclength
$1 = 20
(gdb) p table->s->fields
$2 = 2
(gdb) p (*(table->field+0))->field_name
$3 = 0x7eff94010b0c "aid"
(gdb) p (*(table->field+1))->field_name
$4 = 0x7eff94007518 "num"
(gdb) p (*(table->field+0))->row_pack_length()
$5 = 4
(gdb) p (*(table->field+1))->row_pack_length()
$6 = 15
(gdb) p (*(table->field+0))->type()
$7 = MYSQL_TYPE_LONG
(gdb) p (*(table->field+1))->type()
$8 = MYSQL_TYPE_NEWDECIMAL
(gdb)Copy the code
From the print above, the field types are confirmed, one aid is MYSQL_TYPE_LONG, which is 4 bytes, and num is MYSQL_TYPE_NEWDECIMAL, which is 15 bytes.
The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE). (Before MySQL 5.0.3, SUM() and AVG() return DOUBLE for all numeric arguments.)
But as you can see from our print above, the length of the two fields adds up to 19, while tmp_table_info.reclength in optimizer_trace is 20. In other experiments it has been found that the length of a table-> S -> Reclength is the length of all the fields in the table-> Field array plus one.
Summary execution process
- Try using it on the heap
memory
To store a temporary table in memorygroup by
Data, found insufficient memory; - Create a temporary table with two fields,
aid
andnum
Field (sum(pv) as num
); - From the index
idx_day_aid_pv
Insert 1 row into temporary table. The insertion rule is ifaid
If it does not exist, insert it directly. If it exists, insert itpv
The sum of the values ofnum
On; - Loop through the index
idx_day_aid_pv
on20181220
~20181224
If the value is between, go to Step 3. - Based on temporary tables
num
The value of do priority queue sort; - Fetches the last 10 rows of data left in the heap (priority queue heap) and returns them directly as a result set without returning to the table;
Supplementary Remarks Analysis of the execution steps of priority queue sorting:
- Extract the first 10 rows from the temporary table (not sorted) and select
num
andaid
As 10 elements form a small top heap, that is, the smallest num is at the top of the heap. - Take the next line, compare the value of num with the heap top value, and replace the word if it is greater than the heap top value. The new heap is then heap sorted.
- Repeat step 2 until line 552203 completes the comparison.
To optimize the
Scheme 1 uses idx_AId_DAY_PV index
# Query_time: 4.406927 Lock_time: 0.000200 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552791804; select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
The number of scanned lines is 1337,315. Why is the execution time 12 times faster?
The index sample
To make it easier to understand, I also simulate a small portion of the idx_AID_DAY_PV index following the rules of indexing
aid | day | pv | id |
---|---|---|---|
1 | 20181220 | 23 | 1234 |
1 | 20181221 | 5 | 1257 |
3 | 20181220 | 2 | 1231 |
3 | 20181222 | 22 | 1331 |
3 | 20181224 | 13 | 1431 |
4 | 20181220 | 1 | 1212 |
7 | 20181220 | 2 | 1221 |
10 | 20181221 | 1 | 1251 |
11 | 20181221 | 8 | 1258 |
Group by The case where temporary tables are not required
The idx_AID_DAY_pv index aid is definitely ordered, so temporary tables are not created when group by is executed. Temporary tables are only needed when sorting. If this is true, we can see it through the following execution plan
Idx_day_aid_pv index:
mysql> explain select aid,sum(pv) as num from article_rank force index(idx_day_aid_pv) where day>=20181220 and day<=20181224 group by aid order by null limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+ | 1 | SIMPLE | article_rank | NULL | range | Idx_day_aid_pv, idx_aid_day_pv | idx_day_aid_pv | | NULL | 404607 | | 100.00 Using the where; Using index; Using temporary | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+--------+----------+-------------------------------------------+Copy the code
Note that I used order by NULL above to force the result of group by not to be sorted. If you do not add order by null, the SQL above will appear Using filesort
Effect of idx_aid_DAY_pv index:
mysql> explain select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by null limit 10; +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+ | 1 | SIMPLE | article_rank | NULL | index | Idx_day_aid_pv, idx_aid_day_pv | idx_aid_day_pv 12 | | NULL 10 | | | 11.11 Using the where; Using index | +----+-------------+--------------+------------+-------+-------------------------------+----------------+---------+----- -+------+----------+--------------------------+Copy the code
View optimizer Trace information
Optimizer_trace set optimizer_trace='enabled=on'; Select * from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 select * from article_rank force index(idx_aid_day_pv) where day>=20181220 group by aid order by num desc limit 10; Select trace from 'information_schema'. 'optimizer_trace' \G;Copy the code
Extract the last execution result as follows
{
"join_execution": {
"select#": 1,
"steps": [
{
"creating_tmp_table": {
"tmp_table_info": {
"table": "intermediate_tmp_table",
"row_length": 20,
"key_length": 0,
"unique_constraint": false,
"location": "memory (heap)",
"row_limit_estimate": 838860
}
}
},
{
"filesort_information": [
{
"direction": "desc",
"table": "intermediate_tmp_table",
"field": "num"
}
],
"filesort_priority_queue_optimization": {
"limit": 10,
"rows_estimate": 552213,
"row_size": 24,
"memory_available": 262144,
"chosen": true
},
"filesort_execution": [
],
"filesort_summary": {
"rows": 11,
"examined_rows": 552203,
"number_of_tmp_files": 0,
"sort_buffer_size": 352,
"sort_mode": "<sort_key, rowid>"
}
}
]
}
}Copy the code
The execution process is as follows
- Create a temporary table with two fields,
aid
andnum
Field (sum(pv) as num
); - Read the index
idx_aid_day_pv
And then see if the condition is met ifday
Field is not in the condition range (20181220
~20181224
Between), the next line is read; ifday
If the field is within the condition range, thepv
Value accumulation (not operating in temporary tables); - Read the index
idx_aid_day_pv
The next line in, ifaid
If the value is the same as that in Step 1 and the conditions are metpv
Value accumulation (not operating in temporary tables). ifaid
If the result set is inconsistent with step 1, the previous result set is written to the temporary table. - Repeat steps 2 and 3 until the scan is complete
idx_aid_day_pv
Indexes; - Based on temporary tables
num
The value of do priority queue sort; - Based on the top 10 queries
rowid
A return table (temporary table) returns a result set.
Supplementary Remarks Analysis of the execution steps of priority queue sorting:
- Extract the first 10 rows from the temporary table (not sorted) and select
num
androwid
As 10 elements form a small top heap, that is, the smallest num is at the top of the heap. - Take the next line, compare the value of num with the heap top value, and replace the word if it is greater than the heap top value. The new heap is then heap sorted.
- Repeat step 2 until line 552203 completes the comparison.
Feasibility of the scheme
The experiment found that when I added a row of data of 20181219, although this row did not meet our requirements, the scan index would also read this row. Because I did this experiment, I only got the data from 20181220 to 201812245 days, so the number of rows to be scanned is exactly the number of rows in the whole table.
What if instead of 5 days of data, the table stores 10 days of data, or 365 days of data? Is this plan still viable? First simulate 10 days of data, add 5 days to the existing time, and the number of lines is 785,102 as now.
drop procedure if exists idata;
delimiter ;;
create procedure idata()
begin
declare i int;
declare aid int;
declare pv int;
declare post_day int;
set i=1;
while(i<=785102)do
set aid = round(rand()*500000);
set pv = round(rand()*100);
set post_day = 20181225 + i%5;
insert into article_rank (`aid`,`pv`,`day`) values(aid, pv, post_day);
set i=i+1;
end while;
end;;
delimiter ;
call idata();Copy the code
# Query_time: 9.151270 Lock_time: 0.000508 Rows_sent: 10 Rows_examined: 2122417 SET TIMESTAMP =1552889936; select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
The number of rows scanned is 2122417 because I need to scan the entire index, and the number of rows in the entire index is the number of rows in the entire table, because I just inserted 785102 rows.
When I double the amount of data, the query time here has obviously doubled. So this optimization is not stable.
Solution 2 Expand the upper limit of the temporary tablespace
The default temporary tablespace size is 16MB
mysql> show global variables like '%table_size';
+---------------------+----------+
| Variable_name | Value |
+---------------------+----------+
| max_heap_table_size | 16777216 |
| tmp_table_size | 16777216 |
+---------------------+----------+Copy the code
Dev.mysql.com/doc/refman/… Dev.mysql.com/doc/refman/…
max_heap_table_size
This variable sets the maximum size to which user-created MEMORY tables are permitted to grow. The value of the variable is used to calculate MEMORY table MAX_ROWS values. Setting this variable has no effect on any existing MEMORY table, unless the table is re-created with a statement such as CREATE TABLE or altered with ALTER TABLE or TRUNCATE TABLE. A server restart also sets the maximum size of existing MEMORY tables to the global max_heap_table_size value.tmp_table_size
The maximum size of internal in-memory temporary tables. This variable does not apply to user-created MEMORY tables.
The actual limit is determined from whichever of the values of tmp_table_size and max_heap_table_size is smaller. If an in-memory temporary table exceeds the limit, MySQL automatically converts it to an on-disk temporary table. The internal_tmp_disk_storage_engine option defines the storage engine used for on-disk temporary tables.
Max_heap_table_size is also limited by tmp_table_size.
So here we adjust to 32MB and execute the original SQL
set tmp_table_size=33554432;
set max_heap_table_size=33554432;Copy the code
# Query_time: 5.910553 Lock_time: 0.000210 Rows_sent: 10 Rows_examined: 1337315 SET TIMESTAMP =1552803869; select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
Scheme 3 uses SQL_BIG_RESULT optimization
Tell the optimizer that there are many query results, and the temporary table is stored directly on disk.
# Query_time: 6.144315 Lock_time: 0.000183 Rows_sent: 10 Rows_examined: 2122417 SET TIMESTAMP =1552802804; select SQL_BIG_RESULT aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
The number of lines scanned is the number of lines scanned by 2x (785102) + the number of lines scanned by 2x (552203) + the limit value.
Incidentally, when I doubled the amount of data, the query time remained almost the same using this method. Because the number of rows scanned is still the same. The actual test cost 6.197484
conclusion
The optimization effect of scheme 1 is unstable. When the total amount of table data is the same as the total amount of query scope and does not exceed the size limit of temporary table in memory, the performance reaches the best. When the amount of query data occupies more of the total table data, the optimization effect is less obvious. Scheme 2 needs to adjust the size of temporary table memory, which is feasible. However, if the database size exceeds 32MB, you need to increase the temporary table size. Scenario 3 directly declares the use of disk for temporary tables, although the number of rows scanned is an additional scan of the total number of eligible rows. However, the overall response time was 0.1 seconds slower than scenario 2. Due to the comparison of data volume here, I think this time difference is acceptable.
Therefore, in the end, plan 3 is more appropriate.
Questions and puzzles
# SQL1
select aid,sum(pv) as num from article_rank where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;
# SQL2
select aid,sum(pv) as num from article_rank force index(idx_aid_day_pv) where day>=20181220 and day<=20181224 group by aid order by num desc limit 10;Copy the code
- Mysql > select * from SQL1; mysql > select * from SQL1;
- SQL1 and SQL2
group by
The number of rows that I get after that is zero552203
, why does SQL1 run out of memory, and what other details are there? - From the trace
creating_tmp_table.tmp_table_info.row_limit_estimate
Are all838860
; The calculation is based on the memory limit of the temporary table16MB
, and the space required for a line is 20 bytes, so the most that can be accommodatedfloor(16777216/20) = 838860
Row, and the actual number of rows we need to put into the temporary table is785102
. Why is that? - SQL1 use
SQL_BIG_RESULT
After optimization, the number of rows scanned by the original table is multiplied by 2. What is the logic behind this? Why is it that simply not trying to write to a temporary table in memory makes a 10-fold difference in performance? - From the source code, we can see that many lines scanned in the trace message are not the actual number of lines scanned. Since it is actually executed, why does the trace message not output the real number of lines scanned and the capacity, etc., for example
filesort_priority_queue_optimization.rows_estimate
The number of rows scanned in SQL1 I see the calculation rules by GDB as shown in Figure 1 in the appendix - Is there a tool that can count I/O counts during SQL execution?