preface
Hello, everyone. I am a little boy picking up field snails.
When we do paging requirements on a daily basis, we usually use the limit implementation, but when the offset is very large, the query becomes inefficient. This article will discuss how to optimize deep paging for millions of MySQL data in 4 scenarios, along with a recent case study on optimizing slow SQL production.
- Public number: a boy picking up snails
Why is limit deep paging slow?
First look at the table structure:
CREATE TABLE account (id int(11) NOT NULL AUTO_INCREMENT COMMENT 'id ', name varchar(255) DEFAULT NULL COMMENT' id ', Balance int(11) DEFAULT NULL COMMENT 'balance ', create_time datetime NOT NULL COMMENT' create time ', Update_time datetime NOT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT 'UPDATE time ', PRIMARY KEY (id), KEY idx_name (name), KEY idx_update_time (update_time) ENGINE=InnoDB AUTO_INCREMENT=1570068 DEFAULT CHARSET= UTF8 ROW_FORMAT=REDUNDANT COMMENT=' table ';Copy the code
Suppose the SQL for deep paging is as follows:
Select id,name,balance from account where update_time> '2020-09-19' limit 100000,10;Copy the code
The execution time of this SQL is as follows:
It takes 0.742 seconds to complete. Why is deep paging slow? If you switch to limit 0,10, it takes 0.006 seconds
Let’s take a look at the execution flow of this SQL:
- Filter the update_time conditions by using the common secondary index tree idx_update_time to find the record ids that meet the conditions.
- By ID, go back to the primary key index tree, find the rows that meet the record, and then fetch the displayed columns (back table)
- Scan 100010 rows that meet the criteria, then discard the first 100000 rows and return.
SQL execution flow
The execution planAs follows:
SQL is slow for two reasons:
- The limit statement scans the offset+ N rows, then discards the first offset row and returns the last n rows. That is to say,
Limit 100000,
, will scan 100010 rows, andLimit of 0, 10
, scan only 10 lines. Limit 100000,
Scanning more rows also means thatBack to the tableMore times.
Optimization through subqueries
Because of the above SQL, we only need 10 items of data, so we only need 10 times to return the table. Therefore, we can optimize by reducing the number of times we return to the table.
Review the B+ tree structure
So, how to reduce the number of times back to the table? Let’s review the B+ tree index structure
In InnoDB, indexes are divided into primary key indexes (clustered indexes) and secondary indexes
- The primary key index, the leaf node holds the entire row of data
- Secondary index, the leaf node holds the primary key value.
Move the condition to the primary key index tree
If we move the query criteria back to the primary key index tree, we can reduce the number of times we return to the table. SQL > select * from primary key where id = ‘update_time’; Pull to subquery where ~
How do you draw subqueries there? Select * from table where limit 100000 = 1; select * from table where limit 100000 = 1;
select id,name,balance FROM account where id >= (select a.id from account a where a.update_time >= '2020-09-19' limit 100000, 1) LIMIT 10;
Copy the code
Query effect is the same, the execution time only needs 0.038 seconds!
Let’s look at the execution plan
According to the execution plan, the subquery table A uses the idx_update_time index. First in the index to get the clustered index of the primary key ID, eliminating the table operation, and then the second query directly according to the first query of the ID back to check 10 can!
Therefore, this scheme is ok
INNER JOIN Delay association
The optimization idea for deferred association is the same as that for subqueries: move the condition to the primary key index tree, and then reduce the return to the table. The difference is that deferred association uses inner Join instead of subquery.
The optimized SQL is as follows:
SELECT acct1.id,acct1.name,acct1.balance FROM account acct1 INNER JOIN (SELECT a.id FROM account a WHERE a.update_time >= '2020-09-19' ORDER BY a.update_time LIMIT 100000, 10) AS acct2 on acct1.id= acct2.id;
Copy the code
The query effect is also leveraged and only takes 0.034 seconds
The implementation plan is as follows:
The primary key ID can be queried in the idx_update_time secondary index tree, and then the primary key ID can be connected to the original table. In this way, the primary key index can be directly connected to the original table, and the back table can be reduced.
Label recording
The root cause of the limit deep paging problem is that the larger the offset, the more rows mysql will scan and then discard. This leads to a degradation of query performance.
In fact, we can use the label recording method, that is, mark the last query which item, the next time to check, from that item to scan down. It’s like reading a book. When you see something last time, you fold it or put a bookmark in it, and the next time you look at it, you’re right back there.
If the last record is 100000, the SQL can be changed to:
select id,name,balance FROM account where id > 100000 order by id limit 10;
Copy the code
In this case, no matter how many pages you turn, performance will be good because you hit the ID index. But you, there are limitations to this approach: you need something like a continuous increment field.
The between… and…
In many cases, you can convert a limit query to a known location query, so MySQL can scan between… And, you can get the corresponding result.
If you know the boundary value is 100000,100010, you can optimize as follows:
select id,name,balance FROM account where id between 100000 and 100010 order by id desc;
Copy the code
Hand in hand actual combat case
Let’s take a look at a real case. Suppose you now have a table with the following structure and 2 million data.
CREATE TABLE account (id varchar(32) COLLATE UTf8_bin NOT NULL COMMENT '主键', Account_no varchar(64) COLLATE utf8_bin NOT NULL DEFAULT "COMMENT" account 'amount decimal' (20,2) DEFAULT NULL COMMENT 'amount' Type varchar(10) COLLATE UTf8_bin DEFAULT NULL COMMENT 'type A, B' create_time datetime DEFAULT NULL COMMENT 'create time ', update_time datetime DEFAULT NULL COMMENT' update time ', PRIMARY KEY (id), KEY `idx_account_no` (account_no), KEY 'idx_create_time' (create_time) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='Copy the code
The business requirements are as follows: Obtain type A account data by 2021 and report it to the big data platform.
The implementation of general ideas
Many partners receive a requirement and implement it directly like this:
Integer Total = accountDao.countAccount (); SQL <select ID ='countAccount' resultType="java.lang.Integer"> seelct count(1) from account where Create_time >='2021-01-01 00:00:00' and type ='A' </select> int pageNo = total % pageSize == 0? total / pageSize : (total / pageSize + 1); For (int I = 0; i < pageNo; i++){ List<AcctountPO> list = accountDAO.listAccountByPage(startRow,pageSize); startRow = (pageNo-1)*pageSize; PostBigData (list); } // select * from 'SQL'; <select ID ='listAccountByPage' > seelct * from account where create_time >='2021-01-01 00:00:00' and type ='A' limit #{startRow},#{pageSize} </select>Copy the code
Actual combat optimization scheme
In the above implementation, there is a limit deep paging problem because the account table contains millions of data. So how do you optimize?
In fact, you can use label recording, some partners may be confused, id primary key is not continuous ah, really can use label recording?
Sure, id is not continuous, we can make it continuous by order by. The optimization scheme is as follows:
// Query the minimum ID String lastId = accountDao.queryMinId (); SQL < SELECT ID ="queryMinId" returnType= "java.lang.String" > select MIN(ID) from account where create_time >='2021-01-01 00:00:00' and type ='A' </select> // The number of items on A page Integer pageSize = 100; List<AcctountPO> list ; do{ list = listAccountByPage(lastId,pageSize); LastId = list.get(list,size()-1).getid (); lastId = list.get(list,size()-1).getid (); PostBigData (list); }while(CollectionUtils.isNotEmpty(list)); <select id ="listAccountByPage"> select * from account where create_time >='2021-01-01 00:00:00' and id > #{lastId} and type ='A' order by id asc limit #{pageSize} </select>Copy the code