This article is shared by JavaEdge from “Paging Query for Performance Optimization of Large Data Volume” in Huawei cloud community.
Brush a post to turn the page need paging query, search goods also need paging query. When encountering tens of millions, hundreds of millions of data, how to quickly pull the full amount of data? Such as:
- Large businesses pull tens of millions of orders per month to their independent ISVS for financial statistics
- Big V, who has millions and millions of fans, pushes news to all of them
case
Common mistakes
SELECT *
FROM table
where kid = 1342
and type = 1
order id asc
limit 149420,20;
Copy the code
Typical sort + paging query:
order by col limit N,OFFSET M
Copy the code
MySQL > scan N rows, then fetch M rows The larger N is, MySQL needs to scan more data to locate specific N rows, which will consume a lot of I/O costs and time costs.
Why is scanning data slow in the SQL script above?
T is an indexed organization table, key idx_kid_type(kid,type)
There are many rows where kid=3 and type=1. Let’s take rows 9 and 10.
Select * from t where kid =3 and type=1 order by id desc, 8,2;Copy the code
For Innodb, rows are found based on the primary key contained in the idx_kid_type secondary index. To keep a record of 10 trillion level index size may and data size, number of indexes in the cache in memory is limited, and the secondary index and data leaf node is not in the same physical block storage, secondary indexes as opposed to a primary key of the chaotic mapping relation, also can bring a lot of random I/O requests, N the bigger the must traverse a large number of index pages and data, The longer it takes.
Is it really necessary to completely traverse “invalid data” due to the long time consuming large page query above? If need to:
Limit of 8, 2Copy the code
Would it be faster to skip the previous 8 rows of irrelevant data pages and navigate directly to rows 9 and 10 through the index?
This is the core idea of deferred association: you can use an overwrite index query to return the desired primary key and then associate the original table with the primary key to get the desired data, rather than a secondary index to get the primary key and then traverse the data page.
Through the above analysis, it can be found that the reason why large page query is slow through the conventional way, and also know the specific methods to improve the large page query.
General paging query
Simple limit clause. The limit clause declares as follows:
SELECT * FROM table LIMIT
[offset,] rows | rows OFFSET offset
Copy the code
The limit clause is used to specify the number of records returned by the SELECT statement.
offset
Specifies the offset of the first return row, which defaults to 0. The offset of the initial row is 0, not 1rows
Specifies the maximum number of rows to returnrows
A value of -1 retrieves all record rows from an offset to the end of the recordset.
If given only one argument: it indicates the maximum number of rows to return.
Example:
Select * from orders_history where type=8 limit 1000,10;Copy the code
Query 10 entries from orderS_history table for offset: 1000, i.e. 1001-1010 (1001 <= ID <= 1010)
The records in the table are sorted by primary key (ID) by default, and the above result is equivalent to:
Select * from orders_history WHERE type=8 ORDER by ID limit 1000010;Copy the code
The three query times are as follows:
3040 ms
3063 ms
3018 ms
Copy the code
For this type of query, the following tests the impact of query record volume on time:
Select * from orders_history where type=8 limit 10000,1; Select * from orders_history WHERE type=8 LIMIT 1000010; Select * from orders_history WHERE type=8 LIMIT 10000100; Select * from orders_history WHERE type=8 LIMIT 100001000; Select * from orders_history WHERE type=8 LIMIT 1000010000;Copy the code
Three query times:
Query one record: 3072ms 3092ms 3002ms Query 10 records: 3081ms 3077ms 3032ms Query 100 records: 3118ms 3200ms 3128ms Query 1000 records: 3412ms 3468ms 3394ms Query 10000 records: 3749ms 3802ms 3696msCopy the code
When the number of query records is less than 100, there is no difference in the query time. With the increasing number of query records, the more time is consumed.
Tests against query offsets:
Select * from orders_history WHERE type=8 LIMIT 100,100; Select * from orders_history WHERE type=8 LIMIT 1000,100; Select * from orders_history WHERE type=8 LIMIT 10000100; Select * from orders_history WHERE type=8 LIMIT 100000,100; Select * from orders_history WHERE type=8 LIMIT 1000000,100;Copy the code
The three query times are as follows:
Query 100 offset: 25ms 24ms 24ms query 1000 offset: 78ms 76ms 77ms Query 10000 offset: 3092ms 3212ms 3128ms Query 100000 offset: 25ms 24ms 24ms 3878ms 3812ms 3798ms Query 1000000 Offset: 14608ms 14062ms 14700msCopy the code
With the increase of the query offset, especially when the query offset is greater than 100,000, the query time increases sharply.
This paging query starts with the first record in the DB, so the later the query is, the slower the query is, and the more data the query is, the slower the overall query is.
To optimize the
- Front-end add cache, search, reduce the query operations fall into the library, such as massive goods can be put into the search, the use of waterfall flow to show the data.
- Optimize the way SQL accesses data directly to quickly locate the row to be accessed. The recommended method of “deferred correlation” is to optimize the sorting operation by using an overwrite index query to return the desired primary key, and then associating the original table with the primary key to obtain the desired data.
- Use bookmark to record the latest/largest ID value in the last query, and trace back M lines
Delays associated
Before optimization
explain SELECT id, cu_id, name, info, biz_type, gmt_create, gmt_modified,start_time, end_time, market_type, back_leaf_category,item_status,picuture_url FROM relation where biz_type ='0' AND end_time >='2014-05-29' ORDER BY id asc LIMIT 149420 ,20;
+----+-------------+-------------+-------+---------------+-------------+---------+------+--------+-----+
| id | select_type | table | type | possible_keys | key | key\_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+--------+-----+
| 1 | SIMPLE | relation | range | ind\_endtime | ind\_endtime | 9 | NULL | 349622 | Using where; Using filesort |
+----+-------------+-------------+-------+---------------+-------------+---------+------+--------+-----+
Copy the code
Execution time:
20 rows in set (1.05 sec)
Copy the code
After the optimization:
explain SELECT a.* FROM relation a, (select id from relation where biz_type ='0' AND end\_time >='2014-05-29' ORDER BY id asc LIMIT 149420 ,20 ) b where a.id=b.id;
+----+-------------+-------------+--------+---------------+---------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+---------------+---------+---------+------+--------+-------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 20 | |
| 1 | PRIMARY | a | eq_ref | PRIMARY | PRIMARY | 8 | b.id | 1 | |
| 2 | DERIVED | relation | index | ind_endtime | PRIMARY | 8 | NULL | 733552 | |
+----+-------------+-------------+--------+---------------+---------+---------+------+--------+-------+
Copy the code
Execution time:
20 rows in set (0.36 sec)
Copy the code
After optimization, the execution time is 1/3 of the original.
bookmarks
First get the maximum and minimum IDS of eligible records (default ID is primary key)
select max(id) as maxid ,min(id) as minid
from t where kid=2333 and type=1;
Copy the code
Traversal based on whether the ID is greater than the minimum or less than the maximum.
select xx,xx from t where kid=2333 and type=1
and id >=min_id order by id asc limit 100;
select xx,xx from t where kid=2333 and type=1
and id <=max_id order by id desc limit 100;
Copy the code
case
When the delayed association can not meet the requirements of query speed
SELECT a.id as id, client_id, admin_id, kdt_id, type, token, created_time, update_time, is_valid, version FROM t1 a, (SELECT id FROM T1 WHERE 1 and client_id = 'XXX' and is_valid = '1' ORDER by kdt_id ASC LIMIT 267100,100 = b.id;Copy the code
100 rows in set (0.51 sec)
Copy the code
With deferred associative query data 510ms, using bookmark-based solutions to get down to less than 10ms is definitely a qualitative leap.
SELECT * FROM t1 where client_id='xxxxx' and is_valid=1 and id<47399727 order by id desc LIMIT 100;
Copy the code
100 rows in set (0.00 sec)
Copy the code
summary
Locate the primary key start site directly according to the primary key location data method, and then filter the required data.
Relatively faster than delayed association, there are fewer secondary index scans when looking up data. But the optimization method does not have a silver bullet, for example: the result difference between order by ID desc and Order by ASC is 70ms, and the production case has a limit 100 difference of 1.3s, why is this?
Other optimizations, such as using ICP when overwriting index scans using less than a full index column of a composite index, can also speed up large paging queries.
Subquery optimization
Locate the id of the offset position first, and then query it later, suitable for the id increment scenario:
Select * from orders_history where type=8 limit 100000,1; Select id from orders_history where type=8 limit 100000,1; Select * from orders_history WHERE type=8 and id>=(select id from orders_history where type=8 limit 100,1) limit 100; Select * from orders_history WHERE type=8 LIMIT 100000,100;Copy the code
The query time of the four statements is as follows:
First statement: 3674ms second statement: 1315ms third statement: 1327ms fourth statement: 3710msCopy the code
- 1 v.s2: select id instead of select *, 3 times faster
- 2 V.S3: Not much difference in speed
- 3 v.s4: Thanks to the select ID speed increase, the query speed of 3 is 3 times faster
This method will be several times faster than the original general query method.
Use ID to qualify optimization
If the id of the table is continuously increasing, the range of queried ids can be calculated according to the number of pages and records queried. Id between and:
select *
from order_history
where c = 2
and id between 1000000 and 1000100
limit 100;
Copy the code
Query time:
15ms
12ms
9ms
Copy the code
This greatly optimizes the query speed, which can be completed in tens of milliseconds. The limitation is that it can only be used to explicitly know the ID.
Another way of writing:
select *
from order_history
where id >= 1000001
limit 100;
Copy the code
You can also use in, which is often used when multiple tables are associated, to query with a set of ids from other tables:
select *
from order_history
where id in
(select order_id from trade_2 where goods = 'pen')
limit 100;
Copy the code
A temporary table
It’s not query optimization anymore, which I’ll mention incidentally.
For problems with ID-based optimization, sequential increments of ids are required, but in some scenarios, such as when using history tables, or when missing data is a problem, consider using temporary tables to record paging ids and paging ids for IN queries. This can greatly improve the speed of traditional paging queries, especially when the data volume is in the tens of millions.
Id of the data table
Generally, when the DB creates a table, it is mandatory to add an ID increment field to each table for easy query.
Such as order library data volume is very large, generally will be divided into tables. Instead of using the database ID as a unique id, you should use a distributed, highly concurrent unique ID generator and use additional fields in the data table to store this unique id.
Locating the ID (or index) using a range query and then locating the data using an index can speed up the query several times. Select id, and then select *.
reference
Segmentfault.com/a/119000003…
Click to follow, the first time to learn about Huawei cloud fresh technology ~