During the process of optimizing the project code, we found a problem with deep paging of ten million data. The reason is as follows
There is a consumable MCS_PROD table in the database, which is assembled into a single consumable product within the system through the synchronization of multi-dimensional data in the external data, and finally synchronized to the ES search engine
MySQL ES synchronization process is as follows:
- Synchronization is triggered in the form of timed tasks, such as time intervals of half a day or a day
- The type of synchronization is incremental synchronization based on the update time mechanism. For example, the first synchronization is >= 1970-01-01 00:00:00.0
- The maximum update time is recorded for storage, and the next update synchronization is based on this condition
- The data is retrieved as a pagination, with the current number of pages incremented by one, looped to the last page
The more in-depth the MySQL query page OFFSET is, the worse the performance is. It is estimated that the online MCS_PROD table records about 1000W
At 10 entries per page, the OFFSET value will drag down query performance, creating a “performance abyss”
Synchronous class code can be optimized for this problem in two ways:
- Vernier and streaming schemes are used for optimization
- Optimizing deep paging performance, this article focuses on this topic
I. Software and hardware description
MySQL VERSION
mysql> select version(); + -- -- -- -- -- -- -- -- -- -- -- + | version () | + -- -- -- -- -- -- -- -- -- -- -- + | 5.7.30 | + -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.01 SEC)Copy the code
Table structure Description
Fields, lengths, and names have been removed from the corporate table structure
mysql> DESC MCS_PROD; +-----------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------------------+--------------+------+-----+---------+----------------+ | MCS_PROD_ID | int(11) | NO | PRI | NULL | auto_increment | | MCS_CODE | varchar(100) | YES | | | | | MCS_NAME | varchar(500) | YES | | | | | UPDT_TIME | datetime | NO | MUL | NULL | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 4 rows in the set (0.01 SEC)Copy the code
Through the test, students helped to create about 500W of data
mysql> SELECT COUNT(*) FROM MCS_PROD; + -- -- -- -- -- -- -- -- -- -- + | count (*) | + -- -- -- -- -- -- -- -- -- -- + | 5100000 | + -- -- -- -- -- -- -- -- -- -- + 1 row in the set (1.43 SEC)Copy the code
The SQL statement is as follows
Because the function needs to satisfy the mode of incremental pull, there will be conditional query of data update time, and related query sorting (there is pit here).
SELECT
MCS_PROD_ID,
MCS_CODE,
MCS_NAME,
UPDT_TIME
FROM
MCS_PROD
WHERE
UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY UPDT_TIME
LIMIT xx, xx
Copy the code
I sorted out some information, and friends in need can click to get it directly.
25 Java Interview Topics
Learning routes and materials from 0 to 1Java
Java core knowledge set
MySQL is the king of the road to promotion
MySQL pagination
The LIMIT clause can be used to force a SELECT statement to return a specified number of records. LIMIT accepts one or two numeric arguments, which must be an integer constant
If you are given two arguments, the first argument specifies the offset of the first row to return, and the second parameter specifies the maximum number of rows to return
Take a simple example, analyze the SQL query process, understand why deep paging performance is poor
Mysql > SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1; +-------------+-------------------------+------------------+---------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | UPDT_TIME | +-------------+-------------------------+------------------+---------------------+ | 181789 | XA601709733186213015031 radial LC - DCP bone plate | | feet, the 2020-10-19 16:22:19 | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (3.66 SEC) mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- ----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using index condition | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set, 1 warning (0.01 SEC)Copy the code
SQL > execute SQL
- First query table MCS_PROD, filter UPDT_TIME condition, query display columns (involving table operations) for sorting and LIMIT
- LIMIT 100000, 1 means scan the 100001 rows that satisfy the condition and discard the first 100000 rows
MySQL spends a lot of random I/O querying the clustered index data back to the table, and these 100000 random I/O queries do not show up in the result set
If the number of concurrent queries is slightly higher than 100000 rows per query, the performance will be poor.
The data in Figure 1 are for reference only
Deep page optimization
There are three common strategies for deep page optimization in MySQL:
- Subquery optimization
- Delays associated
- Bookmarks record
The above three points can greatly improve query efficiency. The core idea is to let MySQL scan as few pages as possible, obtain the records that need to be accessed, and then return the columns required by the original table query according to the associated columns
3.1 Sub-query optimization
Subquery deep page optimization statement is as follows:
mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1; +-------------+-------------------------+------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 3021401 | XA892010009391491861476 | anatomy of the metal bone plate T bone plate type A | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.76 SEC) mysql > EXPLAIN the SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE M1. UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1 | PRIMARY | MCS_PROD | NULL | range | PRIMARY | PRIMARY | NULL 100.00 | | 2296653 | | 4 Using the where | | 2 | SUBQUERY | | NULL m1 | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using the where; Using index | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 2 rows in the set, 1 warning (0.77 SEC)Copy the code
According to the execution plan, the sub-query table M1 uses the index. Select * from primary key where ID = 1; select * from primary key where ID = 1
The data in Figure 2 are for reference only
3.2 Delay Association
The “deferred association” deep page optimization statement is as follows:
mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE M1. UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS MCS_PROD2 USING(MCS_PROD_ID); +-------------+-------------------------+------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 3021401 | XA892010009391491861476 | anatomy of the metal bone plate T bone plate type A | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.75 SEC) mysql > EXPLAIN the SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS MCS_PROD2 USING(MCS_PROD_ID); +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -------+----------+--------------------------+ | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 2296653 | | NULL 100.00 | | 1 | PRIMARY | MCS_PROD | NULL | eq_ref | PRIMARY | PRIMARY 4 | | MCS_PROD2. MCS_PROD_ID | | 1 100.00 | NULL | | 2 | DERIVED | | NULL m1 | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using where; Using index | +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 3 rows in the set, 1 warning (0.00 SEC)Copy the code
The idea and performance are the same as the subquery optimization, but the implementation is implemented in the form of JOIN
3.3 Bookmarks
The core of the LIMIT deep paging problem is the OFFSET value, which causes MySQL to scan a large number of unneeded rows and then discard them
Instead of using OFFEST, we can first use bookmarks to retrieve the last fetched location and then start scanning from that location next time
Suppose you want to query the first record after 300,000 rows of data, you can write the query like this
mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1; +-------------+-------------------------+---------------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | +-------------+-------------------------+---------------------------------+ | 127 | XA683240878449276581799 | Locking plate threaded holes in the proximal femur - 1 pure titanium) YJBL01 | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.00) sec) mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- ---------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | index | PRIMARY | MCS_PROD_1 5 | | NULL | 2 | | 50.00 Using the where | +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- ---------+ 1 row in set, 1 Warning (0.00 SEC)Copy the code
The benefit is very obvious, the query speed is super fast, the performance will be stable in milliseconds, from the performance of other ways to consider crushing
However, this approach is somewhat limited, requiring something like continuous increment fields and the notion of continuity that the business can accommodate, depending on the situation
The above picture is the list of files in ali Cloud OSS Bucket. I dare to guess whether it can be completed in the form of bookmarks
ORDER BY the pit, tread carefully
The following statement may shatter all your fantasies about Order by
If LIMIT OFFSET is too deep, the ORDER BY index will be invalidated.
Mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE UPDT_TIME >= '1970-01-01 00:00:0.0 ' ORDER BY UPDT_TIME LIMIT 100000, 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- ----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using index condition | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set, 1 warning (0.00 SEC)Copy the code
ORDER BY:
- Initialize SORT_BUFFER and add MCS_PROD_ID,MCS_CODE,MCS_NAME, and UPDT_TIME
- Find the primary key ID that meets the condition from the index UPDT_TIME, query the four field values in the table and store them in SORT_BUFFER
- Query the records that meet the UPDT_TIME condition from the index, and go to Step 2
- Sort data in SORT_BUFFER by UPDT_TIME
- After the sorting succeeds, the records that meet the LIMIT conditions are retrieved and returned to the client
Sorting by UPDT_TIME may be done in memory, or external sorting may be required, depending on the memory required for sorting and the parameter SORT_BUFFER_SIZE
SORT_BUFFER_SIZE is the memory MySQL allocates for sorting. If the amount of sort data is less than SORT_BUFFER_SIZE, the sort is done in memory. If the amount of data is too large to store, the system uses temporary files on disk to sort data
For the parameter SORT_BUFFER_SIZE, there is little useful information found online. If you have any problems during the test, you can add wechat to communicate with us
4.1 Example of ORDER BY Index Failure
At OFFSET 100000, key Extra indicates that temporary file sorting on disk is not used, so change the OFFSET to 500000
Cool cool night for you missing into the river, into the spring mud caress you… Using Filesort
Mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE UPDT_TIME >= '1970-01-01 00:00:0.0 ' ORDER BY UPDT_TIME LIMIT 500000, 1; +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- ---------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | | ALL MCS_PROD_1 | NULL | NULL | NULL | 4593306 | | 50.00 Using where; Using filesort | +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set, 1 warning (0.00 SEC)Copy the code
Using filesort means that external sorting actions are required in addition to indexes, and performance will be severely affected
Therefore, we should combine the corresponding business logic to avoid the normal LIMIT OFFSET and use the # deep page optimization section to modify the corresponding business
“Said
One final point to note is that MySQL itself is not suitable for large single-table services
Because MySQL is used in enterprise-level projects, it is not easy to query against library tables, there may be more complex federated queries, or there may be frequent addition or update operations when there is a large amount of data, there must be performance sacrifice to maintain index or data ACID features
If the data growth of the database table can be expected at the beginning of the design, reasonable reconstruction and optimization methods should be conceived, such as ES coordination query, sub-database sub-table, TiDB and other solutions