During the process of optimizing the project code, we found a problem with deep paging of ten million data. The reason is as follows

There is a consumable MCS_PROD table in the database, which is assembled into a single consumable product within the system through the synchronization of multi-dimensional data in the external data, and finally synchronized to the ES search engine

MySQL ES synchronization process is as follows:

  1. Synchronization is triggered in the form of timed tasks, such as time intervals of half a day or a day
  2. The type of synchronization is incremental synchronization based on the update time mechanism. For example, the first synchronization is >= 1970-01-01 00:00:00.0
  3. The maximum update time is recorded for storage, and the next update synchronization is based on this condition
  4. The data is retrieved as a pagination, with the current number of pages incremented by one, looped to the last page

The more in-depth the MySQL query page OFFSET is, the worse the performance is. It is estimated that the online MCS_PROD table records about 1000W

At 10 entries per page, the OFFSET value will drag down query performance, creating a “performance abyss”

Synchronous class code can be optimized for this problem in two ways:

  1. Vernier and streaming schemes are used for optimization
  2. Optimizing deep paging performance, this article focuses on this topic

I. Software and hardware description

MySQL VERSION

mysql> select version(); + -- -- -- -- -- -- -- -- -- -- -- + | version () | + -- -- -- -- -- -- -- -- -- -- -- + | 5.7.30 | + -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.01 SEC)Copy the code

Table structure Description

Fields, lengths, and names have been removed from the corporate table structure

mysql> DESC MCS_PROD; +-----------------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default |  Extra | +-----------------------+--------------+------+-----+---------+----------------+ | MCS_PROD_ID | int(11) | NO |  PRI | NULL | auto_increment | | MCS_CODE | varchar(100) | YES | | | | | MCS_NAME | varchar(500) | YES | | | | | UPDT_TIME | datetime | NO | MUL | NULL | | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- - + + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 4 rows in the set (0.01 SEC)Copy the code

Through the test, students helped to create about 500W of data

mysql> SELECT COUNT(*) FROM MCS_PROD; + -- -- -- -- -- -- -- -- -- -- + | count (*) | + -- -- -- -- -- -- -- -- -- -- + | 5100000 | + -- -- -- -- -- -- -- -- -- -- + 1 row in the set (1.43 SEC)Copy the code

The SQL statement is as follows

Because the function needs to satisfy the mode of incremental pull, there will be conditional query of data update time, and related query sorting (there is pit here).

SELECT
 MCS_PROD_ID,
 MCS_CODE,
 MCS_NAME,
 UPDT_TIME
FROM
 MCS_PROD
WHERE
 UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY UPDT_TIME
LIMIT xx, xx
Copy the code

I sorted out some information, and friends in need can click to get it directly.

25 Java Interview Topics

Learning routes and materials from 0 to 1Java

Java core knowledge set

MySQL is the king of the road to promotion

MySQL pagination

The LIMIT clause can be used to force a SELECT statement to return a specified number of records. LIMIT accepts one or two numeric arguments, which must be an integer constant

If you are given two arguments, the first argument specifies the offset of the first row to return, and the second parameter specifies the maximum number of rows to return

Take a simple example, analyze the SQL query process, understand why deep paging performance is poor

Mysql > SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1; +-------------+-------------------------+------------------+---------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME |  UPDT_TIME | +-------------+-------------------------+------------------+---------------------+ | 181789 | XA601709733186213015031 radial LC - DCP bone plate | | feet, the 2020-10-19 16:22:19 | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (3.66 SEC) mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE (UPDT_TIME >= '1970-01-01 00:00:00.0') ORDER BY UPDT_TIME LIMIT 100000, 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- ----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00  Using index condition | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set, 1 warning (0.01 SEC)Copy the code

SQL > execute SQL

  1. First query table MCS_PROD, filter UPDT_TIME condition, query display columns (involving table operations) for sorting and LIMIT
  2. LIMIT 100000, 1 means scan the 100001 rows that satisfy the condition and discard the first 100000 rows

MySQL spends a lot of random I/O querying the clustered index data back to the table, and these 100000 random I/O queries do not show up in the result set

If the number of concurrent queries is slightly higher than 100000 rows per query, the performance will be poor.

The data in Figure 1 are for reference only

Deep page optimization

There are three common strategies for deep page optimization in MySQL:

  1. Subquery optimization
  2. Delays associated
  3. Bookmarks record

The above three points can greatly improve query efficiency. The core idea is to let MySQL scan as few pages as possible, obtain the records that need to be accessed, and then return the columns required by the original table query according to the associated columns

3.1 Sub-query optimization

Subquery deep page optimization statement is as follows:

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1; +-------------+-------------------------+------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 3021401 | XA892010009391491861476 | anatomy of the metal bone plate T bone plate type A | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.76 SEC) mysql > EXPLAIN the SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID >= ( SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE M1. UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) LIMIT 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows |  filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 1 | PRIMARY | MCS_PROD | NULL | range | PRIMARY | PRIMARY | NULL 100.00 | | 2296653 | | 4 Using the where | | 2 | SUBQUERY | | NULL m1 | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using the where; Using index | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 2 rows in the set, 1 warning (0.77 SEC)Copy the code

According to the execution plan, the sub-query table M1 uses the index. Select * from primary key where ID = 1; select * from primary key where ID = 1

The data in Figure 2 are for reference only

3.2 Delay Association

The “deferred association” deep page optimization statement is as follows:

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE M1. UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS MCS_PROD2 USING(MCS_PROD_ID); +-------------+-------------------------+------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 3021401 | XA892010009391491861476 | anatomy of the metal bone plate T bone plate type A | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.75 SEC) mysql > EXPLAIN the SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD INNER JOIN (SELECT m1.MCS_PROD_ID FROM MCS_PROD m1 WHERE m1.UPDT_TIME >= '1970-01-01 00:00:00.0' ORDER BY m1.UPDT_TIME LIMIT 3000000, 1) AS MCS_PROD2 USING(MCS_PROD_ID); +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -------+----------+--------------------------+ | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 2296653 | | NULL 100.00 | | 1 | PRIMARY | MCS_PROD | NULL | eq_ref | PRIMARY | PRIMARY 4 | | MCS_PROD2. MCS_PROD_ID | | 1 100.00 | NULL | | 2 | DERIVED | | NULL m1 | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00 Using where; Using index | +----+-------------+------------+------------+--------+---------------+------------+---------+-----------------------+-- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 3 rows in the set, 1 warning (0.00 SEC)Copy the code

The idea and performance are the same as the subquery optimization, but the implementation is implemented in the form of JOIN

3.3 Bookmarks

The core of the LIMIT deep paging problem is the OFFSET value, which causes MySQL to scan a large number of unneeded rows and then discard them

Instead of using OFFEST, we can first use bookmarks to retrieve the last fetched location and then start scanning from that location next time

Suppose you want to query the first record after 300,000 rows of data, you can write the query like this

mysql> SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1; +-------------+-------------------------+---------------------------------+ | MCS_PROD_ID | MCS_CODE | MCS_NAME | +-------------+-------------------------+---------------------------------+ | 127 | XA683240878449276581799 | Locking plate threaded holes in the proximal femur - 1 pure titanium) YJBL01 | + -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set (0.00) sec) mysql> EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME FROM MCS_PROD WHERE MCS_PROD_ID < 3000000 ORDER BY UPDT_TIME LIMIT 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- ---------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | index | PRIMARY | MCS_PROD_1 5 | | NULL | 2 | | 50.00 Using the where | +----+-------------+----------+------------+-------+---------------+------------+---------+------+------+----------+---- ---------+ 1 row in set, 1 Warning (0.00 SEC)Copy the code

The benefit is very obvious, the query speed is super fast, the performance will be stable in milliseconds, from the performance of other ways to consider crushing

However, this approach is somewhat limited, requiring something like continuous increment fields and the notion of continuity that the business can accommodate, depending on the situation

The above picture is the list of files in ali Cloud OSS Bucket. I dare to guess whether it can be completed in the form of bookmarks

ORDER BY the pit, tread carefully

The following statement may shatter all your fantasies about Order by

If LIMIT OFFSET is too deep, the ORDER BY index will be invalidated.

Mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE UPDT_TIME >= '1970-01-01 00:00:0.0 '  ORDER BY UPDT_TIME LIMIT 100000, 1; +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- ----------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | range | MCS_PROD_1 | MCS_PROD_1 | | NULL | 2296653 | | 100.00  Using index condition | +----+-------------+----------+------------+-------+---------------+------------+---------+------+---------+----------+- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + 1 row in the set, 1 warning (0.00 SEC)Copy the code

ORDER BY:

  1. Initialize SORT_BUFFER and add MCS_PROD_ID,MCS_CODE,MCS_NAME, and UPDT_TIME
  2. Find the primary key ID that meets the condition from the index UPDT_TIME, query the four field values in the table and store them in SORT_BUFFER
  3. Query the records that meet the UPDT_TIME condition from the index, and go to Step 2
  4. Sort data in SORT_BUFFER by UPDT_TIME
  5. After the sorting succeeds, the records that meet the LIMIT conditions are retrieved and returned to the client

Sorting by UPDT_TIME may be done in memory, or external sorting may be required, depending on the memory required for sorting and the parameter SORT_BUFFER_SIZE

SORT_BUFFER_SIZE is the memory MySQL allocates for sorting. If the amount of sort data is less than SORT_BUFFER_SIZE, the sort is done in memory. If the amount of data is too large to store, the system uses temporary files on disk to sort data

For the parameter SORT_BUFFER_SIZE, there is little useful information found online. If you have any problems during the test, you can add wechat to communicate with us

4.1 Example of ORDER BY Index Failure

At OFFSET 100000, key Extra indicates that temporary file sorting on disk is not used, so change the OFFSET to 500000

Cool cool night for you missing into the river, into the spring mud caress you… Using Filesort

Mysql > EXPLAIN SELECT MCS_PROD_ID,MCS_CODE,MCS_NAME,UPDT_TIME FROM MCS_PROD WHERE UPDT_TIME >= '1970-01-01 00:00:0.0 '  ORDER BY UPDT_TIME LIMIT 500000, 1; +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- ---------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | | 1 SIMPLE | MCS_PROD | NULL | | ALL MCS_PROD_1 | NULL | NULL | NULL | 4593306 | | 50.00 Using where; Using filesort | +----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set, 1 warning (0.00 SEC)Copy the code

Using filesort means that external sorting actions are required in addition to indexes, and performance will be severely affected

Therefore, we should combine the corresponding business logic to avoid the normal LIMIT OFFSET and use the # deep page optimization section to modify the corresponding business

“Said

One final point to note is that MySQL itself is not suitable for large single-table services

Because MySQL is used in enterprise-level projects, it is not easy to query against library tables, there may be more complex federated queries, or there may be frequent addition or update operations when there is a large amount of data, there must be performance sacrifice to maintain index or data ACID features

If the data growth of the database table can be expected at the beginning of the design, reasonable reconstruction and optimization methods should be conceived, such as ES coordination query, sub-database sub-table, TiDB and other solutions