Interview questions & Real experience

Interview question: How do you achieve deep paging when you have a large amount of data?

Everyone in the interview, or prepare for the interview may encounter the above questions, most of the answer is basically a sub-database sub-table index, which is a very standard correct answer, but the reality is always very thin, so the interviewer will generally ask you, now the project is insufficient, insufficient personnel, how to achieve deep paging?

At this time, the students who have no practical experience are basically numb, So, please listen to me.

Bitter lesson

One thing must be clear: deep pagination can be done, but deep random page hopping absolutely needs to be disabled.

Previous image:

Do you think, if I click on page 142360, the service will explode?

MySQL, MongoDB is fine, it’s a professional database, it’s not good, it’s slow at best, but when it comes to ES, it’s not the same, we have to use the SearchAfter Api to loop the data, and that’s a memory footprint issue, and if the code wasn’t written gracefully, This can directly lead to a memory overflow.

Why can’t you allow random depth page skipping

From a technical point of view, let’s talk briefly about why random deep page skipping is not allowed, or why deep paging is not recommended

MySQL

Basic principles of paging:

SELECT * FROM test ORDER BY id DESC LIMIT 10000, 20;
Copy the code

LIMIT 10000, 20 means to scan 10020 rows that satisfy the condition, discard the first 10000 rows, and return the last 20 rows. If the LIMIT is 1000000, 100, 1000100 rows will be scanned. In a high concurrency application, more than 100W rows will be scanned per query.

MongoDB

Basic principles of paging:

db.t_data.find().limit(5).skip(5);
Copy the code

Similarly, with the increase of the page number, the items skipped by SKIP will also become larger. This operation is implemented by iterator of CURSOR, and the consumption of CPU will be very obvious. When the page number is very large and frequent, it will inevitably explode.

ElasticSearch

From a business point of view ElasticSearch is not a typical database. It is a search engine. If you filter and don’t find the data you want, you will not find the data you want if you continue deep paging. When paging, you must run into the max_result_window limit. See? The official limit is 10,000.

Query process:

  1. For example, if page 501 is queried with 10 entries on each page, the client sends a request to a node
  2. This node broadcasts data to each shard, each of which queries the first 5010 pieces of data
  3. The query results are returned to this node, and then the data is consolidated to fetch the first 5010 data
  4. Return to the client

You can see why you want to limit the offset, and if you use the Search After scrolling API for deep page hopping queries, you’ll need to scroll thousands of pieces at a time, and maybe millions or tens of millions of pieces of data in total, just for the last 20 pieces of data.

Line up with the product again

As the saying goes, technology can not solve the problem, by the business to solve!

In the internship, I believed in the evil of the product and had to realize deep paging + page skipping. Now I must put things right and make the following changes in the business:

  • Add default filtering conditions, such as time cycle, as much as possible to reduce the amount of data displayed
  • Modify the display of the page hop to scroll display, or small range of page hop

Scroll to display the reference image:

Small scale page hopping reference figure:

Universal solution

The solutions in a short time are as follows:

  • Prerequisite: Indexes must be set for sorting fields
  • Core: Reduce offsets with known data for small page numbers, or known data for scrolling loads
  • Extra: If the situation is not easy to handle, you can also obtain the redundant data, some interception, performance impact is not big

MySQL

Page SQL:

# the first page
SELECT * FROM `year_score` where `year` = 2017 ORDER BY id limit 0, 20;

Page # N
SELECT * FROM `year_score` where `year` = 2017 ORDER BY id limit (N - 1) * 20, 20; 
Copy the code

By the context, it is written as:

SELECT * FROM 'year_score' WHERE 'year' = 2017 and 'id' > XXXX ORDER BY id limit 20;Copy the code

There’s no mole. Get some shit! As mentioned in SQL Optimization and Diagnostics, the LIMIT stops the query when the condition is met, so the total number of scans in this scheme is drastically reduced and the efficiency is improved by Max!

ES

The solution is the same as MySQL, where we can use the From-to Api whenever we want, without worrying about maximum limitations.

MongoDB

The scheme is similar, and the basic code is as follows:

Related performance tests:

If you want to deep random page jump

What if you haven’t screwed the product manager? It doesn’t matter. There’s still a chance.

In SQL Optimization article also mentioned MySQL deep page processing techniques, the code is as follows:

# counter example (time 129.570s)
select * from task_result LIMIT 20000000, 10;

# Positive example (time: 5.114s)
SELECT a.* FROM task_result a, (select id from task_result LIMIT 20000000, 10) b where a.id = b.id;
 # description The # task_result table is a table in the production environment where the total data amount is 34 million, id is the primary key, and the offset is 20 million Copy the code

The core logic of this scheme is to quickly get the primary key ID of the specified offset data without passing back table based on cluster index, and then use cluster index to query back table. At this time, the total number of entries is only 10, which is very efficient.

Therefore, we can use the same method when dealing with MySQL, ES, and MongoDB:

  1. The primary key ID is obtained only through deep paging
  2. The required data is queried by the primary key ID

Defects: Long time when the offset is very large, such as 5s in the article

The last

Reference: MongoDB Chinese community

Thank you @cheng big designer for my love design of the TWO-DIMENSIONAL code 😜

Don’t forget to like it if you find it useful