The profile

Starting from the introduction of search paging, this paper simply expounds the differences between paging data search and the original centralized data search thinking mode, analyzes the phenomenon of deep paging problem based on paging, and finally introduces the case of top N paging system.

Search paging syntax

Elasticsearch has two parameters: from and size.

  • Size: Displays the number of results that should be returned. The default is 10.
  • From: Displays the offset of the query data, the initial number of results that should be skipped. The default is 0.

The meanings of the from and size parameters are the same as the meanings of the limit keyword pagination parameters in MySql.

To give a few examples, query the request on pages 1-3:

GET /music/children/_search? size=10 GET /music/children/_search? size=10&from=10 GET /music/children/_search? size=10&from=20Copy the code

The difference between distributed and centralized data

Centralized data storage, application mode, from the earliest monomer to early SOA service mode, then stored most are using centralized data storage, data to the ground to mysql relational data, such as a separation support, speaking, reading and writing, and for realizing the Lord – more than one database instance is from structural nature of centralized storage.

In a single database or master-slave database, the execution of paging queries, statistical sorting and other ideas are relatively clear, after all, the data are completely put together, directly pick an instance to do it, may be the capacity of the upper limit, the results are slower.

The classical scheme of relational database using distributed data storage is to divide the database into tables. The data of the same table is divided into different database instances with certain routing logic. At this time, data statistics can not only focus on one instance.

For example, the index data of Elasticsearch is split and stored in each shard, and each shard may be scattered on each node of the ES cluster. In this case, you can perform query, statistical analysis and other operations. Although ES has encapsulated the technical details, we still need to understand that this is a distributed storage query solution.

In my opinion, although there is a mature framework to package the difference between distributed data and centralized data in terms of relational database or ES, users still need to understand the changes brought about by distribution in their thinking so as to get the correct results.

Deep the paging problem

Deep paging, simply called deep paging, searches extremely deep and displays hundreds of pages of data. Why is deep Paging problematic?

Let’s say I have 20,000 pieces of data in the index, stored in 5 shards, and I send a query with a conditional sorting field, and if I want to get the data on page 1, then I take 10 pieces of data from each shard, and I summarize them into a Coordinate Node, 50 pieces in total, Coordinate Node sort the 50 data, filter out the next 40 data, only take the first 10 and return them to the client.

What if it’s page 1000? Do you get 10,000-10,0010 per shard, put them ina Coordinate Node, or 50, and then send them back to the client?

That’s not how you do distributed data, so on page 1000, instead of taking 10,000-1,0010 for each shard, you take 1,0010 for each of the five shards and you take 5,050 for each Coordinate Node, Coordinate Node Collect data. After sorting, take item 10000-10010 and return 10 items of data to the client.

Cost so much effort to collect 50050 pieces of data, the actual client only 10 pieces, 50040 pieces of data lost, good memory.

What about page 10,000? The result is hard to look at

The more shards an index has, the more data it needs to aggregate multiplies. You can see that the cost of sorting and distributing results in distributed systems increases exponentially with depth. The two most important influencing dimensions are paging depth and number of shards. Search engines should not return more than 1000 results for any query.

Extended top N problem model

Deep paging can be improved to some extent by optimizing search keywords and controlling paging depth. How to solve the TOP N problem?

In aggregate queries, it is common to encounter the analysis requirement of querying the most XX 10 records, which is the top N problem model.

Perfect solution to the scene

Let’s start with a familiar example: the top 10 most-played English children’s songs.

Document Data structure:

{
  "_index": "music",
  "_type": "children",
  "_id": "2",
  "_version": 6,
  "found": true,
  "_source": {
    "name": "wake me, shark me",
    "content": "don't let me sleep too late, gonna get up brightly early in the morning",
    "language": "english",
    "length": "55",
    "likes": 0
  }
}Copy the code

This requirement is handled easily by ES for several reasons:

  • A document will only exist in one shard
  • Each document data has statistics on the number of plays

With the guarantee of the above points, ES can be assured and bold to take the top 10 data with the highest number of plays in each shard, and the data collected by Coordinate Node is only 50, at this time, the performance is very high.

This scenario is not suitable for direct query

The previous section relies on the pre-design of Document and the feature of Shard storing data to avoid full index scanning and achieve extremely high performance. Assume that there is a play log record for every play click in the system every day, recording the song ID, click person, click time, listening duration and length percentage (the percentage with the complete song, If you hear half of it and quit, this value is 50%), the document data example:

{ "_index": "playlog-20191121", "_type": "music", "_id": "1", "_version": 1, "found": true, "_source": { "music_id": 1, "listener": "Tony Wang ", "listen_date": "2019-11-21 15:35:00", "music_length": 52, "isten_percentage": 0.95}}Copy the code

SQL > select * from playlog-YYYYMmdd; SQL > select * from playlog-YYYYMmDD; SQL > select * from playlog-YYYYMmDD;

If the direct statistics, it can only be carried hard, the basic process is as follows:

  1. Each shard is grouped according to music_id. Theoretically, the maximum number of shards can be 2 million.
  2. Coordinate Node collects the data of 10 shards and merges them. The maximum data processing capacity is 20 million pieces and finally merges them into 2 million pieces.
  3. Take the first 10 items from the 2 million items and return them to the client.

This process is absolutely heavyweight, if every time real-time statistics, ES cluster pressure can be imagined.

The improved scheme
  1. Add data update logic to play function Add index data structure of statistics by date in advance. Every time a user clicks play, an additional update message is sent to update the data. When querying, results are directly obtained from the statistical index, avoiding every query.
  2. Scheduled task statistics Data statistics can be calculated with scheduled tasks to store the calculation results and reduce the real-time performance to avoid the pressure of full index scan calculation.

A simple comparison:

  • Similarities: Both use space for time to avoid full index scan.
  • Differences: The former changes the business implementation logic and increases the data cascading update, which has certain business coupling; The latter changes real-time computation into timed task, which has high flexibility and low coupling with business, but poor real-time performance.
add

Good data structure design can greatly reduce the ES query pressure and improve the performance of real-time query, but there is a point to accept: even the most comprehensive design is difficult to adapt to the ever-changing needs; Requirements change is inevitable and there is no silver bullet.

summary

This paper, from the perspective of the paging query, this paper expounds the deep problems cause of the paging and passing to the distributed system with the centralized system processing differences of thinking to do a simple description, and finally lead to the question of the top N scenario, the scheme mentioned above only for simple scenarios, face the situation of practical production must be more complicated, For example, storm is a distributed computing component to solve the problem of top N.

Focus on Java high concurrency, distributed architecture, more technical dry goods to share and experience, please pay attention to the public account: Java architecture community