A previous interview question:
From: One and a half years’ experience sharing (including Ali Meituan Toutiao, JINGdong Didi)
The article is longer and can be liked
preface
Elasticsearch is a real-time distributed search and analysis engine that has some typical application scenarios, such as paging and traversal.
In the case of relational databases, we were told to watch out for or even explicitly forbid deep paging, and in Elasticsearch, you should avoid deep paging as much as possible.
This article is about pagination in Elasticsearch!
The From/Size parameters
In ES, the paging query returns the top 10 matching hits by default.
If paging is required, you need to use the FROM and size parameters.
-
The from parameter defines the number of hits to skip. Default is 0.
-
The size parameter defines the maximum number of hits to return.
A basic ES query statement looks like this:
POST /my_index/my_type/_search
{
"query": { "match_all": {}},
"from": 100."size": 10
}
Copy the code
The above query represents taking 10 items of data from the search results starting with item 100.
So how does this query execute inside the ES cluster?
In ES, search generally includes two stages, query and FETCH, which can be understood simply. The Query stage determines which doc to fetch, and the fetch stage retrieves specific doc.
The Query phase
The figure above describes the Query phase of a search request: ·
- The Client sends a search request, node1 receives the request, and node1 creates a search request of size
from + size
Node1 is called a coordinating node. - Coordinating node broadcasts the request to the involved Shards, each shard performs the search request internally, and then stores the results internally to the same size
from + size
The priority queue can be understood as an containstop N
A list of results. - Each shard returns the data temporarily stored in its own priority queue to the coordinating node. The coordinating node takes the results returned by each Shards and merges them once to create a global priority queue and stores them in its own priority queue.
In the above example, the Coordinating node took (FROM + size) * 6 pieces of data, and then merged and sorted the previous FROM + size pieces and stored them in the priority queue for use in the FETCH phase.
In addition, the data returned by the sharding to the coordinating node is used to select the previous FROM + size data. Therefore, only the _id of the unique identifier doc and the _score used for sorting are returned to ensure that the amount of data returned is small enough.
When the Coordinating node calculates its priority queue, the Query phase ends and the FETCH phase enters.
The Fetch phase
The Query phase knows what data to fetch, but does not fetch specific data, which is what the fetch phase does.
The image above shows the FETCH process:
- Coordinating Node sends GET requests to the associated Shards.
- Shard according to doc
_id
Retrieve the data details and return to the Coordinating node. - The Coordinating node returns data to the Client.
There are from + size _doc ids in the coordinating node priority queue. However, it is not necessary to fetch all the data in the FETCH phase. In the example above, the first 100 data are not fetched. Just fetch items 101 to 110 in the priority queue.
The data to be retrieved may be in different shards or the same shard. Coordinating node uses multi-get to avoid retrieving data from the same shard multiple times, thus improving performance.
Requesting deep paging in this way is problematic:
We can assume that we are searching in an index with five master shards. When we request the first page of results (results 1 through 10), each shard produces the top 10 results and returns them to the coordinator node, which sorts the 50 results to get the top 10 of the total.
Now suppose we ask for page 1000 – the results are 10001 through 10010. Everything works the same way except that each shard has to produce the first 10010 results. Then coordinate the nodes to sort all 50050 results and finally discard 50040 of these results.
The cost of sorting results increases exponentially with the depth of the page.
Note 1:
The size cannot exceed the value of the index.max_result_window parameter. The default value is 10000.
If the search size is greater than 10000, the index.max_result_window parameter needs to be set
PUT _settings
{
"index": {
"max_result_window": "10000000"}}Copy the code
Note 2:
_doc will be removed in future releases, see:
- www.elastic.co/cn/blog/mov…
- elasticsearch.cn/article/158
Deep paging problem
The From/Size mode of Elasticsearch provides paging, but also has limitations.
For example, an index with 1 billion data, divided into 10 shards, and then a search request, from=1000000, size=100, will bring serious performance problems: CPU, memory, IO, network bandwidth.
In the Query phase, each Shards needs to return 1000,100 pieces of data to the Coordinating node, and the Coordinating node needs to receive 10 * 1000,100 pieces of data, Even if you only have _doc _ID and _score, that’s a lot of data, right?
On the other hand, we realized that this kind of deep paging request didn’t make sense because we rarely looked too far behind, and in many business scenarios, we directly limited paging to, say, the first 100 pages.
For example, if a wechat big V with 10 million fans wants to send a group message to all the fans, or send a group message to the fans in a certain province, it needs to obtain all the qualified fans. The easiest thing to think of is to use from + size to achieve this, but this is not realistic. At this time, The traversal can be done in other ways provided by Elasticsearch.
Deep paging problems can be broadly divided into two categories:
- Random deep paging: Randomly jump to pages
- Scrolling deep paging: you can only search down page by page
Here are some of the deep paging methods that are officially available
Scroll
Scroll to traverse data
We can think of Scroll as a cursor in a relational database, so scroll is not suitable for real-time searches, but rather for background batch tasks, such as group Posting.
The purpose of this paging is not to query data in real time, but to query a large amount of data (or even the entire data) at once.
Because this scroll basically maintains a snapshot of the current index segment, and that snapshot is the snapshot when you perform this scroll query. Any new index data after this query will not be queried in this snapshot.
However, it does not query all the data and discard the missing parts, but records a read location to ensure that the next read is continued quickly.
This can be used in conjunction with searchtype.scan when sorting is not considered.
Scroll can be divided into two parts: initialization and traversal. During initialization, all search results that meet the search conditions are cached (note that this is only the cache doc_id, not really all the document data is cached, the data is fetched in the fetch stage), which can be imagined as a snapshot.
During traversal, data is fetched from this snapshot, that is, after initialization, index inserts, deletes, and updates do not affect the traversal result.
The basic use
POST /twitter/tweet/_search? scroll=1m
{
"size": 100."query": {
"match" : {
"title" : "elasticsearch"}}}Copy the code
Initialize index and type, and then add scroll to indicate how long to hold the search results, just like a normal search request.
A _scroll_id is returned, and _scroll_id is used for the next fetch.
traverse
POST /_search? scroll=1m
{
"scroll_id":"XXXXXXXXXXXXXXXXXXXXXXX I am scroll id XXXXXXXXXXXXXXX"
}
Copy the code
Scroll_id is the _scroll_id from the last traversal or the _scroll_id from initialization, and again, you need scroll.
Repeat this step until the returned data is empty, i.e. the traversal is complete.
Note that the parameter scroll is passed each time to refresh the cache time of search results. In addition, you do not need to specify index and type.
When you set scroll, you need to cache search results until the next scroll is completed, and at the same time, it should not be too long, because the space is limited.
The advantages and disadvantages
Disadvantages:
- Scroll_id can be a huge resource hog (especially for sorted requests)
- Similarly, if scroll is followed by a timeout, a series of problems will occur if scroll requests are made frequently.
- Is a historical snapshot. Data changes are not reflected in the snapshot.
Advantages:
This is suitable for situations where large amounts of data are not processed in real time, such as data migration or index changes.
Scroll Scan
ES provides scroll Scan to further improve traversal performance. However, Scroll Scan does not support sorting. Therefore, Scroll Scan is suitable for scenarios where sorting is not required
The basic use
Scroll Scan traversal is the same as normal Scroll, but the initialization is a little different.
POST /my_index/my_type/_search? search_type=scan&scroll=1m&size=50
{
"query": { "match_all": {}}}Copy the code
Parameters need to be specified:
search_type
: If the value is scan, Scroll scan is used to traverse the Elasticsearch search results without sorting.- Scroll: same as above, upload time.
- Size: Different from normal size, this size represents the number of sizes returned by each shard. The final result is at most
number_of_shards * size
.
Difference between Scroll Scan and Scroll
- The scrollscan results are returned in index order without sorting, which improves data retrieval performance.
- Only returns when initialized
_scroll_id
, there are no HITS results - Size controls the amount of data returned per shard, not the amount of data returned for the entire request.
Sliced Scroll
If you have a lot of data, it’s really not acceptable to Scroll through the data, and now the Scroll interface can do it concurrently.
Each Scroll request can be divided into multiple Slice requests, which can be understood as slices, and each Slice is independently parallel, which is many times faster than Scroll traversal.
POST /index/type/_search? scroll=1m
{
"query": { "match_all": {}},
"slice": {
"id": 0."max": 5} } POST ip:port/index/type/_search? scroll=1m
{
"query": { "match_all": {}},
"slice": {
"id": 1."max": 5}}Copy the code
In the above example, two pieces of data can be requested separately, and the result of combining five pieces of data is the same as that of a direct Scroll scan.
Where Max is the number of blocks and ID is the number of blocks.
The official documentation recommends that the value of Max do not exceed the number of shards. Otherwise, memory may explode.
Search After
Search_after is a new paging query mechanism introduced in ES 5 that works almost exactly the same as Scroll, so the code is almost the same.
Basic use:
The first step:
POST twitter/_search
{
"size": 10."query": {
"match" : {
"title" : "es"}},"sort": [{"date": "asc"},
{"_id": "desc"}}]Copy the code
The result information returned is:
{
"took" : 29."timed_out" : false."_shards" : {
"total" : 1."successful" : 1."skipped" : 0."failed" : 0
},
"hits" : {
"total" : {
"value" : 5."relation" : "eq"
},
"max_score" : null."hits": [{...},"sort": [...]. }, {... },"sort" : [
124648691."624812"}]}}Copy the code
The above request returns an array containing the sort values for each document.
These sort values can be used in the search_after parameter to fetch the data for the next page.
For example, we could take the sort value of the last document and pass it to the search_after parameter:
GET twitter/_search
{
"size": 10."query": {
"match" : {
"title" : "es"}},"search_after": [124648691."624812"]."sort": [{"date": "asc"},
{"_id": "desc"}}]Copy the code
If we want to read the next page of data following the results of the last read, the second query adds search_after to the statement from the first query and specifies which data to start reading from.
The basic principle of
Es maintains a live cursor that takes the last record of the previous query as a cursor for the next query. It is a stateless query, so each query is the latest data.
Because it uses records as cursors, SearchAfter requires at least one globally unique variable in doc (fields with a unique value per document should be used as sorting specifications)
The advantages and disadvantages
Advantages:
- Stateless query prevents data changes from being reflected in the query in time.
- No maintenance required
scroll_id
, there is no need to maintain snapshots, so you can avoid consuming a lot of resources.
Disadvantages:
- Because of stateless queries, changes during queries can result in inconsistent values across pages.
- The sort order may change during execution, depending on index updates and deletions.
- You need to specify at least one unique non-repeating field for sorting.
- It is not suitable for large jump page query or full export. The jump page N query is equivalent to the repeated search after of ES for N times, while the full export is a large number of repeated query in a short time.
SEARCH_AFTER is not a solution for jumping freely to any page, but rather for scrolling through multiple queries in parallel.
conclusion
Paging way | performance | advantages | disadvantages | scenario |
---|---|---|---|---|
from + size | low | Good flexibility, simple implementation | Deep paging problem | The amount of data is small and can tolerate deep paging problems |
scroll | In the | Deep paging is resolved | Unable to reflect real-time data (snapshot version) High maintenance cost, need to maintain a scroll_ID | To export massive data, you need to query massive result sets |
search_after | high | The best performance is no deep paging problems and the ability to reflect real-time changes in data | The implementation of complex, requiring a globally unique field continuous paging implementation will be more complex, because each query needs the results of the last query, it is not suitable for large jump page query | Paging of massive data |
The ES7 version is changed
Reference: www.elastic.co/guide/en/el…
In version 7.*, ES officially recommends not using Scroll method for deep paging, but using search_after with PIT.
Starting with version 7.*, you can use the SEARCH_AFTER parameter to retrieve the next page hit through a set of sorted values from the previous page.
Using SEARCH_AFTER requires multiple search requests with the same query and sort values.
If a refresh occurs between these requests, the order of the results may change, resulting in inconsistent results across pages.
To prevent this, you can create a point in time (PIT) to preserve the current index state during the search.
POST /my-index- 000001./_pit? keep_alive=1M returns a PIT ID: {"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAA AAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
}
Copy the code
Specify PIT in the search request:
GET /_search
{
"size": 10000."query": {
"match" : {
"user.id" : "elkbee"}},"pit": {
"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAA AAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="."keep_alive": "1m"
},
"sort": [{"@timestamp": {"order": "asc"."format": "strict_date_optional_time_nanos"."numeric_type" : "date_nanos"}}}]Copy the code
The performance comparison
10 pieces of data in the range of 1-10, 49000-49010, 99000-99010 are obtained in paging respectively (the premise is 10W pieces), and the performance is roughly as follows:
Add a forward
There is no API for forward scrolling in ES, but according to the official (github.com/elastic/ela…
- For a page, positive order
search_after
If the id of the last data on the page is next, the sequence is reversedsearch_after
The first data ID of the page is the previous page. - Domestic BBS, some people use caching to solve the problems of the previous page: elasticsearch. Cn/question / 77…
conclusion
- If the data volume is small (from+size < 10000), or if you only care about TopN data in the result set, you can use from/size pagination
- Scroll mode is used for tasks such as a large amount of data, in-depth page flipping, and background batch processing (data migration)
- Large amount of data, deep page turning, users real-time, high concurrent query requirements, the use of search after
Individual thinking
Scroll and search_after work in much the same way; they both use the cursor approach for deep paging.
This approach solves the deep paging problem to some extent. However, they are not the ultimate solution to the deep paging problem, which must be avoided! ! .
For Scroll, it is inevitable to maintain scroll_id and historical snapshots, and the lifetime of scroll_id must also be guaranteed, which is a huge load on the server.
For Search_After, if the user is allowed to jump to the page for a long time, it will lead to frequent search actions in a short period of time, which is very inefficient and will increase the load of the server. Meanwhile, during the query process, the increase, deletion and change of indexes will lead to inconsistent query data or changes in order, resulting in inaccurate results.
Search_After itself is a business compromise that does not allow you to specify a jump to a page, but only the next page.
By default, Scroll will retrieve all qualified data later, so it will only search for all qualified doc_id. And sort them and save them in the coordinate node, but instead of fetch all data, scroll reads size documents each time, and returns the last document read this time and the context state. Used to tell which document to start reading from which shard next.
This is why scroll is not officially recommended for real-time paging queries, but rather for pulling large volumes of data, as it is not designed for real-time reading.