preface

When the amount of data is too large to return all the data to the front end at one time, it is often necessary to conduct paging query for the results. There are three ways to implement ElasticSearch paging: From +size, searchAfter, and Scroll. Each of these methods has its advantages and disadvantages.

from+size

From and size are arguments to the ES Search API. From defines the starting position and defaults to 0. Size indicates the number of results to be returned. From and size together delineate a result set.

GET /_search
{
  "from": 5."size": 20."query": {
    "match": {
      "user.id": "kimchy"}}}Copy the code

The advantage of the from+size method is that it is easy to use and can randomly jump pages. It is suitable for the case of no more than 10,000 result sets. This approach does not work for deep paging, where there are performance issues.

Es requests span multiple shards, and each shard produces its own sorting result, which needs to be sorted centrally to ensure that the overall result is correct.

Suppose a search is performed on an index with five master shards. When we request the first page of results (results 1 through 10), each shard produces the top 10 results and returns them to the coordinator node, which sorts the 50 results to get the top 10 of the total.

Now suppose we request page 1000 (results 10001 to 10010), each shard will produce the first 10010 results, the coordination node will sort 50050 results, discard 50040 results and get the final top 10.

As you can see, in the case of deep paging, memory and CPU usage increase significantly, affecting search efficiency. So, by default, you can’t paginate more than 10,000 entries using FROM +size. If you need to page more than 10,000 pieces of data, you can use searchAfter.

searchAfter

You can get the next page by searchAfter, which takes the sort value of the last result of the previous page.

When you get the first page results, submit a request with sort.

GET /_search
{
  "size": 10000."query": {
    "match" : {
      "user.id" : "elkbee"}},"sort": [{"timestamp": {"order": "asc"}}}]Copy the code

Search returns an array of results, each with a sort value.

{
  "took" : 17."timed_out" : false."_shards":... ."hits" : {
    "total":... ."max_score" : null."hits": [{..."_index" : "my-index-000001"."_id" : "FaslK3QBySSL_rrj9zM5"."_score" : null."_source":... ."sort" : [                                
          1623337212000}]}}Copy the code

To get the results for the next page, you need to take the sort value of the last result in the above result set as an argument to the searchAfter. You can continuously fetch the next page, if any, by repeating the searchAfter request. Using searchAfter to get the next page results requires that the query conditions and sort values be the same for multiple requests.

GET /_search
{
  "size": 10000."query": {
    "match" : {
      "user.id" : "elkbee"}},"sort": [{"timestamp": {"order": "asc"}}]."search_after": [                                
    1623337212000]}Copy the code

scroll api

The OFFICIAL ES document indicates that it is not recommended to use the Scroll API for deep paging. In deep paging scenarios, search after is recommended. Scroll applies to the scenarios where query results are fully exported through batch tasks.

Scroll is to cache the query results for a period of time. For example, if scroll=1m, the query result will be cached for 5 minutes before the next request arrives, and the returned value contains a scroll_ID. The next request, with the scroll_ID returned from the previous request, finds the cached result.

POST /my-index- 000001./_search? scroll=3m
{
  "size": 100."query": {
    "match": {
      "message": "foo"}}}Copy the code
POST /_search/scroll                                                             
{
  "scroll" : "3m"."scroll_id" : "DnF1ZXJ5VGhlbkZldGNoAwAAAAAABE74FlZmS2MzSzRjVGFlWmhJNVdEd3N0REEAAAAAAAHT6hZJT205MzczWlFxdUxINXprd05MenFnAAAAAAAB0-kWSU9 tOTM3M1pRcXVMSDV6a3dOTHpxZw==" 
}
Copy the code

Each Scroll request with an scroll parameter will set a new expiration time. If an Scroll request does not have an scroll parameter, the scroll request will release the cached result.