ElasticSearch from Getting started to Mastering (keep updating....) - depth paging from - size | scroll

This is the 15th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021.

from+size

* From +size = 10k * from+size = 10k * from+size = 10k * from+size = 10k * For example, from = 5000, size=10, ES needs to match and sort in each fragment and get 5000*10 valid data, and then take the last 10 data in the result set. The default value of ES is 10000 data. You can change the maximum amount of data by setting the max_result_window value.

GET book_will/_search
{
    "from": 0,
    "size": 200
    "query": { "match_all": {}},
    "sort" : ["_doc"], 
    "size":  1000
}
Copy the code

scroll

Scroll queries can be used to efficiently perform large volume document queries against Elasticsearch without the cost of deep paging. Cursor queries allow us to do query initialization and then pull results in batches. This is kind of like a cursor in a traditional database.

The scroll_ID cursor is generated for one query. Subsequent queries only need to fetch data based on this cursor until the hits field in the result set is empty. The generation of scroll_id can be understood as creating a temporary historical snapshot. Subsequent operations, such as adding, deleting, modifying, and querying, do not affect the snapshot result. The official recommendation for scroll mode is not for real-time requests, because each scroll_ID not only occupies a large amount of resources (especially for sorting requests), but also is a historical snapshot generated, and data changes are not reflected in the snapshot. This approach is often used in situations where large amounts of data are not processed in real time, such as data migration or index changes.

GET book_will/_search? scroll=1m { "query": { "match_all": {}}, "sort" : ["_doc"], "size": 1000 }Copy the code

tips:

Scroll =1m Hold the cursor query window for one minute.

The “sort” : [“_doc”] keyword _doc is the most efficient sort order.

The result returned from this query includes a field _scroll_id, which is a long base64-encoded string.

Now we can pass field _scroll_id to the _search/scroll query interface to get the next batch of results:

GET /_search/scroll { "scroll": "1m", "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzh TYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs=" }Copy the code

Notice set the cursor query expiration time to one minute again.

The next batch of results returned by this cursor query. Although we specify the value of field size as 1000, it is possible to fetch more documents than this value. When querying, field size is applied to a single shard, so the maximum number of documents actually returned per batch is size * number_OF_primary_SHards.

search_after

In the first query, record the location of the last query, in the next query to obtain the location of the last query, then query; The second query adds search_after to the statement from the first query and specifies which data to start reading from. Use the value specified by search_after as a query condition (similar to a cursor) to specify where to continue the query from across the ordered data.

GET book_will/book/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "tweet#654323"],
    "sort": [
        {"es_timestamp": "asc"},
        {"_uid": "desc"}
    ]
}
Copy the code

From + size paging, scroll search, search_after false paging comparison

1. Poor paging performance from + size; Advantages are good flexibility, simple implementation; The downside is deep paging; This method is applicable to scenarios where the data volume is small and deep paging can be tolerated

2. The scroll search performance is medium. Advantages Solve the deep paging problem; The disadvantage is that the data cannot be reflected in real time (snapshot version) and a scroll_id needs to be maintained, which costs a lot. This method is applicable to the scenario where massive result sets need to be queried

3. High performance of search_after fake pages; Advantages The best performance, there is no deep paging problem, can reflect the real-time change of data; The disadvantage is that the implementation is complex, need to have a global unique field, continuous paging each query need the last query results; This method is applicable to paging scenarios that require massive data

Note:

DELETE scroll: DELETE /_search/scroll/_all

ClearScrollRequest ClearScrollRequest = new ClearScrollRequest(); ClearScrollRequest = new ClearScrollRequest(); clearScrollRequest.addScrollId(scrollId); this.esSafeRestClient.clearScroll(clearScrollRequest,RequestOptions.DEFAULT);Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

ElasticSearch from Getting started to Mastering (keep updating….) – depth paging from – size | scroll | search_after

from+size

scroll

search_after

From + size paging, scroll search, search_after false paging comparison

Note:

ElasticSearch from Getting started to Mastering (keep updating….) – depth paging from – size | scroll | search_after

from+size

scroll

search_after

From + size paging, scroll search, search_after false paging comparison

Note:

Related Posts

Redis’ initial features

Validation annotations are customized in SpringBoot

ArrayList source code analysis