Elasticsearch paging queries have a feature if you write a query like this:

{
    "from" : 10."size" : 10."query": {}}Copy the code

Elasticsearch will retrieve the first 20 items of data and truncate the first 10 items to return only 10-20 items of data.

The side effect of this is obvious. If there is a large amount of data, the later the query will be slower.

Therefore, scroll should be used for large amount of data query. In this way, a cursor is set up to mark the current read position and ensure that the next query can quickly retrieve data.

However, there is a small hole to be noted in both ways, which is explained in detail below.

From + size

Possible problems:

Result window is too large, from + size must be less than or equal to: [10000] but was [10010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.

The maximum number of queries through this paging method is 10000, and an error will be reported after 10000.

The solution is also very simple, one is to use scroll for large data volume query; The second is to increase the size of the index.max_result_window value to support the query range.

Scroll mode is recommended.

Scroll way

Possible problems:

Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.

This error is caused by:

When a large number of Scroll requests request Elasticsearch data, the default maximum number of scroll_id requests is 500. When the maximum number of scroll_id requests reaches the maximum value, some requests do not have scroll_id available, and an error occurs.

Especially in high concurrency scenarios, this problem may be more common.

The workaround is to increase the size of the search.max_open_scroll_context value.

However, this is not a good solution. It is better to clean up the scroll_ID immediately after the query.

# python
from elasticsearch import Elasticsearch


client = Elasticsearch(host, http_auth=(username, password), timeout=3600)
es_data = client.search(es_index, query_body, scroll='1m', size=100)
scroll_id = es_data['_scroll_id']
client.clear_scroll(scroll_id=scroll_id)	# Cleanup method
Copy the code

In fact, even if we do not manually clean, such as expired, the cursor will be released by itself, depending on the parameters used.

For example, scroll equals 1m means it will be released after 1 minute.

But just like with any other resource, when you’re done using it, you release it and develop good coding habits that make your system more robust.

Reference Documents:

  • Juejin. Cn/post / 684490…

  • Juejin. Cn/post / 689089…