General advice

  1. Do not return a result set that is too large. As a search engine, Es is very good at returning Top N documents under certain conditions, but not good at returning all documents. If all documents need to be returned, please use scroll API.
  2. Avoid storing documents that are too large. Es defaults to a maximum document size of 100MB and can be set to a maximum of about 2GB. Even setting aside this hard limit, you shouldn’t store too many documents. Too many files put more strain on the network, memory, and disk, and too many files become more expensive for searching, highlighting, and so on. Therefore, more rigorous thinking should be taken into account in data modeling, such as a book search scenario that does not require persistence of the entire book, but only chapters and paragraphs.

Improve indexing efficiency

  1. Send data to Elasticsearch using multi-process or multi-thread. To make better use of cluster resources, multi-threading or multi-process bulk requests should be used to improve data processing efficiency. You can gradually increase the number of threads until machines in the cluster are loaded or CPU saturated.

You are advised to use the “Nodes Stats” interface to view the CPU and load status of nodes. Os.cpu. percent, OS.cpu.load_average.1m, OS.cpu.load_Average.5m, and OS.cpu.load_Average. “Nodes stats” the use of the interface to guide and parameter explanation see: www.elastic.co/guide/en/el…

  1. Increase the refresh_interval interval. By default, each shard is automatically refreshed once per second. If tight real-time search is not required, you can set the refresh frequency for each index to be reduced.
PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s"
  }
}
Copy the code
  1. You can disable the Refresh and Replicas count when initializing the index. If a large amount of data needs to be imported into the index at a time, disable refresh, set refresh_interval to -1, and set number_of_replicas to 0. When the data is imported, refresh_interval and number_of_replicas are set back to their original values.

Select the appropriate number of shards and copies

By default, the number of fragments is 5 and the number of copies is 1.

The number of fragments is very related to the retrieval speed. If the number of fragments is too small or too many, the retrieval will be slow. Too many fragments result in a large number of files being opened during retrieval and slow communication between multiple servers. However, if the number of fragments is too small, the index of a single fragment is too large, so the retrieval speed is slow.

Set the number of fragments based on the number of machines, disks, and index size. It is recommended that the size of a single fragment do not exceed 30GB. The total data divided by the number of shards is the size of the shard.

Improve query efficiency

  1. Avoid placing different types in the same index. After version 7.0, enable type.
  2. Create indexes by year, month, and day in a time range scenario
    1. During expansion, appropriate shard and Replica are selected according to the current data amount, so as to avoid setting a large shard at the beginning to consider expansion.
    2. The alias mechanism allows flexible switching between indexes.
    3. For indexes that are not updated, such as those of last week or last month, optimize indexes to improve query efficiency. To improve query efficiency, merge multiple small segments under the index into one large fragment.
  3. Aggregate statistics for word segmentation fields (of type TEXT) are not a common requirement. Fielddata is required by Elasticsearch for aggregate statistics of word segmentation fields, and is disabled by default. The recommended approach is a multi-field mapping of word strings into a text field for full-text retrieval and a keyword field for aggregate statistics.
  4. Improve query efficiency with filtering. Filters are very fast to execute, do not calculate relevancy (skip the entire scoring phase), and are easily cached. So we will use the constant_score query to execute the term query in non-scoring mode with one as the unified score.
GET /my_store/products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "city" : "London"
                }
            }
        }
    }
}
Copy the code
  1. To avoid in-depth page turning, scroll query is recommended to return a large amount of data.

  2. Allocate more memory to the file system cache. To improve query efficiency, Elasticsearch depth depends on the file system cache. In general, ensure that at least half of the available memory is allocated to the file system cache so that Elasticsearch can keep the hot spots of the index in memory.

  3. Use better hardware.

  4. Document modeling. The correct model can reduce search time and avoid joins as much as possible, resulting in several times lower query efficiency and hundreds of times lower query efficiency by parent-child.

  5. Index data ahead of time. For example, if a document has a price field that is often used in range aggregation of a fixed range, you can add a keyword field, price_range, in the index stage to represent a fixed range, and use terms aggregation instead to improve aggregation efficiency.

  6. Consider a mapping identifier of type keyword. The range aggregation is better for numeric fields, while the Terms aggregation is better for keyword fields. Consider using keyword instead of INTEGER or long when the field value is a numeric value, but the range aggregation is mostly not used (for example, some numeric identifiers such as ISBN).

  7. Avoid scripts. If you really need scripts, use the Painless or Expressions engine.

  8. Try not to use now for date-time searches, as this type of query will not be cached. It should be rounded date, which can make better use of the query cache. Concrete example: www.elastic.co/guide/en/el…

  9. Merge segments with read-only indexes. More about merging operation can view: www.elastic.co/guide/en/el…

  10. Preheat Global ordinals. Global ordinals is a data structure used for terms aggregation of keyword type fields. By default, memory is loaded lazily, so we can set it to refresh at refresh time in mapping.

PUT index
{
  "mappings": {
    "_doc": {
      "properties": {
        "foo": {
          "type": "keyword",
          "eager_global_ordinals": true
        }
      }
    }
  }
}
Copy the code
  1. Preheat the file system cache. www.elastic.co/guide/en/el…
  2. Use index sorting to improve the efficiency of obtaining Top N scenarios. Documents in Segment can specify that they are sorted by a field. This has a slight performance impact when writing data. By default, there is no sorting. When retrieving Top N documents, the search can be terminated early to improve query efficiency. Details: www.elastic.co/guide/en/el…
  3. Use preference to increase cache utilization. Es uses multi-clock caches to improve query performance, such as file system caches, query caches, etc. However, these caches are stored locally on the node. Therefore, when the same request is repeatedly requested, the default routing policy is polling. As a result, these requests may not be processed on the same node, thus making good use of the cache. For example, in some scenarios where multiple requests from the same user are similar, preference can be used to help optimize cache utilization. www.elastic.co/guide/en/el…
  4. Open Adaptive Replica Selection. Polling a routing policy in addition to a routing policy that selects copies based on a number of criteria, such as response time, service time, queue length. www.elastic.co/guide/en/el…
PUT /_cluster/settings
{
    "transient": {
        "cluster.routing.use_adaptive_replica_selection": true
    }
}
Copy the code
  1. Optimize the number of copy fragments. Duplicates can increase throughput, but not as much as possible, because multiple shards on a node (often with both primary and secondary shards on the same node) can compete for resources, such as file system caching, which can affect performance, but duplicates must be set up to ensure availability. In order to weigh availability and throughput, assume a total number of NUM_nodes, num_Primaries in primaries, max_failures may fail simultaneously, The recommended number of copies in this case is Max (max_failures, ceil(num_nodes/num_primaries) -1)
  2. Using a routing mechanism, all documents with the same characteristics (such as the same userID) within the same index are stored on one node so that subsequent queries can be directed to that node rather than broadcast to all nodes.

Reference documentation

General advice

Suggestions for using Elasticsearch

Tune for search speed