What is ES

  • Search engine Search engine
  • Near Real Time Search
  • RESTful API
  • Distributed and highly available
  • Document oriented storage, json format
  • Based on the Apache Lucene

The core concept

  • Cluster Cluster
  • Node A single-node Node in a cluster
  • The Index Index
  • Shard Shard
  • A copy of the up
  • Segment segmentation
  • The Document Document
  • The Field Field
  • May I know the Inverted Index
  • Text/Keyword

Use the whole process

schema(mapping)

Es does not require a pre-schema definition and determines the schema when indexing DOC

The es data interaction form is JSON, so doc can be used out of the box. If there is no predefined mapping when writing doc, each field in doc will determine the type according to the JSON data sent. The default dynamic field mapping rules are as follows:

Json type Es type
null Will not add field
boolean boolean
string Date (through date detection) Double /long(through Numeric detection)text(sub field with keyword)
number float/long
object Object
array Array (Array item type depends on the type of the first non-null element)

The dynamic_template template template is also supported to extend and modify the default rules. For example, in the following example, the default string mapping is modified:

{

  "mappings": {

    "dynamic_templates": [

      {

        "strings_as_keywords": {

          "match_mapping_type": "string",

          "mapping": {

            "type": "text"

          }

        }

      }

    ]

}
Copy the code

However, if dynamic fields are not required, it is not recommended to use DYNAMIC mapping of ES. If dynamic mapping is used improperly, it will pollute the mapping. Therefore, you can set dynamic to false to disable dynamic mapping.

Of course you can use the PUT Mapping API to pre-define the mapping structure of an index, including the field type, the parser used (text type), whether it is indexed, and so on.

Es officially recommends indexing the same field in es in different ways. For example, a value of string type can be indexed as text type for full-text retrieval, or indexed as keyword type for sorting and aggregation.

Aliases are recommended. Es is open to the expansion of mapping, but the modification of mapping is prohibited. For example, you can add a field to a mapping, but you cannot delete/modify a field. So use alias to point to the real index. This way, in scenarios where fields need to be modified, you can use the ReIndex API to rebuild the index and then use the Alias API to change the pointer, making the switch seamless.

Data is written to

Distributed write process

As you can see, the total delay for es writing is equal to the time it takes to write to the master node + Max (the time it takes to write to the slave node).

Shard write process

Three important concepts

  • refresh
  • flush
  • Fsync todo: Indicates whether data will be lost

To optimize

  • Bulk operations are performed using the BULK API
  • When adjusting the interval of refresh_interval, ES creates lucene segments every time it refreshes and tries to merge segments, which costs a lot. If the real-time performance of the search is not high, the size of refresh_interval can be appropriately increased
  • Fields that do not need indexes specify the index attribute as not_analyzed
  • SSD (classic performance is not good, hardware to gather)

read

Search

Using the Search API can be very convenient to achieve data retrieval. Es provides many search apis, such as match_query,term_query, and so on. It is easy to assemble Query DSL(Domain Specific Language), and developers do not need to consider the order of the queries in the DSL. The order of queries in the DSL does not affect the final execution efficiency. The actual execution order is rearranged after the CBO.

  • Term index use FST(Finite State Machines) -> locate inverted chain

  • SkipList -> merge inverted chains

Aggregation

  • Metrics Performs count, Max, and other operations on the dataset hit by query. It is a single value
  • Bucket divides the data set hit by Query into smaller data sets based on criteria, and then executes Metrics on these smaller sets, similar to group BY in SQL

Sort

Do not use text fields as sorting fields. Text fields are usually segmented by profilers, and sorting text fields is often not the expected result

By default, doc returned by ES will be sorted in descending order _score, or sorted by the specified field if other Sort fields are specified. Script is also supported to construct more complex collation rules.

_score, the calculation of score depends on different query methods. For example, fuzzy query will calculate the correlation degree with the spelling of the retrieval word, term query will calculate the percentage between the content and the keyword, etc. The similarity algorithm of ElasticSearch uses TF/IDF, i.e. word frequency/inverse document frequency, and includes the following contents:

  • Word frequency: The frequency with which the search term appears in this field. The higher the frequency, the higher the correlation. Occurrences of five occurrences in a field are more relevant than occurrences of only one.
  • Inverse document frequency: How often each search term appears in the index. The higher the frequency, the lower the correlation. The presence of a search term in a large number of documents carries less weight than its presence in a small number of documents, which tests the general importance of a search term in a document.
  • Field length rule: What is the length of the field? The longer the length, the lower the correlation. Search terms appearing in a short title field are more relevant than the same term appearing in a long Content field.

Page

1. Offset based paging

  • From + size, from specifies the offset, size specifies the number of data to be fetched
  • Implementation principle:

  1. The client initiates a request, and the shard receiving the request becomes the coordination node responsible for merging the subsequent request data
  2. Query is executed, the result set of size from+size is obtained, and the coordinating node builds the priority queue of size from+size locally. The coordinating node distributes requests to other shards
  3. The other shards execute the same query, fetch the result set of size from+size, and return the set to the coordinating node
  4. Coordinate the nodes to merge the result sets, and finally get the priority queue with the size from+size, and return the last size data to the client
  • Features: Support random page access is not suitable for infinite feed dropdown scenarios, there may be the possibility of data duplication at the page boundary with the increase of from, the overhead gradually increases, not suitable for deep page scenarios

2. Paging Scroll based on CURSOR

  • For this request, all the results that meet the conditions are collected to the coordination node and cached, and then directly taken out from the cache of the coordination node
  • Because of the cache, subsequent changes to the document are not synchronized to the cache and are not suitable for real-time requests
  • Also, since the amount of data cached is all doc hit by Query, the heap memory overhead for Scroll is very high
  • It is suitable for non-real-time and low-frequency scenarios such as index reconstruction and data migration

3.search_after

  • Use the search results of the previous page to help retrieve the next page, and include a unique doc identifier in the sort field to ensure consistency of the search_after
  • The data set size for each shard request is size
  • Random paging access is not supported
  • Real-time processing. If doc changes and collation factors are affected, duplicate data may occur

Read the optimization

  • A field of type number that has no scope lookup requirement and is defined as keyword
  • Do not use wildcard query. Try to use match query for the result after word segmentation. If wildcard query is required, avoid character escape
  • Limit the length of your search terms
  • The feed flow scenario uses search_after
  • Replace query Content with filter Context for fields that do not require a score