Let’s start with something interesting:)

Many years ago, a newly married unemployed developer named Shay Banon followed his wife to London, where she was studying to be a chef. He started using an early version of Lucene while looking for a lucrative job to build a recipe search engine for his wife.

Using Lucene directly is difficult, so Shay set out to create an abstraction layer that Java developers can use to easily add search capabilities to their applications. He released his first open source project, Compass.

Shay then landed a job in a high performance, distributed environment with in-memory data grids. The need for a high-performance, real-time, distributed search engine was so great that he decided to rewrite Compass as a standalone service and call it Elasticsearch.

The first public release was in February 2010, and since then Elasticsearch has become one of the most active projects on Github. One company is already building a commercial service around Elasticsearch and developing new features, but it will always be open source and available to everyone.

Shay’s wife is reportedly still waiting for her recipe search engine…

What is ES

  • Search engine
  • Near Real Time Search
  • RESTful API
  • Distributed and highly available
  • Document storage oriented, JSON format
  • Based on the Apache Lucene

The core concept

  • Cluster Cluster
  • Node A single Node that forms a cluster
  • The Index Index
  • Shard Shard
  • A copy of the up
  • Segment segmentation
  • The Document Document
  • The Field Field
  • Inverted Index
  • Text/Keyword Type

Use the full process

schema(mapping)

Es does not require a pre-defined schema and determines the schema when indexing the DOC

The default rules for dynamic field mapping are as follows:

Json type Es type
null No field will be added
boolean boolean
string The date (byDate detection(by) double/longNumeric detection)text(sub field with keyword)
number float/long
object Object
array Array (Array’s item type depends on the type of the first non-null element)

The default string mapping is modified by defining the dynamic_template template. For example, the default string mapping is modified:

{

  "mappings": {

    "dynamic_templates": [

      {

        "strings_as_keywords": {

          "match_mapping_type": "string",

          "mapping": {

            "type": "text"

          }

        }

      }

    ]

}
Copy the code

However, if dynamic fields are not required, it is not recommended to use es dynamic mapping. Improper use will pollute the mapping. Therefore, you can disable dynamic mapping by specifying dynamic to false.

Of course, you can use the Put Mapping API to pre-define the mapping structure of the index, including the field type, the parser used (text type), whether it is indexed, and so on.

The es official also strongly recommends that the same field be indexed into ES in different ways. For example, a string value can be indexed as text for full-text retrieval, or indexed as keyword for sorting and aggregation.

The use of aliases is recommended. Es extensions to mapping are open, but modifications to mapping are forbidden. For example, you can add a field to mapping, but you cannot delete/modify the field. So use the alias to point to the real index, so that in the scenario where the field needs to be changed, you can use the ReIndex API to rebuild the index, and then use the Alias API to change the pointing, enabling a seamless switch.

Data is written to

Distributed write process

As you can see, the total latency of es writes is equal to the time to write to the primary + Max (the time to write to the secondary).

Shard write process

Three more important concepts

  • refresh
  • flush
  • Fsync todo: whether data will be lost

To optimize

  • Perform operations in batches using the BULK API
  • Adjust the interval of refresh_interval. Each refresh of ES creates lucene segments and attempts to merge them, which is costly. If the real-time search is not required, you can appropriately increase the size of refresh_interval
  • Fields that do not require indexes specify the index attribute to not_analyzed
  • SSD (classic performance is not good, hardware will help)

read

Search

Using the Search API can be very convenient to achieve data retrieval. Es provides many search apis, such as Match_Query, terM_Query, etc., to easily assemble the Query DSL(Domain Specific Language) without considering the order of queries in the DSL. The order of queries in the DSL does not affect the final execution efficiency. The actual execution order is rearranged after the Cost Based Optimizer (CBO).

  • Term Index use FST(Finite State Machines) -> locate the inverted chain

  • SkipList -> merge inverted chain

Aggregation

  • Metrics is a single value that counts and Max the data set hit by the query
  • Buckets conditionally divide the data set hit by a query into smaller sets of data and then perform Metrics on these smaller sets, analogous to group by in SQL

Sort

Do not use the text field as the sorting field. The text field is usually segmented by an analyzer. Sorting the text field will not get the expected results

By default, the docs returned by ES are sorted in descending order by _SCORE (document relevance), that is, the score value after counting, and if any other Sort field is specified, they are sorted by that field. Scripts are also supported to construct more complex collations.

For example, fuzzy query will calculate the degree of correlation with the spelling of the retrieved word, term query will calculate the percentage between content and keywords, and so on. The similarity algorithm for ElasticSearch uses TF/IDF (word frequency/inverse document frequency), which includes the following:

  • Word frequency: The frequency with which the retrieved word appears in the field. The higher the frequency, the higher the correlation. 5 occurrences in the field are more relevant than just 1 occurrence.
  • Inverse document frequency: The frequency with which each search term appears in the index. The higher the frequency, the lower the correlation. Search terms appear in most documents with less weight than those in a few, that is, to check the general importance of a search term in a document.
  • Field length criterion:What is the length of the field? The longer the length, the lower the correlation. The search word appears in a shorttitleLonger than the same word appearing in onecontentFields are more relevant.

Page

  1. Offset based paging
  • From + size, from specifies the offset, size specifies the number of data pieces to be taken
  • Implementation principle:

  1. The client initiates a request, and the shard that receives the request becomes the coordinating node and merges the data of subsequent requests
  2. The coordination node locally builds the priority queue of size from+size. The coordination node distributes the request to the other shards
  3. The other shards also execute query, retrieve the result set from+size, and return the set to the coordination node
  4. The coordination node merges the result set, finally obtains the priority queue with the size of FROM +size, and returns the last size data to the client
  • Features:

    • Random paging access is supported
    • Not suitable for feed infinite drop down scenarios where data duplication is possible at the paging boundary
    • As from increases, the overhead increases and is not suitable for deep paging scenarios
  1. Paging based on CURSOR

    1. scroll
  • For this request, all the results that meet the conditions are collected into the cache of the coordination node, and then directly extracted from the cache of the coordination node
  • Due to the presence of the cache, subsequent changes to the document are not synchronized to the cache and are not suitable for real-time requests
  • Also, because the amount of cached data is all doc hit by query, the heap memory overhead of Scroll is very large
  • It is applicable to non-real-time, low-frequency scenarios such as index reconstruction or data migration
  1. search_after
  • Use the search results from the previous page to help retrieve the next page, and include the unique identifier of doc in the sorting field to ensure consistency of search_After
  • The data set size requested by each shard is size
  • Random paging access is not supported
  • Real-time processing, if the DOC changes and affects the sorting factor, there may be duplicate data

Read the optimization

  • There is no scope to find the number type field required and the type is defined as keyword
  • Be careful when using wildcard Query. Try to use match query after word segmentation. There is a need to use Wildcard Query, and pay attention to character escape
  • Limit the length of your search terms
  • Feed flow scenarios use search_after
  • Replace query content with Filter context for fields that do not require score

Reference documentation

  1. ES GitHup: github.com/elastic/ela…
  2. ES the official document: www.elastic.co/guide/en/el…
  3. ES write process: blog.csdn.net/R_P_J/artic…