ElasticSearch whole process

Let’s start with something interesting:)

Many years ago, a newly married unemployed developer named Shay Banon followed his wife to London, where she was studying to be a chef. He started using an early version of Lucene while looking for a lucrative job to build a recipe search engine for his wife.

Using Lucene directly is difficult, so Shay set out to create an abstraction layer that Java developers can use to easily add search capabilities to their applications. He released his first open source project, Compass.

Shay then landed a job in a high performance, distributed environment with in-memory data grids. The need for a high-performance, real-time, distributed search engine was so great that he decided to rewrite Compass as a standalone service and call it Elasticsearch.

The first public release was in February 2010, and since then Elasticsearch has become one of the most active projects on Github. One company is already building a commercial service around Elasticsearch and developing new features, but it will always be open source and available to everyone.

Shay’s wife is reportedly still waiting for her recipe search engine…

What is ES

Search engine
Near Real Time Search
RESTful API
Distributed and highly available
Document storage oriented, JSON format
Based on the Apache Lucene

The core concept

Cluster Cluster
Node A single Node that forms a cluster
The Index Index
Shard Shard
A copy of the up
Segment segmentation
The Document Document
The Field Field
Inverted Index
Text/Keyword Type

Use the full process

schema(mapping)

Es does not require a pre-defined schema and determines the schema when indexing the DOC

The default rules for dynamic field mapping are as follows:

Json type	Es type
null	No field will be added
boolean	boolean
string	The date (byDate detection(by) double/longNumeric detection)text(sub field with keyword)
number	float/long
object	Object
array	Array (Array’s item type depends on the type of the first non-null element)

The default string mapping is modified by defining the dynamic_template template. For example, the default string mapping is modified:

{

  "mappings": {

    "dynamic_templates": [

      {

        "strings_as_keywords": {

          "match_mapping_type": "string",

          "mapping": {

            "type": "text"

          }

        }

      }

    ]

}
Copy the code

However, if dynamic fields are not required, it is not recommended to use es dynamic mapping. Improper use will pollute the mapping. Therefore, you can disable dynamic mapping by specifying dynamic to false.

Of course, you can use the Put Mapping API to pre-define the mapping structure of the index, including the field type, the parser used (text type), whether it is indexed, and so on.

The es official also strongly recommends that the same field be indexed into ES in different ways. For example, a string value can be indexed as text for full-text retrieval, or indexed as keyword for sorting and aggregation.

The use of aliases is recommended. Es extensions to mapping are open, but modifications to mapping are forbidden. For example, you can add a field to mapping, but you cannot delete/modify the field. So use the alias to point to the real index, so that in the scenario where the field needs to be changed, you can use the ReIndex API to rebuild the index, and then use the Alias API to change the pointing, enabling a seamless switch.

Data is written to

Distributed write process

As you can see, the total latency of es writes is equal to the time to write to the primary + Max (the time to write to the secondary).

Shard write process

Three more important concepts

refresh
flush
Fsync todo: whether data will be lost

To optimize

Perform operations in batches using the BULK API
Adjust the interval of refresh_interval. Each refresh of ES creates lucene segments and attempts to merge them, which is costly. If the real-time search is not required, you can appropriately increase the size of refresh_interval
Fields that do not require indexes specify the index attribute to not_analyzed
SSD (classic performance is not good, hardware will help)

read

Search

Using the Search API can be very convenient to achieve data retrieval. Es provides many search apis, such as Match_Query, terM_Query, etc., to easily assemble the Query DSL(Domain Specific Language) without considering the order of queries in the DSL. The order of queries in the DSL does not affect the final execution efficiency. The actual execution order is rearranged after the Cost Based Optimizer (CBO).

Term Index use FST(Finite State Machines) -> locate the inverted chain

SkipList -> merge inverted chain

Aggregation

Metrics is a single value that counts and Max the data set hit by the query
Buckets conditionally divide the data set hit by a query into smaller sets of data and then perform Metrics on these smaller sets, analogous to group by in SQL

Sort

Do not use the text field as the sorting field. The text field is usually segmented by an analyzer. Sorting the text field will not get the expected results

By default, the docs returned by ES are sorted in descending order by _SCORE (document relevance), that is, the score value after counting, and if any other Sort field is specified, they are sorted by that field. Scripts are also supported to construct more complex collations.

For example, fuzzy query will calculate the degree of correlation with the spelling of the retrieved word, term query will calculate the percentage between content and keywords, and so on. The similarity algorithm for ElasticSearch uses TF/IDF (word frequency/inverse document frequency), which includes the following:

Word frequency: The frequency with which the retrieved word appears in the field. The higher the frequency, the higher the correlation. 5 occurrences in the field are more relevant than just 1 occurrence.
Inverse document frequency: The frequency with which each search term appears in the index. The higher the frequency, the lower the correlation. Search terms appear in most documents with less weight than those in a few, that is, to check the general importance of a search term in a document.
Field length criterion:What is the length of the field? The longer the length, the lower the correlation. The search word appears in a shorttitleLonger than the same word appearing in onecontentFields are more relevant.

Read the optimization

There is no scope to find the number type field required and the type is defined as keyword
Be careful when using wildcard Query. Try to use match query after word segmentation. There is a need to use Wildcard Query, and pay attention to character escape
Limit the length of your search terms
Feed flow scenarios use search_after
Replace query content with Filter context for fields that do not require score

Reference documentation

ES GitHup: github.com/elastic/ela…
ES the official document: www.elastic.co/guide/en/el…
ES write process: blog.csdn.net/R_P_J/artic…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Let’s start with something interesting:)

What is ES

The core concept