Full text search engine For Elasticsearch

Elasticsearch in depth.

Data in Life

A search engine is a search for data, so let’s start with the data in our lives. There are two kinds of data in our lives:
- Structured data
- Unstructured data
** Structured data: ** is also called row data, which is logically expressed and realized by two-dimensional table structure. It strictly follows data format and length specifications, and is mainly stored and managed by relational database. Refers to data of fixed format or limited length, such as database, metadata, etc.

** Unstructured data: ** also known as full-text data, variable length or no fixed format, not suitable for two-dimensional table performance by the database, including all formats of office documents, XML, HTML, Word documents, emails, all kinds of reports, pictures and AUDIos, video information, etc.

XML and HTML can be classified as semi-structured data if you want to make a more detailed distinction. Because they also have their own specific tag formats, they can be processed as structured data as needed, or plain text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds:
- Structured data search
- Unstructured data search
** For structured data, ** because they have a specific structure, we can generally store and search through two-dimensional tables of relational databases (MySQL, Oracle, etc.), and also build indexes.

For unstructured data, that is, full-text data search, there are two main methods:
- Sequential scan
- The full text retrieval
** Sequential scan: ** Through the text name can also know the general search method, that is, in order to search for specific keywords.

For example, I give you a newspaper and let you find out where the words “peace” appeared in the newspaper. You definitely need to scan the newspaper from cover to cover and mark where the keyword appears and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the newspaper typesetting is small, and there are many sections or even multiple newspapers, your eyes are almost the same after you scan.

** Full text search: ** Sequential scanning of unstructured data is slow, can we optimize it? Can’t we just try to make our unstructured data have some structure?

Part of the information in unstructured data is extracted and reorganized to make it with a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search.

This way constitutes the basic idea of full-text retrieval. This piece of information extracted from unstructured data and then reorganized is called an index.

The main work of this approach is the creation of the index in the early stage, but it is fast and efficient for the late search.

Say first Lucene

After a brief look at the types of data in our lives, we know that SQL retrieval in relational databases cannot handle such unstructured data.

Such unstructured data processing relies on full-text search, and the best open source full-text search engine toolkit on the market today is Apache’s Lucene.

But Lucene is only a toolkit, it is not a full text search engine. Lucene’s goal is to provide software developers with an easy-to-use toolkit to facilitate the implementation of full-text search functions in the target system, or to build a complete full-text search engine on this basis.

Currently, Solr and Elasticsearch are the main open source full-text search engines based on Lucene.

Solr and Elasticsearch are mature full-text search engines that provide the same functionality and performance.

However, ES itself has the characteristics of distribution and easy installation and use, while Solr’s distribution needs to be realized by a third party, for example, by using ZooKeeper to achieve distributed coordination management.

Both Solr and Elasticsearch rely on Lucene at the bottom, and Lucene can implement full-text search mainly because it implements the query structure of inverted index.

What about inverted indexes? Suppose there are three data files, the contents of which are as follows:
- Java is the best programming language.
- PHP is the best programming language.
- Javascript is the best programming language.
To create an inverted index, we split the content field of each document into separate terms (we call them terms or terms) through a word splitter, create a sorted list of all non-repeating terms, and then list which document each Term appears in.

The result is as follows:

This structure consists of a list of all non-repeating words in a document, with a document list associated with each word.

This structure, which determines the location of records by attribute values, is called an inverted index. A file with an inverted index is called an inverted file.

We translate the above into a graph to illustrate the structure of the inverted index, as shown below:

There are mainly the following core terms to understand:
- Term: The smallest unit of storage and query in the index of **. For English, it is a word, but for Chinese, it usually refers to the word after the participle.
- Term Dictionary: **, or Dictionary, is a collection of terms. The usual unit of index for search engines is the word. A dictionary is a collection of strings made up of all the words that have appeared in a collection of documents. Each index entry in the dictionary contains some information about the word itself and a pointer to an “inverted list”.
- ** Post list: ** A document is usually composed of multiple words. An inversion list records which documents a word appears in and where.
  
  Each record is called a Posting. Inversion lists not only record document numbers, but also store information such as word frequency.
- Inverted files: ** The Inverted list of all Inverted words is often stored sequentially in a File on disk called an Inverted File, which is a physical File that stores Inverted indexes.
From the figure above, we can know that the inverted index mainly consists of two parts:
- The dictionary
- Inverted file
Dictionary and inversion list are two important data structures in Lucene, which are the cornerstone of fast retrieval. The dictionary and the inverted file are stored in two parts, the dictionary in memory and the inverted file on disk.

ES Core Concepts

After a few basics we’ll get into today’s Elasticsearch.

ES is an open source search engine written in Java. It uses Lucene to index and search internally. By encapsulating Lucene, IT hides the complexity of Lucene and provides a simple and consistent RESTful API instead.

However, Elasticsearch is more than Lucene, and more than just a full-text search engine.

It can be accurately described as follows:
- A distributed real-time document store where each field can be indexed and searched.
- A distributed real-time analysis search engine.
- Capable of extending hundreds of service nodes and supporting PB level of structured or unstructured data.
Let’s take a look at some of the core concepts of how Elasticsearch can be distributed, scalable and near real time.

Elasticsearch is a distributed, scalable, near-real-time search and analytics engine.

Cluster

The cluster construction of ES is very simple. It does not need to rely on third-party coordination and management components, and it realizes the cluster management function internally.

An ES cluster consists of one or more Elasticsearch nodes. You can add each node to the cluster by setting the same cluster.name.

Ensure that different cluster names are used in different environments, otherwise you will end up with nodes joining the wrong cluster.

An Elasticsearch service startup instance is a Node. The node name is set by node.name. If this parameter is not set, a random universal unique identifier is assigned to the node at startup.

① Discovery mechanism

Then there is a question, how can ES connect different nodes to the same cluster with the same cluster.name setting? The answer is Zen Discovery.

Zen Discovery is a built-in default Discovery module for Elasticsearch (the Discovery module is responsible for discovering nodes in the cluster and electing Master nodes).

It provides unicast and file-based discovery, and can be extended to support cloud environments and other forms of discovery through plug-ins.

Zen Discovery is integrated with other modules; for example, all communication between nodes is done using the Transport module. The node uses the discovery mechanism to Ping other nodes.

Elasticsearch is configured to use unicast discovery by default to prevent nodes from unintentionally joining a cluster. Only nodes running on the same machine automatically form a cluster.

If the cluster’s nodes are running on different machines, using unicast, you can provide Elasticsearch with a list of nodes it should try to connect to.

When a node contacts a member of the unicast list, it gets the status of all nodes in the entire cluster, and then it contacts the Master node and joins the cluster.

This means that the unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes when a new node contacts one and talks to it.

If you use Master candidate nodes as a unicast list, you only need to list three. This configuration is in the elasticSearch.yml file:
```
discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
Copy the code
```
Node starts to Ping, if discovery. Zen. Ping. Unicast. The hosts are set, the set of Host Ping, or try to Ping localhost several ports.

Elasticsearch can start multiple nodes on the same host. The Ping Response contains the basic information about the node and the node that the node considers to be the Master node.

Select from the list of nodes that each node considers to be Master. The rule is simple: select the first one in alphabetical order. If none of the nodes is considered Master, the node is selected from all of them, as above.

The limit is discovery.zen.minimum_master_nodes. If the minimum number of nodes is not reached, the process is repeated until there are enough nodes to start the election.

The result of the election is that you can definitely elect a Master, and if there’s only one Local node you’re going to elect yourself.

If the current node is Master, wait until the number of nodes reaches discovery.zen.minimum_master_nodes before service is available.

If the current node is not the Master, try to join the Master. Elasticsearch calls the process of service Discovery and master selection Zen Discovery.

Since it supports any number of clusters (1-n), it does not limit the number of nodes to an odd number, as Zookeeper does, so it does not use a voting mechanism to choose the host, but a rule.

As long as all nodes follow the same rules and the information is equivalent, the selected master nodes must be consistent.

But the problem with distributed systems is that information is not equal, which can easily lead to split-brain problems.

Most solutions simply set a Quorum value that requires the number of available nodes to be larger than Quorum (typically more than half of the nodes) in order to provide services.

For Elasticsearch, the Quorum configuration is discovery.ze.minimum_master_nodes.

② Roles of nodes

Each node can be either a candidate master node or a data node. / config/elasticsearch. Yml can be set, the default is true.
```
Node. master: true // Whether to be a candidate primary node node.data: true // Whether to be a data nodeCopy the code
```
Data nodes store Data and perform related operations, such as adding, deleting, modifying, querying, and aggregating Data. Therefore, Data nodes have high requirements on machine configurations and consume large AMOUNTS of CPU, memory, and I/O.

Often as a cluster grows, more data nodes need to be added to improve performance and availability.

The candidate primary node can be elected as the Master node. Only the candidate primary node has the right to vote and be elected in the cluster. Other nodes do not participate in the election.

The master node is responsible for creating indexes, dropping indexes, tracking which nodes are part of the cluster, deciding which shards are assigned to related nodes, and tracking the status of nodes in the cluster. A stable master node is very important for the health of the cluster.

A node can be either a candidate primary node or a data node, but the data node consumes a lot of CPU and memory core I/O.

So if a node is both a data node and a master node, it can have an impact on the master node and therefore on the state of the entire cluster.

In order to improve the health of the Elasticsearch cluster, we should partition and isolate the nodes in the Elasticsearch cluster. Several low-configuration machine farms can be used as candidate primary node groups.

The active node checks each other through Ping. The active node pings all other nodes to check whether any node is down. Other nodes also use the Ping command to check whether the active node is available.

Although the nodes have different roles, users’ requests can be sent to any node, and the node is responsible for distributing the requests and collecting the results, instead of the primary node forwarding.

This kind of node can be called coordination node. Coordination node does not need to be specified or configured. Any node in the cluster can act as coordination node.

③ Split brain phenomenon

At the same time, if multiple Master nodes are elected in the cluster due to network or other reasons, the data update is inconsistent. This phenomenon is called split brain. In other words, different nodes in the cluster choose different Master nodes and multiple Master competitors occur.

There are several possible causes of the “split brain” problem:
- ** Network problems: ** Network latency between clusters causes some nodes to fail to access the Master. The Master is considered to have died, so a new Master is elected, and shards and replicas on the Master are marked red to allocate a new Master shard.
- ** Node load: ** The role of the Master node is both Master and Data. When there is a large number of visits, ES may stop responding (in a state of suspended animation), resulting in a large area of delay. At this time, other nodes can not get the response from the Master node, and consider that the Master node has hung up, so they will re-select the Master node.
- ** Memory reclamation: ** The role of the Master node is both Master and Data. When the ES process on the Data node occupies a large amount of memory, large-scale memory reclamation of the JVM will occur, causing the ES process to become unresponsive.
In order to avoid the occurrence of brain split phenomenon, we can start from the reasons to make optimization measures through the following aspects:
- ** Appropriately adjust the response time to reduce misjudgment. ** Set the response time of the node status by using the discovery.zen.ping_timeout parameter. The default value is 3s, which can be increased appropriately.
  
  If the Master does not respond within the response time range, the node is considered dead. Increase the parameter (such as 6s, discovery.zen.ping_timeout:6) to reduce misjudgment.
- ** Election trigger. ** We need to set the parameter discovery.zen.munimum_master_nodes in the configuration file of the nodes in the candidate cluster.
  
  The default value is 1. The official recommended value is (master_eligibel_nodes/2)+1, where master_eligibel_nodes is the number of candidate primary nodes.
  
  This prevents split-brain and maximizes the high availability of the cluster, because elections can go on normally as long as there are not more than discovery.zen.munimum_master_nodes candidates alive.
  
  When the value is smaller than this value, the election behavior cannot be triggered, the cluster cannot be used, and the fragmentation disorder will not be caused.
- ** Role separation. ** is the role separation of the candidate master node and the data node mentioned above, which can reduce the burden of the master node, prevent the pseudo-death of the master node, and reduce the misjudgment of the “dead” of the master node.
Shards

ES supports pB-level full-text search. When the amount of data on an index is too large, ES splits the data on an index into different data blocks by horizontal splitting. The split database block is called a fragment.

This is similar to MySQL’s library and table, except that MySQL requires third-party components and ES implements this function internally.

When data is written into a multi-shard index, the path is used to determine which shard to write into. Therefore, the number of shards must be specified during index creation. Once the number of shards is determined, it cannot be changed.

The number of shards and the number of copies described below can be configured using Settings when creating an index. ES creates 5 master shards for an index by default, and creates one copy for each shard.
```
PUT /myIndex
{
   "settings" : {
      "number_of_shards" : 5,
      "number_of_replicas" : 1
   }
}
Copy the code
```
ES improves indexes both in scale and performance through sharding. Each shard is an index file in Lucene, and each shard must have a master shard and zero to multiple copies.

Replicas

A Copy is a Copy of a fragment. Each master fragment has one or more duplicate fragments. When the master fragment is abnormal, the Copy can provide data query.

The master shard and the corresponding replica shard are not on the same node, so the maximum number of replica shards is n-1 (where N is the number of nodes).

New, index, and delete requests to documents are write operations that must be completed on the master shard before they can be copied to the relevant replica shard.

In order to improve the writing ability of ES, this process is written concurrently. At the same time, in order to solve the problem of data conflict in the process of concurrent writing, ES controls this process by the way of optimistic lock. Each document has a _version number, and the version number increases when the document is modified.

Success is reported to the coordinator node once all replica shards report success, and the coordinator node reports success to the client.

In order to achieve high availability, the Master node avoids placing the Master shard and the replica shard on the same node.

If Node1 service is down or the network is unavailable, the primary shard S0 on the primary node is also unavailable.

Fortunately, there are two other nodes that work, and ES elects a new master node, and these two nodes have all the data we need for S0.

We will promote the copy of S0 to the master shard, and the process of promoting the master shard is instantaneous. The cluster status will be Yellow.

Why is our cluster state Yellow instead of Green? Although we have all 2 master shards, we also set up two replica shards for each master shard, and only one replica shard exists at this time. Therefore, the cluster cannot be in the Green state.

If we also turn off Node2, our program can still run without losing any data, because Node3 holds a copy for each shard.

If we restart Node1, the cluster can fragment the missing copies and re-allocate them, and the cluster state will return to its normal state.

If Node1 still has the previous shards, it will try to reuse them, but the shard on Node1 is no longer the master shard but the replica shard. If the data has changed in the meantime, it will simply copy the modified data file from the master shard.

Summary:
- Data is sharded to increase the capacity of processable data and make it easier to scale horizontally, while duplicates are sharded to improve cluster stability and concurrency.
- Copies are multiplications, the more expensive, but also the more safe. Sharding is division. The more shards there are, the fewer and more fragmented the single shard data.
- The more copies, the more available the cluster is, but since each shard is equivalent to a Lucene index file, it consumes a certain amount of file handles, memory, and CPU.
  
  In addition, data synchronization between shards will occupy a certain amount of network bandwidth, so the number of index shards and copies is not the better.
Mapping

Mapping is used to define the storage type, word segmentation and whether to store information of ES to the fields in the index. Just like the Schema in the database, it describes the fields or attributes that the document may have and the data type of each field.

However, a relational database must specify the field type when building a table, while ES can not specify the field type and then dynamically guess the field type, or specify the field type when creating an index.

The Mapping that automatically identifies the field type according to the data format is called Dynamic Mapping. The Mapping that specifically defines the field type when we create an index is called Explicit Mapping or Explicit Mapping.

Before explaining the use of dynamic mapping and static mapping, let’s first understand what field types of data in ES are. Later we will explain why we need to create static maps instead of dynamic ones when creating indexes.

The field data types in ES (V6.8) are as follows:

Text A field used to index full-text values, such as the body of an E-mail message or product description. These fields are tokenized and are passed through the tokenizer to convert the string into a list of individual terms before being indexed.

The analysis process allows Elasticsearch to search every complete text field in a single word. Text fields are not used for sorting and are rarely used for aggregation.

Keyword A field that indexes structured content, such as an E-mail address, host name, status code, zip code, or tag. They are commonly used for filtering, sorting, and aggregation. The Keyword field can only be searched by its exact value.

From the understanding of field types, we know that some fields need to be clearly defined. For example, whether a field is of Text type or Keyword type is very different, time fields may need to specify its time format, and some fields need to specify a specific word divider, etc.

If this cannot be done precisely with dynamic mapping, automatic recognition is often somewhat different from what is expected.

So a complete format for creating an index is to specify the number of shards and copies and the definition of the Mapping, as follows:
```
PUT my_index 
{
   "settings" : {
      "number_of_shards" : 5,
      "number_of_replicas" : 1
   }
  "mappings": {
    "_doc": { 
      "properties": { 
        "title":    { "type": "text"  }, 
        "name":     { "type": "text"  }, 
        "age":      { "type": "integer" },  
        "created":  {
          "type":   "date", 
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}
Copy the code
```
Basic use of ES

Elasticsearch (excluding 0.x and 1.x) currently has the following stable major versions: 2.x, 5.x, 6.x, 7.x (current).

You might notice that without 3.x and 4.x, ES jumped straight from 2.4.6 to 5.0.0. This is to create a unified version of ELK (ElasticSearch, Logstash, Kibana) stack so that users don’t get confused.

While Elasticsearch was 2.x (the last 2.4.6 release of 2.x was July 25, 2017), Kibana was 4.x (Kibana 4.6.5 was July 25, 2017).

The next major version of Kibana will definitely be 5.x, so Elasticsearch will release its own version as 5.0.0.

After unification, there will be no confusion in selecting the version of Elasticsearch and then selecting the same version of Kibana without worrying about version incompatibility.

Elasticsearch is built in Java, so in addition to paying attention to the version of ELK technology, we need to pay attention to the JDK version when selecting Elasticsearch.

Because each major release relies on a different JDK version, JDK11 is currently supported in version 7.2.

Install and use

Download and decompress Elasticsearch without installing and decompressing Elasticsearch.
- **bin: ** Binary system instruction directory, including startup commands and plug-in installation commands.
- **config: ** Configuration file directory.
- **data: ** Data store directory.
- **lib: ** dependency package directory.
- **logs: ** directory of log files.
- **modules: ** Modules library, such as x-pack modules.
- **plugins: ** Plugin directory.
2 run bin/ ElasticSearch in the installation directory to start ES.

3. By default, port 9200 is used. Request curl http://localhost:9200/ or enter http://localhost:9200 to obtain a JSON object containing information about the current node, cluster, and version.
```
{ "name" : "U7fp3O9", "cluster_name" : "elasticsearch", "cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA", "version" : {" number ":" 6.8.1 ", "build_flavor" : "default", "build_type" : "zip", "build_hash" : "1 fad4e1", "build_date" : "2019-06-18t13:16:517138z ", "build_snapshot" : false, "lucene_version" : "Minimum_index_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0"}, "tagline" : "You Know, for Search" }Copy the code
```
Cluster Health Status

To check cluster health, we can run the following command GET /_cluster/health from the Kibana console to GET the following information:
```
{ "cluster_name" : "wujiajian", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 9, "active_shards" : 9, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 64.28571428571429}Copy the code
```
Cluster status is indicated by green, yellow, and red:
- ** Green: ** Cluster is healthy, everything is functional, and all shards and replicas are working properly.
- ** Yellow: ** alert status, all master shards function properly, but at least one copy is not working properly. The cluster is working, but high availability is affected to some extent.
- ** Red: the ** cluster cannot be used. If one or more shards and their copies are abnormally unavailable, the query operation of the cluster can still be executed, but the result returned will be inaccurate. An error will be reported on write requests assigned to this shard, resulting in data loss.
When the cluster status is red, it will continue to serve search requests from available shards, but you need to fix the unallocated shards as soon as possible.

ES mechanism principle

After the introduction of the basic concepts and basic operations of ES, we may still have a lot of doubts:
- How do they work inside?
- How are master and replica shards synchronized?
- What is the process for creating an index?
- How does ES allocate index data to different shards? And how is this index data stored?
- Why is ES a near-real-time search engine while the CRUD (create, read, update, delete) operations of documents are real-time?
- How does Elasticsearch ensure that updates are persisted without losing data in the event of a power failure?
- And why does deleting a document not immediately free up space?
With these questions in mind we move on to the next section.

Principle of index writing

The following figure describes a 3-node cluster with 12 shards, including 4 Master shards (S0, S1, S2, S3) and 8 replica shards (R0, R1, R2, R3). Each Master shard corresponds to two replica shards. Node 1 is the Master node responsible for the state of the whole cluster.

Write indexes can only be written to the master shard and then synchronized to the replica shard. There are four master shards, and according to what rule is ES written to that particular shard?

Why is this index data written to S0 but not S1 or S2? Why is that data being written to S3 and not S0 again?

First of all, it certainly can’t be random, otherwise we won’t know where to look when we need to retrieve documents in the future.

In fact, the process is determined by the following formula:
```
shard = hash(routing) % number_of_primary_shards
Copy the code
```
Routing is a variable value, which defaults to the _id of the document and can be set to a custom value.

Routing uses the Hash function to generate a number, which is then divided by number_of_primary_shards (number of primary shards) to get the remainder.

The remainder, between 0 and number_of_primary_shreds -1, is where we are looking for the document shard.

This explains why the number of master shards is determined at index creation time and never changes: if the number changes, then all the values of previous routes are invalid and the document is never found again.

Because each node in ES cluster knows the location of documents in the cluster by the above calculation formula, each node has the ability to handle read and write requests.

After a write request is sent to a node, the node is the coordinator node mentioned above. The coordinator node calculates which shard to write to according to the routing formula, and then forwards the request to the master shard node of the shard.

Shard =hash(routing)%4=0.

The specific process is as follows:
- The client sends a write request to ES1 node (coordination node), and the value is 0 based on the route calculation formula, so the current data should be written to the master shard S0.
- ES1 forwards the request to ES3, where the main shard of S0 is located. ES3 receives the request and writes it to disk.
- Data is concurrently copied to two replica shards R0, where data conflicts are controlled through optimistic concurrency. Once all replica shards report success, the node ES3 reports success to the coordinating node, and the coordinating node reports success to the client.
Storage principle

This process is carried out in ES memory. After data is allocated to specific shards and replicas, it is eventually stored on disk so that data will not be lost in the event of power outages.

The specific storage path can be found in the configuration file.. Set in the/config/elasticsearch. Yml, stored by default in the installation directory of the Data folder.

It is not recommended to use the default value, because if ES is upgraded, all data may be lost:
```
Path. data: /path/to/data // Index data path.logs: /path/to/logs // LogsCopy the code
```
① Segmented storage

Indexed documents are stored on disk as segments. What are segments? If an index file is split into multiple subfiles, each subfile is called a segment, and each segment is an inverted index. The segment is immutable and cannot be modified once the data in the index is written to the hard disk.

In the bottom layer, the segmented storage mode is adopted, so that the lock is almost completely avoided when reading and writing, which greatly improves the read and write performance.

After the segment is written to disk, a commit point is generated, which is a file that records all the information about the segment after the commit.

Once a segment has a commit point, it has read permission but no write permission. Conversely, when a segment is in memory, it has write permission, not read permission, meaning it cannot be retrieved.

The concept of segments was developed primarily because a large inverted index was built for the entire document collection in early full-text retrieval and written to disk.

If the index is updated, a new index needs to be fully created to replace the original index. This approach is inefficient when there are large volumes of data, and because the cost of creating an index is high, the data should not be updated too frequently, thus not ensuring timeliness.

Index files are segmented and non-modifiable, so how do new, updated, and deleted files handle?
- ** new, ** new is easy to handle, because the data is new, so you only need to add a segment to the current document.
- ** Delete, ** Because the document cannot be modified, the deletion operation does not remove the document from the old segment but by adding a.del file that lists the segments of the deleted document.
  
  The tagged deleted document can still be matched by the query, but it will be removed from the result set before the final result is returned.
- ** update, ** can not modify the old section to reflect the document update, in fact, the update is the same as delete and add two actions. The old document is marked for deletion in the.del file, and the new version of the document is indexed into a new segment.
  
  It is possible that both versions of a document will be matched by a query, but the deleted older version will be removed before the result set is returned.
Section being set as unmodifiable has certain advantages and disadvantages, which are mainly shown as follows:
- No locks required. If you never update indexes, you don’t need to worry about multiple processes modifying data at the same time.
- Once the index is read into the kernel’s file system cache, it stays there because of its immutability. As long as there is enough space in the file system cache, most read requests go directly to memory and do not hit disk. This provides a significant performance boost.
- Other caches, like the Filter cache, remain in effect for the lifetime of the index. They do not need to be rebuilt every time the data changes, because the data does not change.
- Writing a single large inverted index allows data to be compressed, reducing disk I/O and the amount of indexes that need to be cached into memory.
The disadvantages of segment invariance are as follows:
- When old data is deleted, it is not deleted immediately, but is marked as deleted in the.del file. Old data can only be removed when the segment is updated, which wastes a lot of space.
- If there is a data update frequently, each update is new new mark old, there will be a lot of space waste.
- Each time data is added, a new segment is required to store the data. When the number of segments is too high, the consumption of server resources such as file handles can be very high.
- The inclusion of all result sets in the results of the query and the need to exclude old data removed by tags adds to the query burden.
② Write delay policy

Now that we’ve covered the form of storage, what happens when indexes are written to disk? Do you tune Fsync directly to physically write to disk?

The obvious answer is that if you write directly to disk, disk I/O consumption can have a significant impact on performance.

Therefore, when a large amount of data is written, ES will be stuck and the query cannot respond quickly. If that were true, ES would not call it a near-real-time full-text search engine.

To improve write performance, ES does not add a segment to disk for every new piece of data, but uses a write delay policy.

Whenever new data is added, it is written to memory first, and between memory and disk is the file system cache.

When the default time (1 second) is reached or a certain amount of data in memory is reached, a Refresh is triggered to generate the data in memory to a new segment and cache it on the file caching system, which is later flushed to disk and generates a commit point.

The memory used here is the ES JVM memory, while the file caching system uses the operating system memory.

New data will continue to be written to memory, but the data in memory is not stored in segments and therefore cannot be retrieved.

When flushed from memory to the file caching system, new segments are generated and opened for search, rather than being flushed to disk.

In Elasticsearch, the lightweight process of writing and opening a new segment is called Refresh.

By default, each shard is automatically refreshed once per second. This is why Elasticsearch is called near real time because changes to a document are not immediately visible to the search, but become visible within a second.

We can also trigger Refresh manually, POST /_refresh Refresh all indexes, and POST/NBA /_refresh Refresh specified indexes.

** Although refreshing is a much lighter operation than committing, it still has a performance overhead. Manual refreshes are useful when writing tests, but do not refresh manually every time you index a document in a production > environment. And not all cases need to be refreshed every second.

Maybe you are using Elasticsearch to index a large number of log files and you may want to optimize indexing speed instead of > near real time search.

In this case, you can lower the refresh frequency of each index by increasing the value of refresh_interval = “30s” in the Settings during index creation. Note that the time unit must be added when setting the value. Otherwise, the default value is milliseconds. When refresh_interval is -1, automatic refresh of indexes is disabled.

Although the delayed write strategy can reduce the number of data writes to the disk and improve the overall write capability, we know that the file cache system is also the memory space, which belongs to the operating system. As long as the memory is in the case of power failure or abnormal situation, the risk of data loss.

To avoid data loss, Elasticsearch added a transaction log, which records all data that has not yet been persisted to disk.

After adding transaction logs, the entire index writing process is shown in the figure above:
- After a new document is indexed, it is first written to memory, but to prevent data loss, a copy of data is appended to the transaction log.
  
  New documents are constantly being written to memory, as well as to the transaction log. At this point the new data cannot be retrieved and queried.
- When the default Refresh time is reached or a certain amount of data in memory is reached, a Refresh is triggered to Refresh the data in memory as a new segment to the file cache system and clear the memory. While the new segment is not committed to disk, it provides document retrieval and cannot be modified.
- As new document indexes are continuously written, a Flush is triggered when the log data size exceeds 512 MB or the log time exceeds 30 minutes.
  
  The data in memory is written to a new segment and written to the file cache system. The data in the file system cache is flushed to disk via Fsync, commit points are generated, the log file is deleted, and an empty new log is created.
In this way, when a power failure or restart is required, ES not only needs to load the persistent segments according to the submission point, but also needs the records in the Translog tool to re-persist the unpersisted data to disk, avoiding the possibility of data loss.

(3) period of consolidation

Because the automatic refresh process creates a new segment every second, this can cause the number of segments to explode in a short period of time. Too many segments can cause major problems.

Each segment consumes file handles, memory, and CPU cycles. More importantly, each search request must take turns checking each segment and then merging the query results, so the more segments, the slower the search.

Elasticsearch solves this problem by periodically merging segments in the background. Smaller segments are merged into larger segments, which are then merged into larger segments.

Segment merging purges old deleted documents from the file system. Deleted documents are not copied to the new large section. The merge process does not break indexing and searching.

Segment merge occurs automatically when indexing and searching, and the merge process selects a small number of similarly-sized segments and merges them behind the scenes into larger segments that can be either uncommitted or committed.

After the merge is complete, the old segment is deleted, the new segment is flushed to disk, and a new commit point containing the new segment is written, excluding the old and smaller segments, and the new segment is opened for search.

Segment merges are computationally expensive and eat a lot of disk I/O, which can drag down write rates and, if left unchecked, affect search performance.

Elasticsearch puts resource limits on the merge process by default, so the search still has enough resources to perform well.

Performance optimization

The storage device

Disks are often a bottleneck on modern servers. Elasticsearch uses disks heavily, the more throughput your disk can handle, the more stable your node will be.

Here are some tips for optimizing disk I/O:
- ** Use SSD. ** As mentioned elsewhere, they are much better than mechanical disks.
- ** RAID 0 is used. ** Striping RAID increases disk I/O at the apparent cost of failing when one disk fails. Do not use mirrored or parity RAID because replicas already provide this functionality.
- ** Additionally, use multiple hard drives, ** and allow Elasticsearch to strip data across multiple path.data directory configurations.
- ** Do not use remotely mounted storage such as NFS or SMB/CIFS. This introduced delay is completely counterproductive to performance.
- ** If you are using EC2, beware of EBS. ** Even SS-BASED EBS are usually slower than local instance storage.
Internal index optimization

Term Dictionary = Term Dictionary = Term Dictionary = Term Dictionary = Term Dictionary = Term Dictionary = Term Dictionary

Now, it looks like a traditional database through b-Tree. But if there are too many terms, the Term Dictionary will also be too large to fit memory, hence the Term Index.

Just like the Index page in the dictionary, which terms start with A and which pages respectively can be understood as A tree Term Index.

This tree does not contain all terms, it contains some prefixes to terms. Term Index allows you to quickly locate an Offset in the Term Dictionary, and then look up from there.

Term Index is compressed by FST in the memory. FST stores all terms in bytes. This compression method can effectively reduce the storage space and make Term Index enough to fit into the memory, but it will also require more CPU resources for searching.

For inversion lists stored on disk, compression techniques are also used to reduce storage space.

Adjusting Configuration Parameters

The recommended parameters are as follows:
- Give each document an ordered sequence of well-compressed schema ids, avoiding random UUID-4 ids that have a low compression ratio and can significantly slow Lucene down.
- Disable Doc values for index fields that do not require aggregation and sorting. Doc Values are an ordered list of mappings based on Document =>field value.
- Use the Keyword type instead of the Text type for fields that do not need to be fuzzy-searched to avoid the need to segment the Text before indexing.
- If your search results don’t require near-real-time accuracy, consider changing the index.refresh_interval to 30s for each index.
  
  If you are doing a bulk import, you can turn off the refresh by setting this value to -1 during the import, and you can turn off the replicas by setting index.number_of_replicas: 0. Don’t forget to turn it back on when you’re done.
- Avoiding deep paging Query It is recommended to use Scroll for paging query. In normal paging query, an empty from+size priority queue will be created. Each shard will return from+size data. By default, only the document ID and Score will be sent to the coordination node.
  
  If there are N shards, then the coordinating node performs a second sort of (from+size) × N data and selects the document to be retrieved. When from is large, the sorting process becomes heavy and CPU intensive.
- Reduce mapping fields and provide only fields that need to be retrieved, aggregated, or sorted. Other fields can be stored on other storage devices, for example, Hbase. Query these fields in Hbase after obtaining the result in ES.
- When creating indexes and querying information, specify a Routing value. In this way, the information can be queried in a specific fragment precisely, improving query efficiency. When selecting routes, ensure that data distribution is balanced.
The JVM tuning

JVM tuning recommendations are as follows:
- Ensure that the minimum heap memory (Xms) is the same size as the maximum heap memory (Xmx) to prevent the program from changing the heap size at run time.
  
  Elasticsearch’s default heap memory is 1GB. Through.. The /config/jvm.option file is configured, but it is best not to exceed 50% of physical memory and 32GB.
- GC defaults to CMS, concurrent but with STW problems, and can be considered using G1 collector.
- ES relies heavily on Filesystem Cache for fast searching. In general, you should ensure that at least half of the available memory is physically allocated to the file system cache.
Source: www.cnblogs.com/jajian/p/11…

Welcome to pay attention to the public number [code farming blossom] learn to grow together I will always share Java dry goods, will also share free learning materials courses and interview treasure book reply: [computer] [design mode] [interview] have surprise oh

Full text search engine For Elasticsearch

Data in Life

Say first Lucene

ES Core Concepts

Cluster

Shards

Replicas

Mapping

Basic use of ES

Install and use

Cluster Health Status

ES mechanism principle

Principle of index writing

Storage principle

① Segmented storage

Performance optimization

The storage device

Internal index optimization

Adjusting Configuration Parameters

The JVM tuning

Related Posts

My girlfriend cried and asked me why the SQL she wrote ran so slowly.

Part 14 – Linux ARM assembler arrays/structures/indexes

Some months in and near your home | Go