First, write first

ES (Elasticsearch) More and more enterprises use ES to store their unstructured data in business scenarios, for example, e-commerce business to realize commodity search, data index analysis, log analysis, etc. As a supplement to traditional relational database, ES provides some capabilities that the relational database does not have.

ES first came to the public’s attention because of its ability to achieve full-text search, and also because of lucene-based implementation, there is an inverted index data structure inside.

In this paper, the author will introduce the distributed architecture of ES and the storage index mechanism of ES. This paper will not introduce the API of ES in detail, but will analyze it from the overall architecture level. In the future, the author will introduce the use of ES in other articles.

What is an inverted index

To understand what is inverted index, what we first comb the first index, such as a book, catalog page of the book, a chapter, section name, we would like to see which section, page through the catalogue, we look up to the corresponding section and the page number, can locate the specific chapters, by the name of the catalog pages of chapter to chapter page, and then see chapters content, This process is an index process, so what is an inverted index?

Such as query “Java programming thought” the articles of this book, opened the book to see catalog pages, record this chapter names and chapter addresses page, by querying the section name “inheritance” section can be positioned to “inherit” this chapter of the specific address, see to the content of the article, we can see the article content contains a lot of the word “object”.

So what if we were to look up all the articles in this book that contain the word “object”?

According to the current index is no doubt looking for a needle in a haystack, assuming that we have a “object” –→ the mapping of the article, can not it? Such a reverse mapping is called an inverted index.

As shown in figure 1, will get the key words, the article after word segmentation in the inverted index based on keyword established, keywords to build into a dictionary, for each word in the dictionary (keywords), each keyword has a corresponding list, this list is the inversion lists, to store the information such as the document number and word frequency is chapter, Each element in the inverted list is an inverted item. Finally, you can see that the entire inverted index is like a Xinhua dictionary. The inverted list of all words is usually stored sequentially in a file on disk, which is called an inverted file.

(figure 1)

Dictionaries and inverted files are the two basic data structures of Lucene, but they are stored differently. Dictionaries are stored in memory and inverted files are stored on disk. This paper will not introduce the segmentation, TF-IDF, BM25, vector space similarity and other techniques used to construct and query inverted index. Readers only need to have a basic understanding of inverted index.

3. Cluster architecture of ES

1. Cluster nodes

An ES cluster can consist of multiple nodes. One node is an ES service instance. You can add the cluster name cluster.name to the cluster. So how do nodes join a cluster by configuring the same cluster name? To understand this, we must first understand the role of the nodes in the ES cluster.

If ES nodes have roles, perform the following configuration in conf/ elasticSearch. yml to set roles.

node.master: true/false
node.data: true/false
Copy the code

A single node in a cluster can be either a candidate master node or a data node. Through the configuration above, four categories can be formed by combining them in pairs:

(1) only a candidate primary node (2) both a candidate primary node and a data node (3) Only a data node (4) Neither a candidate primary node nor a data node

** Candidate primary nodes: ** Only candidate primary nodes can vote in the election, and only candidate primary nodes can be elected as primary nodes.

** Master node: ** is responsible for adding and deleting indexes, tracking which nodes are part of the cluster, allocating fragments, collecting the status of each node in the cluster, etc. A stable master node is very important for the health of the cluster.

** Data node: ** is responsible for the data increase, delete, change, search, aggregation and other operations, data query and storage are responsible for the data node, the MACHINE CPU, IO and memory requirements are relatively high, generally choose the machine with high configuration as the data node.

In addition, there is also a node role called coordination node, which is not assigned by Settings. The user’s request can be sent randomly to any node, and the node is responsible for distributing the request, collecting the results and other operations, without the need for the master node to forward. Such a node can be called a coordination node, and any node in the cluster can act as a coordination node. Each node keeps in touch with each other.

(figure 2)

2. Discovery mechanisms

The node can join the cluster by setting a cluster name. How does ES do this?

Here is a special ES discovery mechanism called ZenDiscovery.

ZenDiscovery is a built-in discovery mechanism of ES. It provides unicast and multicast discovery modes and is responsible for discovering nodes in a cluster and electing Master nodes.

Multicast, also known as multicast, means that a node can send requests to multiple machines. This method is not recommended for ES in the production environment. For a large cluster, multicast will cause a lot of unnecessary communication.

Unicast, when a node joins an existing cluster, or forms a new cluster, the request is sent to a machine. When a node contacts a member of the unicast list, it gets the status of all nodes in the entire cluster, and then it contacts the Master node and joins the cluster.

Only nodes running on the same machine automatically form a cluster. ES is configured to use unicast discovery by default. The unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes when a new node contacts one of them and communicates. If you use Master candidate nodes as a unicast list, you only need to list three.

This configuration is in the elasticSearch.yml file:

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]
Copy the code

In the cluster information collection phase, the Gossip protocol is used. The configuration above is equivalent to a seed nodes protocol, which is not described here.

ES recommends that unicast.hosts be configured for all candidate primary nodes, and ZenDiscovery ping at ping_interval (configuration item). The timeout period is discovery.zent.ping_timeout (configuration item). If the ping fails three times (ping_retries configuration item), the node is down. In this case, a failover is triggered, including fragment reassignment and replication.

If the node is not the Master, the Master updates the cluster meta-information. The Master sends the latest cluster meta-information to other nodes, and the other nodes respond with Ack. The Master node receives a response from discovery.ze.minimum_master_nodes -1 candidate primary node, and then sends the Apply message to the other nodes. The cluster status is updated. If the node that went down is the Master node, the other candidate Master nodes start the election process for the Master node.

2.1 choose the main

During the Master selection process, only one Master is guaranteed. ES ensures that the elected Master is recognized by at least quorum’s candidate primary nodes through a threshold representing the majority of quorum.

The primary node is initiated by the candidate primary node. The current candidate primary node finds that it is not the master node and cannot be contacted by other nodes through ping. Moreover, more than minimum_master_nodes including itself cannot be contacted by the primary node, so the primary node is initiated at this time.

Select main flow diagram

(figure 3)

Select the primary node by the parameter <stateVersion, ID > of the cluster node. The stateVersion is sorted in ascending order to select the node with the latest cluster meta information as the Master, and the ID is sorted in ascending order to avoid split vote failure when the stateVersion is the same.

The first node is the Master node. When a candidate primary node initiates an election, it selects what it considers a Master according to the sorting strategy described above.

2.2 split brain

When it comes to distributed system selection, it is inevitable to mention the phenomenon of split brain. What is split brain? If multiple Master nodes are elected in a cluster, data updates are inconsistent, which is called split brain.

In short, different nodes in the cluster differ in their choice of Master, resulting in multiple Master contests.

Generally speaking, split-brain problems may be caused by the following reasons:

  • ** Network issues: ** Network delays between clusters cause some nodes to fail to access the Master, which is not down, but elects a new Master, and assigns a new Master shard to the shard and its replica.

  • ** Node load: ** The role of the Master node is both Master and Data. When there is a large number of visits, ES may stop responding (in a state of suspended animation), resulting in a large area of delay. At this time, other nodes can not get the response from the Master node, and consider that the Master node has hung up, so they will re-select the Master node.

  • ** Memory reclamation: ** The role of the Master node is both Master and Data. When the ES process on the Data node occupies a large amount of memory, large-scale memory reclamation of the JVM will occur, causing the ES process to become unresponsive.

How to avoid splitting your brain: We can optimize for the above reasons:

  • Appropriately adjust the response timeout time to reduce misjudgment. Ping_timeout Specifies the ping timeout period of a node. The default value is 3s.

  • When the election is triggered, we need to set the parameter discovery.zen.munimum_master_nodes in the configuration file of the candidate nodes. The default value is 1. The official recommended value is (master_eligibel_nodes/2)+1, where master_eligibel_nodes is the number of candidate primary nodes. This prevents split-brain and maximizes the high availability of the cluster, because elections can go on normally as long as there are not more than discovery.zen.munimum_master_nodes candidates alive. When the value is smaller than this value, the election behavior cannot be triggered, the cluster cannot be used, and the fragmentation disorder will not be caused.

  • Role separation, namely, role separation between the candidate master node and the data node mentioned above, can reduce the burden of the master node, prevent the suspended state of the master node, and reduce the misjudgment of downtime of the master node.

How is the index written

1. Index writing principle

1.1 shard

ES supports pB-level full-text search. Generally, when we have a large amount of data, the query performance will be slower and slower. We can think of a way to disperse the data to different places for storage. The split database block is called a Shard, much like MySQL’s Shard.

Different master shards are distributed on different nodes, so where should data be written in the index of multiple shards? You must not write randomly, otherwise you will not be able to quickly retrieve the corresponding data when querying. This requires a routing policy to determine which shard to write to, and how to route will be described later. The number of shards must be specified during index creation, and the number of shards cannot be changed once determined.

1.2 a copy of the

A copy is a copy of a fragment. Each master fragment has one or more duplicate fragments. When the master fragment is abnormal, the copy can provide data query. Primary shards and corresponding replica shards are not on the same node to avoid data loss. When a node breaks down, data can be queried through the replica. The maximum number of replica shards is N-1 (N is the number of nodes).

New, index, and delete requests to doc are writes that must be done on the master shard before they can be copied to the corresponding copy. In order to improve the writing ability of ES, this process is concurrent writing. In order to solve the problem of data conflict in the process of concurrent writing, ES controls this process by optimistic locking. Each document has a _version number, and the version number increases when the document is modified.

Success is reported to the coordinator node once all replica shards report success, and the coordinator node reports success to the client.

(figure 4)

1.3 Elasticsearch Index Writing Process

As mentioned above, the index can only be written to the master shard and then synchronized to the replica shard. As shown in Figure 4, there are four master shards S0, S1, S2 and S3. According to what strategy is a piece of data written to the specified shard? Why is this index data written to S0 but not S1 or S2? The process is determined by the following formula.

shard = hash(routing) % number_of_primary_shards
Copy the code

The value of the above formula is the remainder between 0 and number_of_primary_shreds -1, which is where the data file is shard. Routing uses the Hash function to generate a number, which is then divided by number_of_primary_shards (number of primary shards) to get the remainder. Routing is a variable value, which defaults to the _id of the document and can be set to a custom value.

In a written after the request is sent to a node, the node according to the mentioned above, will act as a coordinator node, according to the routing formula to calculate the writing which shard, the current node has shard of all other node information, if you find the corresponding fragmentation is on other nodes, and then forwards the request to the fragmentation of the primary shard node.

In ES cluster, each node knows the storage location of data in the cluster by the above formula, so each node has the ability to receive read and write requests.

So why is the number of master shards determined at index creation time and not modifiable? Because if the number changes, then all the calculated values of the route will be invalid and the data will never be found again.

(figure 5)

Shard =hash(Routing)%4=0 as shown in Figure 5, shard=hash(routing)%4=0 is the current value of a data obtained by the routing calculation formula, then the specific flow is as follows:

(1) The data write request is sent to node1 node, and the value obtained through routing calculation is 1. Then the corresponding data will be on the main shard S1. (2) Node1 forwards the request to node2, which receives the request and writes it to disk. (3) Data is concurrently copied to three replica fragments R1, in which data conflicts are controlled through optimistic concurrency. Once all replica shards report success, node node2 reports success to Node1, which then reports success to the client.

This mode, as long as there is in copy, write minimum delay is the sum of two single shard write time-consuming, efficiency will be lower, but the benefits are obviously, avoid writing after a single machine hardware failures lead to loss of data, the data integrity and performance, are generally preferred data, unless some special scenario allows lost data.

In ES, to reduce disk I/O and ensure read/write performance, data is written to disk for persistence every 30 minutes. If data is written to memory but not flushed to disk, the data will be lost in the event of machine downtime or power failure. How to ensure this?

ES takes a page from the database and adds a CommitLog module, called transLog in ES, which is described in the ES Storage Principle below.

2. Storage principles

The process of index writing in ES is described above. After data is written to fragments and copies, it is stored in memory. To ensure that data is not lost after power failure, it needs to be persisted to disk.

As we know, ES is realized based on Lucene. Internally, it creates and writes indexes, searches and queries through Lucene. The working principle of Lucene is as follows: When a new document is added, Lucene performs word segmentation and other pre-processing, and then writes the document index into memory. TransLog is similar to mysql’s binlog. TransLog is used to recover memory data after a downtime and stores operation logs of unpersisted data.

By default, Lucene refreshes data in memory to the file system cache every 1s(refresh_interval configuration item), called a segment. Once a segment is flushed into the file system cache, it can be retrieved, not before.

Therefore, refresh_interval determines the real-time performance of ES data, so ES is a quasi-real-time system. Segments are not modifiable on disk, so random writes to disk are avoided and all random writes are done in memory. By default, Lucene persists the segment in the cache to disk every 30 minutes or when the segment space is greater than 512 MB. This is called a commit point and deletes the corresponding transLog.

When testing write operations, you can use manual refreshes to ensure that data is retrieved in a timely manner, but do not manually refresh a document every time it is indexed in a production environment, because the refresh operation has some performance overhead. Not all business scenarios require a refresh every second.

You can reduce the refresh frequency of each index by increasing the value of refresh_interval = “30s” in Settings. Note that the value is followed by a time unit; otherwise, the default value is milliseconds. When refresh_interval is -1, automatic refresh of indexes is disabled.

(figure 6)

Index files are segmented and non-modifiable, so how do new, updated, and deleted files handle?

  • New, new is easy to handle, because the data is new, so you only need to add a segment to the current document.

  • Removed, as immutable, so for the delete operation, not the document from the old section of the removed. But with the addition of a del file, the file will be delete documents list these piece of information, the marked documents can still be matching to delete, but it will be before the end result is returned from the result set.

  • Update, can not modify the old section to update the document, in fact, update is equal to delete and add these two actions. The old document is marked deleted in the.del file, and the new version of the document is indexed to a new segment. It is possible that both versions of a document will be matched by a query, but the deleted older version will be removed before the result set is returned.

Making segments immutable has both advantages and disadvantages.

Advantages:

  • No locks required. If you never update indexes, you don’t need to worry about multiple processes modifying data at the same time.

  • Once the index is read into the kernel’s file system cache, it stays there because of its immutability. As long as there is enough space in the file system cache, most read requests go directly to memory and do not hit disk. This provides a significant performance boost.

  • Other caches, like the Filter cache, remain in effect for the lifetime of the index. They do not need to be rebuilt every time the data changes, because the data does not change.

  • Writing a single large inverted index allows data to be compressed, reducing disk I/O and the amount of indexes that need to be cached into memory.

Disadvantages:

  • When old data is deleted, it is not deleted immediately, but is marked as deleted in the.del file. Old data can only be removed when the segment is updated, which wastes a lot of space.

  • If there is a data update frequently, each update is new, marking the old, there will be a lot of space waste.

  • Each time data is added, a new segment is required to store the data. When the number of segments is too high, the consumption of server resources such as file handles can be very high.

  • The inclusion of all result sets in the results of the query and the need to exclude old data removed by tags adds to the query burden.

2.1 period of consolidation

Because a new segment is created every time a refresh occurs, the number of segments can explode in a short period of time, and too many segments can cause problems. A large number of segments can affect data read performance. Each segment consumes file handles, memory, and CPU cycles.

More importantly, each search request must take turns checking each segment and then merging the query results, so the more segments, the slower the search.

Lucene therefore merges segments according to a policy that purges old deleted documents from the file system. Deleted documents are not copied into new large segments.

The merge process does not interrupt the index and search, the inverted index data structure makes the file merge relatively easy.

Segment merge occurs automatically when indexing and searching, and the merge process selects a small number of similarly-sized segments and merges them behind the scenes into larger segments that can be either uncommitted or committed.

After the merge is complete, the old segment is deleted, the new segment is flushed to disk, a new commit point is written that contains the new segment and excludes the old and smaller segments, and the new segment is opened and can be searched. Segment merges are computationally expensive and eat a lot of disk I/O, and they drag down write rates and, if left unchecked, affect search performance.

ES puts resource limits on merge processes by default, so search performance is guaranteed.

(figure 7)

Write at the end

The author introduces the architectural principle and index storage and writing mechanism of ES. The overall architectural system of ES is relatively clever, and we can use its design ideas for reference when designing the system. This paper only introduces the overall architectural part of ES, and more content will be shared in other subsequent articles.

Author: Development team of Vivo official website mall