I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details.

1. Core Concepts

1.1 Index

An index is identified by a name (all lowercase letters must be used) that is used when indexing, searching, updating, and deleting documents in the index. An index is a database, a collection of similar documents. Must be through the index to search, use can greatly improve the speed of the query, similar to the dictionary inside the directory.

1.2 Type (Type)

Within an index, one or more types can be defined. Typically, a type is defined for documents with the same field, which is a logical partition on the index. The type changes in different ElasticSearch versions.

version Type
5.x Support for multiple types
6.x There is only one Type
7.x The default index type is not supported. The default index type is _doc

1.3 Document (Document)

A document is a basic unit that can be indexed, equivalent to a piece of data in a database. Any number of documents can exist within an index or type.

1.4 Field (Field)

Equivalent to a database table field, each field has a different type.

1.5 Mapping

Mapping limits the ways and rules in which data can be processed. Such as field type, default value, parser, indexed, etc. Processing data according to optimal mapping rules can provide a significant performance boost.

1.6 Sharding (Shards)

The existence of sharding is to solve the storage problems of a large number of documents in a single index and the slow response of search. An index is divided into multiple pieces, each of which is called a shard. Each shard is also a fully functional “index” that can be placed on any node in the cluster.

There are two important reasons for the existence of sharding: 1) horizontal sharding is allowed to expand capacity. 2) Allow distributed, parallel operations on shards to improve their throughput.

Sharding is a Lucene index. A sharding is a Lucene index. An ElasticSearch index is a collection of Lucene indexes. When a query is performed, the query request is sent to each shard belonging to the current ElasticSearch index, and the results from each shard are merged back.

1.7 Replicas

In a network/cloud environment where failure can happen at any time and where a shard/node somehow goes offline or disappears for whatever reason, having a failover mechanism is very useful and highly recommended. For this purpose, Elasticsearch allows you to create one or more copies of a shard. These copies are called replicas.

Replicas exist for two important reasons: 1) improved availability: Note that replicas cannot be on the same node as the master/original shard. 2) Improved throughput: Search operations can be run in parallel on all replicas. The process of allocating shards to a node, including primary shards or replicas. In the case of replicas, it also includes the process of copying data from the master shard. This process is done by the master node.

Second, system architecture

The following figure shows a three-node cluster with 3 shards and 1 replica. P is shard, P0 is primary shard; R is the copy and R0 is the master shard copy.

A running instance of Elasticsearch is called a node, and a cluster is made up of one or more nodes with the same cluster.name configuration that share data and load. When a node is added to the cluster or removed from the cluster, the cluster redistributes all data evenly.

Simulate the data synchronization process of a write request, as shown below:

Process:

1) When a write request is sent to node1, of course it could be any node.

2) Node1 calculates the shard on which the current data is written. Here we assume that the data is written to shard P1 on node2.

3) Shard P1 synchronizes data to its copy R1.

4) When all operations are complete, each step returns a result in the response.

Note: It is configurable to return success after synchronizing several copies.