1. The concept
- Document Minimum unit of data.
- An Index contains a bunch of document data with similar structure, representing a class of similar or identical documents.
- Type represents a logical data category in Index. Each Index can have multiple types, and each type can store multiple documents.
5. In version X, multiple types can be created for an index. 6. In version X, only one Type can exist in an index. As of version 7.0, the concept of Type was removed. The main reason is that in the same index, only a small part of the document field is the same or all of the fields are different, which will lead to sparse data, affecting Lucene’s ability to effectively compress data, currently compatible with Type in 7.x.
- A document has multiple fields, and each Field is a data Field.
- A single machine cannot store a large amount of data. For Elasticsearch, the data stored in an Index can be divided into multiple Shards and distributed on different server nodes to achieve horizontal scaling and improve throughput and performance. This is an architectural pattern of decentralized clustering of data.
Each Shard is a Lucene index, which can be considered a Lucene instance.
- Data in the Replica Shard may be lost and multiple copies (Replica copies) are created for each Shard. Replica can provide backup service in case of Shard failure to ensure data is not lost. Multi-replica improves search throughput and performance.
The number of Primary shards in each index should be set when the index is established. Once set, it cannot be changed. However, the number of Replica Shard can be changed at any time.
Elasticsearch | Relational database |
---|---|
Index | Database instance |
Type | table |
Document | line |
Field | column |
State 2.
- Green: The Primary shard and Replica Shard of each index are active.
- Yellow: The Primary shard of each index is active, but some Replica shards are not active (unavailable).
- Red: Indicates that the primary shard status of some indexes is not active, and data on some indexes is lost.
3. Load balancing
There are 3 primary shards, and each primary shard has 2 replica shards, so there are 9 shards in total:
- P1 / P2 / P3: Test_index data is evenly distributed among the three primary shards.
- R? -? : Each primary shard has a replica shard. Replica Shard is actually a data backup, similar to the slave node in msater/ Slave mode. The six Replica shards are also loaded to each ES process node.
The primary shard and replica Shard of Elasticsearch cannot all be deployed on the same node. If the node fails, data and copies will be lost.
4. High availability
Elasticsearch ensures high availability of the cluster in primary/secondary mode and election mode.
Node 1 is down. The primary shard of node P1 is inactive, and the cluster status changes to red. However, complete data are still stored on nodes 2 and 3. Two copies of P1, R1-1 and R1-2, start a round of election. The winner is the primary shard, assuming that R1-1 wins.
When node 1 recovers, the shard is added to the cluster again. P1 finds a new primary shard on node 2, changes itself into a Repilca shard, and synchronizes data.
5. Scale horizontally
In the original cluster, the capacity expansion effect is as follows:
6. Concurrency control
6.1. The version number
Elasticsearch uses the version number mechanism to control the concurrent modification of Document. Version numbers are essentially optimistic locks. When Elasticsearch writes document data, ES automatically generates a version number for document. When the document is first created, the _version number is 1. Each time the document is modified or deleted, the _version number is automatically increased by 1.
Execute the process
- Let’s say we have two threads A and B that read the product information at the same time, and we get the same _version of document, let’s say 1.
- Thread A locally reduces the inventory by 1 and sends A PUT update request with the version number
PUT /product/storage/1? version=1
. - ES receives the request, updates the data, and then adds the version number +1.
- Thread B sends a PUT update request after the local inventory is reduced by 1, and the request also contains the version number 1.
PUT /product/storage/1? version=1
. - After receiving the request, ES found that the version number of document with ID 1 was already 2, while the version number carried in the update request of THREAD B was still 7. The two were different, so it refused to update and returned an error.
6.2. Customize the version number
Concurrency control is based on a version number maintained by itself.
- Internal version syntax
? version=1
- Custom version number syntax
? version=1&version_type=external
If version_type=external, the change can be completed only when the provided version is larger than _version in Elasticsearch.
7. Routing principles
Elasticsearch uses the hash routing algorithm to calculate the ID of the document record, and generates a shard id. Routing algorithm Shard = hash(routing_key) % number_of_primary_shards. You can also manually specify a document routing_key value, and documents with the same routing_key will be routed to the same shard.
7.1. Write
For example, the first three primary shards have a copy of each. At the beginning, the Client (integrated with Elasticsearch Client SDK) sends a document write request, and the request may hit any ES node, which is also called coordinate Node.
In fact, each ES node knows the information of other nodes in the cluster, including how many primary/replica shards there are in the cluster and what primary/replica shards are allocated to each node.
- Suppose the ES process 2 node (coordinator node) receives the request, hash the document ID, and find that the result is 3, i.e. the primary shard (P3) should handle the request, so the request will be forwarded to P3 on the ES process 3 node.
- After processing the request, primary Shard 3 synchronizes data to replica Shard (R3-1) and responds to ES process 2. After receiving the response, ES process 2 (coordinating node) returns the result to ES Client.
Write requests from Elasticsearch are referred to the Primary shard for processing.
Read 7.2.
The basic principle of document data query is similar to that of write. The query request can be processed by the primary shard or replica Shard to improve the throughput and performance of the system. Coordinate node After receiving the search request, the coordinate node adopts the round-robin algorithm to randomly select a send request from the corresponding Primary shard and all replicas to achieve load balancing of read requests.
Such as: Suppose the client initiates a request to search a document and hits ES process 2. According to the document Id, ES process 2 calculates that primary Shard 3 should handle it. Primary Shard 3 has one replica. Therefore, the coordination node will use the round-robin algorithm to select one of the forwarding requests. For example, r3-1 is selected and the request is forwarded to it. R3-1 will return the query result, and finally ES process 2 will return the result to the client.
8. Data consistency
Elasticsearch provides three data synchronization mechanisms: one, all, and quorum.
PUT /index/type/id? consistency=quorum
- One: For document write operations (add, delete, modify), as long as one primary shard is active available, the operation can be performed.
- All: For document write operations (add, delete, or modify), it is required that all primary shard and Replica shard are active before this write operation can be performed.
- Quorum: For a document write operation (add, delete, or modify), ensure that most shards are available before writing. If the “majority” condition is not met, the request will wait for 1 minute by default, and a timeout error will be reported when the time exceeds. We can add a timeout parameter to the write operation.
PUT /index/type/id? timeout=30
Elasticsearch uses a formula to calculate “most” shards.
(primary + number_of_replicas) / 2 + 1
9. Data persistence
Segment File
Shard shard is a Lucene index. Shard shard is a Lucene index. Shard shard is a Lucene index.
The Segment file can be interpreted as the underlying file that stores the document data, from which ES retrieves the data during retrieval, while the COMMIT point file can be interpreted as a list that stores several segments. It identifies the old segment file before the commit point.
For example, when Elasticsearch creates a commit point file, some segment files already exist, and the Commit point stores information about those old segment files.
Writing process
A piece of Document data is written to disk through the following process. Write, refresh, Flush, merge.
Write
The document will be written to in-memory buffer, which is actually the application memory. At the same time, the Document data is written to the Translog log file.
Refresh
Every 1 second Elasticsearch performs a refresh operation to refresh the data in the buffer into a new segment file in the Filesystem cache. Filesystem cache Is equivalent to OS cache. Note: The segment file is stored only in the OS cache, not on disk.
After the refresh operation is complete, the buffer is cleared. Refresh is also executed if the buffer is full. Data can be retrieved after entering the Filesystem cache.
Elasticsearch is near real-time (NRT) and refresh every second by default. Data written to Elasticsearch is retrieved one second later. The refresh interval is set using the index.refresh_interval parameter of index.
Flush
Segment files always exist in the OS cache, providing a mechanism for writing cached data to disk files in the event of an outage. During refresh, more and more segment files are stored in the OS cache (each refresh creates a new segment file). Flush is triggered when translog reaches the threshold.
- In the first step of flush, the existing data in the buffer is flushed to the OS cache.
- Write a commit point file on disk that identifies all the old segment files in the OS cache before the commit point.
- Forcibly fsync all data in OS cache to disk file.
- Finally, clear the Translog, because the data recorded in it has been fsync to disk.
By default, flush is automatically executed every 30 minutes, or prematurely if translog is too large.
Until the flush operation is performed, the data is either in the buffer or in the OS cache, and when the machine goes down, the data is lost. Data is written to a special log file. If the machine goes down and restarts, Elasticsearch will automatically read the data from the Translog file and restore it to the BUFFER and OS cache. Translog data is itself written to the OS cache and then flushed to disk every five seconds by default. So by default, even with translog guaranteed availability, 5 seconds of data can be lost. At this point, the data only stays in the OS cache of the buffer or translog file, and if the machine hangs at this point, the 5 seconds of data will be lost.
Ensure that the data is not lost set index index. The translog. Durability parameters, each time to write a data is written to the buffer, at the same time fsync translog file to disk, write performance, throughput, serious decline.
Merge
Refresh creates a new segment file in the OS cache every time. As the segment file increases, the search performance deteriorates. Elasticsearch periodically merges segments of similar sizes and physically deletes documents that are identified as deleted. After the merged segment files are written to disk, a commit point file is created that identifies all the new segment files.