The Document routing

Routing algorithm: shard = hash(routing) % number_of_primary_shards

For example, an index has three primary shards, P0, P1, and P2

  • Each time a document is added, deleted, or checked, a routing number is brought in. By default, the document’s _id (either manually specified or automatically generated) is routing = _id, assuming that _id=1
  • It will pass this routing value into a hash function, produce a hash of the routing value, hash(Routing) = 21, and then take the remainder of the hash value of the number of primary shards in the index, 21%3 = 0, I’m going to put this document on P0.
  • The most important value to determine which shard a document is on is the routing value, which is _id by default, or you can specify it manually, and the same routing value, every time it comes in, from the hash function, the hash value must be the same no matter what the hash value is, no matter what the number is, The remainder of number_of_primary_shards must be in the range from 0 to number_OF_primary_shards -1. 0.

_id or custom routing value

The default routing is _id; You can also manually specify a routing value when sending the request, such as put /index/type/ ID? routing=user_id

It is useful to specify routing values manually to ensure that a certain type of document must be routed to a shard, which can be useful for application level load balancing and bulk read performance.

This is why the number of primary shards is immutable.

ES write data and brush disk principle

The data writing process of ES is as follows:

  • The client selects a node to send the request to, and that node is a coordinating node.
  • Coordinating node, routing document, forwarding request to corresponding node (with primary shard)
  • The primary shard on the actual node processes the request and synchronizes the data to the Replica node
  • Coordinating node, if it is found that the primary node and all replica nodes are completed, the response result is returned to the client

Write flow – brush disk principle

(2) Every one second, the data in the buffer is written to a new segment file and stored in the OS cache. When the segment is opened for search, (3) the buffer is emptied. (4) Repeat 1~3, new segments are added continuously, the buffer is emptied, and data in the translog keeps accumulating. (5) When the length of the translog reaches a certain point, Commit operation occurs (5-1) all data in buffer is written to a new segment, and written to OS cache, opened for use (5-2) Buffer is emptied (5-3) A commit ponit is written to disk, All index segment files in filesystem cache are forcibly flushed to disk by fsync (5-5). The existing translog is cleared and a new translog is created

Fsync + Flush the translog, known as flush, which is flushed every 30 minutes by default or when the translog is too large.

POST /my_index/_flush, generally do not manually flush, let it perform automatically.

Translog is fsynced to disk every 5 seconds. After an add, delete, or modify operation, the operation succeeds only when fsync succeeds in both the Primary shard and replica Shard.

However, this forcible fsync translog may cause some operations to be time-consuming. You can also allow some data to be lost and set up asynchronous fsync translog.

The execution process of each merge operation

  • Select some segments of similar size and merge them into one large segment
  • Flush the new segment to disk
  • Write a new Commit point that includes the new segments and excludes the old ones
  • Open the new segment for search
  • Delete the old segment

ES Query data principle

  • The client sends a request to any node, known as a coordinate Node
  • Coordinate Node routes the Document and forwards the request to the corresponding node. In this case, the round-robin random polling algorithm is used to randomly select one from the Primary shard and all replicas to balance the read request load
  • The node that receives the request returns the document to the coordinate Node
  • Coordinate Node returns the document to the client