The election

ElasticSearch is a naturally distributed storage system, and the consensus of distributed systems is a fundamental challenge in distributed systems. Consensus means that all nodes or processes in a distributed system must agree on the same data or state. This can be achieved through algorithms such as Raft (ETCD), Paxos (ZooKeeper), etc.

How does ElasticSearch cluster handle consensus between nodes? The creators of ElasticSearch call this mechanism Zen Discovery. Zen Discover has two modules:

  • Ping: The process is used to discover other nodes
  • Unicast: A list of host names used to control which nodes to ping

The distributed network of ElasticSearch clusters is essentially a peer-to-peer network system. That is, any node can act as an external service node and communicate with other nodes in the system. The ElasticSearch cluster uses this communication to elect the master node. A stable master node is very important for ElasticSearch clusters. The master node is responsible for lightweight cluster-wide operations such as creating/dropping indexes, tracking node cluster status, and deciding which nodes to allocate shards to.

Select * from ElasticSearch node: select * from ElasticSearch node

  • Ping_interval: indicates the ping interval. The default value is 1 second. If the value is too large, the ping request is delayed. If the value is too small, normal nodes may be mistakenly kicked out of the cluster, resulting in node failure.
  • Ping_timeout: indicates the timeout period of a single ping. The default value is 3 seconds. As you can imagine, if this value is set too high, the speed of kicking out abnormal nodes is too slow, which will also cause abnormal data in the cluster. If this value is set too small, it will also cause the cluster to kick out the true length node.
  • Join_timeout: specifies the timeout period for the node that initiates the ping request to join the cluster. The default is 20 times ping_timeout. If the primary node fails, the nodes in the cluster restart the ping and initiate a new election. The ping process can also be helpful when a node accidentally fails as the primary node and discovers the primary node through other nodes.
  • Minimum_master_nodes: The default value is 1. This attribute specifies the minimum number of nodes with the master role that need to be seen from the perspective of this node. You are advised to set this parameter to (number of master nodes /2 + 1). In the production environment, you are advised to use three master nodes. These three nodes do not store data or accept client requests and only perform master tasks.

Optimistic locking

ElasticSearch supports concurrent requests. When multiple requests arrive at a node, the request is sent out of order to process fragments. ElasticSearch uses optimistic locking to prevent old data from overwriting new data. Each piece of data in ElasticSearch has a version number that increases with each operation. A higher version is allowed to overwrite a lower version, but an attempt to overwrite a higher version fails, indicating that a request has been made to change the data. This can be handled at the application level.

Read and write consistency

ElasticSearch provides parameters to enable real-time queries of data written to half the nodes. If the default setting (sync) is used, the latest data is not displayed in the query request until all operations are written to the master and slave. This may not be efficient, after all, to ensure the consistency of master and standby shards. If the default value is changed to query only primary shards, data consistency between primary and standby shards will be sacrificed to improve throughput efficiency. In this case, the master shard must be the latest data, but the replica shard can query the old data.

For write consistency, the default for ElasticSearch is to provide a default set of nodes, and the write will only be accepted by the cluster if all the nodes in the set provide writes. In other words, write operations cannot begin until most nodes are writable. (If the copy fragment fails to be written at this time, the copy will be rebuilt on other nodes)