1. Election mechanism

1. Role division

Follower: completely passive, unable to send any requests, but only receives and responds to messages from the leader and candidate. The initial state of each node after startup must be Follower.

Leader: handles all requests from clients and copies logs to all followers;

Candidate: Used to run for a new leader (Candidate triggered by a follower timeout)

2. Leader election process

A, PreVote

Generally speaking, the election of RAFT only has one round of voting to elect the cluster leader, but there are defects, that is, after the network partition, minority nodes will not get enough votes to become the zone leader, but the optional number of term will keep increasing. When the network partition is restored, minority nodes will be affected by the large term. And force the majority Leader offline, we call him a troublemaker. In the PreVote algorithm, a Candidate must first confirm that he/she can win the votes of most nodes in the cluster, so that he/she can add his/her term, and then initiate a real Vote. Other voting nodes agree to initiate an election if the following two conditions are met:

  • No heartbeat from a valid leader was received and at least one election ran out
  • Candidate’s log is new enough (Term larger, or raft index larger for Term same)

PreVote algorithm solves the problem that network partition nodes will interrupt the cluster when they join again. In the PreVote algorithm, network partitioned nodes cannot add their Term because they cannot obtain permission from most nodes. Then when it rejoins the cluster, it still cannot increment its Term because the other servers will always receive periodic heartbeat messages from the Leader node. Once the server receives the heartbeat from the Leader, it returns to the Follower state, with the same Term as the Leader.

B, that

Timeout drive: heartbeat interval/communication timeout between Leader and Followers Time for triggering elections Random timeout: reduces the probability of election collisions causing votes to be split

  • Followers – > Candidate

  • Win election: Candidate — > Leader

  • Another node wins the election: Candidate — > Follower

  • No node wins the election for a while: Candidate — > Candidate

C. Realize timing

Log replication

JRaft logs can be divided into two main categories: logs generated by the internal running of the system; Another type of log is generated when users actively submit instructions to the JRaft cluster

1. The business data is submitted to the Leader

A, NodeImpl

The Apply method provides an interface for services to submit operation instructions to the JRaft cluster. These instructions are circulated in the cluster in the form of tasks and recorded in the log form to the Leader node and synchronized to all Follower nodes in the cluster. Finally, the logs are transparently transmitted to the state machines of all cluster nodes that successfully replicate logs.

Will host the user instructions Task and encapsulate LogEntry objects, and events in the form of delivery to the Disruptor queue for asynchronous processing, realized LogEntryAndClosureHandler EventHandler interface, Use to consume events in the Disruptor queue

B, LogManagerImpl

Write log data to memory and submit it to the Disruptor queue as an Event of type OTHER for asynchronous flush

C, BallotBox

The ballotBox ballotBox mainly records the status of the log submission. This method is called after each node successfully commits the log

D. FSMCallerImpl transmits messages transparently to the state machine

E. StateMachine service StateMachine

2. The Leader submits data to followers

3. Snapshot mechanism

This mechanism periodically generates snapshot files for local data status and deletes log files to reduce disk space consumption. When a new node joins the cluster, the node only needs to download and install the latest snapshot file from the Leader node instead of copying all the previous log files from the Leader node.

1. Leader persists snapshots

NodeImpl: indicates the leader node

SnapshotExecutorImpl: snapshot executor

FSMCallerImpl: indicates the scheduling state machine

StateMachine: defines a service system that generates a snapshot file and records the snapshot file name and metadata information using the SnapshotWriter

The JRaft Node starts the snapshotTimer during initialization (executing Node’s init method) and is used to generate snapshots periodically (default interval is 1 hour).

2. Follower Creates persistent snapshots

Replicator: The Leader node sends a request to install a snapshot file through the Replicator

NodeImpl: The Follower node receives installation requests

SnapshotExecutorImpl: Registers a new snapshot file download task, starts to download the snapshot file from the Leader node, and blocks waiting for the download process to complete

FSMCallerImpl: Transparently transmits snapshot data to the service and the service determines how to recover the status data contained in the snapshot locally

StateMachine: obtains metadata information of a snapshot file, loads snapshot data, and updates data values

4. Storage structure

1. Log data storage

A, SegmentFile

Log data is recorded and saved in MMAP mode

magic bytes first log index reserved [0x20 0x20] [… 8 bytes…] [8 bytes]

[record, record, …]

Magic bytes data length data [0x57, 0x8A] [4 bytes] [bytes]

B:

Checkpoint files save the file name and commitPos location of the SegmentFile file at a certain time, saved via ProtoBufFile

commitPos (4 bytes) + path(4 byte len + string bytes)

C, AbortFile

Initiate initialization to create abort files. Normal exit deletes abort files. Abnormal exits can be identified by the presence of abort files

2. Snapshot file read and write (__raft_snapshot_meta)

A. Storage structure

message LocalFileMeta {

optional bytes user_meta = 1;

optional FileSource source = 2;

optional string checksum = 3;

}

message SnapshotMeta {

required int64 last_included_index = 1;

required int64 last_included_term = 2;

repeated string peers = 3;

repeated string old_peers = 4;

repeated string learners = 5;

repeated string old_learners = 6;

}

message LocalSnapshotPbMeta {

message File {

required string name = 1;

optional LocalFileMeta meta = 2;

};

optional SnapshotMeta meta = 1;

repeated File files = 2;

}

B, class diagram

LocalSnapshotStorage: LocalSnapshotStorage

LocalSnapshotWriter: Reads metadata file information

SnapshotFileReader: writes metadata file information

3. Meta-information file (raft_meta)

A. Storage structure

message StablePBMeta {

required int64 term = 1;

required string votedfor = 2;

};

B. Core class

Raft metastorage LocalRaftMetaStorage LocalRaftMetaStorage LocalRaftMetaStorage LocalRaftMetaStorage localraft metadata Initialize the meta information store StorageFactory Creates the LocalRaftMetaStorage meta information store by default based on the Raft meta information store path, Raft internal configuration, and Node monitoring.

Jraft’s ProtobufMsgFactory class uses the dynamic parsing mechanism of protobuf to process messages, so instead of deserializing messages based on classes compiled from proto files, consumers construct dynamic message classes based on proto file descriptors, The message is then deserialized (parsed). ProtoBuf provides dynamic parsing mechanism to solve this problem. It requires that, on the basis of providing binary content, the corresponding Descriptor object is provided, and the object result is obtained through the member method of DynamicMessage class during parsing

5. Extend KV storage based on JRAFT

RheaKV is a lightweight distributed embedded KV storage system with basic apis: GET, PUT, delete, cross-partition Scan, Batch PUT, distributed Lock

1. Deploy roles

PD: the global central controller node, responsible for scheduling the entire cluster. One PD Server can manage multiple clusters, which are isolated based on clusterId. PD server requires a separate deployment, of course, since don’t need a lot of scene management, rheaKV support is not enabled PD, also need to configure the StoreEngineOptions must be configured (services) and PlacementDriverServerOptions instance

Store Client: Client kv storage, need to configure the RheaKVStoreOptions, regionRouteTableOptionsList (Client) must be provided and PlacementDriverOptions instance.

StoreServer: server for KV storage. You need to configure RheaKVStoreOptions, StoreEngineOptions (mandatory for the service), and PlacementDriverOptions instances

2. Storage design

Currently, MemoryDB and RocksDB are supported:

  • MemoryDB is based on the ConcurrentSkipListMap implementation and has better performance, but the storage capacity of a single machine is limited by memory

  • The storage capacity of RocksDB is limited only by disks and is suitable for scenarios with larger data volumes

3. Flow from Client to StoreServer

Read the timing

Write sequence

4, PD Client (RemotePlacementDriverClient) to PD Server process