Distributed consistency algorithm is a basic problem of distributed system. What it needs to solve is how a distributed system can reach a strong agreement on a certain value (resolution), and then solve the problem of high availability of the system. Paxos is the most important distributed consistency algorithm, many people regard it as the synonym of “distributed consistency protocol”.

Although the theory of Paxos has been put forward for many years, and products using Paxos and its variants have emerged in an endless stream, there are still few industrial-level independent third-party libraries and even fewer open source projects. Common open source products referring to Paxos include Zookeeper and ETcd. However, the protocol does not support a high-throughput state machine replication, and does not provide an independent third-party library for other systems to access quickly.

To this end, we design and implement the DCF feature to support the distributed strong consistent scenarios involved in openGauss. \

1. What is DCF?

DCF stands for Distributed Consensus Framework, that is, Distributed Consensus Framework. Typical algorithms to solve distributed consistency problems are Paxos, Raft, etc. DCF implements Paxos algorithm. DCF provides log replication and high cluster availability. DCF provides multiple role node types based on Paxos and can be adjusted. Log replication support dynamic flow adjustment, support minority strong up ability, self-selected master ability.

DCF is a high-performance, highly mature, reliable, easy to expand, easy to use independent base library. Other systems can easily have strong consistency, high availability, automatic disaster recovery and other capabilities given by Paxos algorithm through simple interface docking with DCF.

DCF functional architecture diagram is shown in the figure above, which mainly includes algorithm module, storage module, communication module, service layer, etc.

Algorithm module:

The algorithm module is implemented based on the Multi-PaxOS protocol. In combination with its own business scenarios, as well as the requirements of high performance and ecology, DCF has made many function extensions and performance optimizations, making it more functional than the basic Multi-PaxOS and significantly improved its performance in various deployment scenarios. It mainly includes: Leader election module, log replication module, metadata module, and cluster management module.

Storage module:

For specific business scenarios and extreme high performance concerns, DCF model the log stored separately extracted a set of common interface, and the implementation of a high-performance storage module by default, a particular scenario or extreme high performance and cost requirements of users, can be combined with the existing storage system, docking DCF model log storage interface to achieve its specific requirements, This is one of the advantages of DCF as a third-party independent library.

Communication module:

The communication module is mainly based on the MEC implementation (Message Exchange Component) and provides the communication capability between the whole DCF Component instances as well as the asynchronous event processing framework. The main functions include: Extensible communication protocols, unicast, broadcast, loopback sending interface, message asynchronous processing framework, support multi-channel mechanism and multi-priority queue, support compression and batch sending, etc.

Service layer:

The service layer is the basis that drives the whole DCF operation and provides various basic services required by the program operation, such as: lock, task asynchronous scheduling, thread pool service, timer capability, etc.

2. What can DCF do?

2.1 Nodes can be added or deleted online and the Leader capability can be transferred online

Based on the standard Multi-PaxOS, DCF supports the online addition and deletion of nodes and the online transfer of leader capability to other nodes, which is more suitable for a wide range of business scenarios and the construction of development ecology.

2.2 Support priority selection and strategic majority

The tactical majority: Classic Paxos theory, the data can be submitted after the majority consensus, and the majority are specific, there is no guarantee that one or some nodes must be able to get the complete data, in practice, often is the geographical position close node will have consistent data, and remote node location, has been in a consistent state, When city-level Dr Occurs, the active node cannot be activated. Policy-based majority capability allows users to dynamically configure a node or nodes to have strong consistent data. When a DISASTER recovery (Dr) requirement occurs, the active node can be activated immediately.

Priority selection: Users can specify the priority of each node. DCF selects the primary node strictly according to the specified priority. Nodes with a lower priority are activated only when all nodes with a higher priority are unavailable. \

2.3 Supporting node role diversity

In addition to the classic roles of Leader, Follow and Candidate, DCF can also provide customized roles, such as Passive role (log role, data role, unelected, does not participate in majority voting), log role (log role, data role, unelected, participate in majority voting), With the support of these node roles, DCF can support multiple cluster deployment modes such as synchronous and co-asynchronous node deployment.

2.4 Batch & Pipeline

Batch: DCF supports multi-level Batch operations, including: 1) Multiple logs are combined into a single message and sent. 2) Merge multiple logs to write disks. 3) Merge and copy multiple logs. Batch can effectively reduce the additional loss caused by message granularity and improve throughput.

Pipeline: refers to the mechanism of sending the next message to the corresponding node concurrently before the last message returns the result. By increasing the number of concurrent messages (Pipeline number), it can effectively reduce the delay of sending single requests and improve the performance. DCF adopts pure asynchronous mode in log persistence, network transmission, log replication and other stages, which maximizes Pipeline performance. \

2.5 Efficient flow control algorithm

Batching and Pipelining can improve the overall throughput and performance of the system, but large Batch can easily lead to a large single request delay, resulting in a high number of concurrent requests, which affects throughput and request delay. Therefore, a set of efficient adaptive flow control algorithm is designed and implemented by DCF. Automatic detection of network bandwidth, network transmission delay, request concurrency and other parameters, and timely adjustment of Batch and Pipeline parameters, control the injection of service traffic.

Flow control algorithm main flow:

The core algorithm flow is as follows:

  1. The primary DCF node periodically samples and calculates consensus information: consensus information mainly includes the end-to-end consensus delay, end-to-end consensus log bandwidth, and overall log playback bandwidth of the system.
  2. Calculation of control quantity: The master node calculates the performance change trend according to the sampling result and historical result, adjusts the control direction and control step size according to the value and change trend of historical control quantity, and calculates the new control quantity in the direction of better performance.
  3. Update the control quantity when the control period arrives.
  4. Control quantity continuously applies to service traffic and controls the frequency of service traffic injection

DCF will continue to evolve in scenarios such as data communication, multi-log flow, and parallel large-capacity replication to provide users with efficient, reliable, and easily managed multi-copy log replication and backup capabilities to meet users’ requirements for disaster recovery and high availability of databases. \

3. The DCF model is used

Three assumptions cluster nodes, IP, respectively, 192.168.0.11, 192.168.0.12, 192.168.0.13. Node ids are 1,2, and 3 respectively. The node roles are LEADER, FOLLOWER, and FOLLOWER.

Enable the DCF function during OM installation and deployment. Enable_dcf is on. This function is disabled by default. The following is an example:

Script/gspylib/etc/conf/centralized/cluster_config_template_HA XML for XML document template.

The bold text is an example and can be replaced by yourself. Each line of information is annotated.

<? The XML version = "1.0" encoding = "utf-8"? > <ROOT> <! -- Overall information --> <CLUSTER> <! <PARAM name="clusterName" value="Sample1" /> <! <PARAM name="nodeNames" value="node1,node2,node3" /> <! IP - node, and nodeNames one-to-one correspondence - > < PARAM name = "backIp1s" value = "192.168.0.11, 192.168.0.12, 192.168.0.13" / > <! -- Database installation directory --> <PARAM name="gaussdbAppPath" value="/opt/huawei/newsql/app" /> <! -- Log directory --> <PARAM name="gaussdbLogPath" value="/opt/huawei/logs/gaussdb" /> <! --> <PARAM name="tmpMppdbPath" value="/opt/huawei/logs/temp" /> <! PARAM name="gaussdbToolPath" value="/opt/huawei/tools" /> <! <PARAM name="clusterType" value="single-inst"/> <! -- Whether to enable DCF mode, enable: on, disable: off --> <PARAM name=" enable_dcF "value="on/off"/> <! -- DCF config config information --> <PARAM name="dcf_config" value="[{" stream_id ":1," node_id ":1," ip"; : & quot; 192.168.0.11 & quot; & quot; port&quot; : 17783, & quot; role&quot; : & quot; LEADER&quot;}, {& quot; stream_id & quot; : 1, & quot; Node_id & quot; 2, & quot; ip&quot; : & quot;, 192.168.0.12 & quot; & quot; port&quot; : 17783, & quot; role&quot; : & quot; FOLLOWER&quot; }, {& quot; stream_id & quot; : 1, & quot; node_id & quot; : 3, & quot; ip&quot; : & quot; 192.168.0.13 & quot;, & quot; port&quot; : 17783, & quot; role&quot;:&quot;FOLLOWER&quot;}]"/> </CLUSTER>Copy the code



3.1 Querying cluster Status After the Installation

Run the gs_ctl command to query cluster status

# # gs_ctl query - D < data_dir > gs_ctl query - D/nvme0 / gaussdb/cluster/nvme0 / dn1Copy the code

Dcf_replication_info indicates the DCF information of the current node.

Role: indicates the current node role, including LEADER, FOLLOWER, LOGGER, PASSIVE, PRE_CANDICATE, CANDIDATE, and UNKNOW. It can be seen from the figure above that the current node is the LEADER node.

Term: elected term.

Run_mode: indicates the DCF operating mode. 0 indicates the automatic election mode, and 2 indicates that the automatic election mode is disabled.

Work_mode: indicates the DCF working mode.

Hb_interval: indicates the heartbeat interval between DCF nodes, expressed in ms.

Elc_timeout :DCF election timeout time, expressed in ms.

Applied_index: Log location applied to the state machine.

Commit_index: log location saved by most DCF nodes. The logs are persisted before this commit_index.

First_index: indicates the first log position saved by the DCF node. This position is advanced backwards when the DN calls DCF_TRUNCate, and previous logs are cleared.

Last_index: indicates the last log position of the DCF node. The value contains logs stored in the memory but not persisted. Therefore, last_index >= COMMIT_INDEX.

Cluster_min_apply_idx: minimum applied log location in the cluster.

Leader_id: INDICATES the ID of the leader node.

Leader_ip: INDICATES the IP address of the leader node.

Leader_port :leader node port, used internally by DCF.

Nodes: indicates information about other nodes in the cluster.



3.2 Online Cluster Size Adjustment

To add copies online, execute one of the following commands

# gs_ctl member --opration=add --nodeid=<node_id> --ip=<ip> --port=<port> -D <data_dir>

To drop the copy online, run the following command: # gs_ctl member –operation=remove –nodeid=

-d

If the cluster status is normal, the task of deleting a single copy can be completed in 5 minutes.

3.3 Cluster support minority strong function

In the majority fault scenario, the Paxos agreement cannot be reached and the system cannot continue to provide services. In order to provide emergency service capacity, service delivery needs to be activated on an emergency basis in a minority of cases.

Use the following command

# cm_ctl setrunmode -- n <node_id> -d <data_dir> --xmode=minority --votenum=<num>

In the case of cluster 3 copies, 2 copies fail and only 1 copy is required to be agreed.

3.4 Active switchover

In active/standby mode, active and standby database instances can be switched between azs. You need to ensure that the database instance is running properly. Then you perform a switchover after all switchover services are complete and you run the pgxc_get_senders_catchup_time() view to check that there is no master/slave switchover.

For example, run the command to change the standby node to the active node

# cm_ctl switchover -- n <node_id> -d <data_dir>

3.5 Rebuilding the standby Server

Supports full build capability in active/standby mode. When the primary DN receives a full build request, blocks the primary DN to reclaim DCF logs. The standby DN copies Xlog logs and data files from the primary DN, and sets DCF to replicate log points after the standby DN pulls up the kernel.

The following is an example command:

# gs_ctl build -- b full -- Z datanode -- D <new_node_data_dir>

The open source DCF feature is another exploration of openGauss in the distributed field, and another substantial contribution to open source technology.

OpenGauss has been committed to promoting the depth of database technology innovation, and increase the input of database basic research, database theoretical innovation, fully open the top technical ability, together with developers at home and abroad to promote the database production, learning, research innovation and development.