DolphinDB provides high availability solutions for data, metadata, and clients. This ensures that the database still works properly when a database node fails.

DolphinDB’s high availability ensures consistency unlike other sequential databases.

1. An overview of the

DolphinDB databases use a multi-copy mechanism where copies of the same block of data are stored on different data nodes. Even if one or more data nodes in a cluster go down, the database can serve as long as at least one replica of the cluster is available.

Metadata is stored on controller nodes. DolphinDB uses the Raft protocol to ensure high availability of metadata. DolphinDB builds a Raft group of control nodes that can be serviced as long as less than half of the control nodes go down.

The DolphinDB API provides automatic reconnection and switchover mechanisms. If a connected data node fails, the API attempts to reconnect, and if the connection fails, it automatically switches to another data node for tasks. Switching data nodes is transparent to the user, and the user will not be aware that the currently connected node has been switched.

To use high availability, deploy DolphinDB clusters first. High availability is supported only in clusters, not in single instances. For details about cluster deployment, see Multi-Server Cluster Deployment Tutorial.

2. Data is highly available

DolphinDB allows multiple DolphinDB copies of data to be stored on different servers to ensure that data is secure and highly available. Even if data is corrupted on one machine, data services can be guaranteed by accessing copies of data on other machines.

The number of replicas can be set by the dfsReplicationFactor parameter in controller.cfg. For example, set the number of copies to 2:

dfsReplicationFactor=2
Copy the code

DolphinDB allows copies of the same block of data to be distributed on the same machine by default. To ensure high availability of data, copies of the same data block need to be distributed on different machines. You can add the following configuration items to controller.cfg:

dfsReplicaReliabilityLevel=1
Copy the code

The following is a visual example of DolphinDB’s high availability. First, execute the following script on the data node of the cluster to create the database:

N = 1000000 date = rand (2018.08.01.. Sym 2018.08.03, n) = rand (` AAPL ` MS ` C ` posted, n) qty = rand (1.. Table (date,sym,qty,price) if(existsDatabase(" DFS ://db1")){dropDatabase(" DFS ://db1")} Db = database (DFS: / / "db1", VALUE, 2018.08.01. J. 2018.08.03) trades = db. CreatePartitionedTable (t, ` trades, ` date), append! (t)Copy the code

The distributed table TRADES is divided into three partitions, with each date representing one partition. DolphinDB’s Web cluster management interface provides the DFS Explorer for easy viewing of data distribution. The distribution of each partition of the TRADES table is shown in the figure below:

Take partition 20180801 as an example. The Sites column shows that data with date=2018.08.01 is distributed on 18104Datanode and 18103Datanode. Even if 18104Datanode is down, as long as 18103Datanode is normal, users can still read and write data with date=2018.08.01.

3. Metadata is highly available

Metadata is generated during data storage, such as the location on which data nodes each data block is stored. If metadata is unavailable, the system cannot access the data even if the data block is intact.

Metadata is stored on the controller node. We can deploy multiple control nodes in a cluster to ensure uninterrupted metadata service through metadata redundancy. All control nodes in a cluster form a Raft group. There is only one Leader in the Raft group and the other control nodes are followers. The metadata on the Leader and Follower maintain strong consistency. Data nodes can only interact with the Leader. If the current Leader is unavailable, the system immediately elects a new Leader to provide metadata services. Raft group can tolerate failure of less than half of the control nodes. For example, a cluster with three control nodes can tolerate failure of one control node. A cluster of five controller nodes can tolerate two controller node failures. To set metadata high availability, the number of controller nodes must be at least three and the number of data copies must be greater than one.

The following example shows how to enable metadata high availability for an existing cluster. Assume that the controller nodes in the existing cluster are located on machine P1, and two controller nodes need to be deployed on machine P2 and P3 respectively. Their internal addresses are as follows:

P1: 10.1.1.1
P2: 10.1.1.3
P3: 10.1.1.5
Copy the code

(1) Modify the configuration file of an existing controller node

In the P1 controller. The CFG file add the following parameters: dfsReplicationFactor = 2, dfsReplicaReliabilityLevel = 1, dfsHAMode = Raft. The modified controller.cfg file is as follows:

LocalSite = 10.1.1.1:8900: controller1 dfsReplicationFactor = 2 dfsReplicaReliabilityLevel = 1 dfsHAMode = RaftCopy the code

(2) Deploy two new controller nodes

Download the DolphinDB server package from P2 and P3 and unzip it, for example, to the /DolphinDB directory.

Create the config directory in the /DolphinDB/server directory. Create the controller. CFG file in the config directory and fill in the following contents:

P2

LocalSite = 10.1.1.3:8900: controller2 dfsReplicationFactor = 2 dfsReplicaReliabilityLevel = 1 dfsHAMode = RaftCopy the code

P3

LocalSite = 10.1.1.5:8900: controller3 dfsReplicationFactor = 2 dfsReplicaReliabilityLevel = 1 dfsHAMode = RaftCopy the code

(3) Modify the configuration file of the existing proxy node

Add the sites parameter to the existing agent. CFG file, which indicates the LOCAL area network (LAN) information of the proxy node and all controller nodes. The proxy node information must precede all controller nodes. For example, the agent-cfg file of P1 is modified as follows:

LocalSite = 10.1.1.1:8901: agent1 controllerSite = 10.1.1.1:8900: controller1 Sites = 10.1.1.1:8901: agent1: agent, 10.1.1.1:8900: controller1: controller, 10.1.1.3:8900: controller2: controller, 10.1.1.5:8900 :controller3:controllerCopy the code

If there are multiple proxy nodes, the configuration file of each proxy node needs to be modified.

(4) Modify the cluster member configuration files of existing controller nodes

Add LAN information about controller nodes on cluster.nodes of P1. For example, P1 cluster.nodes is modified as follows:

LocalSite mode 10.1.1.1:8900: controller1, controller 10.1.1.2:8900: controller2, controller 10.1.1.3:8900: controller3, controller 10.1.1.1:8901: agent1, agent 10.1.1.1:8911: datanode1, datanode 10.1.1.1:8912: datanode2, datanodeCopy the code

(5) Add cluster member configuration files and node configuration files for the new controller node

Cluster.nodes and cluster.cfg are required to start a controller node. Copy cluster.nodes and cluster. CFG on P1 to the config directory of P2 and P3.

(6) Start the HA cluster

  • Start the controller node

Execute the following commands on the machine where each control node is located:

nohup ./dolphindb -console 0 -mode controller -home data -config config/controller.cfg -clusterConfig config/cluster.cfg  -logFile log/controller.log -nodesFile config/cluster.nodes &Copy the code
  • Start the proxy node

Execute the following command on the machine where the proxy node is deployed:

nohup ./dolphindb -console 0 -mode agent -home data -config config/agent.cfg -logFile log/agent.log &
Copy the code

Starting, stopping, and modifying data nodes can only be performed on the cluster management page of the Leader.

  • How to determine which controller node is the Leader

Enter the IP address and port number of any controller Node (for example, 10.1.1.1:8900) in the address bar of the browser. Click controller1 in the Node column to go to DolphinDB Notebook.

Execute the getActiveMaster() function, which returns the alias of the Leader.

In the address box of the browser, enter the IP address and port number of the Leader to open the cluster management page of the Leader.

4. The client is highly available

When you use the API to interact with DolphinDB Server data nodes, if the connected data node fails, the API tries to reconnect it. If the connection fails, the API automatically switches to another available data node. This is transparent to the user. Currently only the Java and Python apis support high availability.

The API’s connect method is as follows:

connect(host,port,username,password,startup,highAvailability)
Copy the code

When connecting data nodes using the connect method, you only need to specify true as the highAvailability parameter.

The following example makes the Java API highly available:

import com.xxdb;
DBConnection conn = new DBConnection();
boolean success = conn.connect("10.1.1.1", 8911,"admin","123456","",true);
Copy the code

If the data node 10.1.1.1:8911 goes down, the API will automatically connect to other available data nodes.

5. Dynamically add data nodes

You can run the addNode command to add a data node online without restarting the cluster.

The following example describes how to add datanode3 (port 8911) to server P4 (Intranet IP address 10.1.1.7).

To add a data node on a new physical server, deploy a proxy node to start the data node on the server. The agent node of P4 has port 8901, alias agent2.

Download the DolphinDB package on P4 and unzip it to a specified directory, such as /DolphinDB.

Go to the /DolphinDB/server directory and create the config directory.

CFG file in the config directory and fill in the following information:

LocalSite = 10.1.1.7:8901: agent2 controllerSite = 10.1.1.1:8900: controller1 Sites = 10.1.1.7:8901: agent2: agent, 10.1.1.1:8900: controller1: controller, 10.1.1.3:8900: controller2: controller, 10.1.1.5:8900 :controller3:controllerCopy the code

Create the cluster.nodes file in the config directory and fill in the following information:

LocalSite mode 10.1.1.1:8900: controller1, controller 10.1.1.2:8900: controller2, controller 10.1.1.3:8900: controller3, controller 10.1.1.1:8901: agent1, agent 10.1.1.7:8901: agent2, agent 10.1.1.1:8911: datanode1, datanode 10.1.1.1:8912: datanode2, datanodeCopy the code

Change cluster.nodes on P1, P2, and P3 to be the same as cluster.nodes on P4.

Run the following Linux command to start the proxy node on P4:

nohup ./dolphindb -console 0 -mode agent -home data -config config/agent.cfg -logFile log/agent.log &
Copy the code

Execute the following command on any of the data nodes:

Invoke the (" 10.1.1.7 ", 8911, "datanode3")Copy the code

After the preceding script is executed, refresh the Web cluster management page. You can find that the newly added data node already exists but is in the closed state. You need to manually start the new data node.

6. Summary

DolphinDB meets the need for around-the-clock DolphinDB service in areas such as the Internet of Things and finance by ensuring uninterrupted connectivity to data, metadata services and apis.

Welcome to Sopho: DolphinDB Technology Exchange group, which includes QR codes