Abstract: Failures in large-scale distributed systems are unavoidable. How are cluster status and services recovered when a single point of failure occurs?
This article is shared by CloudGanker in “Warehouse Cluster Management: RTO Mechanism Analysis of Single Node Failure” in Huawei cloud community.
One, foreword
GaussDB(DWS) uses a distributed architecture. Cluster management (high availability) requires a balance between stability and agility.
When a single node fault (such as downtime, network disconnection, and power-off) occurs in a cluster, the Recovery Time Objective (RTO) process and indicators for end-to-end service Recovery include: The cluster status is restored (CM Server active/standby switchover, DN/GTM active/standby switchover) and services are restored (CN can run services normally).
This article focuses on cluster state recovery, and the rest will be analyzed separately.
Reference links:
GaussDB(DWS) Cluster Management Series: CM Component (Architecture and Deployment Mode)
GaussDB(DWS) Cluster Management Series: CM Component (Core Functions)
2. Assumptions and key configuration parameters
Generally, it takes a long time (10 minutes by default) for CN to be automatically deleted. Therefore, this article does not involve the process of CN deletion and instance repair, nor does it discuss the interruption of DDL services when CN is faulty.
The hypothesis is as follows:
1. Except for a clear fault (such as a node down), the link can be successfully established within the timeout period (that is, the link establishment time is calculated according to the timeout period).
2. Message delivery takes no time
DN/GTM perform failover within T_{\rm failover}_T_failover (usually less than 5 seconds)
Key parameters are as follows:
[CM configuration parameters] Instance Heartbeat timeout instance_heartbeat_timeout (default value: 30 seconds), T_{\rm hb}_T_hb.
Note: Because C/C++ language multiplication and division do not meet the associative law, the operations involved in this paper are integer operations.
Example of cluster topology
Ignoring CN deployment, the following figure shows a three-node cluster as an example:
-
Two CM_server instances are deployed on nodes 1 and 2 in active/standby mode
-
Two GTM instances are deployed on nodes 1 and 2 in active/standby mode
-
A group of DN instances are deployed on nodes 1, 2, and 3
-
The CM_Agent component is deployed on each node
Iv. Overall process analysis
If node 1 is faulty, the cluster is unavailable for a short period of time. Then the cluster automatically recovers to the degraded state and can run services on the CN. Thus, the discussion of the RTO process can be divided into four stages.
1. A single node fault occurs, the cluster is unavailable, and the CM_server /GTM/DN is in the inactive state
2. The standby CM_server becomes active, and the GTM/DN server waits for arbitration
3. The standby GTM/DN server (in parallel) becomes the active node, and the cluster restores to the degraded state
4. CN Is connected to GTM and DN, and services are running properly
Taking the fault occurrence time as point 0, each stage is analyzed one by one and the relevant time is calculated as follows.
5. The standby CM Server becomes active
After a single node failure occurs, the cluster management component does not immediately sense the fault status for the sake of stability. When two CM_server instances communicate, the survival status of each other is determined based on heartbeat. If the heartbeat between them times out, the following self-arbitration process is performed (peer connection refers to the connection to another CM_server).
The sleep time between two self-quorum polls is fixed at 11 seconds, and the heartbeat timeout threshold is T_{\rm CMS \_hb}=\frac{T_{\rm hb}}{2}_T_cms_hb=2_T_hb, The timeout threshold for establishing links (both between cm_servers and between CM_agents and cm_servers) is T_{\rm CMS \ _conn} = \ frac {T_ {\ RMCMS \ _hb}} {2} – 1 _t_cms_hb _t_cms_conn = 2-1. The reason for this threshold is that, ignoring the code execution time, when the heartbeat timeout is determined to be true, at least two attempts can be made to establish a link between the CM_server and the CM_agent to establish a link with both cm_servers.
The maximum time required for upgrading the standby CM_server is as follows:
1. Determine the heartbeat timeout. Assume that the communication senses the peer fault immediately and may wait for one more link timeout and sleep. Time-consuming T_ {\ rm CMS \ _hb} + T_ + 1 = {\ RMCMS \ _conn} \ frac {3 T_ {\ rm CMS \ _hb}} {2} _T_cms_hb + _T_cms_conn + 1 = 23 _T_cms_hb
2. In the pre-master promotion stage of self-arbitration, according to the assumption, cm_agent has initiated pre-link at this time, so it can directly pre-master promotion and reset heartbeat, which takes 00
3. The second time to determine the heartbeat timeout, time-consuming T_ {\ rm CMS \ _hb} + T_ + 1 = {\ RMCMS \ _conn} \ frac {3 T_ {\ rm CMS \ _hb}} {2} _T_cms_hb + _T_cms_conn + 1 = 23 _T_cms_hb
4. In the official master promotion stage of self-arbitration, CM_agent has initiated the official link, so the official master promotion takes 00
In conclusion, the total maximum time consuming to \ frac {3 t_ {\ rm CMS \ _hb}} {2} x 2 = \ frac {3 t_ {\ RMHB}} {4} * 2 = 223 _t_cms_hb _t_hb 43 x 2. (Note that this is an integer operation.)
6. The standby DN/GTM becomes the active server
Cluster management arbitration takes the form of passive triggers. Each CM_agent detects the instance status of the node where it is located and reports it to the master CM_server periodically (at a fixed interval of 1 second). The master CM_server synthesizes the state of each instance for arbitration and then sends the necessary arbitration results to the relevant CM_agent. Cm_agent receives the arbitration result and runs the corresponding command.
Taking a primary DN fault as an example, a typical arbitration process includes:
① CM Agent 1 detects the DN master instance and finds faults
2 CM Agent 1 continuously reports instance fault information to the CMServer
3 CM Server performs the arbitration process and selects the standby DN Server to become the active DN Server
4 CM Server sends the master promotion command to CM Agent 2
⑤ CM Agent 2 performs the master operation on the instance
For single-node faults, the arbitration of DN and GTM instances can be carried out at the same time, and the steps are as follows:
1. After cm_server2 becomes the active server, it cannot receive messages reported by cm_agent1. When the timeout period T_hb reaches, DN1 and GTM1 are forcibly set to Down, which takes T_{\rm hb}_T_hb
2. Cm_server2 continuously receives messages from cm_agent2 that DN2 and GTM2 are the standby servers. Therefore, the cm_server2 executes the failover command to become the active server, which takes 11 seconds
3. Cm_agent2 receives the failover command, which takes 0.20.2 seconds (based on 11).
4. Cm_agent2 Performs failover, which takes T_{\rm failover}_T_failover
5. Cm_agent2 detects that DN2 and GTM2 become active in 11 seconds
6. Cm_agent2 reports the message that DN2 and GTM2 are hosts (polling interval), which takes 0.20.2 seconds (calculated by 11)
7. Cm_server2 receives the message that DN2 and GTM2 are active and updates the cluster status to degraded, which takes 00 seconds
In summary, the total time is T_{\rm hb} + T_{\rm failover} +4_T_hb+_T_failover+4.
Seven, summary
Add the time of CM Server self-arbitration and DN/GTM arbitration, namely, the recovery time of cluster status (unit: second)
T_ {\ rm cluster} = \ frac {3 T_ {\ RMHB}} {4} * 2 + T_ {\ rm hb} + T_ {\ rm failover} + 4 \ approx \ frac {5 T_ {\ rm hb}} {2} + T_ {\ rm Failover} +4 \text{(approximate in real numbers)}_T_cluster=43_T_hb×2+_T_hb+_T_failover+4≈25_T_hb+_T_failover+4
T_{\rm cluster} = 84_T_cluster=84 Based on the default value.
As the analysis process is calculated according to the maximum time, the theoretical time will be enlarged compared with the actual time. Typically, cm_server does not have to wait another connection timeout to determine a heartbeat timeout, and CM_Agent probes instance status almost instantaneously. Therefore, when a fault occurs, the cm_server self-arbitration time (by default) is about T_{\rm hb} = 30_T_hb=30. The time when the standby DN/GTM becomes active is about T_{\rm hb} + T_{\rm failover} =35_T_hb+_T_failover=35, and the cluster recovery time is about 6565 seconds.
You can adjust the instance_heartbeat_timeout parameter to select an appropriate RTO indicator.
Click to follow, the first time to learn about Huawei cloud fresh technology ~