• GreatSQL community original content is not allowed to use without authorization, please contact xiaobian and indicate the source.

1. Traditional master-slave replication technology architecture

The traditional master-slave replication method is to perform data update transactions on the master node, record these transactions to the binlog, and then send the binlog to the slave node to dump the binlog to the slave node as a relay log. On the slave node, separate threads read these relay logs and then re-execute or apply these transactions. It is shared-nothing, and each node has a complete copy of the data, and its technical flow chart is as follows:

  • Architecture diagram of traditional master-slave replication technology

MySQL also provides semi-synchronous replication, which adds a synchronization step to the traditional master/slave replication. Before committing a transaction, the master node must wait for the slave node to confirm that the transaction has been received. The technical flow chart is as follows:

  • Architecture diagram of semi-synchronous replication technology

2. Technical architecture of MGR group replication

MGR is also shared-nothing, each node has a complete copy of data, and nodes interact with each other through Group Communication System (GCS). The GCS layer provides assurance of global messages and their orderliness between nodes.

The MGR can execute read/write transactions (excluding read-only transactions) on any node at any time, but the read/write transactions cannot be committed until they are confirmed by the entire replication group. If it is a read-only transaction, there is no restriction and any node can initiate and commit.

When a read/write transaction is ready to commit, it issues an atomic broadcast to the replication group containing the data modified by the transaction and its corresponding writesET. All nodes in the replication group either receive the transaction or do not. If all nodes in the group receive the transaction message, they receive the broadcast message in the same order as they sent the transaction before. Therefore, all group members receive the write set of transactions in the same order and establish the global order for the transactions.

In this case, it is necessary to compare and judge the writesET of two parallel transactions to confirm the conflict. This process is called transaction authentication, also known as conflict detection. Transaction conflict detection is row-level, that is, when two parallel transactions update the same row, a conflict occurs. In this case, the transaction that is first in the global order can succeed and all nodes commit the transaction. A later transaction in the global order will fail to roll back, and each node will delete the transaction. This is essentially a distributed first-commit first-win transaction rule. Suggestion: If there are frequent transaction conflicts between nodes, it is better to have these transactions executed on the same node, so that they may all commit successfully under the coordination of local transaction concurrency control without a transaction being rolled back all the time due to collision detection by MGR.

For transactions that are being applied or externalized, MGR allows them not to be executed in the original order, as long as the consistency and validity of the transactions are not compromised. The default requirement for MGR is final consistency, which means that data on all nodes is consistent after all transactions have been applied. When traffic is high, transactions can be externalized and slightly out of order. For example, in multi-master mode, a local transaction is externalized immediately after being authenticated, even though there may be a remote transaction with this earlier global order that has not been applied, as long as the authentication thread in MGR considers that this transaction does not cause a conflict. In single-master mode, a local concurrent transaction on the Primary node may commit and externalize in a slightly different order than the global transaction order of that transaction, without causing a collision. On the Secondary node, since no transactions are written, their transaction order is the same as the global transaction order.

The following figure describes the group replication protocol for MGR, and you can see some differences from traditional master-slave (and semi-synchronous) replication. For the sake of simplicity, the consensus algorithm and PAXOS-related information are missing from the figure:

  • MGR Technical architecture diagram

3. Single-master and multi-master modes of MGR

MGR supports single master or multiple master modes.

During startup, you can set group_REPLICATION_singLE_primary_mode to determine which mode to use. The value must be the same on all nodes. If this parameter is set to ON, the system uses the single main mode. If this parameter is set to OFF, the system uses the multiple main mode.

You cannot change the value of group_REPLICATION_singLE_primary_mode online. However, starting from MySQL 8.0.13, This can be done by calling group_replication_switch_to_single_primary_mode() and group_REPLICATION_switch_to_multi_primary_mode (). Both UDFs change the running mode online or through the MySQL Shell.

In single-master mode, only one (Primary) node can write data, and all (Secondary) nodes can only read data. In multi-master mode, data can be read and written simultaneously on any node.

The MGR can support a maximum of nine nodes, in either single-master or multi-master mode.

4. Node management

MGR consists of a set of nodes, each of which has a unique name in the format of a UUID. Nodes can join or leave the MGR dynamically (or possibly passively expel it).

The group membership service on MGR is used to maintain the information that defines the active nodes, also known as group views. The group view for each node is consistent, which indicates which members are active in the group at a given time.

In addition to being consistent when a transaction commits, the MGR nodes also agree when the group view changes. A new group view change is triggered when a new node joins, or when an existing node leaves.

When a node voluntarily leaves the cluster, it triggers automatic cluster reconfiguration, and the remaining nodes agree on a new group view. However, if a node leaves the cluster unexpectedly due to a network exception or breakdown, automatic reconfiguration cannot be triggered. In this case, the cluster fault detection mechanism detects the status of the node after the node leaves the cluster for a period of time and sends a reconfiguration proposal. The reconfiguration group view requires the consent of the majority members. If the consensus cannot be reached, automatic reconfiguration cannot be implemented and manual intervention is required. One possible reason for not reaching a consensus is that the number of remaining nodes does not reach more than half of the summary points, meaning that a majority cannot be formed.

Allow a node to go offline for a short time before it is confirmed to be faulty, or before reconfiguring the group to remove the failed node, and then try to rejoin the cluster. In this case, the node may lose its previous state (transaction data), which can cause data inconsistencies and other problems if other nodes send it messages containing pre-crash messages.

To solve this problem, starting from MySQL 5.7.22, MGR checks if a node with the same address + port joins the cluster again with a new identity to see if its old identity still exists. The new identity cannot be added until the old identity can be deleted from the cluster. Note: Group_replication_member_expel_timeout sets a waiting period so that a node has more time to try to join the cluster again before being expelled. That is, a node in the suspected state can try to join the cluster again before it times out and become an active node again. When a node exceeds the group_REPLICATION_MEMBER_expel_TIMEOUT threshold and is expelled from the cluster, the node exits the cluster by running the STOP GROUP_REPLICATION command, or the node breaks down, the node must be added to the cluster again with a new identity.

5. Fault detection

MGR has its own fault detection mechanism, which can find and report which node is in the silent state. When certain conditions are reached, the node will be considered dead. It is a distributed fault detection service that provides information about which node is in a (suspected) dead state.

Suspicion may be triggered when a node goes silent (does not actively send messages or reply to messages from other nodes). When node A has not received A message from node B within A given period of time, A message timeout occurs and suspicion is raised. After that, the other members of the cluster decide that the node is faulty if they all agree (the majority agree) that their suspicions about the node are positive.

If a node is disconnected from other nodes due to a network failure, it may suspect that other nodes are faulty as well. But since it does not form a majority decision, the suspicion is invalid, and the node cannot perform any read-write transactions at this point, only read-only transactions at most.

When the network is unstable, frequent disconnections and reconnections between random two nodes can theoretically result in all nodes being marked for expulsion and the cluster exiting and needing to be rebuilt. To avoid this, starting with MySQL 8.0.20, GCS keeps track of nodes marked for expulsion and decides if a suspect node is still in the majority, making it possible to have at least one node in the cluster without exiting. When the expelled node is officially removed from the cluster, GCS deletes the record marked as expelled so that it can be added back later.

6. Fault tolerance mechanism

MGR is implemented based on distributed Paxos algorithm, so a majority of nodes are required to survive to guarantee voting. This determines the number of nodes that can be allowed to fail without affecting the overall availability of the system. Assuming that the total number of nodes is N and the number of nodes that can tolerate failure is F, their relationship is as follows: n = 2*f + 1. In short, the number of failed nodes should not be more than half of the total.

The following table shows the corresponding relationship of different node numbers:

The total number of nodes Number of majority nodes Maximum number of nodes that can fail
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
8 5 3
9 5 4

References, documents

  • MySQL 8.0 Reference Manual
  • Database kernel development – Wenzheng Lake
  • Group Replication principle – Song Libing

disclaimer

Due to my limited skills, errors and omissions are unavoidable in this column. Do not directly copy the commands and methods in the document and apply them to the online production environment. Please be sure to fully understand and verify in the test environment before formal implementation, to avoid damage or damage to the production environment.

Enjoy GreatSQL 🙂

Article recommendation:

GreatSQL Quarterly (2021.12.26) mp.weixin.qq.com/s/FZ_zSBHfl…

Technology sharing | sysbench mp.weixin.qq.com/s/m16LwXWy9 up.the pressure measuring tool usage…

Failure analysis | Linux disk IO utilization rate is high, the analysis of the correct posture mp.weixin.qq.com/s/7cu_36jfs…

Technology sharing | flashbacks in MySQL implementation and improvement of mp.weixin.qq.com/s/6jepwEE0D…

How a # 20, an index pushdown data filtering mp.weixin.qq.com/s/pt6mr3Ge1…

About GreatSQL

GreatSQL is a MySQL branch maintained by Wanli Database, which focuses on improving the reliability and performance of MGR. It supports InnoDB parallel query feature and is a MySQL branch version suitable for financial applications.

Gitee: gitee.com/GreatSQL/Gr…

Making: github.com/GreatSQL/Gr…

Bilibili: space.bilibili.com/1363850082/…

Wechat &QQ Group: you can search to add GreatSQL Community Assistant wechat friends, send verification information “add group” to join GreatSQL/MGR communication wechat group

QQ group: 533341697 wechat assistant: WanliDBC

This article is published by OpenWrite!