Introduction: On March 2, Ali Cloud open source PolarDB enterprise architecture release conference, Ali Cloud database technology expert Meng Burong brought the theme of PolarDB three nodes of high availability wonderful speech. The three-node high availability function mainly provides PolarDB with financial level strong consistency, high reliability cross-room replication ability, based on distributed consensus algorithm to synchronize database physical logs, automatic failover, zero data loss after any node failure. This topic mainly introduces the functional characteristics and key technologies of PolarDB three-node high availability.

On March 2, Ali Cloud open source PolarDB enterprise architecture release conference, Ali Cloud database technology expert Meng Burong brought the theme of PolarDB three points of high availability wonderful speech. The three-node high availability function mainly provides PolarDB with financial level strong consistency, high reliability cross-room replication ability, based on distributed consensus algorithm to synchronize database physical logs, automatic failover, zero data loss after any node failure. This topic mainly introduces the functional characteristics and key technologies of PolarDB three-node high availability.

Live video review: developer.aliyun.com/topic/Polar… PDF download: developer.aliyun.com/topic/downl…

The following is arranged according to the video content of the press conference speech:

PolarDB for PostgreSQL three-node high availability features mainly combine physical replication with consistency protocols to provide PolarDB with financial consistency and high reliability across room replication capabilities.

PG native stream replication supports asynchronous/synchronous /Quorum synchronization.

The primary goal of synchronous replication is to ensure that data is not lost, but it also presents three problems:

(1) The availability requirements cannot be met. When the standby database fails or the network link jitter, the availability of the primary database will be affected, which is unacceptable in the production environment. Secondly, it cannot guarantee data loss completely. The scheme of synchronous replication to ensure data loss is that the transactions of the primary database cannot be committed until the RW logs are fully persisted by the secondary database. In extreme cases, for example, WAL logs have been written to the primary database and the primary database is restarted while waiting for the standby database to synchronize WAL logs, the log playback process does not wait for the standby database to persist during the restart. So after playback, it is possible that the standby repository is not persisted and the log is visible after playback on the primary repository.

② It does not have the ability of automatic fault switching. Capabilities such as automatic switching, availability detection, and so on are dependent on external components.

③ The old master library may not be directly added to the cluster after the fault recovery. For example, the WAL logs of the transaction on the primary library have been persisted, but the standby library has not received the logs or not persisted. In this case, if the primary library is faulty and the secondary library is switched over to the primary library, the old primary library cannot be directly pulled from the primary library because there are redundant WAL logs before the restart. You must rely on other tools to handle the consistency of the logs before adding the logs to the cluster.

Asynchronous replication has higher performance and availability than synchronous replication because the fault of the standby server or jitter of the network link does not affect the primary database. However, the biggest problem of asynchronous replication is data loss. For example, data that was previously visible on the primary database does not exist on the standby database after the switchover. Second, it also does not have the ability of automatic failover and automatic detection, after switching the master library cannot be automatically added to the cluster.

Using the majority scheme, Quorum replication may also guarantee data loss, but it does not involve selecting a new host in the event of a host failure. Secondly, how to ensure log consistency when the logs of each node are inconsistent? Third, how to ensure the consistency of the cluster state when the cluster changes. Quorum replication does not provide a complete solution to these problems. Therefore, PG Quorum replication is not, in essence, a complete, high-availability solution without loss of data.

Our solution is to introduce Ali’s internal consistency protocol X-PaxOS to coordinate physical replication. X-paxos has been running stably for a long time both inside Alibaba and on multiple aliyun products, so its stability is guaranteed. Its conformance protocol algorithm is similar to other protocols.

The whole high availability scheme is a single – point write, multi – point read cluster system. As a single point of write (SSO) node, the Leader node provides read and write services and generates WAL logs that are then synchronized to other nodes. Followers receive WAL logs from the Leader node and play back the logs to provide read-only services.

Then its main capabilities include the following three aspects:

Ensure data consistency in the cluster, that is, RPO is 0. Only after WAL logs of most nodes are successfully written, the logs are considered to have been successfully submitted at the cluster level. In the event of a failure, the other Follower nodes automatically align logs with the Leader node.

Automatic failover. In an HA cluster, as long as more than half of the nodes survive, the cluster can provide services normally. Therefore, a failover failure or network failure of a few nodes does not affect the cluster service capability.

When the Leader node is faulty or disconnected from the majority nodes, the new Leader node automatically selects the main node of the cluster and provides read and write services. In addition, the followers node automatically synchronizes WAL logs from the new Leader node and automatically aligns them with the new Leader logs. If there are more logs on the followers than on the new Leader, WAL logs are automatically aligned on the new Leader.

Online cluster change supports online node addition and deletion, manual switchover, and role change, for example, from Leader to follower. In addition, all nodes can be supported to set election weight, and the node with high election weight will be selected as the main node first. In addition, cluster change operations do not affect the normal running of services, which is guaranteed by the consistency protocol. The configurations in the cluster are consistent, and there is no problem of inconsistent status due to exceptions during the cluster configuration.

A new role is added to the three-node high availability feature: Learner node. It does not have majority decision-making power but can provide read-only services.

The log synchronization status of the Learner node has nothing to do with and does not affect the Leader. It has two main functions:

① As the intermediate state of adding nodes. For example, the newly added Leader node has a large delay. If it is directly added to the majority node, the submission of the majority node will be affected. Therefore, it first joins the cluster as learner to synchronize data, and then rises to the follower node after its data basically catches up with the Leader.

2 As a remote Dr Node. It does not affect the availability of the master library, and after a Leader switch, it automatically synchronizes logs from the new account without external intervention.

In terms of cluster deployment, the oceanstor 9000 supports cross-room and cross-domain deployment, including three-room duplicate in the same room, three-room duplicate in the same city, three-room duplicate in two places, and three-room duplicate in three places. In addition, the Learner node can be used for CROSS-domain Dr Without affecting the availability of the Leader node.

In addition, it is compatible with pG-native stream replication and logical replication to ensure that downstream consumption is not affected and there is no uncommitted data downstream.

As you can see from the previous introduction, in PolarDB’s high availability plan, at least three copies of data are stored, which increases the storage cost. To solve this problem, we offer two solutions:

First, improve the utilization of resources. The Follower node serves as a read-only node to provide read services, increasing the read expansion capability of the cluster. In addition, support cross-node parallel query ability, can make full use of the resources of each base node.

Secondly, the log node is introduced to reduce the occupation of resources. The log node does not store data itself, it only stores real-time WAL logs as one of the majority nodes for log persistence. This log node also has complete log replication capability and is compatible with native stream replication and logical replication. It can be used as the source of downstream log consumption to reduce the log transmission pressure on the Leader node. You can customize network specifications or other resources of log nodes based on downstream log consumption requirements.

The basic principles of consistency protocol replication are as follows:

(1) Transfer or synchronize WAL logs through native asynchronous stream replication.

② The submission sites of clusters are promoted by conformance protocols.

③ To solve the problem of automatic failover, the state change at the database level is driven according to the state change at the consistency protocol level. For example, a heartbeat timeout may degrade automatically.

Specifically, Consensus Log was used as the carrier to advance submission sites. A Consensus Log Entry is generated for each WAL Log, which records the end LSN of the WAL Log. Then, a persistence dependency is introduced to ensure that WAL logs at the corresponding points on the node have been persisted successfully when each Log Entry is persisted.

After the introduction of the above two mechanisms, if the Consensus Log is considered to have been submitted successfully at the consistency protocol level, it means that the Consensus Log has been persisted successfully on the majority, and WAL logs at the corresponding loci must also have been persisted successfully.

In the figure above, three WAL logs have been persisted on the Leader. On the Follower 1 node, the WAL log of the log entry has been persisted successfully, but the corresponding Consensus log has not been persisted. Therefore, the Consensus protocol assumes that the Consensus Log was not persisted successfully either. On Follower 2, the Log Entry and Consensus Log are not persisted, its WAL logs are persisted only for one segment, and its WAL Log segments are not persisted successfully. Therefore, according to the Consensus protocol, the current LogIndex 2 Log has been successfully written on most nodes, the current Consensus Log CommitIndex is 2, and the Commit LSN is 300.

The figure above shows the SITUATION of RTO during tpmC test. When tpmC reaches about 300,000, kill the master library. As you can see, in less than 30 seconds, the new master library has recovered its write capacity and is back to its pre-switch level.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.