In a distributed system, we tend to consider the high availability of the system. For stateless programs, high availability is relatively simple to implement, and it is relatively easy to scale vertically and horizontally. However, for data-intensive applications, such as high availability of databases, it is not very easy to scale. When we consider the high availability of the database, we mainly consider to maintain the availability of the database as far as possible in case of system breakdown and unexpected interruption to ensure that the business will not be affected. The second is the backup database. The data of the read-only copy node must be consistent with that of the primary node in real time. After the database switchover, the data consistency should be maintained to prevent data loss or data inconsistency from affecting services. Many distributed databases solve this problem and provide flexible ways to meet business requirements, such as synchronization, semi-synchronization, number of data copies, primary/secondary switchover, failover, and so on (mentioned below). However, we usually use the official community version of Mysql 5.7 and earlier (excluding other Mysql branches such as PhxSQL, Percona XtraDB Cluster, MariaDB Galera Cluster) are not doing a good job of supporting distribution and system availability. For this series of problems, the following analysis of how to solve the problem.

During this period, I found and submitted a bug that resulted in data inconsistency between master and slave due to mHA online master switchover

Github.com/yoshinorim/…


What is a failover

Before you mention MHA, popularize failover. When an active service or application is terminated unexpectedly, a redundant or standby server, system, hardware, or network is quickly enabled to take over. A failover is basically the same as a switch failover, except that the failover is usually automatic and is performed manually without warning. Switching needs to be done manually, and system designers often design failover capabilities for servers, systems, or networks that require high availability and stability. Simply put, when a service is unavailable, other service modules in the system can automatically continue to provide services. Many well-designed open source software designs automatically include failover, such as load balancing Nginx, HAProxy, which can support back-end detection, backup, When a back-end endpoints exception is detected, the switch to normal backup is automatically seamless. Then, data intensive applications such as distributed applications will also include failover, including mongdb copy set, etCD/ZooKeeper node election, If some data nodes are abnormal, select the data node as the master/primary/leader. Even if the message queue is like rabbitMQ mirror queue, kafka Replicas will have failover.

Link: ZooKeeper Leader election ETcd-raft ETCd-raft Visual Replica mongdb set RabbitMQ mirror queue Kafka ISR and Sync ElasticSearch Replics

Data replication

Failover was mentioned earlier, and since the system supports failover, it must be guaranteed that there is a Backup or replics available to continue the service as the new “master”(here the leader/master/primary is called master). As mentioned earlier, many open source software has built-in data synchronization function in the design process, that is, all data insertion, deletion and update are carried out on the master, and then the data is synchronized to the slave. In short, it is necessary to keep the data consistency between master and slave. Update (), delete(), insert() and so on are recorded in the log, and then these statements are transmitted to each slave library. Each slave parses and executes the SQL statement, replicating the data in the slave as if it had been received from a client request; It is also possible to transmit write-ahead logs (WAL), where logs are first written to disk, such as SSTables and the LSM tree implementation engine, so logs are all append byte only sequences that contain all database writes, you can use the exact same log to build a copy on another node, write the log to disk at the same time, The primary repository can send it over the network to other secondary nodes, such as ETCD state machine synchronization. Another way is to send the data to be synchronized directly through a process within the cluster, such as the RabbitMQ mirroring queue.

The following is a data replication application scenario: one user writes to the write library, and another user reads data from the write library. The data may be the latest, or the slave library may not be the latest. The replication system adopts several replication methods for this scenario.

Sample data graph

An important detail of a replication system is whether replication is synchronous or asynchronous

About replication mode:

Synchronous replication:

In synchronous replication, when a data update request is sent to the master node, the request is returned to the client only after the update operation is successfully executed on the slave node. The slave database must have the same latest data copy as the master database. If the main library suddenly fails, we can be sure that the data can still be found in from the library, but has its drawbacks as well as the way, if there is no response from the library (such as it has collapsed, or network failures, or any other reasons), the main library will not be able to write operation, the main library must stop all written, and wait for the synchronous replicas available again, This procedure the database is unable to update the inserted data.

Asynchronous replication:

In asynchronous replication, when a data update request is sent to the master node, the master node directly returns the request to the client after the operation is complete. The slave background updates data from the master node without confirmation. The disadvantage of this approach is that if the primary repository fails and cannot be recovered, any writes that have not been copied to the secondary repository will be lost, which means that even if the success has been confirmed to the client, the writes will not be Durable. The advantage of this approach is that even if all the slave libraries fall behind, the master can continue to process the writes and the service continues to run.

Semi-synchronous replication:

Semi-synchronous replication is an intermediate strategy. When a data update request is sent to the master node, the operation must be successfully executed on a slave before it is finally returned to the client. If a synchronous slave becomes slow, an asynchronous slave can be changed to synchronous. This ensures reliability while maintaining some data consistency (which can lead to data inconsistencies and data delays).

Mysql semi-synchronous Replication

The master replicates data to the slave immediately after the update. The slave receives the data and writes it to the relay log (no execution is required). The master returns a success message to the master only after receiving the success message from the slave. Only when the slave node is unavailable or the network used for data replication is abnormal, the master suspends the response to the client (about 10 seconds by default in mysql) and degrades the replication mode to asynchronous replication. When data replication recovers, the replication mode is restored to semi-synchronous replication.

Mysql data synchronization and failover

Mysql supports relatively strict ACID, is a very good performance and stability of the relational database, but is not very friendly to distributed support, although it implements NDB, but it is not widely used, the domestic use is more basic master slave replication mode. Mysql supports all the data replication methods mentioned above, so you only need to select the corresponding replication mode for each scenario. For those with high availability requirements and low data consistency requirements, asynchronous replication can be selected. For scenarios that require high data consistency, strong synchronous replication can be used in financial scenarios. Internet scenarios may have some requirements on availability and consistency, but the requirements are not very high. Semi-synchronous replication can be selected. The following is a brief description of the mysql master-slave synchronization logic

Mysql master/slave synchronization logic

First open the mysql binlog in the history of the master, mysql on through an I/O slave thread reads the binlog from mysql master, and then transfer to mysql slave relay log, The mysql Slave SQL thread then reads the relay log from the relay log and applies it to the mysql Slave database, thus achieving master/slave data synchronization.

Mysql master/slave synchronization logic

However, mysql itself does not implement failover, so when the master is abnormal, you need to formulate a strategy to implement failover and handle the database switchover. The failover logic is that when the master is abnormal, the slave automatically promotes the master, and then lets other slave libraries know about the new master and continue to synchronize data from the new master. This is where we use MHA, a high availability management tool for mysql. The MHA automatically performs database failover within 0 to 30 seconds and ensures data consistency to the greatest extent during the failover to achieve high availability. I won’t go into the details of the MHA here, it’s all in the official WiKE (the documentation is really detailed, the authors consider a lot of common scenarios, and a lot of parameters can be configured).

Link: MHA installation MHA advantage MHA architecture MHA configuration and much more….

Here only analyzes the architecture and principle of implementing failover. The structure is as follows (the picture on the official website is slightly blurred).

Mha architecture diagram

The MHA consists of two parts:

Mha Manager (management node): It can be deployed on an independent machine to manage multiple master-slave clusters (preferably managed with mysql related servers). It can also be deployed on a slave node to manage multiple mysql Server services, such as master detection, master election, connection check, and so on. Master Works such as failover.

Mha node(data node): runs on each mysql server to copy master binary logs. Then the differential relay log is generated on the slave with the latest data and the differential log is applied. Finally, the relay log is deleted without stopping the SQL thread.

Principle:

(1) Save binary log events from the crash master;

(2) Identify the latest slave;

(3) Apply differential relay logs to other slaves;

(4) Application of binary log events saved from the master;

(5) Make other slaves connect to the new master for replication;

(6) Make other slaves connect to the new master for replication;

Problems mHA needs to solve:

How to determine the new master:

Mysql does not have a cluster of distributed decision nodes such as ElasticSearch or ETCD, so the master election node is the MHA Manager node. ** Candidate_master =1** 2. The latest slave node can be promoted to master based on the latest binary files in each slave node.

I’m gonna go ahead and select a new master

How to ensure data consistency:

Mha ensures that data is not lost to the greatest extent. If the mysql master is abnormal but the machine is providing services normally, MHA will compare the data difference between the master node and the slave node that will become the master node, and then copy the data difference to the new slave node. Then apply this part of data difference to complete the data.

diff master

If the mysql master is abnormal and the machine is abnormal, then the binary binlog file stored in the system is not accessible and cannot be copied. The copy process will be skipped and the new master will be selected directly from the salve candidate. If you use semi-synchronous replication in mysql 5.5, you can greatly reduce the risk of data loss.

Copy the bin – log

How to copy data between nodes:

Since mysql does not do such a bin-log copy internally, we have custom requirements to implement the copy. In this case, MHA relies on THE SSH protocol, that is, files are transferred through the SCP protocol. To set up MHA, ensure that hosts can communicate with each other through SSH.

Copy the bin – log

How do other slave nodes know about the new master:

After the candidate master is promoted to master, the MHA Manager will use mysql change Replication to change the synchronization source of all slaves in the current cluster.

Update the master slave

How does the management node solve the network partition problem?

From the above network structure, we can guess that the system may have a big problem, which is network partition. Network partition refers to the split of the system into two clusters due to network separation, each of which does not trust each other. For stateless systems, there is little impact and requests are processed normally, such as Nginx. When a data system is partitionable, it can be very complicated to fix the problem if the system is not properly designed or configured, so distributed data systems such as ElasticSearch, ETCD, mongdb, etc are naturally supported. There is a mechanism to avoid data inconsistency due to network partitions. The solution is to ensure that most of the nodes in the cluster can communicate properly. For example, if your cluster has 5 nodes and partitioning results in a partition with 2 nodes and a partition with 3 nodes, then a partition with 2 nodes is considered abnormal and cannot serve properly. There are certain algorithms, such as RAFT, that can solve this problem.

For example, in the figure below, there are 3 nodes, so the minimum number of nodes in the trust cluster should be 2, so node C will be marked as an exception and will not provide services normally

The network partition of MHA is a little different from that mentioned above. Since the cluster has only one MHA Management (note that only one MANAGEMENT can be deployed here, otherwise, an exception will occur), mHA Management does not have a split brain problem. The mHA management node and mysql Master node are partitioned as follows:

The network partition between mHA Management and mysql Master occurred. Procedure

When mHA Management and mysql Master appear in two partitions, MHA considers that mysql Master is abnormal, but in fact mysql Master and mysql Slave are working properly and providing services. However, mHA will still switch the master, which may cause two masters for the application (if there is a load balancer on the front end), resulting in data inconsistency. In this case, MHA provides a secondary detection mode, that is, multiple links are detected. Mha Management directly detects the mysql master node by detecting one link. 2 For other links, mHA Managerment can log in to other slaves through SSH to check whether the mysql Master is normal. In this way, the network partition problem of mHA Managerment and mysql Master can be solved to prevent misswitchover.

The client application recovers automatically. Procedure

Generally, distributed systems with built-in failover can restore services on their own, such as ElasticSearch and ETCD, where both the client and the cluster are automatically aware of cluster node changes. The client connects to a set of cluster addresses. Endpoints []string is an array. This ensures that if a node IP fails, the machine is not accessible, the machine is abnormal, the machine is loaded, etc. Etcd connections are as follows:

cfg := client.Config {


Endpoints: []string{”
http://127.0.0.1:2379″},


Transport: client.DefaultTransport, // set timeout per request to fail fast when the target endpoint is unavailable


HeaderTimeoutPerRequest: time.Second,


}

However, as with mysql’s default connection mode, the default way to connect to a database using Tomcat or other clients is the mysql driver, which cannot connect to an array. So our solution is to reduce the client awareness, reduce the logical changes, and make the client only need to connect to one IP as before. The IP in this case is proxy IP, and there will be multiple ways to connect to the application (this is not about sharding and other advanced routing, only about connecting to the application, The high availability mode of proxy can be used to cooperate with each other through keepalived)

  • Layer 7 Supports mysql proxy:
    • mycat
    • kingshard
    • Atlas
    • vitess
    • phxsql
    • MaxScale

In principle, the proxy resolves the mysql protocol, and then routes (read/write separation) to the appropriate mysql server on the back end according to different libraries, tables and request types. However, because of the addition of such a 7-layer proxy resolution, the performance will suffer. For details, you can see the relevant comparison. What we need to do is to configure our service application group on the back end of the proxy, and configure read/write separation. When the master is abnormal, we can switch.

  • 4 layer proxy
    • lvs
    • haproxy

Layer 4 proxy will not parse out the mysql Layer 7 protocol, but can only parse layer 4, so it is ok to ensure that the mysql back-end port passes, that is, when detecting that the back-end master is unavailable, switch to the backup Master, because it is a Layer 4 protocol, automatic read/write separation cannot be configured. Only master port and slave port can be configured separately (if keepalived can be configured, there is a custom script to switch, custom script can be configured for master/slave synchronization delay)

  • Direct use of VIP
    • Configuring VIP using Scripts
    • Keepalived configuration of VIP

The logic of this last approach is:

Manually configure the VIP: If the mysql master fails, mHA Management configures the master_ip_online_Change_script script. If the mysql master fails, mHA Management configures the master_ip_online_Change_script script. Using the script in the new slave started on new IP, so it was no impact to the client, VIP is relatively easy to configure on configuration process [configuration process] (www.cnblogs.com/gomysql/p/3)… I’m not going to go into the details.

Our production environment actually uses MaxScale to separate read and write, and its documents are particularly comprehensive. The reason we choose MaxScale is that it is stable and efficient, and can work seamlessly with MHA. Mha does not need to configure any logic such as IP switchover. Maxscale will automatically perceive the role of servres in the system, and it can perceive the master switch, which has no impact on the application at all, as shown in the figure below:

Automatically identify roke

Conclusion:

Here is to solve the problem of high availability of the original official community version of mysql, using THE mHA + MaxScale method, the scheme can change the existing system with the minimum cost, improve the availability and stability of the system. Mysql’s support for clustering was relatively weak in previous versions (prior to 5.7), but mysql has been growing and the community has developed many solutions, such as PhxSQL, Percona XtraDB Cluster, MariaDB Galera Cluster, Mysql has also developed a GA using mysql Group Replication to solve the data consistency problem using distributed protocol. We are looking forward to more and more solutions being proposed in the future to better solve the mysql high availability problem.