MySQL is highly available, so perfect??

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

MySQL is favored by developers for its ease of learning and high availability. Almost all of its high availability architectures rely directly on Binlog. Binlog is responsible for making MySQL the most popular open source database today. How does MySQL achieve high availability? Is this high availability perfect enough?

Active/standby synchronization process

process

The synchronization process is shown in the following figure, which can also illustrate the action of an update statement in master:

Standby database B maintains A long connection with primary database A. Primary library A has an internal thread dedicated to servicing this long connection for standby library B. The complete process of a transaction log synchronization looks like this:

Run the change master command on standby library B to set the IP, port, username, password of primary library A, and the location from which the binlog request should start. This location contains the file name and log offset.
Run the start slave command on slave library B. In this case, the slave library starts two threads, namely io_thread and SQL_thread, as shown in the figure. Io_thread is responsible for establishing connections with the primary library.
After verifying the user name and password, primary database A reads the binlog from the local database to the location sent by secondary database B and sends the log to secondary database B.
Standby database B obtains the binlog and writes it to a local file, which is called the relay log.
Sql_thread reads transfer logs, parses commands in logs, and executes them.

Synchronization position

After an active/standby switchover, the secondary database needs to synchronize data from the new primary database. In the first step of the above process, you need to specify where to start the binlog request. There are two main schemes:

Based on the site

Before MySQL5.6, use the change master command to replace the primary library.

CHANGE MASTER TO 

MASTER_HOST=$host_name 
MASTER_PORT=$port 
MASTER_USER=$user_name 
MASTER_PASSWORD=$password 
MASTER_LOG_FILE=$master_log_name 
MASTER_LOG_POS=$master_log_pos

Copy the code

The operation process is as follows:

Wait for the new master database A ‘to complete the synchronization of all the relay logs;
Run the show master status command on A ‘to obtain the latest File and Position on A’.
Take the fault time T of the original master library A;
Use mysqlbinlog to parse A ‘File and get the loci at time T.

Based on the GTID

Site-based schemes are too cumbersome, and MySQL 5.6 introduced GTID, eliminating the need to manually calculate loci.

The Global Transaction Identifier (GTID) is the unique Identifier that is generated when a Transaction is committed. Each MySQL instance maintains a collection of GtiDs for “all transactions performed by this instance”.

CHANGE MASTER TO 
MASTER_HOST=$host_name 
MASTER_PORT=$port 
MASTER_USER=$user_name 
MASTER_PASSWORD=$password 
master_auto_position=1

Copy the code

Master_auto_position =1 indicates that the master/slave relationship uses the GTID protocol.

Run the start slave command on instance B and obtain the binlog logic as follows, where set_A and set_B are the GTID sets of executed transactions:

Instance B specifies the primary library A ‘and establishes the connection based on the active/standby protocol.
Instance B sends set_B to primary library A ‘.
Instance A ‘calculates the difference set between set_A and set_B, that is, the set of all gtiDs that exist in set_A but not in set_B, and determines whether A’ local contains all binlog transactions required by the difference set.

A. If not, a ‘has deleted the binlog required by instance B, and an error is returned.

B. If it is confirmed that all transactions are included, A ‘finds the first transaction in its binlog file that is not in SET_B and sends it to B;
Then we start from this transaction, read the files backwards, and send the binlog to B in order.

The operation based on GTID can be considered that the system calculates the corresponding loci by itself.

Circulation problems

If log_slave_updates is set to ON, the standby database generates a binlog after executing relay log. In the master-master + master-slave replication case, it is sometimes found that the master and slave are not synchronized, most likely because some master library does not have log_slave_updates set to ON.

Since the consumption of relay logs generates new binlogs, why does the master case not generate cyclic replication between nodes?

This is mainly because MySQL records in binlog the server ID of the instance where the command was executed for the first time.

Specify that the server ids of two libraries must be different. If they are the same, the relationship between them cannot be set as master/slave.
A standby database receives a binlog and, during playback, generates a new binlog with the same server ID as the original one.
After receiving the log from the primary library, each library determines the server ID. If the log is the same as its own, it indicates that the log is generated by itself and directly discards the log.

High availability (HA)

Now everyone uses MySQL, mainly because of its high availability. There are two reasons for high availability: active/standby consistency and active/standby switchover. Both are indispensable.

Under normal circumstances, as long as all binlogs generated by the master database can be passed to the standby database and executed correctly, the standby database will reach the same state as the master database, which is the final consistency.
If there is a problem with the primary database, the standby database can be used as the primary database and continue to provide services.

The MySQL ha system is based on the master/slave switchover logic, but the master/slave switchover depends on the master/slave delay.

The reason is easy to understand. If the synchronization of the standby database is not complete, changing the standby database to the primary database may cause data loss and data inconsistency.

Synchronization delay

According to the master/slave synchronization process mentioned above, we can see that the time points related to data synchronization mainly include the following three points:

Master A performs A transaction, writes to binlog, and we’ll call that T1;
And then we send it to standby B, and we call the time that standby B receives this binlog T2;
Standby B executes and completes the transaction, which we will mark as T3.

The so-called master/slave delay is the difference between the completion time of the standby database and the completion time of the master database for the same transaction, namely t3-T1. So there’s bound to be a delay between the master and the standby.

To view the delay time, run the show slave status command on the slave database. Seconds_behind_master is displayed in the command output, which indicates the delay time of the slave database. The time displayed on the DB monitor is seconds_behind_master.

The reason for the delay

The master/slave delay is inevitable, but it should not be too long. The reasons for the long active/standby delay are as follows:

The performance of the machine where the standby library resides is worse than that of the machine where the primary library resides

The standby library’s machine configuration is inherently worse than the primary library’s
The active database is deployed on multiple machines, and the standby database is deployed on a single machine

The pressure of standby storage is large

Queries on the standby database consume a lot of CPU resources

The master library performs large transaction or large table DDL

The slave library is delayed for as long as the statement is executed in the master library

Parallel replication capability of the standby

Does the standby library use single-thread replication or multi-thread replication
Starting with MySQL5.7.22, you can select which parallel replication strategy to use through COMMIT_ORDER, WRITESET, and WRITE_SESSION of the binlog-Transaction dependency-tracking parameter

The main/backup

The following is the process of the active/standby switchover between two Master nodes. Generally speaking, double M means that AB is set to be in active/standby mode, but only one node is being updated at any time.

There are two types of active/standby switchover policies: reliability first and availability first.

Reliability first

The details of the transition from state 1 to state 2 look like this:

Determine the current seconds_behind_master of secondary database B. If the seconds_behind_master is less than a certain value (for example, 5 seconds), proceed to the next step. Otherwise, retry continuously.
Change primary library A to read-only state, that is, set readonly to true;
Check the seconds_behind_master value of secondary database B until the value becomes 0.
Alter database B to be read/write, i.e. set readOnly to false;
The business request is cut to standby database B.

The reliability first advantage is that the switchover does not cause system problems after the data on the active and standby nodes is consistent. The downside is that the system is unavailable for some time.

Availability first

The details of the transition from state 1 to state 2 look like this:

Alter database B to be read/write, i.e. set readOnly to false;
Service requests are switched to standby database B.
Change primary library A to read-only state, that is, set readonly to true;

The advantage of usability first is that there is little time for the system to be unavailable, and the disadvantage is the possibility of data inconsistencies in the system.

Data inconsistency occurs because standby database B continues to consume uncompleted binlog logs while receiving service requests. New requests may conflict with old requests. Setting binlog to Row can help detect such problems in a timely manner and reduce their severity.

Abnormal situation

Assume that the active/standby delay between active library A and standby library B is 30 minutes. In this case, active library A is powered off, and the HA system switches library B to be the active library. There are problems with cutting and not cutting.

Che: Some data cannot be found in the standby database, and data inconsistency may occur
Not cut: The database is unavailable

This is why the availability of MySQL high availability systems is dependent on master/standby latency. The smaller the delay, the shorter the service recovery time and the higher the availability in the event of a primary library failure. So synchronization delays need to be a major concern for both DBAs and developers.

The problem

Although the master/slave synchronization brings high availability to MySQL, the delay is inevitable. As a result, after the master database is updated, the slave database is immediately checked, and the updated data is not available in the slave database. This problem cannot be avoided, but we can find ways to optimize it. We need to make a good balance between efforts and benefits.

Forcibly remove the master library scheme

The search operation does not look up the secondary library, but the primary library

Sleep solutions

After the client is updated successfully, the search operation will be performed after a while

Determine the no-delay scheme in active/standby mode

Check whether Seconds_behind_master is already 0 each time before executing a query request from the library. If not, the query request cannot be executed until this parameter is 0.
Judge site: Compare the latest read site of the master database with the latest read site of the standby database
Check whether the GTID sets of all logs received by the standby database are consistent with those that have been executed in the standby database

Work with the semi-sync solution

With one master and one standby, semi-synchronous replication ensures that both master and standby receive updates

Wait for the main repository point scheme

Select master_pos_wait(file, pos[, timeout]) from master_pos_wait(file, pos[, timeout])

Such as GTID scheme

Select wait_for_executed_gtid_set(gtid_set, 1) from the master library, where gtid_set is the GTID of the transaction directly retrieved from the return package after the master transaction has been updated

conclusion

In fact, MySQL master-slave synchronization is of little relevance to developers, but understanding its imperfections can be helpful in tracking down problems when exceptions occur. Moreover, it can be extended to many new ways of playing, such as business consumption binlog, to achieve many functions.

data

Log_slave_updates important parameters for primary and secondary synchronization Settings
MySQL45 speak
MySQL DDL– Ghost tools to learn

The last

If you like my article, you can follow my public account (Programmer Malatang)

My personal blog is shidawuhen.github. IO /

Review of previous articles:

Design patterns
recruitment
thinking
storage
The algorithm series
Reading notes
Small tools
architecture
network
The Go