MySQL MGR node management and failover

Article source: Tao Teacher operation notes – wechat official account

  • Dev.mysql.com/doc/refman/…
  • Dev.mysql.com/doc/refman/…

The MySQL Group Replication (MGR) framework enables automatic master/slave switchover and failover for MySQL.

This article describes node management and failover:

  • 1)MGR node management and failover related theoretical knowledge
  • 2) Test the recovery of cluster after master failure in MGR single master mode; Method of adding/deleting nodes to MGR cluster.

1. Test environment

1.1 The environmental planning is as follows

role IP port server-id
DB-1 192.110.103.41 3106 103413106
DB-2 192.110.103.42 3106 103423106
DB-3 192.110.103.43 3106 103433106

Note: Port 3306 on the machine has been occupied, so we have to change to 3106.

1.2 installation MGR

See the MGR installation steps in the previous section. MySQL5.7 MGR installation and introduction

2. MGR is introduced

2.1 MGR profile

MySQL Group Replication (MGR) is a new feature in MySQL5.7 that provides highly available, highly scalable, highly reliable (highly consistent) MySQL cluster services. MGR consists of multiple instance nodes to form a database cluster. The system submission transaction must be approved by more than half of the nodes before submission. A database state machine is maintained on each node in the cluster to ensure the consistency of transactions among nodes.

2.2 Single-master/Multi-master Mode

MySQL group replication can be configured to work in single-master mode and multi-master mode. The following is a brief introduction of the two modes:

  • Single-master model: a master node is automatically elected from multiple MySQL nodes in the replication group. Only the master node can write data. The other nodes are automatically read only. When the master node fails, a new master node is automatically elected. After the election, the master node is set to writable, and other slaves point to the new master.
  • Multi-master model: Any node in a replication group can be written, so there is no concept of master and slave, and this multi-master model can continue to be available as long as the number of nodes that suddenly fail is not too high.

MySQL group replication uses the Paxos distributed algorithm to provide distributed coordination between nodes. Because of this, it requires a majority of the nodes in the group to be online to reach the quorum to make a unanimous decision on a decision.

Most refers to N/2+1(N is the current total number of nodes in the group). For example, if there are currently 5 nodes in the group, 3 nodes are required to meet the requirements of the majority. Therefore, the number of nodes allowed to fail is shown as follows:

The group size Majority quantity Failure tolerance quantity
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2

2.3 Configuration Description

[mysqld] datadir=/data socket=/data/mysql.sock server-id=100 # Mandatory gtid_mode=on # Mandatory enforce_gtid_consistency=on # Mandatory Log-bin =/data/master-bin # must binlog_format=row # Must binlog_checksum=none # Must master_info_repository=TABLE # Must Relay_log_info_repository =TABLE # must relay_log=/data/relay-log # Must, if not given, The default value log_slave_updates=ON # must sync-binlog=1 # recommended log-error=/data/error.log pid-file=/data/mysqld.pid Transaction_write_set_extraction = XXHASH64 # must be loose - group_replication_group_name = "aaaaaaaa -- aaaa aaaa aaaa -- aaaaaaaaaaaa" # must loose- group_replication_STARt_ON_boot =off # Recommended setting to off loose-group_replication_member_weigth = 40 # Not required, Loose-group_replication_local_address = "192.110.103.41:31061" # Mandatory, The next line must also loose - group_replication_group_seeds = "192.110.103.41:31061192110 103.42:31061192110 103.43:31061" loose-group_replication_bootstrap_group = OFF loose-group_replication_single_primary_mode = FALSE # = multi-primary loose-group_replication_enforce_update_everywhere_checks=ON # = multi-primaryCopy the code

Analyze the configuration options above:

  • 1). Because group replication is based on GTID, gtid_mode and enforce_gtid_consistency must be enabled.
  • 2). Log_bin and binlog_format must be set for group replication. In this way, information can be gathered from log records and data consistency can be ensured.
  • 3). Set binlog_checksum=none; group checksum=none;
  • 4). Group replication will write master and relay log metadata to mysql.slave_master_info and mysql.slave_relay_log_info.
  • 5). Each node in the group retains a complete copy of the data, which is share-nothing mode. Therefore, log_slave_updates must be enabled on all nodes, so that the new node can select any donor for asynchronous replication.
  • 6). Sync_binlog =1 to ensure that the binlog will be flushed immediately every time the transaction is committed, and the log will not be lost even if the fault occurs.
  • 7). The last six lines are the configuration of the group replication plug-in. Starting with loose_ indicates that MySQL continues to allow group replication even if the plugin is started. This prefix is optional.
  • 8). Line 6 from last indicates that the write collection is hashed using the XXHASH64 algorithm. A write set is a unique identifier of the row that has been modified in a transaction and is used to detect whether concurrent transactions have modified the same row for conflict. It is based on primary key generation, so to use group replication, the table must have a primary key.
  • 9). The fifth line from last indicates the name of the replication group. It must be a valid UUID value. On Linux, you can use the uuidgen tool to generate uuUID values.

3. Group replication management

  • Reference: netease cloud MGR https://www.cnblogs.com/163yun/p/9816569.html implementation analysis

3.1 View and View Switching

MGR manages members based on the Group View (referred to as View). View refers to the member state of the Group in a period of time. If no member changes during this period, that is to say, no member joins or quits, then this continuous period is a view. If the member joins or quits, then the view changes. The MGR uses the View ID to track the changes of views and to distinguish the sequence of views.

select * from performance_schema.replication_group_member_stats\G ;
*************************** 1. row ***************************
                      CHANNEL_NAME: group_replication_applier
                           VIEW_ID: 15718289704352993:39
                         MEMBER_ID: 509810ee-f3d7-11e9-a7d5-a0369fac2de4
       COUNT_TRANSACTIONS_IN_QUEUE: 0
        COUNT_TRANSACTIONS_CHECKED: 10
          COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-60:1000003-1000006:2000003-2000006
    LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:58
Copy the code

The Group view ViewID consists of a prefix timestamp and sequence number.

  • Prefix part: is generated when the Group is initialized. It is the timestamp at that time. This value will not change during the lifetime of the Group.
  • Ordinal part: When the Group is initialized, the first view ordinal is 1, and the ordinal is increased by one for any subsequent member joining or exiting.
#viewid:15718289704352993:39 select from_unixtime(FLOOR(15718289704352993/10000000),'%Y-%m-%d %h:%m:%s'); +----------------------------------------------------------------------+ | from_unixtime(FLOOR(15718289704352993/10000000),'%Y-%m-%d %h:%m:%s') | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + | 2019-10-23 07:10:30 | +----------------------------------------------------------------------+Copy the code

3.2 Node Addition Process

Use the MGR diagram of netease Cloud to introduce the process:When a node requests to join a Group, it first establishes a TCP connection with the seeds of the Group (Gcs_xcom_control::do_join()) based on the configured group_replication_group_seeds parameter. The seed member checks whether new nodes are allowed to join according to its group_replication_IP_whitelist (IP address whitelist). By default, the MGR does not limit the IP addresses of new nodes. After the connection is established, the new node sends a request to join the group. After receiving the request, the seed member broadcasts the message of the view change to all nodes in the Group, including the node applying for joining, as shown in the upper right; After receiving the message, each node starts to switch views. Each node broadcasts a status exchange message, and each exchange message contains the current state and information of the node, as shown in the lower left. After the exchange message is sent, each node receives the message broadcast by other nodes and updates the node information to the member list maintained by the node.

View switching is only the first step for a member to join the Group. It only means that the member can receive the message reached through the Paxos agreement in the Group, but it does not mean that the member can be set ONLINE to provide services externally. The reason is that new members also need to synchronize data to establish the correct data version (recovery_module->start_recovery). After that, the Paxos protocol message can be executed, and the normal user access service can be provided online.

3.3 Group Replication Management Statements

Common action group replication statements.

SELECT * FROM performance_schema.replication_group_members; Master SELECT ta.*, tb.member_host, tB.member_port, tB.member_state FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID; SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member'; >SHOW STATUS LIKE 'group_replication_primary_member'; start group_replication; stop group_replication;Copy the code

When you want to stop group replication in a member of a group, execute the stop group_replication statement on that node.

It is important to note, however, that before executing this statement, you must ensure that the node does not provide MySQL services externally, otherwise new data may be written (for example, when the primary node stops) or stale data may be read.

Therefore, the best way to safely restart the entire group is to stop all MySQL instances on the non-primary node (not only the group replication function), and then stop MySQL instances on the primary node. The startup sequence is to restart the master node, boot the group on this node, enable its group replication function, and finally add the slave nodes to the group.

In group replication, there are two types of node degroup: voluntary degroup and involuntary degroup.

Voluntary withdrawal:

Perform the stop group_replication; Statements.

  • 1). Executing this statement indicates that the node voluntarily degroups, which triggers automatic view configuration and copies the view change operation to all nodes in the group. The node will degroup until most nodes agree to the new view configuration.
  • 2). When the node leaves the group voluntarily, the legal votes will not be lost. Therefore, no matter how many nodes voluntarily leave the group, the group will not be blocked by the requirement of “failing to meet the majority”.
  • 3). For example, A group of 5 nodes voluntarily exits A node A, and the size of the group is 4. This group thinks node A never appears.

Involuntary withdrawal:

Except for voluntary departures above, all departures are involuntary. Such as node downtime, network outage and so on.

1). When a node degroups involuntarily, the fault detection mechanism will detect the problem and report the problem to the group. Automatic configuration of group view membership is then triggered, requiring the majority of nodes to agree to the new view.

2). The size of the group will not change when the group is left involuntarily. No matter how many nodes are in the group, the size of the group will remain 5 after the node is left involuntarily, but these nodes are marked as non-online.

3). When leaving the group involuntarily, the quorum votes will be lost. Therefore, when the number of involuntary degrouping nodes is too large, the number of remaining nodes in the group cannot meet most requirements, and the group will be blocked.

4). For example, for A 5-node group, the size of the group remains 5 after one node A is involuntarily removed, but node A is marked as unreachable or other status in the new view. When the two nodes continue to exit involuntarily, there are only two ONLINE nodes left in the group. At this time, most requirements cannot be met, and the group will be blocked.

4. MGR fault is recovered

First, briefly introduce data replication and channels involved in MGR. When MGR runs normally, that is, all nodes in the Group are online, data is transmitted between nodes through Paxos protocol, and remote transactions are written to group_replication_applier Relay log after authentication. The replication thread of group_Replication_applier is responsible for playback.

When a node is added to a Group, group_Replication_recovery is another replication channel, which is a traditional master-slave asynchronous replication channel.

We can divide the process of adding nodes into two stages, including the former view data recovery and the current view cache transaction execution. The first stage can be subdivided into local recovery and global recovery.

4.1 Restoring View Data

1) Local recovery phase

Step 1: Initialize fault recovery, including initialization of fault recovery thread, initialization of Group member information, etc.

Step 2: Start the group_Replication_applier replication channel. If the new node does not play back this part of the Relay log.

2) Global recovery

Step 3: After the local Relay log playback is complete, the third step of fault recovery is performed. That is, State transfer uses the group_Replication_recovery replication channel to pull missing data from other online nodes in the Group. Different from the traditional master-slave replication configuration, you only need to configure the account and password for mGR. the configuration mode is as follows: CHANGE MASTER TO MASTER_USER=’rpl_user’, MASTER_PASSWORD=’rpl_pass’ FOR CHANNEL ‘group_replication_recovery’. Obviously, the channel is based on GTID in MASTER_AUTO_POSITION. The recovery thread selects a random online node (donor) and calls the request_dump function in rpl_slave.cc to establish a donor replication relationship. This function carries the gtid_executed message of the donor. The donor side iterates the Binlog file in reverse order, checks the previous-gtids starting from the Binlog file to find the first transaction that is not gtid_executed, and replicates data from the transaction.

4.2 Cache transaction execution and exit recovery

3) Cache transaction execution

Step 4: Wake up the authentication queue processing thread blocked in step 2 and make this node step by step catch up with the other nodes in the Group, and then set it to ONLINE.

Step 5: After the condition specified by group_REPLICATION_RECOVERy_COMPLEte_AT is met, the Recovery_message::RECOVERY_END_MESSAGE message is sent to notify each node in the Group that the node is set to online.

Step 6: In the whole process of fault recovery, in case of failure to continue or ERROR, the node should be set to ERROR state first to confirm that the copy-related thread of the node exits, and the node should send leave message to exit the Group instead of remaining in the Group when it is set to ERROR state.

Step 7: Reset the failover parameters and destroy the failover thread, error or no error.

5. MGR node Management (Add/Delete points)

The previous is all MGR theoretical knowledge, if you have no time, you can skip reading and directly follow the practice in this section.

5.1 Deleting a Node from a Group

To delete a node from a group, run the stop group_replication statement.

select * from performance_schema.replication_group_member_stats\G ; select * from performance_schema.replication_group_members; -- stop group_replication; select * from performance_schema.replication_group_members;Copy the code

5.2 Adding a Node to the MGR Cluster

5.2.1 Synchronization resumes after the former MGR node is Interrupted for a short time

If the previous MGR node is shut down, then the service is restored. Usually just start group_replication as follows:

show global variables like '%seed%'; show global variables like '%group_repli%'; Set global group_replication_enforce_update_everywhere_checks=OFF; set global group_replication_single_primary_mode=ON; START GROUP_REPLICATION; select * from performance_schema.replication_group_members;Copy the code

5.2.2 Adding a New Node

1) Install MySQL

Install the MySQL software. For details, see the official reference.

  • Dev.mysql.com/doc/refman/…
  • Dev.mysql.com/doc/refman/…
Mkdir -p /data1/mysql_3106/{data,etc,logs, TMP} vim /data1/mysql_3106/etc/my.cnf parameter plugin_load_add='group_replication.so' transaction_write_set_extraction=XXHASH64 loose-group_replication_group_name = "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" loose-group_replication_start_on_boot = OFF Loose-group_replication_local_address = "192.110.103.42:31061" loose-group_replication_group_seeds = "192.110.103.41:31061192110 103.42:31061192110 103.43:31061" loose - group_replication_bootstrap_group = OFF loose-group_replication_single_primary_mode = TRUE # FALSE : multi-primary loose-group_replication_enforce_update_everywhere_checks= FALSE # ON : multi-primaryCopy the code

2) Back up master data

# authorization
#set GLOBAL super_read_only=0;
#grant all on *.* to backup@'127.0.0.1' identified by 'backup';
# backup
backup_path=/data1/mysqlbackup/`date +%Y%m%d`/mysql_3106
mkdir -p $backup_pathInnobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf -h 127.0.0.1 -p 3106 --user=backup --password=backup --parallel=4 --throttle=4 --no-timestamp --slave-info$backup_path/ >> $backup_path/mybackup.log 2>&1 &
# application
innobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf --apply-log $backup_path
cp /data1/mysql_3106/etc/my.cnf $backup_path/
Copy the code

3) Transfer backup data

SCP - pr $backup_path 192.110.103.42: / data1 / mysql_3106_backup or with nc transmission, In the slave receiver nc - 1234 l | tar XZVF - > mysql_3106 # transmission party CD $backup_path/tar CZVF - mysql_3106 | nc 192.110.103.42 1234Copy the code

4) Restore the backup data to the slave

#mkdir -p /data1/mysql_3106/{data,etc,logs, TMP}Copy the code

Modifying a configuration file:

Cp /data1/mysql_3106_backup/my.cnf /data1/mysql_3106/etc/my.cnf # vim /data1/mysql_3106/etc/my server-id=103423106 loose-group_replication_start_on_boot = OFF loose-group_replication_local_address = Loose - group_replication_group_seeds = "192.110.103.42:31061" "192.110.103.41:31061192110 103.42:31061192110 103.43:31061" loose - group_replication_bootstrap_group = OFF loose-group_replication_single_primary_mode = TRUE # FALSE : multi-primary loose-group_replication_enforce_update_everywhere_checks= FALSE # ON : Multi - primary report_host = 192.110.103.42 report_port = 3106Copy the code

Restore backup data:

#
innobackupex [--defaults-file=MY.CNF] --copy-back  [--defaults-group=GROUP-NAME] BACKUP-DIR
innobackupex --defaults-file=/data1/mysql_3106/etc/my.cnf --copy-back  /data1/mysql_3106_backup/
#
chown -R mysql:mysql /data1/mysql_3106
Copy the code

5) start the MySQL

# start MySQL su - MySQL - c "/ usr/local/MySQL/bin/mysqld_safe - defaults - file = / data1 / mysql_3106 / etc/my CNF &"Copy the code

6) Install the MGR plug-in

If MGR is not installed, install the MGR plug-in. Otherwise, skip this step.

Group_replication SONAME 'group_replication. So '; SHOW PLUGINS;Copy the code

7) The new node starts MGR and joins the cluster

Core steps to join MGR cluster:

A new node can join the MGR cluster in the following steps:

# cat xtrabackup_info -- get gTID value binlog_pos = filename 'mysql-bin.xx', position 'xx', GTID of the last change 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-65:1000003-1000006:2000003-2000006' # RESET SLAVE ALL; RESET MASTER; # set global gtid_purged='aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-65:1000003-1000006:2000003-2000006'; START GROUP_REPLICATION; select * from performance_schema.replication_group_members;Copy the code

The following are detailed steps for adding a new node to the MGR cluster.

1. Set a replication account for the new secondary library (optional)

If you have a new slave library backed up from the MGR cluster using Xtrabackup, you already have this replication account. (This step can be omitted)

SET SQL_LOG_BIN=0;
CREATE USER repl@'192.%.%.%' IDENTIFIED WITH 'mysql_native_password' BY 'repl';
GRANT SUPER,REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO repl@'192.%.%.%';
FLUSH PRIVILEGES;
SET SQL_LOG_BIN=1;
Copy the code

2. View the last GTID executed by the master during backup

Xtrabackup backs up data sets but does not backup binlogs. Therefore, you must first obtain the last transaction ID at the end of this backup and explicitly specify to skip these transactions on the slave. Otherwise, the slave will copy the binlogs from the master and execute them again, resulting in repeated data execution.

It can be obtained from the xtrabackup_info file in the data directory. If xtrabackup is not used, you can use the master’s show global variables like “gtid_executed”; The value of gtid_purged is not the value of gtid_purged.

Check the xtrabackup_info:

cat xtrabackup_info uuid = b30fde6b-223e-11ea-bbbe-a0369fac2de4 tool_command = --defaults-file=/data1/mysql_3106/etc/my.cnf -h 127.0.0.1 -p 3106 --user=backup --password=... - the parallel = 4 - throttle = 4 - no - timestamp - slave - info/data1 / mysqlbackup / 20191219 / mysql_3106 / ibbackup_version = from 2.4.8 Server_version = 5.7.23-log binlog_pos = filename 'mysql-bin.000008', position '1313', GTID of the last change 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-54:1000003-1000006:2000003-2000006'Copy the code

3. New slave library change Master

Reset master and change master to master.

Show global variables like '%gtid%'; show master status; RESET SLAVE ALL; RESET MASTER; Gtid_executed,gtid_purged show global variables like '%gtid%'; +---------------------------------------------------+---------+ | Variable_name | Value | +---------------------------------------------------+---------+ | binlog_gtid_simple_recovery | ON | | enforce_gtid_consistency | ON | | group_replication_allow_local_disjoint_gtids_join | OFF | | group_replication_gtid_assignment_block_size | 1000000 | | gtid_executed | | | gtid_executed_compression_period | 1000 |  | gtid_mode | ON | | gtid_owned | | | gtid_purged | | | session_track_gtids | OFF | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- - + 10 rows in the set (0.01 SEC)Copy the code

Set gtid_purged to the GTID value on xtrabackup_info, this is critical otherwise it will fail, and then change master.

-- Set gtid_purged to xtrabackup_info, This step is critical. The set global gtid_purged = 'aaaaaaaa -- aaaa aaaa aaaa - aaaaaaaaaaaa: 1 - shall 00003-1000006:2000003-2000006'; show global variables like '%gtid%' ; -- change master CHANGE MASTER TO MASTER_USER='repl', MASTER_PASSWORD='repl' FOR CHANNEL 'group_replication_recovery';Copy the code

4. Start group replication :(single-master mode)

Select * from performance_schema.replication_group_members; select * from performance_schema.replication_group_member_stats\G ; show global variables like 'group_replication%'; show global variables like 'group_replication_single%'; In section 1, after group_replication_bootSTRap_group =ON is enabled, the group replication function of this node is enabled immediately to create groups. SET GLOBAL group_replication_bootstrap_group=OFF; -- The first node is ON, and the other nodes are OFF. Set global group_replication_enforce_update_everywhere_checks=OFF; set global group_replication_single_primary_mode=ON; set global group_replication_allow_local_disjoint_gtids_join=ON; -- Check show global variables like 'group_replication_bootstrap_group'; show global variables like 'group_replication_enforce_update_everywhere_checks'; show global variables like 'group_replication_single%'; show global variables like 'group_replication_allow_local_disjoint_gtids_join'; -- START GROUP_REPLICATION; Select * from performance_schema.replication_group_members;Copy the code

Note: It is best not to modify any data before the node joins the cluster, otherwise the following error will occur.

  • Segmentfault.com/a/119000001…
START GROUP_REPLICATION; [ERROR] Plugin group_replication reported: 'This member has more > executed transactions than those present in the group. Local transactions: c3c274ff-c63e-11e7-> b339-00163e0c0288:1-4 > Group transactions: 2e6bfa69-0439-41C9-add7-795a9ACfd499:1-10, c5898a84-c63e-11e7-bc8b-00163e0af475:1-4' # set global group_replication_allow_local_disjoint_gtids_join=ON; # Compatible join groupsCopy the code

5.2.3 Inconsistent Data The node is added after being repaired

If a new Node or restored Node has failed for a long time and the master log has been purge, GROUP_REPLICATION cannot be started directly. Recovery. You need to manually back up data in MGR cluster. There is no CONCEPT of SST and IST in MGR cluster, but an operation of “synchronizing and retrieving data” is realized by GTID and binlog.

Note: Data inconsistency node replication exception, repair operation can be exactly the same as the previous section of the new node to join!

1. [Phenomenon]

192.110.103.42:3106 The fault time is too long, and the master log in MGR has been purge, thus RECOVERING or ERROR, and cannot be added to MGR.

Note: The MGR down node will ask the surviving cluster, can you complete the binlog? If it can be completed, then it will be normal transmission, data recovery; If the logs required by the faulty node do not exist, the node cannot be added to the cluster environment.

[email protected]: (none) > show variables like 'group_replication_group_seeds'; +-------------------------------+-------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+-------------------------------------------------------------+ | Group_replication_group_seeds | 192.110.103.41:31061192110 103.42:31061192110 103.43:31061 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 1 row in the set (0.00 SEC) [email protected]: (none) > start group_replication; Query OK, 0 rows affected (3.35 SEC) [email protected]: (none) > select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | Group_replication_applier | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | RECOVERING | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ 3 rows In set (0.00 SEC) [email protected]: (none) > select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ERROR | +---------------------------+--------------------------------------+---------------+-------------+--------------+ 1 row In the set (0.00 SEC)Copy the code

Error log:

Master has purged binary logs containing GTIDs that the slave requires.

2019-10-24T08:15:11.015032z 198 [ERROR] ERROR Reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but t he master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236) 2019-10-24T08:15:11.015060z 198 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CH ANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236 2019-10-24T08:15:11.015076Z 198 [Note] Slave I/O Thread No longer intended for channel 'group_replication_recovery' Read up to log 'FIRST', position 4 2019-10-24T08:15:11.015131z 196 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.' 2019-10-24T08:15:11.015164z 199 [Note] Error Reading relay log event for channel 'group_replication_recovery': The slave SQL thread was killed 2019-10-24T08:15:11.015530z 199 [Note] The slave SQL thread was killed 'group_replication_recovery' exiting, Replication stopped in log 'FIRST' at position 0 2019-10-24T08:15:11.017239z 196 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous State master_host='192.110.103.41', master_port= 3106, master_log_file='', m aster_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, Master_bind = ". 2019-10-24T08:15:11.019165z 196 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'Copy the code

2. [Processing]

In this case, you can recreate the slave library with Xtrabackup, but this section uses mysqldump:

1) Install a new DB instance

If the secondary database can still be started normally, you can skip this step. You only need to clear the local GTID information and import the mysqldump data.

A) Install a new MySQL server

B) You can also delete the current data from the database.

2) Create secondary libraries with mysqldump

Use mysqldump to make a full backup of the data.

The mysqldump backup:

/ usr/local/mysql/bin/mysqldump -h 127.0.0.1 - P3106 - all - databases - the default - character - set = utf8 - R - q - triggers --master-data=2 --single-transaction > mysql3106_online.sql ;Copy the code

Transferring data to slave library:

SCP - pr mysql3106_online. SQL [email protected]: data1 / nc # slave receiver nc - 1234 l | tar XZVF - > mysql3106_online. SQL # transmission party tar CZVF - mysql3106_online. | nc 192.110.103.42 SQL 1234Copy the code

Clearing the local GTID Information Shut down the synchronization process, read only, and clear the local GTID information.

-- slave run show variables like 'group_replication_group_seeds'; STOP GROUP_REPLICATION; RESET SLAVE ALL; set global super_read_only=0; show master logs; reset master; show master logs show global variables like '%gtid%';Copy the code

Importing backup data:

Mysql -h 127.0.0.1 -p3106 -uroot -pxxx --default-character-set=utf8 < mysql3106_online.sqlCopy the code

3) Install the MGR plug-in

INSTALL PLUGIN group_replication SONAME 'group_replication.so'; SHOW PLUGINS;Copy the code

4) Start restore MGR:

show global variables like '%gtid%'; set global group_replication_enforce_update_everywhere_checks=OFF; set global group_replication_single_primary_mode=ON; START GROUP_REPLICATION; # check show global variables like 'group_replication_bootstrap_group'; show global variables like 'group_replication_enforce_update_everywhere_checks'; show global variables like 'group_replication_single%'; show global variables like 'group_replication_allow_local_disjoint_gtids_join';Copy the code

5) results:

Because it is a newly made slave library, it can catch up with data synchronously. Therefore, you can START GROUP_REPLICATION. Successfully added to MGR.

START GROUP_REPLICATION; select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | Group_replication_applier | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+Copy the code

6. Failover test

6.1 Current Status

The current status is single master mode and is as follows:

select * from performance_schema.global_variables where VARIABLE_NAME in ('group_replication_single_primary_mode','group_replication_enforce_update_everywhere_checks'); +----------------------------------------------------+----------------+ | VARIABLE_NAME | VARIABLE_VALUE | +----------------------------------------------------+----------------+ | group_replication_enforce_update_everywhere_checks | OFF | | group_replication_single_primary_mode | ON | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + 2 rows in the set (0.00 SEC) select * from performance_schema.global_variables where VARIABLE_NAME like '%read_only'; +-----------------------+----------------+ | VARIABLE_NAME | VARIABLE_VALUE | +-----------------------+----------------+  | innodb_read_only | OFF | | read_only | OFF | | super_read_only | OFF | +-----------------------+----------------+ > SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | Group_replication_applier | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+Copy the code

Current master:

Only the group replication in the single-master model needs to search for the master node. The multi-master model does not have the concept of master/slave, so no search is required.

SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME='group_replication_primary_member'; +----------------------------------+--------------------------------------+ | VARIABLE_NAME | VARIABLE_VALUE | +----------------------------------+--------------------------------------+ | group_replication_primary_member | 509810ee-f3d7-11e9-a7d5-a0369fac2de4 | +----------------------------------+--------------------------------------+ Master SELECT ta.*, tb.member_host, tB.member_port, tB.member_state FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID; +----------------------------------+--------------------------------------+---------------+-------------+--------------+  | VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ | group_replication_primary_member | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ 1 row in set (0.00 SEC)Copy the code

6.2 The primary library is faulty

1) Normal shutdown master

Run the kill command to stop the master.

#Master machine 192.110.103.41 kill mysql $ps aux | grep mysql 0.0 0.0 106252 122456 1444 3106? S Oct23 0:00 /bin/sh /usr/local/mysql-5.7.23 /bin/mysqLD_safe --defaults-file=/data1/mysql_3106/etc/my.cnf mysql 123471 /usr/local/mysql-5.7.23/bin/mysqld --defaults-file=/data1/mysql_3106/etc/my.cnf - the basedir = / usr/local/mysql - 5.7.23 - datadir = / data1 / mysql_3106 / data - the plugin - dir = / usr/local/mysql - 5.7.23 / lib/plugin --log-error=/data1/mysql_3106/logs/mysqld.err --open-files-limit=8192 --pid-file=/data1/mysql_3106/tmp/mysql.pid --socket=/data1/mysql_3106/tmp/mysql.sock --port=3106 $kill 122456 123471 ; tail -f /data1/mysql_3106/logs/mysqld.errCopy the code

Log:

Original DB Master logs

$tail -f /data1/mysql_3106/logs/mysqld.err 2019-10-24t03:10:32.746843z 0 [Warning] /usr/local/mysql-5.7.23/bin/mysqld: Forcing Close of Thread 31 user: 'root' 2019-10-24T03:10:32.746873z 0 [Note] Plugin group replication reported: Check.' 2019-10-24T03:10:32.746901z 0 [Note] 标 签 : replication reported [Note] Plugin group_replication reported: 2019-10-24T03:10:35.797258z 0 'Group membership changed: '2019-10-24T03:10:40.799923z 0 [Note] Plugin group_replication reported: 'Auto_increment_increment is reset to 1' 2019-10-24T03:10:40.799954z 0 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1' 2019-10-24T03:10:40.800110z 7 [Note] Error Reading Relay log event for channel 'group_replication_applier': Slave SQL thread was killed 2019-10-24T03:10:40.800431z 7 [Note] Slave SQL thread was killed for channel 'group_replication_applier' exiting, Replication stopped in log 'FIRST' at position 65 2019-10-24T03:10:40.800652z 4 [Note] Plugin group_replication reported: 'The group replication applier thread was killed' 2019-10-24T03:10:40.800787z 0 [Note] Plugin group_replication Reported: 'Plugin 'group_replication' has been stopped.' 2019-10-24T03:10:40.800799z 0 [Note] 2019-10-24T03:10:40.801278z 0 [Note] end 2019-10-24T03:10:40.802272z 0 [Note] Shutting down Plugin 'Group_replication', nej 9, Nej 9 'All Group Replication server observers have been successfully unregistered' ... [Note] Neacorn j, Neacorn j, Neacorn J, neacorn J, neacorn J, j j [Note] / usr/local/mysql - 5.7.23 / bin/mysqld: Mysqld_safe mysqld from PID file /data1/mysql_3106/ TMP /mysql.pidCopy the code

New DB Master log:

$tail -n 30 /data1/mysql_3106/logs/mysqld.err 2019-10-23T11:11:671705z 0 [Note] Plugin group_replication reported: 'XCom Protocol Version: 3' 2019-10-23T11:11:00.671736z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 31061' 2019-10-23t11:11:05.400823Z 2 [Note] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.110.103.41:3106.' 2019-10-23T11:05.401138z 20 [Note] Plugin group replication reported: 'Express group recovery connection with a possible donor. Attempt 1/10' 2019-10-23T11:05.401143z 0 [Note] Plugin  group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106 on view 15718289704352993:2.' 2019-10-23T11:11:05.402757z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', Master_log_pos = 4, master_bind= ". New state master_host='192.110.103.41', master_port= 3106, master_log_file= ", Master_log_pos = 4, master_bind= ". 2019-10-23t11:11:05.404717z 20 [Note] Plugin group_replication reported: 'Express Connection to a group replication recovery donor 509810EE-f3D7-11e9-a7d5-A0369FAC2de4 at 192.110.103.41 port: '2019-10-23T11:11:05.404998Z 22 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; See the 'START SLAVE Syntax' in the MySQL Manual for more information. 2019-10-23T11:11:05.406423z 22 [Note] SLAVE I/O thread for channel 'group_replication_recovery': Connected to master '[email protected]:3106', Replication started in log 'FIRST' at position 4 2019-10-23T11:11:05.442349z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './mysql-relay-bin-group_replication_recovery.000001' position: 4 2019-10-23t11:11:05.461483z 20 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.' 2019-10-23T11:11:05.461910Z 23 [Note] Slave SQL thread for channel 'group_replication_recovery' Withdraw, Replication Stopped in log 'mysql-bin.000002' at position 934 2019-10-23t11:05.462119z 22 [Note] Slave I/O thread replication stopped in log 'mysql-bin.000002' at position 934 2019-10-23t11:05.462119z 22 Slave I/O thread Killed while Reading event for channel 'group_Replication_recovery' 2019-10-23T11:11:05.462143z 22 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'mysql-bin.000002', Position 934 2019-10-23T11:11:05.523357z 20 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed' Previous State master_host='192.110.103.41', master_port= 3106, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, Master_bind = ". 2019-10-23t11:11:05.526137z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group' 2019-10-23t11:15:33.426684z 0 [Note] Plugin group_replication reported: 'Members joined the group: 192.110.103.43:3106' 2019-10-23T11:15:33.426832z 0 Plugin group replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:3.' 2019-10-23t11:15:34.094942z 0 [Note] Plugin group_replication reported: 'The member with address 192.110.103.43:3106 was declared online within The replication group' 2019-10-24T03:10:32.839967z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: [Note] Plugin group_replication reported: 'Primary server with address 192.110.103.41:3106 left the group. Electing new Primary.' 2019-10-24T03:10:32.840052z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.42:3106 was elected, Enabling conflict detection until the new primary applies all relay logs.' 2019-10-24T03:10:32.840086z 41 [Note] Plugin group_replication reported: 'This server is working as primary member.' 2019-10-24T03:10:32.840107z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.42:3106, 192.110.103.43:3106 on view 15718289704352993:4.' 2019-10-24t03:12:01.677869z 4 [Note] Plugin group_replication reported: 'Primary had applied all relay logs, disabled conflict detection'Copy the code

DB automatic switching:

You can see that the DB has successfully completed the automatic switchover.

[email protected]: (none) > select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ 2 rows Set (0.00 SEC) [email protected]: (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID; +----------------------------------+--------------------------------------+---------------+-------------+--------------+  | VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ 53 e462be | group_replication_primary_member | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ 1 row in set (0.00 SEC)Copy the code

Restore the original master and rejoin the cluster:

If the original master is restored, START GROUP_REPLICATION.

> select @@group_replication_bootstrap_group; +-------------------------------------+ | @@group_replication_bootstrap_group | +-------------------------------------+ | 0 | +-------------------------------------+ > START GROUP_REPLICATION; > select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | Group_replication_applier | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | | 53 e462be group_replication_applier | - e9 f3d7-11-9125 - a0369fa6cce4 | 192.110.103.42 | 3106 | ONLINE | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+Copy the code

2) Stop the master abnormally

Disable kill -9 master node 192.110.103.42. The log shows that the primary is automatically selected and changed to 192.110.103.41.

#kill -9 master, $kill -9 mysql_PID #192.110.103.41 Log tail -f /data1/ mysQL_3106 /logs/ mysqlD. err 2019-10-24T06:17:31.473849z 0 [Warning] Plugin group replication reported: 'Member with address 192.110.103.42:3106 has become unreachable.' 2019-10-24t06:17:32.479299z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: 192.110.103.42:3106' 2019-10-24T06:17:32.479323z 0 'Primary server with address 192.110.103.42:3106 left the group. Electing new Primary.' 2019-10-24T06:17:32.479395z 0 [Note] Plugin group_replication reported: 'A new primary with address 192.110.103.41:3106 was elected, Enabling conflict detection until the new primary applies all relay logs.' 2019-10-24T06:17:32.479439z 37 [Note] Plugin group_replication reported: 'This server is working as primary member.' 2019-10-24T06:17:32.479465z 0 [Note] Plugin group_replication reported: 'Group membership changed to 192.110.103.41:3106, 192.110.103.43:3106 on view 15718289704352993:6.'Copy the code
[email protected]: (none) > select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+---------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ | Group_replication_applier | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | | Group_replication_applier | ee4a9cec - f3d5-11 e9-9 ded - a0369fa6cd30 | 192.110.103.43 | 3106 | ONLINE | +---------------------------+--------------------------------------+---------------+-------------+--------------+ 2 rows Set (0.00 SEC) [email protected]: (none) > SELECT ta.* ,tb.MEMBER_HOST,tb.MEMBER_PORT,tb.MEMBER_STATE FROM performance_schema.global_status ta,performance_schema.replication_group_members tb WHERE ta.VARIABLE_NAME='group_replication_primary_member' and ta.VARIABLE_VALUE=tb.MEMBER_ID; +----------------------------------+--------------------------------------+---------------+-------------+--------------+  | VARIABLE_NAME | VARIABLE_VALUE | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ | group_replication_primary_member | 509810 ee - e9 f3d7-11 - a7d5 - a0369fac2de4 | 192.110.103.41 | 3106 | ONLINE | +----------------------------------+--------------------------------------+---------------+-------------+--------------+ 1 row in set (0.00 SEC)Copy the code

Reference:

  • Dev.mysql.com/doc/refman/…
  • Dev.mysql.com/doc/refman/…
  • Teacher Tao operation and maintenance of MySQL GTID introduction and construction of GTID slave library
  • Mysql Cluster
  • Segmentfault.com/a/119000001…
  • www.cnblogs.com/163yun/p/98…
  • www.cnblogs.com/kevingrace/…
  • MGR stuck in RECOVERING