Through the failover test of Consul cluster, we can know its high availability working mechanism.

Consul Node IP address Role Consul version log path

Consul01192.168.101.11 server0.9.3 / var/log/consulconsul02192.168.101.12 server0.9.3 / var/log/consulconsul03192.168.101.13 se Rver (leader) 0.9.3 / var/log/consul

Note: We are verifying consul cluster’s high availability, not Geo Failover that comes with Consul.

Initial cluster status (ignored by client nodes)

[root@consul01 consul]# ./consul operator raft list-peersNode ID Address State Voter RaftProtocolconsul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2Consul03 192.168.101.13:8300 192.168.101.13:8300 1 [root@consul01 consul]#./consul membersNode Address Status Type Build Protocol DC Segmentconsul01 192.168.101.11:8301 Alive server 0.9.3 2 DC Consul02 192.168.101.12:8301 Alive Server 0.9.3 2 DC Consul03 192.168.101.13:8301 Alive Server 0.9.3 2 DC consul03...Copy the code

Consul of a TSOP domain is a cluster of three server nodes and theoretically supports a maximum of one server node failure. Therefore, we test whether a server node failure affects the cluster

Consul Cluster simulated fault test

§1. Stop a Follower Server node

Take the Consul01 node as an example

[root@consul01 consul]# systemctl stop consul 
Copy the code

Logs of other nodes

[root@consul02 ~]# tail -100 /var/log/consul 2019/02/12 02:30:38 [INFO] serf: EventMemberFailed: Consul01 [INFO] consul01: Handled member-failed event for server "consul01.dc" in area "wan" 2019/02/12 02:30:39 [INFO] serf: EventMemberLeave: Consul01 192.168.101.11Copy the code

View cluster information on other nodes

[root@consul03 consul]# ./consul operator raft list-peersNode ID Address State Voter RaftProtocolconsul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2Consul03 192.168.101.13:8300 192.168.101.13:8300 2[root@consul03 consul]#./consul membersNode Address Status Type Build Protocol DC Segmentconsul01 192.168.101.11:8301 Left Server 0.9.3 2 DC consul02 192.168.101.12:8301 alive server 0.9.3 2 DC consul03 192.168.101.13:8301 alive server 0.9.3 2 dc...Copy the code

Check whether the cluster is running properly. (Query the registration service. If no service exists, manually create one using Consul API.)

[root@consul03 consul]# ./consul catalog servicesconsultest-csdemo-v0-snapshottest-zuul-v0-snapshot 
Copy the code

As you can see, the stopped server node is in the left state, but the cluster is still available

As a result,

Server node, which does not affect Consul cluster service.

Restore this server node shell[root@consul03 Consul]#./ Consul Operator raft list-peersNode ID Address State Voter RaftProtocolconsul02 192.168.101.12:8300 follower true 192.168.101.11:8300 192.168.101.11:8300 follower true 2[root@consul03 Consul]# /consul membersNode Address Status Type Build Protocol DC Segmentconsul01 192.168.101.11:8301 alive server 0.9.3 2 DC Consul02 192.168.101.13:8301 alive Server 0.9.3 2 DC Consul02 192.168.101.13:8301 alive server 0.9.3 2 DC consul02 192.168.101.13:8301 alive server 0.9.3 2 DCCopy the code

Other nodes have detected and added this node

[root@consul02 ~]# tail -100 /var/log/consul 2019/02/12 02:43:51 [INFO] serf: EventMemberJoin: Consul01 [INFO] consul01: consul Handled member-join event for server "consul01.dc" in area "wan" 2019/02/12 02:43:51 [INFO] serf: EventMemberJoin: Consul01 [INFO] consul: added to local server (consul01) TCP / 192.168.101.11:8300) (DC: DC)Copy the code

§1. Stop the Leader Server node

Take the consul03 node as an example

[root@consul03 consul]# systemctl stop consul

The other Follow Server nodes detect that the Leader node is offline and re-elect the Leader

[root@consul02 ~]# tail -100 /var/log/consul 2019/02/12 02:48:27 [INFO] serf: EventMemberLeave: [INFO] consul03.dc consul03.c Handled member-leave event for server "consul03.dc" in area "wan" 2019/02/12 02:48:28 [INFO] serf: EventMemberLeave: [INFO] consul: Removing LAN server consul03 (url: TCP /192.168.101.13:8300) (DC: DC) 2019/02/12 02:48:37 [WARN] raft: Reality Vote request from 192.168.101.11:8300 since we have a leader: 192.168.101.13:8300 2019/02/12 02:48:39 [ERR] Agent: Coordinate Update Error: No cluster leader 2019/02/12 02:48:39 [WARN] raft: [INFO] raft: raft from "192.168.101.13:8300" Node at 192.168.101.12:8300 [Candidate] entering Candidate state in term 5 2019/02/12 02:48:43 [ERR] HTTP: Request GET /v1/catalog/services, error: No cluster leader from=127.0.0.1:44370 2019/02/12 02:48:43 [ERR] HTTP: Request GET /v1/catalog/nodes, error: No cluster leader from=127.0.0.1:36548 2019/02/12 02:48:44 [INFO] raft: Node at 192.168.101.12:8300 [Follower] entering Follower state (Leader: "") 2019/02/12 02:48:44 [INFO] New leader elected: consul01Copy the code

From the log, you can see that Consul01 has been selected as the new leader to view cluster information

[root@consul02 consul]# ./consul operator raft list-peersNode ID Address State Voter RaftProtocolconsul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2Consul01 192.168.101.11:8300 192.168.101.11:8300 leader true 2[root@consul02 consul]#./consul membersNode Address Status Type Build Protocol DC Segmentconsul01 192.168.101.11:8301 Alive Server 0.9.3 2 DC Consul02 192.168.101.12:8301 Alive Server 0.9.3 2 DC Consul03 192.168.101.13:8301 left server 0.9.3 2 dcCopy the code

You can see that the stopped Leader Server node is in the left state, but the cluster is still available. Query service verification

[root@consul02 consul]# ./consul catalog servicesconsultest-csdemo-v0-snapshottest-zuul-v0-snapshot
Copy the code

As a result,

Server node, which does not affect Consul cluster service.

Then restore the node to shell[root@consul03 consul]# systemctl start Consul

You can see from its logs that it is now a Follower server

[root@consul03 ~]# tail -f /var/log/consul  2019/02/12 03:01:33 [INFO] raft: Node at 192.168.101.13:8300 [Follower] entering Follower state (Leader: "") 2019/02/12 03:01:33 [INFO] serf: Ignoring previous leave in snapshot 2019/02/12 03:01:33 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer 2019/02/12 03:01:33 [INFO] agent: Joining LAN cluster... 2019/02/12 03:01:33 [INFO] agent: (LAN) joining: [consul01 consul02 consul03] 
Copy the code

Node information is also updated on the new leader and follower

[root@consul01 ~]# tail -f /var/log/consul 2019/02/12 03:01:33 [INFO] serf: EventMemberJoin: Consul03.dc consul03.dc consul03.dc Handled member-join event for server "consul03.dc" in area "wan" 2019/02/12 03:01:33 [INFO] serf: EventMemberJoin: Consul03 [INFO] consul: added to local server TCP /192.168.101.13:8300) (DC: DC) 2019/02/12 03:01:33 [INFO] raft: Updating configuration with AddStaging (192.168.101.13:8300, 192.168.101.13:8300) to [{Suffrage: Voter ID: 192.168.101.12:8300 Address: 192.168.101.12:8300} {Suffrage: Voter ID: 192.168.101.11:8300 Address: 192.168.101.11:8300} {Suffrage: Voter ID: 192.168.101.13:8300 Address: 192.168.101.13:8300}] 2019/02/12 03:01:33 [INFO] Raft: Added Peer 192.168.101.13:8300, Starting Replication 2019/02/12 03:01:33 [WARN] RAFT: Added Peer 192.168.101.13:8300, Starting Replication 2019/02/12 03:01:33 AppendEntries to {Voter 192.168.101.13:8300 192.168.101.13:8300} Rejected, sending older logs (next: 394016) 2019/02/12 03:01:33 [INFO] consul: member 'consul03' joined, marking health alive 2019/02/12 03:01:33 [INFO] raft: Pipelining replication to peer {Voter 192.168.101.13:8300 192.168.101.13:8300}Copy the code

Check the cluster information again

[root@consul01 consul]# ./consul operator raft list-peersNode ID Address State Voter RaftProtocolconsul02 192.168.101.12:8300 192.168.101.12:8300 follower true 2Consul01 192.168.101.11:8300 192.168.101.11:8300 leader true 1 [root@consul01 consul]#./consul membersNode Address Status Type Build Protocol DC Segmentconsul01 192.168.101.11:8301 Alive server 0.9.3 2 DC Consul02 192.168.101.12:8301 Alive Server 0.9.3 2 DC Consul03 192.168.101.13:8301 Alive server 0.9.3 2 DCCopy the code

In this case, only the leader role is changed, which does not affect the external services provided by the cluster.