This article from: https://www.oschina.net/translate/mysql-high-availability-at-github

The original address: https://githubengineering.com/mysql-high-availability-at-github/

GitHub uses MySQL as the primary storage for all non-Git repository data, and its availability is critical to GitHub’s access operations. The GitHub site itself, the GitHub API, authentication, and more all require database access. We run multiple MySQL clusters to support different services and tasks. Our cluster uses a classic master-slave configuration where a node in the primary cluster can accept writes. The rest synchronize asynchronously from the cluster nodes to master server changes and provide data reading services.

The availability of the primary node is particularly important. Without a master server, the cluster cannot accept writes: any written data that needs to be retained cannot be persisted, and any incoming changes (such as commits, questions, user creation, reviews, new repositories, and so on) will fail.

In order to support write operations, we obviously need to have a data write node available, a primary cluster. But equally important, we need to be able to identify or find that node.

In a scenario where a write fails and the master node crashes, we must ensure that a new master node can be enabled and identified quickly. The time it takes to detect a failure, to failover, and to publish a new master node, makes up the total downtime.

This article introduces GitHub’s MySQL high availability and master service discovery solution, which enables us to reliably run cross-data center operations, tolerate data center isolation, and reduce downtime when failures occur.

High availability goals

The solution described in this article iterates on and improves on the high availability (HA) solution previously implemented on GitHub. MySQL’s high availability strategy must adapt to change as it grows in size. We hope to provide a similar high availability strategy for MySQL and other services on GitHub.

When considering high availability and service discovery, there are questions that can lead you to the right solution. Including but not limited to:

  • How long can you tolerate interruptions?

  • How reliable is crash detection? Are you tolerant of error reporting (premature failover)?

  • How reliable is failover? When is it ok to fail?

  • How does the solution work in a cross-data center scenario? In the case of low latency and high latency networks?

  • Does the solution allow for a complete data center failure or network isolation?

  • Is there a mechanism to prevent or mitigate split-brain (where two servers claim to be the master of a cluster, unaware of each other’s existence, and both accept writes)?

  • Can you allow data to be lost? To what extent?

To illustrate some of the above, let’s first discuss the previous high availability scenario and why we changed it.

Remove SERVICE discovery based on VIP and DNS

In previous iterations, we:

  • Use the Orchestrator for detection and failover

  • Discovery of master nodes using VIP and DNS

In this iteration, the client uses a name service (such as mysql-writer-1.github.net) to discover write nodes. The name can be resolved to a virtual IP(VIP) that points to the master node.

Therefore, under normal circumstances, the client simply resolves the name, connects to the resolved IP, and then discovers that the master node is also listening for the link on the other side (that is, the client connects to the master node).

Consider this replication topology across three different data centers:

When the primary node fails, a server must be selected from the replica and promoted to the new primary node.

The orchestrator will detect failures, elect a new master node, and then reassign the name (name) and VIP (virtual IP).

Clients don’t actually know the true identity of the master node: they only know name, which must now be resolved to the new master node. However, consider:

  • Vips are collaborative: they are declared and owned by the database server itself.

  • To obtain or release a VIP, the server must send an ARP request.

  • Servers with VIPs must be released before the newly promoted master can gain viPs.

This has some additional implications:

  • An orderly failover first notifies the failing master and asks it to release the VIP, and then notifies the newly promoted master and asks it to acquire the VIP. What if the original master cannot be notified or the VIP refuses to be released? The first thing to consider is that there are failure scenarios on this server, and it is unlikely that it will not respond in a timely manner, or at all.

    • We could end up with a split brain: two annotations claiming to have the same VIP. Depending on the shortest network path, different clients may connect to different servers.

    • The fact is that there is a collaboration between two independent servers, and this setup is unreliable.

  • Even if the original master did cooperate, the workflow wasted valuable time: switching to the new master was waiting while we notified the original master.

  • Even if the VIP changes, the existing client connection is not guaranteed to disconnect from the original server, and we may still experience a split brain.

Vips are limited by physical location. They belong to switches or routers. Therefore, we have to reassign the VIPs to servers located in the same location. In particular, when newly promoted servers are located in different data centers, we cannot assign VIPs and can only modify DNS.

  • Modifying DNS takes a long time. According to the configuration, the client caches DNS for a period of time. Cross-dc failover means more downtime: it takes longer for all clients to know the identity of the new master node.

These limitations alone are enough to prompt us to seek new solutions, but consider more:

  • The primary node is self-injected via the PT-Heartbeat heartbeat service for the purpose of measuring latency and throttling. This service must start from the newly promoted master node. If possible, the service on the original master node will be shut down.

  • Similarly, pseudo-gtid injection is managed by the master node itself. It will start at the new master node and end at the old master node.

  • The new primary node is made writable. If possible, the original master node is set to read-only.

These extra steps are a factor in the total outage time and introduce their own failures and friction.

The solution worked and GitHub has successfully failover MySQL, but we want our HA to improve in the following areas:

  • The data center is unknown

  • Allow data center failures

  • Remove unreliable collaboration workflows

  • Reduce the total outage time

  • Do as much lossless failover as possible

GitHub’s high availability solutions: Orchestrator, Consul, GLB

Our new strategy, in addition to incidental improvements, solves or alleviates many of these problems. In today’s high availability Settings, we have:

  • Use the Orchestrator for monitoring and failover. We used the Orchestrator/Raft scheme across the data center, as shown below.

  • Use Consul of Hashicorp for service discovery.

  • Use GLB/HAProxy as the proxy layer for the client and write nodes.

  • Use anycast for network routing.

The new Settings will completely remove VIP and DNS modifications. As we introduced more components, we were able to decouple them and simplify tasks, and we were able to use a reliable, stable solution. Let’s break it down one by one.

The normal process

Normally, applications connect to write nodes via GLB/HAProxy.

The application never knows the identity of the master node. As before, they use names. For example, the primary node of Cluster1 is named mysql-writer-1.github.net. In our current setup, the name is resolved to an anycast IP.

With multicast, names are resolved to the same IP everywhere, but traffic is routed based on client locations. It should be noted that in each of our data centers, GLB (our high availability load balancing) is deployed in different containers. Traffic to mysql-writer-1.github.net is always routed to the GLB cluster in the local data center. Therefore, all clients are serviced by local agents.

We run GLB on HAProxy. Our HAProxy maintains a write connection pool: one connection pool per MySQL cluster, where each connection pool has only one back-end server: the primary node of the cluster. All GLB/HAProxy containers in the DC have the same connection pool, and they all point to the same back-end server. This way, if an application wants to write mysql-writer-1.github.net, it doesn’t matter which GLB server it connects to. It will always be routed to the actual Cluster1 primary node.

For the application, service discovery ends at GLB and no rediscovery is required. In this way, traffic is routed to the correct address via GLB.

How does GLB know which servers are available as back-end servers and how to propagate changes to GBL?

Consul’s service discovery

Consul is a well-known service discovery solution that also offers DNS services. In our solution, however, we use it as an efficient key-value storage system.

In Consul’s key-value store, we write the identity of the cluster master. For each cluster, there is a key-value pair record that identifies the primary FQDN, port, IPV4, and IPV6 of the cluster.

Each GLB/HAProxy node runs consul template: Each service listens for changes in Consul data (in this case, data changes to the cluster master). Consul template generates a valid configuration file and automatically overloads HAProxy when configuration changes.

Therefore, changes in master identity in Consul are observed by each GLB/HAProxy and they immediately reconfigure themselves, setting the new master as a single object in the cluster back-end pool, and overloading to reflect these changes.

In GitHub, every data center has a Consul setting, and every setting has high availability. However, these Settings are independent of each other and do not copy or share data with each other.

So how does Consul get change notifications, and how is the information distributed across cross-data centers?

orchestrator/raft

Run an ORCHESTRator/RAFT setup: Orchestrator nodes communicate with each other via the RAFT consistency algorithm. Each data center has one to two orchestrator nodes.

The Orchestrator is responsible for failure detection, MySQL failover, and change notifications from Consul master. Failover operates through a single Orchestrator /raft master node, but for master changes, the message that generated the new master is propagated to all orchestrator nodes through the RAFT mechanism.

Once the Orchestrator nodes receive messages about master changes, they communicate with their corresponding local Consul Settings: they all perform KV writes. Data centers with multiple Orchestrator nodes will have multiple exactly the same Consul write operations.

The whole process

In the scenario where the primary node fails:

  • Orchestrator nodes detect faults.

  • Orchestrator/RAFT dominant node starts to recover. A new master node is set to the Promoted state.

  • Orchestrator/RAFT Notifies all RAFT cluster nodes of changes to the master node.

  • All Orchestrator/RAFT members receive notifications of changes to the master node. Each member writes a KV record containing the identity of the new master node to local Consul.

  • Each GLB/HAProxy runs a Consul template that monitors Consul KV storage changes and reconfigates and reloads HAProxy.

  • Client traffic is redirected to the new primary node.

Each component has clear accountability, and the overall design is simple and decoupled. The orchestrator does not need to know about load balancing. Consul doesn’t need to know where this information is coming from. The agent only cares about Consul, the client only cares about the agent.

And:

  • No DNS changes need to be propagated.

  • There is no TTL.

  • The entire process does not require the cooperation of the original failed master node, which has been largely ignored.

Other details

To further ensure the security of the process, we also provide the following:

  • Set the HAProxy configuration item hard-stop-after to a very short time. When reloaded with a new back-end server in the write connection pool, it automatically terminates all connections to the original master node.

    • By using the Hard-stop-after configuration, we don’t even need the client’s cooperation, which alleviates the split brain situation. It’s important to note that this isn’t absolute, and it still takes some time to kill off old connections. But after a certain point, we get comfortable that there are no nasty surprises.

  • We do not strictly require Consul to be available at all times. In fact, we only need it to be available during failover. If Consul becomes unavailable at precisely this time, GLB will continue to operate with known information and will not take any extreme action.

  • GLB is used to verify the identity of the newly promoted master node. Similar to our context-aware MySQL pools (context-aware MySQL thread pools), check on the backend server to ensure that it is indeed a writable node. If we happen to delete the identity of the master node in Consul, no problem; Empty entries are ignored. If we incorrectly write the name of a non-primary node in Consul, there is no problem. GLB will refuse to update it and continue in its last known state.

We will further accomplish the high availability goals that have been highly anticipated and anticipated in the following sections.

Orchestrator/RAFT failure detection

The Orchestrator uses a holistic approach to detecting failures, so this approach is very reliable. We don’t observe false positives — there’s no unnecessary outage time because we’re not doing premature failover.

Orchestrator /raft takes this one step further with full DC network isolation, also known as DC fencing. A DC network isolation can cause some confusion: servers in the DC can communicate with each other. Are they isolated from other DC networks, or are other DCS isolated from the network?

In a Orchestrator /raft setup, the leader node for the RAFT is the node that runs failover. The leader is the node that has the support of a specified number of nodes. Such is our Orchestrator node deployment that no single data center can be the majority, nor can any N-1 DC.

In an event of complete DC network isolation, the orchestrator node for this DC is disconnected from its counterparts in other DCS. Finally, the Orchestrator node in the isolated DC cannot act as the leader node for the RAFT cluster. If any such node happens to become the leader node, it exits. A new leader node can be assigned from any other DC. The Leader node is supported by all the other DCS that can communicate with each other.

Therefore, the Orchestrator node that calls Shots will be outside the network isolated data center. An isolated DC should have a master server that the Orchestrator would replace with one of the available DCS to initiate failover. We ease DC isolation by delegating this decision to those nodes in the non-quarantined DC.

Faster announcements

The total downtime can be further reduced by announcing that the main branch is about to change. How to implement this idea?

When the Orchestrator starts failover, it looks at the server queues available for upgrading. It can make a trained decision on the best course of action, knowing the rules of self-replication and accepting cues and constraints.

It may realize that a server that can be upgraded is also an ideal candidate strategy, for example:

  • There is nothing to prevent the server from upgrading (potential users have indicated that the server is being upgraded first), and

  • Assume that the server can treat all of its versions as copies.

In this example, the orchestrator first made the server writable and then immediately advertised the server upgrade (in our case, writing to Consul KV). Even if the replication tree was repaired asynchronously, this operation would take several more seconds to smooth out.

It is possible that by the time our GLB server is fully overloaded, the replication tree will already be intact, but this is not strictly required. The server can receive write operations!

Semi-synchronous replication

In semi-synchronous replication of MySQL, the master server does not confirm that a transaction has been committed until it knows that changes have been sent to one or more replicas. It provides a way to achieve lossless failover: any changes applied to the master server will be applied or wait to be applied to one of the replicas.

The cost of consistency: availability risk. If no replicas acknowledge receipt of changes, the master server will block and write operations will stop. Fortunately, there is a timeout setting after which the primary server can revert to asynchronous replication mode, making the write operation available again. We have set our timeout to a reasonably low value: 500ms. This threshold is more than sufficient to send changes from the master server to a local DC replica, and often also to a remote DC. After setting this timeout, we can observe perfect semi-synchronous behavior (no need to fall back to asynchronous replication) and satisfactory performance over a very short blocking period in the case of confirmation failures.

We enable semi-synchronization on the local DC replica and expect (though not strictly enforced) lossless failover in the event of primary server outage. Lossless failover of a complete DC failure is expensive and not something we look forward to.

While experimenting with semi-synchronous timeouts, we also observed one behavior that worked to our advantage: we were able to influence the identity of the best candidate in the event of a primary server failure. By enabling semi-synchronization on the specified servers and marking them as candidate servers, we can reduce the total downtime by influencing the outcome of the failure. In our experiments, we observed that we usually ended up with the best candidate server for quick announcements.

Heart into

Instead of managing the startup/shutdown of pT-Heartbeat on the promoted/degraded primary device, we choose to run it anywhere, anytime, instead. This requires some patching so that pt-heartbeat can support changing their READ_only state back and forth on the server side or crashing completely.

In our current setup, the Pt-Heartbeat service runs on the primary server and its replica. On the primary server, they generate heartbeat events. On replica servers, they recognize that the server is read-only and periodically recheck its status. As soon as the server is upgraded to the primary server, the Pt-Heartbeat on that server identifies the server as writable and starts injecting heartbeat events.

Orchestrator Ownership delegation

We further delegate to the orchestrator:

  • Pseudo – GTID injection

  • Set the promoted master as writable, clearing its overwrite state

  • If possible, set the old master to read-only

All of this reduces the likelihood of conflicts for the new master. A newly promoted master should be online and accessible, otherwise we should not promote it. It should also make sense to have the Orchestrator apply changes directly to the promoted master.

Limitations and Disadvantages

The proxy layer keeps the identity of the primary server unknown to the application, but it also masks the identity of the application from the primary server. All connections seen by the master server come from the agent layer, and we lose information about the actual source of the connection.

As distributed systems evolve, we are still left with unprocessed scenarios.

It is important to note that in a data center isolation scenario, applications in the DC can still be written to the master server, assuming that the master server is located in the DC. Once the network is restored, the network status may be inconsistent.

We’re trying to mitigate this split brain by implementing a solid STONITH in a very independent DC. As before, it will take some time before the main server is turned on, and a brief split brain may occur. And the operational costs of avoiding brain splintering are very high.

More scenarios exist: Consul’s terminal during failover; Partial DC isolation; The other. We know that it is impossible to eliminate all vulnerabilities with a distributed system of this nature, so we focused on the most important cases.

The results of

The Orchestrator /GLB/Consul setup gives us the following features:

  • Reliable fault detection

  • Data center unknown failover

  • Typical lossless failover

  • Support for data center network isolation

  • – alleviating split brain (still in progress)

  • There are no cooperative dependencies

  • About 10 to 13 seconds of power-off recovery in most scenarios. (We observed up to 20 seconds of recovery in some scenarios and up to 25 seconds in extreme scenarios)

conclusion

The Choreography/proxy/service discovery paradigm uses well-known trusted components in a decoupled architecture, which makes deployment, operation, and observation easier, and each component can be independently scaled or shrunk. We will continue to test our setup to continue to look for improvements.