Last week, GitHub experienced an incident that degraded service for 24 hours and 11 minutes. While some parts of the platform were not affected by the incident, multiple internal systems were still affected, showing outdated and inconsistent content to users. Fortunately, no user data was lost, but manual adjustments for a few seconds of database writes are still in progress. During the incident, Webhook was unable to provide services or build and publish GitHub Pages.

We are deeply sorry to every customer who was affected. We feel the trust our users have in GitHub and are proud to build resilient systems that keep the platform highly available. We are deeply sorry that we have let our users down in this incident. While we cannot undo the issues that caused the GitHub platform to become unusable for such an extended period of time, we can explain what led to this incident, the lessons we learned from it, and the steps we are taking to ensure it does not happen again.

background

Most user-facing GitHub services run in our own data center. Data center topologies are designed to provide a powerful and scalable edge network in front of multiple regional data centers that handle computing and storage workloads. Despite the redundancy layers built into our physical and logical components, it is still possible that our sites will not be able to communicate with each other for some time.

At 22:52 UTC on October 21, the connection between the east coast network center and the east coast data center was disconnected in order to replace the faulty 100G fiber optic device. The connection was restored 43 seconds later, but the brief interruption triggered a series of incidents that resulted in 24 hours and 11 minutes of service being downgraded.

Earlier, we talked about how to use MySQL to store GitHub metadata and how we made MySQL highly available.

GitHub hosts multiple MySQL clusters ranging in size from a few hundred GIGABytes to five terabytes, each with up to a few dozen read-only duplicates that store non-Git metadata. So our application can provide pull request and issue management, authentication management, background processing coordination, and more beyond the original Git object store. Data from different parts of the application is stored in various clusters by functional sharding.

To dramatically improve performance, applications write data directly to each cluster’s primary database, but delegate read requests to replica servers in most cases. We used the Orchestrator to manage the MySQL cluster topology and handle automatic failover. The Orchestrator is based on Raft’s consensus algorithm and can implement topologies that applications cannot support, so care must be taken to align the Orchestrator configuration with your application’s expectations.

Accident schedule

22:52 UTC, October 21, 2018

When network partitioning occurs, the Orchestrator in the master data center initiates the deselection process based on the Raft consensus mechanism. The Orchestrator nodes of the West Coast DATA center and east Coast public cloud obtained the quorum votes and began failover of the cluster to direct write operations to the West Coast data center. When the connection is restored, the application layer immediately begins directing write traffic to the new master node at the West Coast site.

The east Coast data center database contains a short period of written data that has not yet been copied to the West Coast data center. Because the database clusters in these two data centers contain data that does not exist in the other data centers, we cannot safely move the primary database to the East Coast data center.

22:54 UTC, October 21, 2018

Our internal monitoring system began to send out alerts informing us of a large number of system failures. By this time, several engineers were sorting the incoming notifications. At 23:02 UTC, the first Response team engineers had determined that many of the database clusters were in an abnormal state of topology. The database replication topology displayed by the Orchestrator API contains only servers in west Coast data centers.

UTC, October 21, 2018 23:07

At this point, the response team decided to manually lock down our internal deployment tools to prevent the introduction of any additional changes. At 23:09 UTC, the response team turned the site yellow. This automatically escalates the incident and sends an alert to the incident coordinator. At 23:11 UTC, the incident coordinator came in and two minutes later decided to put the site on red.

UTC, October 21, 2018 23:13

This problem affects multiple database clusters. Other engineers from GitHub’s database engineering team were also notified. They started investigating what steps needed to be taken to configure the East Coast database as the master database for each cluster and rebuild the replication topology. This is challenging because so far the West Coast database cluster has received nearly 40 minutes of write data from the application layer. In addition, a few seconds of write data in the East Coast cluster was not copied to the West Coast, and new write data was prevented from being copied back to the east Coast.

Protecting the confidentiality and integrity of user data is a top priority at GitHub. A 30-plus minute write to a West Coast data center forced us to consider failing- Forward in order to keep user data safe. However, applications running on the East Coast rely on writes from MySQL clusters on the West Coast and are currently unable to cope with the additional delay due to cross-country round-tripping. This decision will result in many users being unable to use our service. Finally, we believe that extending the service degradation time can ensure the consistency of user data.

23:19 UTC, October 21, 2018

By querying the state of the database cluster, we found that we needed to stop the metadata write job. We suspended Webhook and GitHub Pages to avoid damaging data already received from users. In other words, our policy is to prioritize data integrity over site availability and recovery time.

00:05 UTC, October 22, 2018

Engineers on the incident response team began to develop solutions to the data inconsistencies and implement MySQL failover. Our plan is to take a backup from restore, synchronize copies of both sites, fall back to a stable service topology, and then continue with the queued jobs. We updated the status to tell users that we were going to perform a controlled failover of the internal data storage system.

MySQL data backups are performed every four hours and retained for years, but backups are stored remotely in a bloB storage service in a public cloud. It takes several hours to restore TB backup data. Most of the time is spent transferring data from the remote backup service. It also took a lot of time to decompress, validate, preprocess, and load the large backup files into the newly configured MySQL server. Throughout the process, we tested at least once a day. Prior to this incident, we had never completely rebuilt the entire cluster based on backups, relying instead on other strategies such as deferred replication.

00:41 UTC, October 22, 2018

At this point we have started the backup process for all affected MySQL clusters and engineers are monitoring the progress. In the meantime, engineers from multiple teams are working on ways to speed up transfers and reduce recovery times without further degrading site availability or corrupting existing data.

UTC, October 22, 2018 06:51

Several clusters in us East Coast data centers have recovered from backups and started copying new data from the West Coast. This results in slower load times for pages that have to perform writes via cross-country links, but pages read from these database clusters will return the latest results if read requests fall on newly restored copies. Other larger database clusters are still recovering.

Our team already knows how to do a recovery directly from the West Coast to break through throughput constraints caused by downloading from off-site storage and is increasingly confident in the recovery process, the time it takes to set up a replication topology depends on how long it takes for replication to catch up. This estimate was based on linear interpolation of our existing replicating telemetry data, and we updated the status page to set two hours as our estimated recovery time.

October 22, 2018 07:46 UTC

GitHub has published a blog post that provides more background information. We used GitHub Pages internally, but since all builds had been suspended a few hours ago, publishing them required extra work. We apologize for the delay. We want to be able to release this information more quickly and make sure we can release updates within these constraints in the future.

October 22, 2018 11:12 UTC

All master databases on the East coast of the United States are in place. The responsiveness of the site has also improved, with writes now directed to the database server in the same physical data center as the application layer. While this greatly improves performance, there are still dozens of database read copies that lag several hours behind the main database. These delayed replicas cause users to see inconsistent data when interacting with our service. We spread the read load over a large number of read-only copies, with each request potentially arriving on a read-only copy that was already several hours late.

In fact, the time required for replication to catch up with the main database follows a power decay function rather than a linear trajectory. When users in Europe and the United States woke up and went to work, the database cluster’s write load increased and the recovery process took longer than originally estimated.

13:15 UTC, October 22, 2018

So far, we are approaching peak traffic loads for GitHub.com. The incident response team discussed how to proceed. Clearly, replication latency is increasing. We have configured additional read-only copies of MySQL in the East Coast public cloud. Once these instances are in place, it is easier to spread read requests across more servers. Reducing cross-replica aggregation allows replication to catch up faster.

October 22, 2018 16:24 UTC

After the replica was in sync, we failover the original topology to address latency and availability issues. To prioritize data integrity, we leave the service state red when we start processing backlogs.

October 22, 2018 16:45 UTC

At this stage, we had to balance the load of backlogged data, since too many notifications could overload the rest of the ecosystem, and restore service to 100% as quickly as possible. At this point there were over 5 million Webhook events and 80,000 page build requests in the queue.

When we started processing this data again, we processed about 200,000 Webhook loads that had exceeded the internal TTL and were discarded. When we discovered this problem, we paused processing and temporarily extended the TTL.

To avoid further undermining the reliability of status updates, we are in a degraded state until we have processed all the backlogs and ensured that our service has definitively returned to normal levels.

23:03 UTC, October 22, 2018

All pending Webhook and Pages builds have been processed and the integrity and normal operation of all systems have been confirmed. The site status has been updated to green.

The next step

During recovery, we captured MySQL binary logs that contained data that we had written to the primary site but had not copied to the West Coast site. The number of writes not copied to the West Coast is relatively small. For example, one of our busiest clusters had 954 writes during the affected period. We are currently analyzing these logs and determining which writes can be resolved on their own and which need to be confirmed with the user. We had multiple teams involved in this work and identified a class of writes that had been repeated and successfully retained by users. Our primary goal is to maintain the integrity and accuracy of user data.

We wanted to communicate meaningful information to our users during an incident, and we made multiple public estimates of recovery times based on how quickly we could process the backlog of data. In retrospect, our estimates didn’t account for all the variables. We regret this confusion and hope to be able to provide more accurate information in the future.

During this analysis, we identified a number of technical initiatives. As we continue to conduct extensive post-incident analysis internally, we find that we have more work to do.

  1. Adjust the Orchestrator configuration to prevent cross-region elections to the master database. The Orchestrator will only run as configured, regardless of whether the application layer supports such topology changes. Elections for leaders in individual regions are usually safe, but sudden delays across borders were a major factor in the accident. This is an emergent behavior of the system, as we have never encountered such a large internal network partition before.

  2. We have put in place a mechanism to report status more quickly, to talk more clearly about the progress of the accident. Although many parts of GitHub were still available during the incident, we were only able to set the status to green, yellow, and red. We realized that this didn’t give us an accurate idea of what was working and what wasn’t working, and in the future we’ll also show different components of the platform so we can see the status of each service.

  3. In the weeks leading up to the accident, we launched a company-wide engineering program to enable multiple data centers to support GitHub traffic through a live design. The goal of this project was to support N+1 redundancy at the facility level and tolerate single data center failures without affecting users. It’s a very important job and it’s going to take some time. We believe that this kind of geographically connected site can provide a good set of trade-offs. The accident added urgency to the project.

  4. We will take a more proactive stance in testing our assumptions. GitHub is a rapidly growing company that has acquired a degree of complexity over the past decade. As companies continue to grow, it becomes increasingly difficult to capture and transfer the historical burden of trade-offs and decisions.

This incident led to a shift in our view of site reliability. We have realized that tighter operational controls or improved response times are still not sufficient for site reliability on a complex service system like ours. We will also initiate a systematic practice to validate failure scenarios before they have the potential to impact users, and we will invest in failure injection and chaos engineering.

conclusion

We know how much users depend on GitHub to bring success to projects and businesses. No one cares more about the availability of services and the correctness of data than we do. We will continue to analyze this incident in order to provide better services to our users and win their trust.

English text: https://blog.github.com/2018-10-30-oct21-post-incident-analysis/