Fail at Scale

Fail at Scale is a Facebook article published on THE ACM Queue in 2015. The main writing of the common online fault and coping methods, the content is more realistic.

“What Would You Do If You Weren’t Afraid?” And “Fortune is supportive the Bold.” that’s FB credo, the kind that’s on the wall.

In order to make FB system stable in a rapidly changing system, engineers have made some summaries and abstractions of system faults, which must be understood in order to build a reliable system. To understand failures, engineers have built tools to diagnose problems and a culture of replicating accidents to avoid the same failures in the future.

The occurrence of accidents can be divided into three main categories.

Why did it fail

Single machine fault

Typically, isolated failures encountered by a single machine do not affect the rest of the infrastructure. A stand-alone failure here means that a machine’s hard disk fails or a service on a machine encounters a bug in its code, such as memory corruption or deadlocks.

The key to avoiding individual machine failures is automation. By summarizing known failure patterns and combining them with the exploration of unknown symptoms. When symptoms of unknown faults (such as slow response) are found, the machine is removed, analyzed offline, and summarized into known faults.

Workload variation

Real-world events can put a strain on Facebook’s infrastructure, such as:

Obama was elected president, and his Facebook page experienced record levels of activity.
The climax of a major sporting event, such as the Super Bowl or The World Cup, can lead to sky-high posts.
Load testing, as well as the diversion phase when a new feature is released, will not be shown to users but will bring in traffic.

The statistics collected from these events provide a special perspective on system design. Major events will lead to changes in user behavior, which can provide data basis for subsequent decisions of the system.

Human error

Figure 1A shows how events on Saturday and Sunday decreased dramatically, even though traffic to the site remained consistent throughout the week. Figure 1B shows that in a six-month period, there were only two weeks without incidents: the week of Christmas and the week when employees wrote peer reviews for each other.

The figure above shows that most of the failures are caused by human activities, because the statistics of events are consistent with the law of human activities.

Three causes of the failure

There are many reasons for failure, but three are the most common. Each one listed here gives precautions.

Rapid deployment configuration changes

Configuration systems are often designed to replicate changes quickly across the globe. Rapid configuration change is a powerful tool; however, rapid configuration change can mean accidents when deploying problematic configurations. Here are some ways to prevent configuration change failures.

Everyone has the same configuration system. Using a common configuration system ensures that programs and tools are suitable for all types of configurations.

Static check for configuration changes. Many configuration systems allow loosely typed configurations, such as JSON structures. These types of configurations make it easy for engineers to mistype field names, use strings where integers are needed, or make other simple mistakes. Such simple errors are best caught with static validation. Structured formats (Facebook uses Thrift) provide the most basic validation. However, it is desirable to write programs to validate the configuration at a more detailed business level.

Canary deployment. Deploy the configuration to a small scale first to prevent catastrophic changes. Canaries can take many forms. The simplest is A/B testing, such as enabling the new configuration for only 1% of users. Multiple A/B tests can be performed simultaneously, and metrics can be tracked using data over time.

In terms of reliability, A/B testing does not meet all requirements.

A change deployed to a small number of users, if it causes the relevant server to crash or run out of memory, can obviously have an impact beyond the limited number of users in the test.
A/B testing is also time consuming. Engineers often want to push small changes without using A/B testing.

To avoid obvious configuration problems, Facebook’s infrastructure automatically tests the new version of the configuration on a small number of servers.

For example, if we wanted to deploy A new A/B test to 1% of users, we would deploy the test to 1% of users first, ensuring that their requests landed on A small number of servers, and monitoring those servers for A short period of time to ensure that there were no obvious crashes due to configuration updates.

Stick to the configuration that works. The configuration system design ensures that the original configuration is retained in the event of a failure. Developers tend to crash when the configuration goes wrong, but Facebook’s infrastructure developers decided it was better to make the module work with the old configuration than to return errors to users. (Note: I can only say this really depends on the scene)

Roll back quickly when configuration problems occur. Sometimes, despite best efforts, problematic configurations come online. Rapid rollback is the key to solving these problems. The configuration content is managed in the version management system to ensure the rollback.

Strongly dependent on core services

Developers tend to think that core services, such as configuration management, service discovery, or storage systems, will never fail. Under this assumption, temporary failures of these core services become massive failures.

Caches data for core services. Some of the data can be cached locally on the service, which reduces the dependency on the caching service. Provide specialized SDKS to use the core services. It is best to provide a dedicated SDK for the core service so that everyone follows the same best practices when using the core service. At the same time in THE SDK can consider good cache management and fault handling, users once and for all. Practice. Without a walkthrough, there is no way to know if you will fail if a dependent service fails, so walkthroughs for fault injection are a must.

Delayed increases and resource depletion

Some failures result in increased latency, which can be small (for example, resulting in a slight increase in CPU usage) or large (thread deadlocks for service responses).

A small amount of additional delay can be easily handled by Facebook’s infrastructure, but a large amount of delay can cause cascading failures. Almost all services have a limit on the number of unprocessed requests. This limitation may be due to a limited number of threads requesting response-type services, or it may be due to limited memory for event-based services. If a service experiences a lot of extra latency, the service that calls it will exhaust its resources. This kind of failure spreads layer upon layer, causing a major failure.

Resource exhaustion is a particularly destructive failure pattern that causes failures of services used by a subset of requests to cause failures of all requests:

A service invokes a new experimental service that is rolled out to 1% of users. Normally, the request for this experimental service would take 1 millisecond, but because of the failure of the new service, the request takes 1 second. Requests from the 1% of users using the new service may consume so many threads that requests from the other 99% cannot be executed.

The following can be used to avoid requests piling up:

Delay control. When analyzing past events involving latency, engineers found that a large number of requests were stacked in queues waiting to be processed. Services typically have a limit on the number of threads or memory usage. As the service response speed is less than the incoming request speed, the queue grows larger and larger until a threshold is reached. To limit queue size without compromising reliability, FB engineers studied the Bufferbloat problem, which is similar to the Bufferbloat problem in that congestion does not cause excessive delay. This implements a variant of the Controlled Delay algorithm:

Note: although M and N are written inside, M and N are fixed values, N = 100ms, M = 5ms

onNewRequest(req, queue):

  if queue.lastEmptyTime() < (now - N seconds) {
     timeout = M ms
  } else {
     timeout = N seconds;
  }
  queue.enqueue(req, timeout)
Copy the code

In this algorithm, if the queue has not been emptied in the past 100ms, the time spent in the queue is limited to 5ms. If the service was able to empty the queue in the last 100ms, then the time spent in the queue is limited to 100ms. This algorithm can reduce queuing (because lastEmptyTime will cause queuing timeouts of 5ms in the distant past) while allowing short queuing times for reliability purposes. While it may seem counterintuitive to have requests with such a short timeout, this process allows requests to be discarded quickly, rather than piling up when the system can’t keep up with incoming requests. A short timeout ensures that the server always accepts a little more work than it can actually handle, so it never sits idle.

As mentioned earlier, the M and N are basically not adjusted according to the scene. Other methods to solve queuing problems, such as setting a limit on the number of items in the queue or setting timeout for the queue, need to be tuned according to scenarios. M values of 5 milliseconds and N values of 100 milliseconds work fine in most scenarios. Facebook’s open source Wangle library and Thrift use this algorithm.

Adaptive LIFO(last in first out). Most services process queues in FIFO order. However, during heavy queuing, first-in requests often wait so long that users may have given up on request-related behavior. Processing first-in-queue requests at this point would expend resources on a request that is unlikely to benefit the user. FB’s service uses an adaptive lifO algorithm to process requests. Under normal operating conditions, requests are processed in FIFO mode, but switch to LIFO mode when the queue starts to pile up tasks. As figure 2 shows, adaptive LIFO and CoDel work well together. CoDel sets short timeouts to prevent long queues from forming, while adaptive LIFO mode puts new requests at the front of the queue, maximizing their chances of meeting CoDel’s deadlines. HHVM implements this lifO algorithm.

Concurrency control. Both CoDel and adaptive LIFO run on the server side. The server side is the best place to reduce latency — the server serves a large number of clients and has more information than the client side.

However, some failures are more serious and may cause server-side control to fail to start. FB also implements a policy on the client side: each client tracks the number of outstanding outbound requests on a per-service basis. When a new request is sent, if the number of pending requests to the service exceeds the number that can be configured, the request is immediately flagged as an error (note: it should resemble a circuit breaker). This mechanism prevents a single service from monopolizing all of its clients’ resources.

Tools to help diagnose faults

Despite the best precautions, breakdowns will happen. During a failure, using the right tools can quickly find the root cause and minimize the duration of the failure.

Quick access to information is important when dealing with accidents with Cubism’s high-density dashboard. A good dashboard allows engineers to quickly assess the types of metrics that might be out of order, and then use that information to infer the root cause. The dashboard, however, gets bigger and bigger until it becomes harder to scan quickly, and the charts displayed on these dashboards have too many lines to see at a glance, as shown in Figure 3.

To solve this problem, we built our top-level dashboard using Cubism, a framework for creating horizon charts — charts that encode information more densely using color, allowing easy comparisons of multiple similar data curves. For example, we use Cubism to compare metrics across data centers. Our tools around Cubism allow simple keyboard navigation that allows engineers to quickly view multiple metrics. Figure 4 shows the same data set at different heights using an area plot and a horizontal plot. In the area map version, the 30 pixel high version is hard to read. It is very easy to find peaks in the horizon map at the same altitude.

What has changed recently?

Since one of the primary causes of failure is human error, one of the most effective ways to debug failure is to look for recent human changes. We collect information about recent changes, from configuration changes to deployment of new releases, in a tool called OpsStream. However, we found that this data source has become very noisy over time. With thousands of engineers making changes, there are often too many changes to evaluate in one event.

To solve this problem, our tool tries to relate failures to their associated changes. For example, when an exception is thrown, in addition to printing a stack trace, we also print the most recent change in the value of any configuration read by the request. Often, the cause of many error stack problems is one of these configuration values. Then, we can respond quickly to the problem. For example, tell the engineer who launched the configuration to roll back the configuration immediately.

Learn from your failures

When failures occur, our incident review process helps us learn from those incidents.

The purpose of the incident review process is not to point fingers. No one has been fired because of an incident he or she caused. The purpose of the review is to understand what happened, remedy the circumstances that allowed the incident to occur, and establish safety mechanisms to reduce the impact of future events.

Methods for escalation Facebook has developed a method called Detection, Escalation, Remediation, and Prevention (DERP) to help with productive escalation.

Detection of detection. How are problems detected — alarms, dashboards, user reports?
Upgrade escalation. Did the right people step in quickly? Can these people be drawn into the troubleshooting process through alerts, rather than being drawn into it manually?
Remedial remediation. What steps have been taken to address this problem? Can these steps be automated?
Prevention of prevention. What improvements can be made to prevent the same type of failure from happening again? How can you fail gracefully, or more quickly, to reduce the impact of this failure?

With the help of this pattern, if not prevent the same type of event from happening again, at least recover faster next time.

Less trouble with a shuttle

A “move fast” mentality doesn’t have to conflict with reliability. To make these ideas compatible, Facebook’s infrastructure provides safety valves:

The configuration system can prevent the rapid deployment of bad configuration.
Core services provide customers with hardened SDKS to prevent failures;
The core stability library prevents resource exhaustion in the event of delays.

To deal with those that slip through the cracks, easy-to-use dashboards and tools have also been built to help find the most recent changes associated with the failure.

Most importantly, the lessons learned through replay after an event will make the infrastructure more reliable.

References

CoDel (controlled delay) algorithm; Queue.acm.org/detail.cfm?… .
Cubism; square.github.io/cubism/.
HipHop Virtual Machine (HHVM); Github.com/facebook/hh… .
Thrift framework; Github.com/facebook/fb… .
Wangle library; Github.com/facebook/wa… .