The Tail at Scale, Google published a paper in 2013, The Long Tail latency problem of large-scale online services.

To understand how to solve the long tail problem, it is important to understand what the long tail latency is. When developing online services, we all know to focus on p99/ P999 latency of the service so that most users can get a response within the expected time frame.

Here is a map of the number of requests with different response times:

Most systems also follow this distribution law. Now the system scale of the Internet is relatively large, and it is possible for a service to depend on dozens or hundreds of services. The long Tail latency of a single module can be magnified in service granularity with a large number of dependencies, as shown in The Tail at Scale paper.

Consider a system where most service calls respond within 10ms, but the 99th quartile has a delay of 1 second. If a user request is processed on only one such service, only one in 100 user requests will be slow (one second). The diagram here Outlines how the service-level latency is affected by a very small probability of large latency values under this hypothetical scenario.

If a user request had to collect responses from 100 such services in parallel, 63% of the user requests would take longer than a second (marked “X” in the figure). Even for a service that has a one in ten thousand chance of experiencing a response delay of more than one second on a single server, if the service scale reaches 2000 instances, it is observed that almost one in five user requests require more than one second (marked “O” in the figure).

Xargin note:

  1. Since the requests are parallel, the delay will be less than 1s only if all 100 requests are in the 99th bit, so the probability of delay less than 1s is POW (0.99, 100) = 0.3660323412732292, So there must be a 63% chance that it’s going to go beyond 1s.
  2. Second, the probability that we fall into the 99th quart is POW (0.9999, 2000) = 0.8187225652655495, which means that nearly 20% of users will respond more than 1s.

The table above shows the measurements of a real Google service that is logically similar to the simplified scenario above; The root service distributes a request to a large number of leaf services through an intermediate service. The table shows the effect of a large number of fan-out calls on the latency distribution.

The 99th quart delay for the completion of a single random request measured at the root server was 10ms. However, the 99th quartile delay for all requests to complete is 140ms, and the 99th quartile delay for 95% of requests is 70ms, which means that the slow requests waiting for the slowest 5% are responsible for half of the total 99th percentile delay. Tuning these slow request scenarios can have a significant impact on the overall latency of the service.

Why is there a long tail delay?

Long-tail high latency for a single service component can occur for a number of reasons, including:

Shared resources (Xargin: more and more now). A machine may be shared by different applications competing for shared resources, such as CPU core, processor cache, memory bandwidth, and network bandwidth, while within the same application, different requests may compete for resources. Daemons. Background daemons may use limited resources on average, but may have peak jitter of a few milliseconds at runtime. Global Resource sharing. Applications running on different machines may compete for global resources (such as network switches and shared file systems). Maintenance activities. Background activities (such as data reconstruction in distributed file systems, periodic log compression in storage systems like BigTable, and periodic garbage collection in garbage collection languages) cause periodic peaks of latency. Queuing (Queueing). This possibility is magnified by the multiple layers of queues in intermediate servers and network switches.

But also because of current hardware trends:

Power limits. Modern cpus are designed to run temporarily above their average power (Intel’s speed-up technology) and then limit the heat by throttling if this activity continues for a long time; Garbage collection. Solid state storage devices provide very fast random read access, but the need to periodically garbage collect large chunks of data can increase read latency by a factor of 100 for even modest write activity. Energy management. Power saving modes for many types of devices can save considerable energy, but add additional delays when moving from inactive mode to active mode.

The solution

Minimize long tail delays in modules

Differentiating service classes and higher-level queuing. Differentiated service categories can be used to prioritize requests that users are waiting for over non-interactive requests. Keep low-level queues short so that higher-level policies take effect faster.

Reducing head-of-line blocking. Breaking long-running requests into a series of smaller requests that can be interlaced with other short-time tasks; Google’s Web search system, for example, uses this kind of time splitting to prevent large requests from affecting the latency of a large number of other queries with lower computational costs. (There are protocol and connection layer issues with queue header blocking, which need to be solved by using newer protocols, such as h1 -> H2 -> h3.)

Managing background activities and synchronized disruption. Background tasks may generate significant CPU, disk, or network load. Examples are log compression for a logging storage system and garbage collector activity for a garbage collection language. In combination with current-limiting capabilities, heavy operations can be broken down into lower-cost operations and triggered when overall load is low, such as in the middle of the night, to reduce the impact of background activities on interactive request delays.

Some adaptive means within the request period

Hedged requests. A simple way to suppress delay variation is to make the same request (or channel in Go concurrent mode) to multiple copies and use the result of responding first. Once the first result is received, the client cancellations the remaining unprocessed requests. However, doing so directly creates additional multiple loads. So optimization needs to be considered.

One approach is to delay sending the second request until the first request reaches the 95th quartile and has not yet returned. This method limits the extra load to around 5%, while greatly reducing the long tail time.

Tied requests. Instead of waiting for a period of time to send, as with hedging, multiple replicas are sent simultaneously, but the replicas are told that other services are also executing the request, and when the replicas finish processing the request, they proactively request the other replicas to cancel the same request they were processing. Additional network synchronization is required.

Adaptive means across requests

“Micro-partition.” Dividing servers into smaller partitions, such as 20 partitions for a large server, can speed up traffic adjustment and recovery. Fine-grained load adjustment minimizes the latency impact of load imbalance.

Selective replication. Add replicators for partitions that you detect or predict will be hot. Then, load balancers can help spread the load. Google’s main web search system takes this approach by making additional copies of popular and important files in multiple micropartitions. (that is to prepare more instances of hotspot partition, feel should need to do some predictions according to the specific business)

Put slow machines on probation. When a slow machine is detected, it is temporarily excluded from operation (a circuit breaker). Since slowness is often temporary, monitor when to bring affected systems back online. Continue to make shadow requests to these excluded servers to collect their latency statistics so that they can be reintegrated into the service when the problem abates. (Here’s a simple circuit breaker.)

Some other tradeoffs

Consider ‘good enough’ responses. Once enough of all the servers have responded, the user is likely to get the best possible service with slightly incomplete results in exchange for better end-to-end latency. If the incomplete results are useful enough for the user, it is ok not to show the unimportant results.

Use canary requests. Another problem that can occur in systems with very high fan outs is that a particular request triggers an untested code path, causing a crash or an extremely long delay on thousands of servers simultaneously. To prevent such related crashes, some of Google’s IR systems use a technique called “canary requests”; Rather than initially sending a request to thousands of leaf servers, the root server first sends it to one or two leaf servers. The remaining servers will only be queried if the root server gets a successful response from canary in a reasonable amount of time.