This article is translated from chapter 8 of Designing Data-Intensive Applications.

The last few chapters focus on how the system handles errors. For example, we discussed replica failover, replication lag, and concurrency control of transactions. When we understand the various boundary cases that can occur in a real system, we are better able to deal with them.

The first few chapters, though they talk a lot about mistakes, are too optimistic. In this chapter, we will assume, most pessimistically, that “whatever can go wrong will eventually go wrong”.

Programming for distributed systems is fundamentally different from writing software on a single machine — the main difference is that distributed systems have many novel ways of failing. In this chapter, we’ll look at what went wrong in practice and what we can rely on and what we can’t.

In the end, our job as engineers is to build systems that get the job done (that is, the assurance that the user expects), even if everything goes wrong. In Chapter 9, we’ll look at some examples of algorithms that can provide this assurance in distributed systems. But first, in this chapter, we must understand the challenges we face.

This chapter is a gloomy and depressing overview of what can go wrong in distributed systems. We will examine the problem of networks (” Unreliable Networks “, p. 277); Clock and timing issues (” Unreliable clocks “, p. 287); We will discuss the extent to which they can be avoided. The consequences of all these questions can be confusing, so we’ll look at how to think about the state of distributed systems and how to reason about what has happened (” Knowledge, Truth, and Lies, “on page 300).

Errors and partial failures

When you write a program on a single machine, it usually behaves in a predictable way: it either works or it doesn’t. Buggy software can give the impression that something is wrong with your computer (a reboot usually fixes the problem), but it’s mostly the result of poorly written software.

There is no fundamental reason for software to behave strangely on a single machine: when the hardware is working properly, the same operation always produces the same result (which is deterministic). If there is a hardware problem (for example, memory is corrupted or a connector is loose), the result is usually an entire system failure (for example, a “blue screen of death” that does not boot). Standalone machines with good software are usually fully functional or completely broken, and not in between.

This is a deliberate choice in computer design: if there is an internal failure, we would rather the computer crash completely than return false results, which are hard to deal with and confusing. As a result, computers hide the fuzzy physical reality on which their implementation depends and come up with an idealized model of the system that can be perfectly combined with mathematics. CPU instructions always do the same thing; If you write some data to memory or disk, the data remains intact and does not randomly corrupt. This design goal of always calculating correctly dates back to the first digital computer.

When you write software that runs on multiple computers and is connected over a network, the situation is completely different. In distributed systems, we are no longer in an ideal system model – we have no choice but to face the messy reality of the physical world. And in the real world, as this anecdote shows, all sorts of things can go wrong:

In my limited experience, I have dealt with long network partitions in a single data center (DC), PDU (distribution unit) failures, switch failures, unexpected rack-wide power failures, full DC trunk failures, full DC power failures and a hypoglycemic driver who crashed his Ford pickup into an ac system. I’m not even an operations person. – the Coda Hale

In distributed systems, it is possible that while the rest of the system is working well, some parts of the system may fail in some unpredictable way. This is called a partial failure. The difficulty with this problem is that part of the failure is indeterminate: if you try to do anything involving multiple nodes and networks, it may sometimes work well and sometimes fail unexpectedly. As we will see, you may not even know if something is successful, because there is also uncertainty about the time it takes for news to travel across the network!

This uncertainty and the possibility of partial failure is what makes distributed systems difficult to deal with.

Cloud computing and supercomputing

There is a philosophy about how to build large computing systems:

  • At one end of the scale is high-performance computing (HPC). Supercomputers with thousands of cpus are often used for computationally intensive scientific computing tasks such as weather forecasting or molecular dynamics (which models the motion of atoms and molecules).

  • At the other end of the spectrum is cloud computing, which is not very well defined but is commonly associated with multi-tenant data centers, commodity computers connected to IP networks (usually Ethernet), elastic/on-demand resource allocation, and billing by the hour.

With these philosophies, the way to deal with mistakes is very different. In supercomputers, jobs often check their computational state to persistent storage from time to time. If a node fails, the usual solution is to simply stop the entire cluster workload. After the failed node is repaired, the calculation restarts from the previous checkpoint. As a result, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failures by upgrading to full failure – when any part of the system fails, simply bringing the whole system down (like a kernel panic on a stand-alone machine).

In this book, we focus on systems that implement Internet services, which often look very different from supercomputers:

  • Many Internet-related applications are online in the sense that they need to be able to provide low-latency services to users at any time. Service unavailability (for example, stopping the cluster for repair) is unacceptable. In contrast, offline (batch) jobs like weather simulations can be stopped and restarted with minimal impact.

  • Supercomputers are typically built with dedicated hardware where each node is very reliable and the nodes communicate via shared memory and remote direct memory access (RDMA). Nodes in cloud services, on the other hand, are built by ordinary machines that provide the same performance at a lower cost but with a higher failure rate.

  • Large data center networks are usually based on IP and Ethernet and are arranged in a Clos topology to provide high one-to-one bandwidth. Supercomputers often use specialized network topologies, such as multidimensional grids and Toruses, which provide better performance for HPC workloads with known communication patterns.

  • The larger the system, the higher the probability that some components in the system will fail. Over time, failures are fixed and new components fail, but in a system with thousands of nodes, it is a reasonable assumption that failures are always happening in the system. When error handling strategies are not effective enough, a large system can end up spending a lot of time recovering from failures rather than doing useful work.

  • If the system can tolerate failed nodes and still continue to work as a whole, this is a very useful feature for operations and maintenance: for example, rolling upgrades (see Chapter 4) can be performed, restarting nodes one at a time, and the system continues to serve users without interruption. In a cloud environment, if a virtual machine is not performing well, you can kill it and request a new (hopefully faster) virtual machine.

  • In a geographically distributed deployment (keeping data geographically close to users to reduce access latency), communication is likely to take place over the Internet, which is slow and unreliable compared to local networks. Supercomputers usually assume that all their nodes are close together.

If we want distributed systems to work, we must accept the possibility of partial failures and build fault tolerance into our software. In other words, we need to build reliable systems from unreliable components. (As discussed in “Reliability” on page 6, there is no perfect reliability, so we need to understand the limits of what we can realistically promise.)

Even in small systems with only a few nodes, it is important to consider partial failures. In a small system, it is likely that most of the components will work well most of the time. But sooner or later, some part of the system will fail, and the software will have to deal with it somehow. Troubleshooting must be part of the software design, and the software operator needs to know what the software will do when a failure occurs.

It would be unwise to assume that mistakes are rare and hope only for the best. It is important to consider all possible (and even unlikely) errors, and artificially create them in your test environment to see what happens. In distributed systems, success is achieved with skepticism, pessimism and paranoia.

Build reliable systems from unreliable components

You might wonder if this makes sense — intuitively, a system is only as reliable as its least reliable component, its weakest link. Not so: in fact, building more reliable systems from less reliable foundations is an old idea in computing. Such as:

  • Error-correcting codes allow digital data to be transmitted accurately over communication channels, occasionally with certain bit errors, such as due to radio interference on a wireless network.

  • IP (Internet Protocol) is unreliable: packets can be lost, delayed, repeated or out of order. TCP (Transmission Control Protocol) provides a more reliable transport layer on top of IP: it ensures that lost packets are retransmitted, eliminating duplication, and packets are reassembled into their sending order.

While a system may be more reliable than its base, its reliability is always limited. Error-correcting codes, for example, can handle a small number of single-bit errors, but there is a fundamental limit to the amount of data that can be obtained over a communication channel if the signal is swamped by interference. TCP can hide packets from us that are lost, repeated and out of order, but it can’t miraculously eliminate latency in the network.

While the higher-level system, which is more reliable, is not perfect, it is still useful because it can handle some of the trickier low-level failures, so it is often easier to resolve and deal with the rest.

Unreliable networks

As discussed in the introduction to Part 2, the distributed system we focus on in this book is a shared-nothing system: a collection of machines connected over a network. The network is the only way these machines can communicate. We assume that each machine has its own memory and disk, and that one machine cannot access the memory or disk of the other (except to make requests to services over the network).

Shared-nothing isn’t the only way to build systems, but it has become the dominant way to build Internet services for several reasons: it’s relatively cheap, because it doesn’t require special hardware, can leverage commercialized cloud computing services, and can achieve high reliability through redundancy across multiple geographically distributed data centers.

The Internet and most of the internal networks in data centers (usually Ethernet) are asynchronous packet networks. In this network, one node can send a message (a packet) to another, but the network cannot guarantee when, or even if, it will arrive. If you send a request and expect a response, many things can go wrong (some of which are shown in Figure 8-1) :

  1. Your request may have been lost (perhaps someone unplugged the Internet).
  2. Your request may be waiting in a queue to be sent later (perhaps the network or recipient is overloaded).
  3. The remote node may fail (possibly crash or power failure).
  4. The remote node may have temporarily stopped responding (it may be experiencing a long garbage collection pause; See “Process pause” on page 295), but it will start responding again later.
  5. The remote node may have processed your request, but the response was lost on the network (possibly because the network switch was incorrectly configured).
  6. The remote node may have already processed your request, but the response is delayed and will be sent later (possibly because the network or your own machine is overloaded).

The sender can’t even tell if the packet has been sent: the only option is for the receiver to send a response message, which can be lost or delayed. These problems are indistinguishable in asynchronous networks: the only information you have is that you haven’t received a response yet. If you send a request to another node and do not receive a reply, there is no way to know why.

The usual way to deal with this problem is to use timeouts: give up waiting after a certain amount of time and assume that the response will not arrive. However, when a timeout occurs, you still don’t know if the remote node received your request (if the request is still queued up somewhere, it may still be delivered to the receiver, even if the sender has given up).

Network Fault Practice

We’ve been building computer networks for decades – one might hope that by now we’d know how to make them reliable. But it seems we haven’t succeeded yet.

There are systematic studies and plenty of anecdotal evidence that network problems can be widespread even in a controlled environment like a company-run data center. A study in a medium-sized data center found that about 12 network failures occur each month, half of which individual machines are disconnected and half of which the entire rack is disconnected. Another study measured failure rates for components like rack top switches, aggregation switches, and load balancers and found that adding redundant network equipment doesn’t reduce failures as much as you might hope, because it doesn’t protect against human error (for example, misconfigured switches), which is a major cause of network outages.

Public cloud services such as EC2 are notoriously prone to short-lived network failures, and a well-managed network of dedicated data centers would be more stable. Still, no one is immune to network problems: for example, a problem during a switch software upgrade may trigger a network topology reconfiguration, during which network packets may be delayed for more than a minute. Sharks can bite undersea cables and damage them. Other surprising failures include the network interface sometimes dropping all inbound packets but successfully sending outbound packets. Therefore, just because a web link works in one direction does not guarantee that it will work in the opposite direction.

Network partitioning When one part of the network is disconnected from the rest due to a network failure, it is sometimes called network partitioning or network splitting. In this book, we use the more general term network failure to avoid confusion with partitioning (fragmentation) of a storage system as described in Chapter 6.

Even if network failures are rare in your environment, the fact that they can happen means that your software needs to be able to handle them. There is always a chance that communication over the network will fail. There is no way around it.

If error handling of network failures is not defined and tested, volatile errors can occur: for example, even if the network is restored, the cluster can become deadlocked and permanently unable to service requests, or it can even delete all your data. If the software is not under control, it may behave unexpectedly.

Dealing with network failures does not necessarily mean tolerating them: If your network is generally quite reliable, an effective approach might be to simply display error messages to users when the network encounters problems. However, you need to know how your software will react to network problems and make sure your system can recover from them. It makes sense to deliberately trigger network problems and test system responses (this is the idea behind Chaos Monkey; See “Reliability” on page 6).

Detect fault

Many systems require automatic detection of faulty nodes. Such as:

  • The load balancer needs to stop sending requests to dead nodes.
  • In a distributed database with single-leader replication, if the leader fails, a follower needs to be promoted to become the new leader (see “Handling node failures” on page 152).

Unfortunately, the uncertainty of the network makes it difficult to tell if a node is working properly. In certain cases, you may receive feedback that clearly tells you that a component is not working properly:

  • If you can get to the machine running the node, but no process is listening on the target port (for example, because the process crashed), the operating system will help close or reject the TCP connection by sending RST or FIN packets. However, if the node crashes while processing the request, you will have no way of knowing how much data the remote node has actually processed.

  • If a node process crashes (or is killed by an administrator) but the node’s operating system is still running, the script can notify other nodes about the crash so that another node can quickly take over without waiting for a timeout.

  • If you have access to the administrative interfaces of data center network switches, you can query them to detect hardware level link failures (for example, whether remote machines are powered down). This option cannot be used if you are connected over the Internet, if you are in a shared data center but have no permission to access the switch, or if you cannot access the admin interface due to a network problem.

  • If the router determines that the IP address you are trying to connect to is unreachable, it may reply with ICMP target unreachable packets. But the router doesn’t have magical fault detection capabilities — it’s subject to the same limitations as the rest of the network.

  • Quick feedback on remote node outages is useful, but you can’t count on it. Even if TCP acknowledges that the packet has been sent, the application may crash before it can process the data. If you want to confirm that a request was successful, you need to respond positively in the application itself.

  • Conversely, if something goes wrong, you may get an incorrect response at some level, but often you must assume that there is no response at all. You can retry several times (TCP retries are transparent, but you can retry at the application level), wait for the timeout to pass, and if no response is received within the timeout range, finally declare the node invalid.

Timeout and infinite delay

If timeouts are the only reliable way to detect failures, how long should they be? Unfortunately there are no easy answers.

A long timeout means a long wait before a node is declared dead (and users may have to wait or see error messages during this time). Short timeouts allow failures to be detected more quickly, but carry a higher risk of miscalculation, such as when a node is temporarily slow (such as due to peak work or network load) and is misjudged as dead.

Declaring a node dead too early is problematic: If the node is actually active and performing some operation (for example, sending an E-mail), and then another node takes over, the operation may end up being performed twice. We’ll discuss this in more detail in “Knowledge, Truth, and Lies” on page 300 and in chapters 9 and 11.

When a node is declared dead, its responsibilities need to be transferred to other nodes, placing additional burden on the other nodes and the network. If the system is already under high load, declaring a node dead prematurely can make the problem worse. In particular, it may be that the node is not actually dead, but is simply slow to respond due to high load. Shifting its load to other nodes can lead to cascading failure (in extreme cases, all nodes declare each other dead and then everything stops working).

Suppose the network of a virtual system guarantees maximum latency for packets — each packet is either delivered or lost within a certain amount of time, but never more than D. In addition, it is assumed that non-failing nodes are guaranteed to always process requests within a period of time r. In this case, it is guaranteed that every successful request will receive a response within 2d + R, and if no response is received within this time, you know that the network or remote node is not working. If that were the case, 2D + R would be a reasonable timeout.

Unfortunately, we use the most systems have the assurance: asynchronous network with infinite delay (i.e. they send packets as soon as possible, but the packet reaches the time required to not limit), and most of the server implementation does not guarantee they can request is processed in a specific time (see the “response time guarantee” (page 298)). For fault detection, fast is not enough most of the time: if the timeout is short, the round-trip time only needs to rise instantaneously to cause the system to lose balance.

Network congestion and queues

When driving a car, the time spent on the road is often different because of traffic jams. Similarly, the variability of packet latency on a computer network is usually due to queuing:

  • If multiple different nodes attempt to send packets to the same destination at the same time, the network switch must queue them and send them one by one to the target network link (as shown in Figure 8-2). On busy network links, packets may have to wait some time to get a slot (this is called network congestion). If there is so much incoming data that the switch queues fill up, packets will be discarded, so packets will need to be re-sent, even if the network is fine.

  • If all CPU cores are currently busy when the packet arrives at the target machine, incoming requests from the network will be queued by the operating system until the application is ready to process it. This can take an arbitrary amount of time, depending on the load on the machine.

  • In a virtualized environment, the running operating system usually pauses for tens of milliseconds while another VM is using the CPU core. During this time, the virtual machine cannot use any data in the network, so the input data is queued (buffered) by the virtual machine monitor, further increasing the variability of network latency.

  • TCP performs flow control (also known as congestion avoidance or back pressure), where nodes limit their own sending rates to avoid overloading network links or receiving nodes. This means that senders queue data even before it enters the network.

In addition, if TCP is not acknowledged within a certain timeout period (calculated based on the observed round-trip time), the packet is considered lost and the lost packet is automatically resend. Although the application does not see packet loss and retransmission, it does see the resulting delay (waiting for the timeout to expire and then waiting for the retransmitted packet to be confirmed).

TCP and UDP

Some delay-sensitive applications, such as video conferencing and voice over IP (VoIP), use UDP instead of TCP. This is a compromise between reliability and variability of latency: because UDP does not perform flow control and does not retransmit lost packets, it avoids some of the causes of variable network latency (although it is still vulnerable to switch queuing and scheduling delays).

UDP is a good choice in cases where delayed data is worthless. For example, in a VoIP phone call, there may not be enough time to retransmit lost packets before their data will be played back on the speakers. In this case, it makes no sense to retransmit the packet — the application must fill the timeslot of the missing packet with silence (resulting in a brief interruption of sound) and then continue in the data stream. In contrast, retry occurs at the human level. (” Can you say that again? There was no sound.” )

All of these factors contribute to variations in network latency. As systems approach their maximum capacity, the range of queuing delays is large: systems with a large amount of spare capacity can easily absorb queues, while in heavily used systems long queues can quickly form.

In public clouds and multi-tenant data centers, resources are shared by many customers: network links and switches, and even network interfaces and cpus (when running on virtual machines) for each computer are shared. Batch workloads such as MapReduce (see Chapter 10) can easily saturate network links. Since you have no control or understanding of how other customers are using shared resources, network latency can be fickle if someone next to you is using a lot of resources.

In such an environment, you can only select the timeout by experiment: test the network round-trip time distribution with multiple machines over an extended period to determine the expectation of delay variability. Then, considering the nature of your application, you can determine an appropriate tradeoff between failure detection delays and the risk of premature timeouts.

Better yet, instead of using configured constant timeouts, the system is able to continuously measure response times and their variation (jitter) and automatically adjust timeouts based on the observed distribution of response times. This can be done with the Phi Accrual fault detector used in Akka and Cassandra. TCP retransmission timeout works similarly.

Synchronous and asynchronous networks

Distributed systems would be much simpler if we could rely on networks to deliver packets with a fixed maximum delay, rather than dropping them. Why can’t we solve this problem at the hardware level and make the network reliable so that software doesn’t have to take these issues into account?

To answer this question, it is interesting to compare data center networks with very reliable traditional landline networks (non-cellular, non-voip) : delayed audio frames and dropped calls are very rare. Telephone calls require consistently low end-to-end latency and sufficient bandwidth to transmit audio samples of speech. Wouldn’t it be nice to have similar reliability and predictability in computer networks?

When you make a call over a phone network, it establishes a line: it allocates a fixed amount of guaranteed bandwidth to the call along the entire route between two callers. The line remains occupied until the end of the call. An ISDN network, for example, runs at a fixed rate of 4,000 frames per second. After the call is established, 16 bits of space are allocated within each frame (in each direction). Thus, each party is guaranteed to be able to send an accurate 16-bit audio data every 250 microseconds during the call.

The network is synchronous: even if data passes through multiple routers, it is not affected by queuing because the 16-bit space of the call is already preserved in the next hop of the network. And because there is no queuing, the maximum end-to-end latency of the network is fixed. We call this finite delay.

Can’t we simply make network latency predictable?

Note that lines in a telephone network are very different from TCP connections: a line is a fixed amount of reserved bandwidth that no one can use when the line is set up, whereas packets in a TCP connection have the opportunity to use any available network bandwidth. You can provide TCP with a variable size block of data (such as an email or web page), and TCP will transfer it in the shortest possible time. When a TCP connection is idle, no bandwidth is used. If the data center network and the Internet are a line-switched network, establishing a line ensures maximum round-trip times. However, they are not: Ethernet and IP are packet-switching protocols, and they are subject to queuing, which causes infinite network latency. These protocols have no concept of lines.

Why do data center networks and the Internet use packet switching? The answer is that they are optimized for burst traffic. A circuit suitable for audio or video calls requires a fairly constant number of bits per second to be transmitted during the call. On the other hand, requesting a web page, sending an email or transferring a file doesn’t have any specific bandwidth requirements, we just want it done as quickly as possible.

If you want to transfer files over a wire, you have to guess the bandwidth allocation. If your guess is too low, the transfer speed will be unnecessarily slow, resulting in unused network capacity. If you guess too high, the line will not be established (because a network cannot establish a line if it cannot guarantee its bandwidth allocation). Therefore, using lines for burst data transfers wastes network capacity and causes transmission to be unnecessarily slow. TCP, by contrast, dynamically adjusts data transfer rates to fit the available network capacity.

There have been some attempts to build hybrid networks that support line switching and packet switching, such as ATM. InfiniBand, for example: it implements end-to-end traffic control at the link layer, reducing the probability of network queuing, although it may still suffer from latency due to link congestion. Through careful use of quality of service (QoS, priority and scheduling of packets) and access control (rate-limiting transmitters), line switching on packet networks can be simulated, or statistically bounded delays can be provided.

Latency and resource usage

More generally, you can think of variable latency as a result of dynamic resource partitioning.

Suppose there is a line between two telephone exchanges that can make 10,000 calls simultaneously. Each circuit switched over this line occupies one of the call slots. Thus, you can think of a line as a resource that can be shared by as many as 10,000 concurrent users. Resources are allocated statically: even if you are now the only phone on the line and all the other 9,999 slots are unused, your line will still be allocated the same fixed amount of bandwidth as if the line were fully utilized.

By contrast, the Internet dynamically shares network bandwidth. Senders compete to get their packets across the network as quickly as possible, and the network switch decides which packets to send (that is, bandwidth allocation). This method has the disadvantage of queuing, but the advantage is that it maximizes the use of the line. The cost of a line is fixed, so every byte sent over that line is cheaper if you make more use of it.

A similar situation occurs with cpus: if you dynamically share each CPU core between multiple threads, sometimes one thread must wait for the operating system’s run queue while another thread runs, so threads can be suspended for different lengths of time. However, this makes better use of the hardware than allocating a static number of CPU cycles per thread (see “Response time guarantees” on page 298). Higher hardware utilization is also an important motivation for using virtual machines.

If resources are statically partitioned (for example, dedicated hardware and dedicated bandwidth allocation), latency guarantees can be implemented in some environments. However, this comes at the cost of reduced utilization. In other words, it is more expensive. On the other hand, multi-tenancy under dynamic resource allocation provides better utilization, so it is cheaper, but it has the disadvantage of variable latency.

Variable latency in networks is not a natural law, but merely the result of a cost/benefit trade-off.

However, this quality of service is not currently available in multi-tenant data centers and public clouds or when communicating over the Internet. The technology currently deployed does not allow us to make any guarantees about latency or reliability of the network: we must assume that congestion, queuing and infinite latency may occur. Therefore, there is no “correct” value for the timeout, which needs to be determined experimentally.

To be continued…