When we talk about the difference between a distributed system and a stand-alone program, the first thing we think about is network issues between multiple machines. The network connection between multiple machines brings us many advantages, such as the ability to connect machines in different places without worrying about performance bottlenecks caused by a single machine. But it also brings us a lot of unexpected problems, this article will explain in detail why we say that the network is unreliable.

Possible problems with simple requests

Let’s first look at the problems that can occur when a node sends a request to another node (as shown in the following figure) :

  1. Your request may simply be lost. (E.g., the network suddenly goes down while sending)
  2. Your request may get stuck in a queue for some time before it gets sent. (such as heavy network load or direct overload of the receiver, etc.)
  3. The receiving node has a problem (crash, power failure, etc.)
  4. There may be a temporary problem on the receiving side that takes a while to respond (e.g. GC is being done, etc.)
  5. The receiver processed your request, but the response was lost.
  6. The receiver is quick to process your request, but the response is slow.

Therefore, it can be seen from the above situation that if there is a problem with a request, the sender may not know what happened at all, whether it is not sent out, sent out and not processed by the receiver, or processed by the receiver, but the response cannot be returned in time. The only information the sender can know is that a response has not been received for a period of time.

Because of the above reasons, generally speaking, the handling method of this situation is to add a timeout: the request is considered to have a problem if the response is not received within a period of time, but in fact the receiver may still process the relevant request (for example, the response cannot be sent back successfully).

False detection

Since networks are so unreliable, many systems need a mechanism to detect if there is a problem with the network, such as the following scenarios:

  1. A load balancing system needs to stop sending requests to problematic nodes.
  2. For a single leader database, if the leader fails, the election of the leader needs to be re-conducted.

As we have said, it is difficult to determine whether a node actually works, but it is possible to determine whether an aspect of a node actually works in some special scenarios:

  1. You can connect to the remote node, but no one is listening. For example, the operating system can return an RST or FIN packet to indicate that the TCP connection has been broken or rejected.
  2. The process of one node crashes, but the whole operating system still works normally. We can actively inform other nodes of the process crash through a script, without using timeout to judge.
  3. If you have access to the management interface of the database network switch, you can use it to detect connectivity errors at the hardware level. That’s assuming you have access, of course.
  4. If the Destination IP address is Unreachable, the router may reply to you with the Destination Unreachable packet.

Of course, in general, we still use the heartbeat mechanism to detect errors. For example, if the heartbeat has not been received for a period of time, the corresponding node is considered to have a problem. It’s just that determining timeout is a topic worth investigating.

Timeouts and uncontrolled delays

If we set the timeout too long, requests will continue to be sent to this node between the time the problem actually occurs and the time it is detected, but we will see many false replies. If we set the timeout was very short, so a small network fluctuation or a load of wave could cause we mistakenly think of node as there is a problem, and this is the aftermath of the originally belongs to the node load transfer to other nodes, and it is also a problem (imagine the extreme, many nodes are considered to have a problem, So only a few nodes are processing the request.

If our network transmission can have a fixed time commitment, for example, or every packet will be completed within time D, otherwise it will be lost. Each node is then able to complete processing of the request within time r. This assumes that we must receive response in time 2d+r, and we can set the timeout to this value.

Unfortunately, there is no such promise in reality, and here’s why:

Congestion and queues on the Internet

In fact, just like driving to and from work, there are many times when the transmission of network packets will be congested and need to queue:

  1. If multiple sources are sending network packets to the same destination at the same time, the network Switch needs to queue them up and send them to the destination one by one, as shown in the figure below.
  2. After the network packets arrive, if all the cpus are busy, the operating system will queue the incoming network packets until there are free cpus available to process them.
  3. On a VM, the CPU may be used by another VM, and the VM will pause for tens of milliseconds. During this time, the VM cannot process any network packets, so you have to wait.
  4. TCP flow control, where a node controls the speed at which network packets are sent, that is, packets are controlled before they are sent.
  5. TCP retransmission, that is, TCP automatically retransmits when no response is received due to timeout, which may also cause delay.

All of which we have mentioned above can cause latency for various network packets. In reality, especially in a multi-tenant data center, many things are shared, such as network, CPU, etc., so they can be easily affected by other transmissions.

With so many factors at play, how do we choose a timeout? We can actually record the actual data and then calculate the expected value from the data, so that we can set the timeout value based on the expected value. You can even skip timeouts and use a score to judge the status of nodes, which we call Phi Accrual Failure Detector. This strategy is used in Akka and Cassandra.

conclusion

This paper introduces in detail the causes of unreliable network transmission and the detection methods we generally use.