TCP (1) : State machine and retransmission This section describes the TCP state machine and retransmission mechanism. This paper introduces Flow Control and Congestion Control. TCP Ensures the Quality of Service (QOS) of the network.
Most of this article is based on the TCP things (below), some of the views are different from the original, important points added explanation.
TCP traffic control
RTT algorithm
From the previous introduction to TCP Timeout retransmission mechanism, we know that the Timeout setting is very important for retransmission:
- Set long, re – hair is slow, lost a long time to re – hair, no efficiency, poor performance.
- If you set it too short, it will cause you to resend it without losing it. So retransmissions are faster, increasing network congestion, leading to more timeouts, and more timeouts leading to more retransmissions.
Moreover, the timeout period varies in different network environments and must be set dynamically. To this end, TCP introduced RTT (Round Trip Time) : the Time it takes for a packet to return from sending out. In this way, the sender knows about the time required for normal transmission and calculates the RTO (Retransmission TimeOut) accordingly. It sounds simple enough: t0 is written down when the sender sends a packet and t1 is written down when the receiver receives an Ack, so RTT = T1 — t0. However, this is only a sampling and does not represent the general situation of the network environment.
Classic algorithms
A classical algorithm is defined in RFC793:
-
First, sample the RTT and write down the RTT values for the last few times.
-
Then, Weighted Moving Average Method is used to smooth, and the SRTT (Smoothed RTT) is calculated:
SRTT = (alpha * SRTT) + (RTT) (1 - alpha) * (usually take 0.8 alpha 0.9 or less or less, the bigger the alpha the faster convergent speed)Copy the code
-
Finally, calculate RTO: RTO = min [UBOUND, Max [LBOUND, (β * SRTT)] (usually 1.3≤β≤2.0)
Karn/Partridge algorithm
The classical algorithm describes the basic idea of RTO calculation, but there is still an important question: the sampling of RTT is “the time of first SENDING Seq+ receiving Ack” or “the time of retransmitting Seq+ receiving Ack”.
As shown in figure:
- In case (a), RTT is sampled as “time of first Seq sending + Ack receiving”. If the Seq packet sent for the first time is completely lost, the sampled time is calculated as the interval from the first to the retransmission.
- In case (b), RTT is sampled as “Seq retransmission + Ack receiving time”. If the Ack transmission is slow, the Ack sent by the receiver is received just after retransmission, then the sampled time is less than the interval from the first to the retransmission.
The essence of the problem is that the sender cannot distinguish between the received Ack and the retransmitted Seq (it is the same when entering the network). To solve this problem, Karn/Partridge algorithm chose to avoid the problem of retransmission: the retransmission samples were ignored, and RTT samples were only taken without retransmission.
There are also problems with simply ignoring the retransmission of samples: suppose that the current RTO is small, network jitter occurs suddenly, and the delay increases rapidly, causing all packets to be retransmitted. Because the retransmission sample is ignored, the RTO will not be updated, so the retransmission will make the network more congested. Congestion leads to more retransmissions, a vicious cycle until the network crashes. Karn/Partridge algorithm uses a trick: as soon as the retransmission occurs, the existing RTO value is doubled (exponential backoff strategy), and after the network is restored, the RTO is gradually smoothed as the classical algorithm to reduce.
The algorithm is available, but network jitter has a great impact on performance.
Jacobson/Karels algorithm
The previous two algorithms both use weighted moving average algorithm for smoothing. The biggest problem with this method is that it is difficult to detect large fluctuations in RTT values because they are smoothed out (1-A is small, that is, the weight of the latest RTT is small). In order to solve this problem, Jacobson/Karels algorithm introduces the latest sampled RTT value and smoothed SRTT value gap as a factor, namely DevRTT (Deviation RTT, RTT Deviation), while considering the inertia brought by SRTT and the fluctuation brought by DevRTT:
SRTT = SRTT + alpha (RTT - SRTT) - computing SRTT DevRTT DevRTT + = (1 - beta) * beta * (| RTT - SRTT |) - the newest RTT SRTT and the gap between the weighted moving average (RTO) = (including * SRTT + ∂ *DevRTT -- with SRTT and DevRTT considered simultaneouslyCopy the code
Linux 2.6 uses this algorithm to compute the RTO, which defaults to α = 0.125, β = 0.25, μ = 1, ∂ = 4 (metaphysical parameters, you get the idea).
TCP sliding window
TCP uses the Sliding Window for traffic control and out-of-order rearrangement. Out-of-order rearrangement is introduced in the TCP retransmission mechanism. The following describes traffic control.
* The TCP header has a field called Window (or Advertised Window), which is used by the receiver to inform the sender how much buffer they have available for receiving data. The sender sends data based on the processing capability of the receiver, which does not cause the receiver to fail to process the data. This is called traffic control. Consider Advertised Window as a sliding Window for the time being and it will be easier to understand how flow control can be achieved with a sliding Window; the difference will be explained later when discussing congestion control.
Observe the TCP send buffer and receive buffer:
Assuming that the sequence number of positions increases from left to right (common read/write buffer design), explain:
- Sender: LastByteAcked points to the position where the largest consecutive Ack is received; LastByteSent points to the position of the LastByteSent; LastByteWritten points to the position of the LastByteWritten by the upper application.
- Recipient: LastByteRead points to the LastByteRead by the upper application; NextByteExpected points to the location of the largest consecutive Seq received; LastByteRcvd points to the location of the last byte received. You can see that some SEQs have not arrived between NextByteExpected and LastByteRcvd, corresponding to the blank area.
AdvertisedWindow is calculated at the receiver and EffectiveWindow is calculated at the sender:
- The receiver records its own in the Ack
AdvertisedWindow = MaxRcvBuffer -- (LastByteRcvd - LastByteRead)
And reply to sender with Ack. - The sender is based on the AdvertisedWindow value in Ack
LastByteSent - LastByteAcked ≤ AdvertisedWindow
, the remaining data size that can be sent in the windowEffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked)
To ensure that the receiver can process.
AdvertisedWindow and EffectiveWindow
AdvertisedWindow measures the amount of data that the receiver can still receive. Based on AdvertisedWindow, the sender determines the upper limit of the amount of data to be sent next, namely, EffectiveWindow (this may be 0).
The calculation of AdvertisedWindow
LastByteRcvd might point to Seq because it’s out of order, and Seq(LastByteSent + 1) to Seq(lastBytesent-1) is still on the way to the receiver, In the best case, there is no packet loss (retransmission after packet loss), then the space after LastByteRcvd and before the receive buffer boundary is the upper limit of the length of data that the sender can send next time (retransmission is not the next send), so, AdvertisedWindow = MaxRcvBuffer — (LastByteRcvd – LastByteRead).
The calculation of EffectiveWindow
LastByteRcvd can also point to Seq(no new packet has been received), obviously AdvertisedWindow is still the same formula and Seq(LastByteAcked + 1) to Seq(LastByteSent) is still on the road, It will arrive at the receiver in the future and enter the receive buffer, then “Seq(LastByteAcked + 1) to Seq(LastByteSent) on the road” should not exceed the remaining space AdvertisedWindow (currently equal to MaxRcvBuffer) of the receive buffer, The space after LastByteSent and before the limit of the remaining space of the receive buffer is the upper limit of the length of data that can be sent in the sender window. Therefore, EffectiveWindow = AdvertisedWindow – (lastByte – lastbyte).
EffectiveWindow is, of course, at least 0.
Example 1
Here is a sliding window for sending buffers:
The figure above is divided into four parts:
# 1
It’s the region before the confirmed data has been sent, that is, LastByteAcked.# 2
Is the area between LastByteAcked and LastByteSent that has sent unconfirmed data, and is no larger than AdvertisedWindow.# 3
Is the unsent data in the window, that is, the area between LastByteSent and the right edge of the window, which is equal to EffectiveWindow (probably 0).# 4
Is the unsent data outside the window, the area between the window’s right-hand bound and LastByteWritten.
Where #2 + #3 form the sliding window and the total size is no larger than AdvertisedWindow. The ratio of the two is affected by the processing speed of the receiver and the network condition (if the packet loss is serious or the processing speed is slower than the sending speed, #2:#3 will become larger and larger).
Example 2
The following is an AdvertisedWindow adjustment process, EffectiveWindow changes accordingly:
Zero Window
Understanding problems do not require mastery.
Above, we can see how a slow Server can reduce the size of the Client’s send window to 0. For the receiver, the receive buffer is indeed full at this point, so it makes sense to temporarily disable sending by reducing the size of the sender’s send window to 0. So, how to notify the sender of the new window size once the receiver’s receive buffer is empty?
To solve this problem, ZWP technology (Zero Window Probe) is designed for TCP. The sender will send a ZWP packet to the receiver after the Window becomes 0, so that the receiver can Ack his Window size. ZWP retransmission also follows an exponential rollback policy, with three retries by default. If the window size is still 0 after 3 times, the receiver is considered abnormal and sends RST to reset the connection (Retry to Window size normal??).
Note: DDoS attacks can occur anywhere there is waiting, and Zero Window is no exception. Some attackers set Window to 0 after setting up a connection to the server and sending a GET request, so that the server can only wait for ZWP. Then the attacker sends a large number of CONCURRENT ZWP messages, exhausting the resources on the server side. (How does waiting consume the server?)
TCP congestion control
Congestion in communication refers to:
The number of packets arriving at a certain part of the communication subnet is too large to be processed by the network. As a result, the performance of this part or the entire network deteriorates. In serious cases, network communication services may be stopped or deadlock occurs.
Why congestion control? Assume that the network is congested. If the network is not congested, the delay increases, more packets are lost, and data is retransmitted by the sender. The vicious cycle continues until the network breaks down. Congestion control and flow control have different adaptation scenarios and purposes.
Before congestion occurs, the network can be prevented from being overwhelmed by excessive traffic growth. When congestion occurs, the only option is to reduce traffic. Four algorithms are mainly used to complete congestion control:
- Slow start
- Congestion avoidance
- Congestion occurs
- Fast recovery
Algorithms 1 and 2 are applicable before congestion occurs, algorithm 3 is applicable when congestion occurs, and Algorithm 4 is applicable after congestion is resolved (i.e. before congestion occurs).
RWND with CWND
Before introducing the algorithms, let’s add the concepts of RWND (Receiver Window) and CWND (Congestion Window) :
- RWND is the size of the window used for traffic control, namely AdvertisedWindow in the above flow control, which depends on the processing speed of the receiver and is adjusted passively by the receiver to inform the sender (see above for more logic).
- CWND is the size of the window used for congestion processing, depending on the network condition, which is actively adjusted by the sender probe network.
When introducing flow control, we do not consider CWND, and think that the maximum sliding window of the sender is RWND. In practice, if both flow control and congestion processing need to be considered, the size of the sender window should not exceed min{RWND, CWND} The following four congestion control algorithms only involve the adjustment of CWND. Just as in the introduction of flow control, RWND is not considered for the moment and the maximum sliding window is assumed to be CWND. However, readers should clarify the relationship between RWND, CWND and sender window size.
Four congestion control algorithms
Slow start algorithm
Slow Start algorithms work before congestion sets in: For new connections, speed them up bit by bit, rather than trying to get them all in one go. As follows:
- As soon as the connection is established, the CWND = 1 is initialized (of course, it is usually not initialized to 1, which is too small), indicating that it is possible to pass data of an MSS size.
- For every ACK received, cwnd++ grows linearly.
- With each RTT, CWND = CWND * 2, increasing exponentially (the main source of growth).
- There is also a SSTHRESH (slow Start Threshold), when CWND >= SSTHRESH, congestion avoidance algorithm is entered (see below).
Therefore, if the network speed is fast, Ack returns are fast and RTT is short, then this slow start is not slow at all. The following diagram illustrates the process:
Congestion avoidance algorithm
As mentioned earlier, when CWND >= SSTHRESH (usually SSTHRESH = 65535), there is an algorithm of Congestion Avoidance: slow growth, careful search for the optimal value. As follows:
- For each Ack received, CWND = CWND + 1/ CWND.
- After each RTT, cwnd++, linear growth (main source of growth).
Slow start algorithms are exponential, rough and fast (” slow “is relative to one-step). However, the congestion avoidance algorithm mainly shows linear growth, fine type and slow speed, but it is easier to find the optimal CWND value in the network environment without causing congestion.
The algorithm when congestion occurs
Before congestion occurs, different strategies are adopted to increase CWND. If congestion has occurred, a strategy is required to reduce CWND. So how does TCP determine that the current network is congested? Quite simply, if the sender detects a Seq sending failure (represented as “packet loss”), the network is considered congested.
After packet loss, there are two retransmission modes corresponding to different network conditions, which also correspond to two control algorithms when congestion occurs:
- Timeout retransmission. TCP thought this was too bad and made a big adjustment:
- ssthresh = cwnd /2
- CWND = 1, restart the slow startup process (bad network, adjust slowly)
- Fast retransmission. TCP considers this situation generally better than RTO timeouts, and mainstream implementations of TCP Reno have a softer adjustment (TCP Tahoe’s implementation is just as grumpy as RTO timeouts) :
- ssthresh = cwnd /2
- CWND = CWND /2, enter fast recovery algorithm (network is not so bad, can be adjusted quickly, see below)
As you can see, ssTHRESH becomes half as good as CWND regardless of the retransmission mode, and is still an exponential backoff. After congestion disappears, ssTHRESH gradually increases back to the new optimal value _ and oscillates around the optimal value (dynamic) in general.
After the rollback, you can select different recovery algorithms based on different network conditions. Slow start has been introduced, but the fast recovery algorithm is introduced.
Fast recovery algorithm
If fast retransmission is triggered, that is, the sender receives the same Ack at least three times, then TCP decides that the network situation is not so bad that there is no need to be alarmed and can be appropriately bold to recover. For this purpose, a Fast Recovery algorithm is designed. The implementation of TCP Reno is described below.
To recap, CWND and sSHTHRESH have been updated before going into quick Recovery:
- ssthresh = cwnd /2
- cwnd = cwnd /2
Then, enter the fast recovery algorithm:
- CWND = SSTHresh + 3 * MSS
- The Seq corresponding to repeated Ack is retransmitted
- If this repeated Ack is received, then cwnd++, linear growth (slow adjustment)
- If a new Ack is received, CWND = SSTHRESH and then the congestion avoidance algorithm is entered.
What can be ignored for now:
This implementation also has a problem: it relies on three duplicate ACKS. Recall from the discussion of “retransmission of one or all”, 3 repeated ACKS do not mean only one packet is lost, most likely many packets are lost. Apparently the fast recovery algorithm selects “retransmit one” and the remaining packets have to wait until the RTO times out. Therefore, a timeout window is halved, multiple timeouts will exceed the TRANSMISSION speed of TCP decreased exponentially; At the same time, timeout retransmission will not trigger the fast recovery algorithm, slow start is easy to aggravate the timeout situation, into a vicious circle.
The sample
Here is a simple illustration of how CWND changes during congestion control:
Reference:
- The WHOLE TCP thing (part 2)
- Network Basics 3- Transport layer
TCP (2) : Traffic control and congestion control TCP (2) : Traffic control and congestion control TCP (2) : Traffic control and congestion control