BBR, the new kid on the block TCP | APNIC Blog of this article is quite good, a lot of content is below based on this article.
Congestion avoidance Prevents packet loss on the link due to congestion caused by the sender sending data too fast.
After TCP connection is established, it goes through Slow Start stage first. After receiving an ACK, CWND doubles and data transmission rate increases exponentially. When packet loss occurs, or ssthRESH is reached, or RWND limit of the receiver is reached, it enters the phase of Congestion Avoidance. The following diagram is a good one. It describes several processes. I can’t find the source.
Some basic things to look at: TCP congestion control – Wikipedia
BDP
BDP is Bandwidth and Delay Product. Is the product of bandwidth (BPS) and delay (s), in bit, and is the maximum amount of data allowed to be in the Flying state between Source and Destination. Flying, also known as Inflights, is Ack data that has been sent but not yet received.
The closer the actual transmission rate multiplied by delay is to BDP, the higher the efficiency of the algorithm is.
Reno
Reno, called ACK-Pacing, verifies network status based on ACK. If ACK packets continue to be received, the network can handle the current sending rate.
Reno Band an Estimate of the time to send a packet and receive the Corresponding ACK (The “Round Trip Time,” As it were) or RTT), and while the ACK stream is showing that no packets are being lost in transit, then Reno will increase the sending rate by one additional segment each RTT interval.
Under Reno, CWND adds one for each ACK received, and the sender halve the sending rate after packet loss occurs. Ideally, Reno should have the following curve:
Reno made the following assumptions:
- Packet loss must be due to network congestion, but the actual network itself may be bad, there may be inherent packet loss probability, so the hypothesis is not rigorous.
- Network congestion must be caused by buffer overflow on the network.
- Network RTT and bandwidth are stable and not easy to change;
- By halving the rate, the buffer on the network must be able to go from overflow to drain. In other words, there are also assumptions about the size of Buffer on the network.
As can be seen from the above hypothesis, Reno is greatly affected by the Buffer size on the link. When the Buffer is small, packet loss may occur before the amount of data actually sent on the link reaches the Bandwidth delay product (BDP). As a result, the transmission rate of Reno immediately halves and the network Bandwidth cannot be efficiently utilized. If the Buffer is large enough to exceed BDP, it may enter the “Buffer Bloat” state, i.e. the delay is extremely high, because even if the Reno speed is reduced to half, the link Buffer will not be empty, or the link Buffer may be non-empty most of the time. In addition, every time Reno slows down due to packet loss, data will be retransmitted. As a result, the data idle resources sent before that may still be queued on the link will be lost in vain, thus causing the inherent delay to persist on the whole link.
Reno’s question:
- As mentioned above, it is greatly affected by link Buffer;
- For high bandwidth networks, such as 10Gbps bandwidth, even assuming a very low latency of only 30ms, adding 1500 octal numbers per RTT to CWND will take several hours to really use this amount of bandwidth, and it is required that there is no packet loss during these hours, otherwise Reno will slow down.
- Reno starts to expand CWND every time it receives an Ack, making it unfriendly to other large RTT connections that share the link together. For example, if the RTT of a connection is very low, its CWND will be larger than that of other shared links, occupying more bandwidth unfairly.
BIC
BIC called Binary Increase Congestion Control, which is an improvement on Reno. The CWND enlargement process is divided into three stages. The first stage is to reduce the CWND to β × Wmax after packet loss. The Increase in CWND is a quadratic function between Wmax and CWND, called Binary Increase, which gradually approaches Wmax. When Wmax is reached, it is converted into a conic curve to detect the next limit. See BIC TCP – Wikipedia for details
The advantage is that you can recover quickly after packet loss, and try to keep it longer during the stable period, and also support detection of higher bandwidth values.
Cubic
Cubic on the basis of BIC, one is to smooth the growth curve of CWND by Cubic function, so that it can maintain a longer time when it is close to the maximum value of the last CWND, and another is to make the growth of CWND independent of the length of RTT, that is, not to increase THE CWND after every ACK. Instead, the cubic function that makes CWND grow depends on time. No matter how big RTT is, CWND must grow to a certain value after a certain period of time, thus making the network fairer. Connections with small RTT should not crowd out resources of connections with large RTT.
After smoothing, the curves of CWND and time are shown as follows. It can be seen that the cubic curve retains the advantage of BIC, that is, at the initial stage after packet loss, CWND can increase rapidly and reduce the influence brought by packet loss. In addition, the growth rate of CWND will be very slow when it is close to the maximum value of the previous CWND, and it will take a long time to surpass Wmax, so that it can try to keep stable at Wmax. Finally, after surpassing Wmax, the growth was very slow at the initial stage, and then the rapid exponential growth was used to detect the next Wmax.
Cubic’s final curve is as follows. The yellow line is CWND, and the blue line is the number of queue congestion packets on the network. The blue arrow represents Queue fill and the green arrow queue drain. See there are two light green lines, the high is the start of packet loss, the low is the start of queue queuing. Because the bandwidth and buffer are fixed on the network, Cubic did not go beyond Last Wmax to continue to detect the curve of the next Wmax, but was forced to reduce CWND due to packet loss before reaching the first Wmax. The reason why the yellow spike was so high at the first time was that the buffer on the network was empty at that time. Later, before the buffer was completely emptied after packet loss, the CWND rose again, resulting in data queuing, so the subsequent yellow spike was not as high as the first one.
Cubic’s advantages:
- It’s more fair because it’s not RTT related;
- More suitable for large BDP networks. Reno is not suitable because it depends on RTT. If RTT is high when BDP is large, it will take a long time to improve transmission efficiency.
Cubic faults:
- When the Bandwidth changes, Cubic takes a long time to enter the phase of detecting the next Wmax from the stable point;
- More likely to result in Bufferbloat. Under Reno, if the Buffer on the link is very large and congested, RTT will also be long, and ACK will be received for a long time, and CWND will be added. However, the CWND growth of Cubic has nothing to do with RTT, and it increases over time, thus it is easier to aggravate the link burden. Bufferbloat can be seen in the next section.
In general, Cubic is suitable for high performance network with large BDP. Performance means bandwidth or sending rate. When the sending rate is large enough, Cubic can empty the Buffer quickly and have low delay.
A good PPT about Cubic: Cubic, and the paper. CUBIC for Linux 2.6.
Bufferbloat
Bufferbloat is pretty good at Bufferbloat – Wikipedia. Simply put, as memory gets cheaper and cheaper, some devices on the link tend to have excessively large buffers, and the popular TCP Congestion Avoidance algorithm is based on packet loss. Data queuing in the queue indicates that the link has been congested, which should be immediately fed back to the sending end for sending end to reduce the sending speed. However, because the Buffer is large, the data are queued, and the sending end does not know that the data it sends has started queuing and is still running at a certain speed (Reno) or even higher (Cubic, Add CWND) send when time is up. When packet loss occurs, the sender still does not know about the packet loss and sends a message quickly. The sender does not know to slow down the sending speed until the receiver senses the lost packet and replies with ACK. The retransmitted packets placed on the link can only be processed after all the previous packets have been sent to the receiver. If the Buffer at the receiving end is not large enough, much Of the data will be discarded after it reaches the receiving end and will be sent upstream until the retransmitted packet arrives (Head Of Line problem), greatly increasing the delay.
Bufferbloat can cause extremely long latency on the network on the one hand, and can cause erratic network traffic, sometimes with very little latency, sometimes with very large latency, and so on.
Bufferbloat: Dark Buffers in the Internet-ACM Queue
PRR
An optimization called Proportional Rate Reduction (PRR) on CUBIC allows the CUBIC algorithm to quickly recover to the current CWND normal value in the event of packet loss without going too far down to the low level.
Refer to draft-Mathis – TCPM – Proportional – Rate-reduction-01 – Proportional Rate Reduction for TCP. No further records. A version of Linux 3.X was introduced and works with CUBIC.
Queue model on links
To further optimize Cubic, you can first look at the queue model on the link.
- When the data on the link is small, all data is sent on the sending link and no data is queued. The latency is lowest in this case;
- As more data is transferred, data is queued and latency increases.
- When the queue is full, it enters the third state. Packet loss occurs when the queue is full.
The optimal State is between State 1 and State 2, that is, there is no queuing leading to delay increase, and data can be sent fully occupying the link bandwidth with high efficiency and low delay. The Congestion Avoidance strategy based on packet loss is kept at State 2. Cubic keeps CWND as close to the State of the previous Wmax as possible, at the end of State 2 and about to switch to State 3. Therefore Cubic is equivalent to occupying link resources as much as possible, trying to send data to fill up the downstream link but sacrificing delay.
It can be seen that, in order to keep in State 1 and State 2, we can monitor the RTT of each packet and try to increase the CWND to improve the sending rate first. If it is found that the RTT does not increase after the sending rate increases, we can continue to increase the sending rate until it has an impact on the RTT, and then slow down. The switch is from State 1 to State 2.
We want CWND to try to stay at the point labelled Optimum Operating Point in the figure below, where the RTT is small and the link is full of bandwidth. In this graph, RTprop is called round-trip Propagation time and BtlBw is called bottleneck Bandwidth. On the horizontal axis we have inflights, and on the vertical axis we have RTT and speed of delivery. You can see that there’s a very, very light yellow line in the middle of the graph that divides the graph into two parts, the top half of which is the relationship between Inflights and Round trip time. The second part is the relationship between Inflights and Delivery Rate.
Here are four things to mention: CWND, send speed, inflights, and sender bandwidth. Inflights are the amount of data that a sender sends to a link that has not received an ACK. The sending speed is the speed at which the sender sends data. The bandwidth of the sender is the maximum sending speed that the sender can achieve. For example, even if there is enough bandwidth and even enough data to send, CWND is not enough, so you have to slow down the sending speed. So these things are going to get tangled up, and there are a lot of articles that describe congestion algorithms that might mix up some of these things for the sake of convenience, so you have to try to be aware of that when you look at it.
Back to the figure, the blue line shows CWND subject to RTprop and the green line shows BltBw. Gray areas represent areas that are unlikely to be affected by at least one of these two restrictions. The white area does not normally appear, but it is theoretically possible to appear, for example, there are other senders on the link sending data and interfering with each other. The gray area is not going to happen anyway.
At first, the amount of Inflights is small, and the RTT is affected by the link’s inherent RTprop Time. Even if the link’s Buffer is empty, the RTT is at least an RTprop. Later, as Inflights increase and reach BDP, RTT gradually increases, with a slope of 1/BtlBw. Because the blue dotted line to the right of BDP is the size of Buffer, the vertical axis corresponding to each blue point is the time required to consume such a large Buffer. The time consumed/size of the Buffer is 1/BltBw, because the time consumed multiplied by BltBw is the size of the Buffer. It’s hard to describe, but let’s just think about it this way. The main thing is to see that it is impossible to reach the gray part because the transmission rate cannot be higher than BltBw. When the Inflights fill up the Buffer, they reach the red zone and start dropping packets.
In the lower part, the Inflights are small. As the Inflights increase, the Delivery Rate increases. Because the bandwidth upper limit has not been reached, the more data is sent, the higher the arrival Rate is. When Inflights exceed BDP, the Delivery Rate is restricted by the size of BtlBw and does not continue to increase, and the arrival speed reaches the upper limit, and Buffer begins to accumulate. When the Buffer reaches the maximum limit, packets are discarded in the red area.
The previous congestion algorithm based on packet loss actually works with the link Buffer to control the distance between Inflights and the BDP line behind the BDP line. Memory used to be expensive, maybe just a little bit bigger than BDP, and the algorithm based on packet loss would not have very high latency, maybe a little bit higher than optimal RTT. However, as buffers become cheaper and cheaper, buffers on links tend to be very large. In this case, the congestion algorithm based on packet loss tends to lead to the actual occupation of buffers being very large, which leads to the occurrence of BufferBloat.
To achieve the best advantage, the transmission rate must reach BltBw, so as to occupy the full link bandwidth and make the most efficient use of bandwidth. Furthermore, it is necessary to control the amount of Inflight data not to exceed the BDP = BtlBw * RTprop of the link, so as to have the optimal delay. Send rates can exceed BltBw, queues can be generated temporarily, but the total amount of data sent cannot exceed BDP, so that the average latency is still around the best.
TCP Vegas
Vegas, just like the algorithm in the ideal mentioned above, will monitor RTT. When trying to increase the sending rate, if packet loss or RTT increase is found, the sending rate will be reduced, and congestion will be considered in the network. But it has these problems:
- CWND growth is linear and, like Reno, takes a long time to make good use of network transmission rates;
- The most fatal thing is that it cannot coexist well with Congestion Avoidance based on packet loss, because these algorithms make the queue between State 2 and State 3, that is, queue up as much as possible, which is equivalent to increasing the delay as much as possible. But Vegas does not queue up as much as possible. As soon as queuing is found, the sending rate will be reduced immediately, so Vegas will be gradually crowded out when co-existing with other packet-loss algorithms.
References are provided in the next article.