3.1 the introduction
The lower network layer (IP) may lose, duplicate, or disorder packets. TCP is required to provide reliable data transmission services. To ensure the correctness of data transmission,TCP retransmits packets it considers lost. TCP determines whether packets are lost based on a series of confirmation messages sent from the receiving end to the sending end. When a data segment or confirmation fails,TCP retransmits unconfirmed data. Two sets of mechanisms for retransmission :1. Based on time and 2. Based on the composition of confirmation information (more efficient)
1.RTO(Retransmission timeout): TCP sets a timer when sending data. If no confirmation message is received after the timeout, the corresponding timeout or the retransmission operation based on the timer is triggered.
2. Fast transmission: If TCP cumulative acknowledgement fails to return a new ACK, or if the ACK contains selective acknowledge (SACK) indicating that an out-of-order packet segment occurs, fast retransmission will infer that packet loss occurs.
3.2 Simple Timeout and retransmission examples
Binary exponential backoff: Time between retransmissions
Logically,TCP has two thresholds that determine how to retransmit a message segment. R1 and R2
R1: the number of times (or wait time) TCP is willing to try retransmission before sending “negative suggestions “(such as reevaluating the CURRENT IP path) to the IP layer; R1 in the data transmission segment (SYN packet segment) is greater than =3
R2:(R2>R1) indicates when TCP should drop the current connection. The R2 of the data transmission segment is greater than =100s, and the R2 of the TCP connection segment is greater than =180s
R1 and R2 Settings in Linux, either through application or system configuration. Net.ipv4. tcp_retries1: indicates the number of retransmissions. The default value is 3. Net.ipv4. tcp_retries2: indicates the default value is 15. For SYN packet segments net.ipv4.tcp_syn_retries and net.ipv4.tcp_synack_retries, the retransmission times are set to 5(about 180s) by default.
3.3 Setting retransmission Timeout
How does TCP set the RTO based on the RTT of a given connection? If TCP retransmits data before RTT, unnecessary duplicate data may be introduced into the network. Otherwise, if data is sent at an interval longer than RTT, the overall network utilization decreases.
TCP returns an acknowledgement message after receiving data, so it can carry one byte of data in the message to measure the time it takes to transmit the acknowledgement message. Each measurement result is called RTT sample.TCP first needs to establish an estimate based on the sample value in a period of time, and the second part is how to set the RTO based on the estimate.
3.3.1 Classical methods
The original TCP specification [RFC0793] calculates a smooth RTT estimate (called SRTT) using the following formula:
SRTT <– α(SRTT) + (1 – α)RTTs
- SRTT is a measurement method called the Weighted Moving Average (EWMA) or Low Pass filter, based on updated results from existing values and from new sample RTTs.
- α: smoothing factor, recommended value is 0.8-0.9,80% ~ 90% from existing values and 10% ~ 20% from new measurements
Considering that the estimated value obtained by the SSRTT estimator varies with the RTT value,[RFC0793] recommends setting the RTO according to the following formula:
RTO = min (ubound, Max (lbound (SRTT) beta))
- β: indicates the delay dispersion factor. The recommended value is 1.3 to 2.0
- Ubound :RTO upper bound (can be set to 1 minute recommended)
- Lbound :RTO lower bound (suggested value,1 second)
We call it the classical method, which sets the RTO to 1 second, or about twice the SRTT
3.3.2 Standard Method (Jacobson)
Setting timers in the classical way described above will not accommodate large RTT variations (especially if the actual RTT is much larger than the estimate, resulting in unnecessary retransmissions). An enlarged RTT sample indicates that the network is overloaded, at which point unnecessary retransmission will further burden the network.
In order to solve this problem, the original method can be improved to adapt to the situation where RTT changes greatly. Changes in RTT measurements and average values were recorded to obtain more accurate estimates. Setting the RTO based on changes in the mean and estimated values will be more adaptable to large changes in RTT than calculating the RTO using a constant multiple of the mean.
The following algorithm adopts the classical method ([RFC0793]), and also considers the method of RTT sample change value to calculate the comparison of RTO. We use TCP’s sample RTT measurements as a statistical process, while measuring the mean and variance (or standard deviation) is a better estimate of future values, and helps TCP set an RTO that is suitable for most cases.
Mean deviation is a good approximation of standard deviation, but calculation is easier and faster. This is because calculating the standard deviation requires the square root of the difference, which is expensive for a fast TCP implementation. So we use a combination of mean and mean deviation to estimate. The following formula is used for each RTT measurement M(previous RTTs) :
srtt <— (1 – g)(srtt) + (g)M
rttvar <— (1 – h)(rttvar) + (h)(|M – srtt|)
RTO = srtt + 4(rttvar)
Here, SRTT replaces the previous SRTT, and RTTVAR is the EWMA of the mean deviation, rather than the previous β setting for RTO. The above equation can also be written in another form, which is more convenient for computer implementation:
Err = M – srtt
srtt <— srtt + g(Err)
rttvar <— rttvar + h(|Err| – rttvar)
RTO = srtt + 4(rttvar)
- SRTT: EWMA of the mean, involved in calculating the RTO, and varies over time
- The absolute error | rttvar: Err | EWMA, participate in computing RTO, and change over time
- Err: The deviation between the measured value M and the current RTT estimate BEFORE SRTT
- G: increment g, the weight of the new RTT sample M to the SRTT estimate, with the value 1/8, and the value 2 to the (negative) power
- H: Incremental H is the weight of the new mean deviation sample (the absolute error before the new sample M and the current mean SRTT) to the estimated deviation value Rttvar. The value is 1/4, and the value is 2 to the (negative) power
When RTT changes, the larger the increment of deviation is, the faster the RTO grows.
The difference between the classical method and Jacobson’s calculation method:
- The average RTT is calculated similarly (α = 1-g) with different increments.
- Jacobson calculates RTO based on smooth RTT and smooth deviation at the same time, and the classical method simply adopts the multiple of smooth RTT, which is the method of many TCP RTO implementations so far.
3.3.2.1 Boundary between clock Granularity and RTO
During the measurement of RTT, the TCP clock clock is in running state. For initial sequence numbers, the actual TCP connection clock does not start from zero, nor does it have absolute precision. The TCP clock is usually a variable that is updated with the system clock, but not one to one synchronously. The time of a TCP clock “tick” is often referred to as granularity. Typically, this value is relatively large (about 500ms), but the new clock is more granular (Linux uses 1ms)
Granularity affects RTT measurements and RTO Settings. Granularity is used to optimize the update situation of the RTO and sets a lower bound on the RTO. The computing companies are as follows:
RTO = max(srtt + max(G,4(rttvar)),1000)
- G: Granularity of the timer. 1000ms is the lower limit of the entire RTO. Therefore, the RTO must be at least 1s
3.3.2.2 initial value
Prior to the first SYN exchange, TCP cannot set the initial RTO value unless the system provides it. According to [RFC6298], the initial VALUE of RTO is 1s, and the timeout period of the initial SYN packet segment is 3s. When the first RTT measurement M is received, the estimator is initialized as follows:
srtt <— M
rttvar <— M/2
3.3.2.3 Retransmission ambiguity and Karn algorithm
Retransmission ambiguity: In the process of measuring THE RTT sample, if the transmission of a packet times out, the data will be retransmitted and an ACK message will be received. Then, the ACK will confirm whether the first or second transmission exists this ambiguity.
Karn algorithm
-
[KP87] indicates that a timeout has occurred and the RTT estimate cannot be updated when confirmation of retransmitted data is received. This is the first part of Karn’s algorithm. The ambiguity problem of RTT estimation is solved by eliminating the ambiguity data.
-
TCP uses a backoff factor to calculate the RTO. When the retransmission timer times out, the backoff factor doubles, and the process continues until the data is not retransmitted. At this point, the retreat coefficient is 1, and the retransmission timer returns to normal. Doubling the retransmission coefficient is the second part of Karn’s algorithm. If TCP times out, congestion control mechanism is also triggered, which changes the sending rate.
When receiving the acknowledgement message of the data that has been re-transmitted (that is, retransmitted at least once), the RTT measurement of the data packet is not performed to avoid the retransmission ambiguity problem. In addition, a fallback strategy is applied to packets following this data. Only if unretransmitted data is received. This SRTT is used to calculate the RTO.
3.3.2.4 RTT measurement with timestamp option
TCP timestamp option (TSOPT) is the basis of PAWS algorithm, and RTT measurement (RTTM) can also be used. Allows the sender to carry a 32 bit number in the corresponding acknowledgement message returned.
The timestamp value (TSV) carries the initial SYN in TSOPT and is returned in the TSER part of TSOPT for SYN+ACK to set the initial values of SRTT, RTTVAR, and RTO.
When a large amount of data is transmitted,TCP usually returns an ACK for every two packet segments. When data is lost, disordered, or retransmitted successfully,TCP does not have a one-to-one relationship between the packet segments of the cumulative acknowledgement mechanism and its ACKS. To address these issues, TCP using the timestamp option uses the following algorithm to measure RTT sample values:
- 1. The TCP sender carries a 32-bit timestamp value in the TSV part of the TSOPT of each packet segment it sends. This value contains the TCP clock value at the time the data is sent.
- 2. The receiver records the received TSV value (the variable of TsRecent) and returns it in the corresponding ACK and records the ACK number (the variable of LastAck) above it. The ACK number represents the next ordered sequence number that the receiver (ACK sender) expects to receive.
- 3. When a new packet segment arrives and its sequence number matches the value of LastACK (that is, the next packet segment to be received), its TSV value is stored in TsRecent.
- 4. Any ACK sent by the receiver contains TSOPT, and the timestamp contained in TsRecent is written to its TSER section.
- 5. After receiving the ACK, the sender subtracts the TSER value from the current TCP clock. The difference obtained is the new RTT sample estimate.
3.3.3 Methods used in Linux
The RTO is usually set to SRTT +4(RTTVAR), and any large change in RTTVAR will result in an increase in THE RTO, regardless of whether the value of the largest RTT sample is greater than or less than SRTT. Linux solves this problem by drastically reducing the impact of RTTVAR by reducing the RTT sample value. How to set RTO in Linux:
Like the standard method,Linux also logs SRTT and rttvar values, but it also logs two new variables, mdev and mdev_max.
- Mdev: An estimate of the instantaneous mean deviation of the standard method, rTTVAR above
- Mdev_max: Record the maximum MDEV during the measurement of RTT samples. The minimum value is 50ms. Rttvar must be updated periodically to ensure that it is not less than Mdev_max. So the RTO should not be less than 200ms.
Linux updates rttvar.RTO based on the value of mdev_max, always equal to the sum of SRTT and 4(rttvar) to ensure that RTO does not exceed TCP_RTO_MAX(default: 120s). The diagram below.
-
First RTT sample measurement
-
srtt = 16ms
-
mdev = (16/2)ms = 8s
-
Rttvar = mdev_max = Max (mdev,TCP_RTO_MIN) = Max (8,50)
-
RTO = srtt + 4(rttvar) = 16 + 4(50) = 216ms
After the initial SYN exchange, the sender returns an ACK for the recipient’s SYN, and the recipient performs a window update. These packets contain no data data (SYN or FIN fields, but they count as data), no time is recorded, and no RTT updates are made when window updates are received by the sender. TCP does not provide reliable transmission for packets that do not contain data. This means that packets are not retransmitted in case of packet loss. Therefore, you do not need to set a retransmission timer.
Note: The TCP option itself does not do retransmission or reliability transfer. Retransmission is lost as a side effect only when the data segment (including SYN and FIN packet segments) is specified.
When the application performs the write operation for the first time, the sender TCP sends two packet segments, each containing a TSV of 127. The two values are equal because the interval between sending packets is less than 1ms(TCP sending granularity of the sender). When the sender sends multiple segments in this manner, no progress can be seen.
The receiver variable LastACK records the sequence number of the LastACK sent. In this example, the last ACK packet sent is a SYN + ACK packet in the connection establishment phase. Therefore, AC starts from 1. If the first full-size packet segment arrives and its serial number matches LastACK, TsRecent is updated to TSV of the receiving group, that is, 127. The arrival of the second message segment does not update TsRecent because its sequence number field does not match LastACK. When the receiver returns the ACK of the corresponding group, it should include TsRecent in its TSER part and update the ACK number of the LastACk variable to 2801.
- The second RTT sample was measured
Sample M = 223-127 = 96
-
mdev = mdev(3/4) + |m – srtt|(1/4) = 8(3/4) + |80|(1/4) = 26ms
-
Mdev_max = Max (mdev_max,mdev) = Max (50,26) = 50ms
-
srtt = srtt(7/8) + m(1/8) = 16(7/8) + 96(1/8) = 14 + 12 = 26ms
-
rttvar = mdev_max = 50ms
-
RTO = srtt + 4(rttvar) = 26 + 4(50) = 226ms
Rttvar in the standard method has a large weight (coefficient 4), so when RTT decreases, RTO will also increase. At larger clock granularity (500ms), there is not much of an impact because there are few RTO values available. However, a fine clock, such as 1ms on Linux, may cause problems. In the case of reduced RTT, if the new sample value is less than the lower bound of the RTT estimation range (SRTT-mdev), the weight of the sample will be reduced. The pseudocode is as follows:
if(m < (srtt - mdev))
mdev = (31/32) * mdev + (1/32)*|srtt -m|
else
mdev = (3/4)* mdev + (1/4) * |srtt - m|
Copy the code
3.5 Fast Retransmission
The fast retransmission mechanism triggers retransmission based on feedback from the receiver, not the timeout of the retransmission timer. Compared with timeout retransmission, fast retransmission can repair packet loss more timely and effectively.
Fast retransmission algorithm: the TCP sender retransmits the packet of data that may have been lost after observing at least dupthRESH (threshold for repeated acks), without waiting until the retransmission timer times out. You can also send new data at the same time. Packet loss inferred from repeated ACK is usually associated with network congestion, so the accompanying fast retransmission should trigger congestion control mechanisms. If SACK is not used, only one packet segment can be retransmitted at most before receiving valid ACK. With SACK, the ACK can contain additional information, allowing the sender to fill multiple gaps in each RTT time.
3.6 Retransmission with selection confirmation
As the selection and confirmation selection becomes standardized, the TCP receiver can provide the SACK function to describe the received data through ACk fields accumulated in the TCP header. The gap between the ACK number and the rest of the data in the receiver’s buffer is called a void. Data with a higher serial number than the gap is called out-of-order data because it is not contiguous with the previously accepted sequence number.
The TCP sender’s task is to make up for the gap in the receiver’s buffer by retransmitting the lost data, but at the same time to ensure that the correctly received data is not retransmitted. The reasonable use of SACK information can realize vacancy filling faster and reduce unnecessary retransmission, because it can know a vacancy within an RTT. When the SACK option is used, an ACK can contain three or four SACK messages informing of out-of-order data. Each SACK message contains a 32-bit sequence number representing the start to the last sequence number of the out-of-order data stored at the receiver.
The SACK option specifies that n blocks are 8n+2 bytes long, so 40 bytes can contain four blocks. SACK is typically used with TSOPT, and therefore requires an additional 10 bytes (plus 2 bytes of padding data), which means that SACK can contain only three blocks per ACK.
3.6.1 Behavior of the SACK receiver
SACK is generated when the receiver receives the SACK license option during TCP connection establishment. In general, the receiver can generate SACK whenever there is out-of-order data in the cache.
The first SACk block contains the sequence number of the most recently received packet segment. Because the SACK option has limited space, you should ensure that the TCP sender is provided with up-to-date information whenever possible. The rest of the SACK blocks also contain the contents in order of receipt. That is, the latest block contains the received sequence number and repeats the previous SACK block.
The purpose of including multiple SACK blocks in a SACK option and repeating the block information in multiple sacks is to provide some backup in case of SACK loss.
3.6.2 Behavior of the SACK sender
SACK function should also be provided at the sender, and the received SACK block should be properly used for packet loss retransmission, which is called selective retransmission or selective retransmission SACk sender records cumulative ACK information received.SACk information also needs to be recorded and used to avoid retransmitting correctly received data.
When the SACk sender performs a retransmission, usually as a result of receiving SACk or repeated ACKS, it can choose to send new or old data. SACK information provides the serial number range of data at the receiving end, so that the sender can infer the vacant data to be retransmitted. The simplest way to do this is to have the sender fill the gap at the receiver and then continue sending new data. This is a common method.
3.7 Pseudo Timeout and retransmission
In many cases, retransmission can occur even if there is no data loss. This type of unnecessary retransmission is called spurious retransmission. The primary cause is spurious retransmission. Other causes include out-of-order, packet duplication, or ACK loss. Pseudo timeouts may occur when the actual RTT increases significantly beyond the current RTO. This happens more often in environments where the underlying protocol performance varies greatly (wireless environment).
Pseudo timeout solutions usually include detection algorithm and response algorithm.
3.7.2 Eifel detection algorithm
Experimental Eifel detection algorithm uses TSOPT of TCP to detect false retransmission. After timeout retransmission occurs,Eifel algorithm waits to receive the next ACK. If it confirms the first transmission, it determines that the retransmission is pseudo retransmission.
The mechanism of Eifel detection algorithm is simple. It needs to use TCP’s TSOPT. When sending a retransmission, save the TSV value. When an ACK of the corresponding grouping is received, the TSER part of the ACK is detected. If the TSER part is smaller than the TSV value of the previous store, it can be judged that the ACK corresponds to the original transmission group, that is, the group is pseudo retransmission.
3.7.3 RTO Forward Recovery (F-RTO)
** Forward RTO Recovery (F-RTO)** is the standard algorithm for detecting false retransmission. The algorithm detects only pseudo retransmission caused by the timeout of the retransmission timer. False retransmission caused by other reasons cannot be judged.
F-rto modifies TCP’s behavior so that on the first ACK received after timeout retransmission,TCP sends new (non-retransmitted) data and then responds to an incoming ACK. If there is a duplicate ACK, the retransmission is considered ok. If neither of these are duplicate ACKS, the retransmission is a pseudo-retransmission. If the transfer of new data receives a corresponding ACK, the receiver window moves forward. If the sending of new data results in duplicate ACKS then the receiver has at least one or more vacancies. In both cases, receiving new data does not affect overall data transmission performance.
3.7.4 Corresponding Eifel algorithm
Once pseudo-retransmission is judged, a set of standard operations, namely Eifel response algorithm, will be triggered. After the retransmission timer times out, it checks the values of SRTT and rttvar and records the new variables srTT_prev and rTTVAR_prev as follows:
-
srtt_prev = srtt + 2(G)
-
rttvar_prev = rtvar
- G:TCP clock granularity
- Srtt_prev: Set this parameter to SRTT plus twice the clock granularity. Because SRTT is too small, pseudo timeout may occur. If the SRTT is slightly larger, a timeout may occur.
After srTT_prev and RTTVAR_prev are stored, some detection algorithm is triggered. Running the detection algorithm yields a special value called SpuriousRecovery. If a false timeout is detected, set false recovery to SPUR_TO. If a late timeout is detected, set it to LATE_SPUR_TO. Otherwise,TCP continues to respond to normal timeout.
If SPUR_TO is SPUR_TO,TCP can operate before the recovery phase is complete. Change the sequence number of the next segment to be sent to the latest segment that has not been sent. This avoids the “fallback N” behavior.
Snd. NXT does not change if a late pseudo-timeout is detected and an ACK for the first retransmission has been generated. In both cases, the congestion control state needs to be reset. SRTT, RTTVAR, and RTO are updated as follows upon receipt of ACk segments sent after the retransmission timer has expired
-
srtt <— max(srtt_prev,m)
-
rttvar <—max(rttvar_prev,m/2)
-
RTO = srtt + max(G,4(rttvar))
- M :RTT sample value, calculated by the ACK of the first sent data received after timeout.