preface

The improvement of TCP performance not only depends on the theoretical knowledge of TCP, but also on the understanding and application of the kernel parameters provided by the worrying system.

TCP is implemented by the operating system, so the operating system provides many parameters for adjusting TCP.

Linux TCP parameters

How to use these parameters correctly and effectively to improve TCP performance is not so simple. We need to address the problem at each stage of TCP, not just the other way around.

Next, the strategies to improve TCP will be explained from three perspectives, which are as follows:

  • TCP three-way handshake performance improvement;
  • TCP quadruple wave performance improvement;
  • TCP data transmission performance improvement;

This section Outlines


The body of the

01 TCP three-way handshake performance is improved

TCP is a connection-oriented, reliable, bidirectional transport layer communication protocol. Therefore, a three-way handshake is required to establish a connection before data transmission.

Three-way handshake and data transfer

Therefore, the three-way handshake takes up more than 10% of the average time of an HTTP request. In scenarios such as poor network status, high concurrency, or SYN attacks, if the parameters in the three-way handshake cannot be adjusted properly, the performance will be greatly affected.

How to use these parameters correctly and effectively to improve the performance of the TCP three-way handshake is to understand the state transitions of the three-way handshake. In this way, when a problem occurs, you can use the netstat command to check which phase of the handshake is faulty, and then take appropriate remedies.

Description TCP three-way handshake status changed

Both the client and the server can optimize performance for the three-way handshake. The optimization of the client side that initiates the connection actively is relatively simple, while the server side needs to listen to the port and belongs to the passive connection side, which maintains many intermediate states, and the optimization method is relatively complex.

Therefore, the client (active initiator) and server (passive connector) are optimized in different ways. Next, optimize for client and server respectively.

Client optimization

The primary purpose of the three-way handshake to establish a connection is to “synchronize serial numbers.”

Reliable transmission occurs only when the Sequence number is synchronized. Many TCP features, such as traffic control and packet loss and retransmission, depend on the Sequence number. This is why a packet in a three-way handshake is called SYN, which is short for SYN Sequence Numbers.

The TCP header

Optimization of the SYN_SENT state

The client, as the active connection initiator, will first send a SYN packet, so the client connection will be in the SYN_SENT state.

Normally, the server returns a SYN+ACK packet within a few milliseconds. However, if the client does not receive a SYN+ACK packet for a long time, the server resends the SYN packet. The number of retries is controlled by the tcp_SYN_retries parameter.

Typically, the first timeout retransmission occurs after 1 second, the second timeout retransmission occurs after 2 seconds, the third timeout retransmission occurs after 4 seconds, the fourth timeout retransmission occurs after 8 seconds, and the fifth timeout retransmission occurs after 16 seconds. That’s right. Each timeout is twice as long as the last one.

After the fifth timeout retransmission, the client will wait another 32 seconds, and if the server still does not respond to the ACK, the client will terminate the three-way handshake.

So, the total time is 1+2+4+8+16+32=63 seconds, which is about 1 minute.

SYN retransmission timed out

You can modify the SYN retransmission times and the maximum three-way handshake time of the client based on the network stability and target server busy. For example, when communicating on the Intranet, the number of retries can be reduced to expose errors to the application as soon as possible.

Server optimization

After receiving a SYN packet, the server replies with a SYN+ACK packet, confirming receipt of the client’s sequence number and sending its own sequence number to the peer.

At this point, a new connection appears on the server with the status SYN_RCV. In this state, the Linux kernel establishes a “half-connection queue” to maintain the “unfinished” handshake information. When the half-connection queue overflows, the server cannot establish new connections.

Half connection queue and full connection queue

SYN attack, which attacks the half-connected queue.

How do I check if a connection was dropped because the SYN half-connection queue was full?

The number of failures caused by the half-connection queue is full can be obtained from the netstat -s command:

The preceding output is the cumulative value, indicating the total number of TCP connections discarded because of the overflow of the half-connection queue. If the command is executed several times every few seconds, it indicates that the half-connection queue overflows.

How do I resize the SYN half-connection queue?

To increase the half-connection queue, not just the tcp_MAX_syn_backlog value, but also the somaxCONN and ACCEPT queues. Otherwise, simply increasing the tcp_max_syn_backlog is not effective.

Tcp_max_syn_backlog and somaxconn can be augmented by modifying the Linux kernel parameters:

The way the backlog is enlarged varies from Web service to Web service. For example, Nginx increases the backlog as follows:

Finally, after changing these parameters, the Nginx service needs to be restarted because the SYN half-connection queue and accept queue are initialized in LISTEN ().

If the SYN half-connection queue is full, can I only drop the connection?

This is not the case, as enabling Syncookies can successfully establish a connection without using the SYN half-connection queue.

The working principle of Syncookies: The server calculates a value according to the current status and sends it in the SYN+ACK packet sent by the client. When the client returns an ACK packet, it takes out the value for verification. If the value is valid, the connection is established successfully, as shown in the following figure.

Enable syncookies

The syncookies parameter has the following three values:

  • If the value is 0, the function is disabled.
  • The value of 1 indicates that the SYN half-connection queue is enabled only when it is too large.
  • 2: indicates that the function is enabled unconditionally.

For SYN attacks, set the value to 1:

Optimization of SYN_RCV state

After receiving a SYN+ACK packet from the server, the client replies an ACK packet to the server. The connection status changes from SYN_SENT to ESTABLISHED, indicating that the connection is ESTABLISHED successfully.

The connection is ESTABLISHED later. After the server receives the ACK from the client, the connection status changes to ESTABLISHED.

If the server does not receive an ACK, it resends a SYN+ACK packet and remains in the SYN_RCV state.

When the network is busy or unstable, packet loss becomes more serious. In this case, you should increase the retransmission times. Otherwise, you can adjust the number of retransmissions. To change the number of retries, modify the tcp_synack_retries parameter:

The default value of tcp_SYNack_retries is five times. Similar to SYN retries on the client, the tcp_SYNack_retries takes 1, 2, 4, 8, and 16 seconds. After the last retransmission, the server waits 32 seconds.

When the server receives an ACK, the kernel removes the connection from the semi-connection queue, creates a new full connection, and adds it to the Accept queue, waiting for the process to call the Accept function to retrieve the connection.

If the process fails to call the Accept function in a timely manner, the Accept queue (also known as the full connection queue) overflows and the established TCP connection is discarded.

Accept queue overflow

Is the Accept queue full and can only drop connections?

Dropping the connection is the default behavior of Linux, but we can also send an RST reset message to the client to tell the client that the connection has failed. Turning this on requires setting the tcp_ABORt_on_overflow parameter to 1.

Tcp_abort_on_overflow has two values, 0 and 1 respectively, which represent:

  • 0: If the Accept queue is full, the server throws the ACK sent by the client.
  • 1: If the Accept queue is full, the server sends an RST packet to the client, indicating that the handshake process and the connection are discarded.

Set tcp_aborT_on_overflow to 1. If the client fails to connect to the server, you can set tcp_abort_on_overflow to 1. Then it can be proved that the server TCP full connection queue overflow problem.

In general, you should set tcp_ABORT_ON_overflow to 0, because this is better for dealing with unexpected traffic.

For example, when the accept queue is full and the server loses an ACK, the client’s connection state is ESTABLISHED, and the client sends the request on the ESTABLISHED connection. As long as the server does not reply with an ACK for the request, the client request will be “resent” multiple times. If the process on the server is busy for a short time and the Accept queue is full, when the Accept queue is empty, the next received request packet will still trigger the server to establish a connection because it contains ACK.

Tcp_abort_on_overflow is set to 0 to handle burst traffic

Therefore, tcp_ABORT_ON_overflow is set to 0 to improve the success rate of establishing connections. Only if you are sure that the TCP full connection queue will overflow for a long time should you set it to 1 to notify clients as soon as possible.

How do I adjust the length of the Accept queue?

The length of the accept queue depends on the minimum between somaxconn and backlog, i.e. Min (somaxconn, Backlog), where:

  • Somaxconn is a Linux kernel parameter. The default value is 128, which can be set with net.core.somaxconn.
  • Backlog is the size of the backlog in listen(int sockfd, int backlog);

The common Tomcat, Nginx, and Apache Web services have a backlog value of 511.

How do I check the length of the server process Accept queue?

You can run the ss-ltn command to view:

  • Recv -q: the size of the current accept queue, that is, the TCP connection that has completed the three-way handshake and is waiting for the server to accept().
  • Send-q: indicates the maximum length of the Accept queue. The preceding output indicates that the TCP service on port 8088 is monitored and the maximum length of the Accept queue is 128.

How do I view connections that were dropped because the Accept connection queue was full?

When the accept connection queue is exceeded, the server will discard subsequent TCP connections. The number of discarded TCP connections will be counted. We can use the netstat -s command to check:

41150 times indicates the number of accept queue overflows. Note that this is a cumulative value. This can be done every few seconds, and if this number keeps increasing, the accept connection queue is occasionally full.

If persistent connections are discarded due to accept queue overflow, the backlog and somaxconn parameters should be enlarged.

How to get around the three-way handshake?

This is just optimizing the three-way handshake. Now let’s see how to bypass the three-way handshake and send data.

The consequence of establishing a connection with a three-way handshake is that HTTP requests must be sent after an RTT (a round-trip time from client to server).

Regular HTTP requests

Later than Linux 3.7, the TCP Fast Open function is provided to reduce the delay for establishing TCP connections.

Here’s how TCP Fast Open works.

The TCP Fast Open function is enabled

The process when the client first establishes a connection:

  1. The client sends a SYN packet that contains the Fast Open option and the Cookie of this option is empty, indicating that the client requests the Fast Open Cookie.
  2. A server that supports TCP Fast Open generates a Cookie and sends it back to the client with the Fast Open option in the SYN-ACK packet.
  3. After the client receives a SYN-ACK, the Cookie in the Fast Open option is cached locally.

Therefore, the normal three-way handshake is still required for the first HTTP GET request.

After that, if the client re-establishes a connection to the server:

  1. The client sends a SYN packet containing data (excluding data in a non-TFO TCP handshake) and a previously recorded Cookie.
  2. A server that supports TCP Fast Open verifies the received Cookie. If the Cookie is valid, the server acknowledges the SYN and “data” in a SYN-ACK packet and sends the “data” to the corresponding application program. If the Cookie is invalid, the server discards the “data” contained in the SYN packet, and subsequent SYN-ACK packets confirm only the sequence number of the SYN.
  3. If the server accepts the “data” in the SYN packet, the server can send the “data” before the handshake is complete, thus reducing the RTT cost of the handshake.
  4. The client sends an ACK to acknowledge the SYN and data sent back from the server, but if the initial SYN packet is not acknowledged, the client resends the data.
  5. The subsequent TCP connection data transfer process is consistent with the normal non-TFO situation.

As a result, the three-way handshake can be bypassed for subsequent HTTP GET requests, which reduces the 1 RTT cost of the handshake.

Note: After a client requests and stores a Fast Open Cookie, it can repeatedly repeat TCP Fast Open until the server considers the Cookie invalid (usually expired).

How to enable TCP Fast Open in Linux?

In Linux, you can enable the Fast Open function by setting the tcp_fastopn kernel parameter:

Tcp_fastopn Meaning of each value:

  • 0 to shut down
  • 1 Use the Fast Open function as a client
  • 2 Use the Fast Open function as the server
  • 3 You can use the Fast Open function no matter as a client or server

The TCP Fast Open function takes effect only when both the client and server support it.

summary

This summary mainly introduces several TCP parameters to optimize the TCP three-way handshake.

Three-way handshake optimization strategy

Client optimization

When a client initiates a SYN packet, it controls the retransmission times using tcp_SYN_retries.

Server side optimization

If the SYN half-connection queue overflows, subsequent connections are discarded. You can run the netstat -s command to check whether the SYN half-connection queue overflows. If the SYN half-connection queue overflows seriously, The SIZE of the SYN half-connection queue can be adjusted with the tcp_MAX_SYN_backlog, somaxconn, backlog parameters.

The number of times a server replies with SYN+ACK is controlled by the tcp_synack_retries parameter. If SYN attacks occur, set tcp_syncookies to 1, indicating that syncookie is enabled only when SYN queues are full to ensure that normal connections are established.

When the server receives an ACK from the client, it moves the connection to the ACCPET queue and waits for the accPET () function to fetch the connection.

You can run ss-lnt to check the accept queue length of the server process. If the ACCEPT queue overflows, the system discards ACK by default. If tcp_ABORT_ON_overflow is set to 1, RST is used to notify the client that the connection fails to be established.

If the ACCPET queue overflows badly, the queue size can be increased by the BACKLOG parameter of the LISTEN function and the somaxconn system parameter. The accept queue length depends on min(Backlog, somaxconn).

Bypass the three handshakes

The TCP Fast Open function can bypass the three-way handshake and reduce the TIME of HTTP requests by one RTT. In Linux, the TCP Fast Open function can be enabled using the tcp_fastopen function. Ensure that both the server and client support this function.


02 TCP quadruple wave performance improvement

Next, let’s take a look at how to optimize performance against TCP four wave shutdown.

Before we start, we need to understand how the four waves change.

Both the client and server can disconnect the connection actively. Usually, the party that closes the connection first is called the active party, and the party that closes the connection later is called the passive party.

The client is shut down

It can be seen that the four-wave process only involves two types of packets, namely FIN and ACK:

  • FIN indicates the end of the connection. If a FIN packet is sent, it will not send any more data and the transmission channel in this direction is closed.
  • An ACK is an acknowledgement that your sending channel has been closed.

The process of four waves:

  • When the active party closes the connection, a FIN packet is sent. In this case, the TCP connection changes from ESTABLISHED to FIN_WAIT1.
  • When the passive receives the FIN packet, the kernel automatically replies the ACK packet and the connection status changes from ESTABLISHED to CLOSE_WAIT, indicating that the passive is waiting for the process to call the close function to close the connection.
  • When the active party receives this ACK, the connection status changes from FIN_WAIT1 to FIN_WAIT2, indicating that the active party’s sending channel is closed.
  • When the passive enters CLOSE_WAIT, the passive continues to process data. When the process’s read function returns 0, the application calls the close function, triggering the kernel to send a FIN message, and the passive’s connection status changes to LAST_ACK.
  • After the active party receives the FIN packet, the kernel sends an ACK packet to the passive party, and the active party’s connection status changes from FIN_WAIT2 to TIME_WAIT. In Linux, the connection in TIME_WAIT state is closed after about 1 minute.
  • After receiving the final ACK packet, the connection is closed.

As you can see, each direction requires a FIN and an ACK, which is often referred to as four waves.

Note that only those who actively close a connection have a TIME_WAIT state.

The active and passive shutters optimize in different ways, so let’s talk about how to optimize them.

Optimization of the initiative

The connection can be closed in two modes: RST packet closing and FIN packet closing.

If the process exits abnormally, the kernel will send an RST message to shut it down, which is a violent way to close the connection without going through the four-wave process.

To safely close the connection, the process calls the close and shutdown functions to send a FIN message (the shutdown parameter must be passed to SHUT_WR or SHUT_RDWR before the FIN is sent).

What is the difference between calling close and shutting down?

Calling close means that the connection is completely disconnected, and completely disconnected means that not only data cannot be transferred, but also data cannot be sent. If you use netstat -p, you will find that the process name for the connection is null.

Using the close function to close a connection is inelegant. As a result, there is a shutdown function that gracefully closes the connection, controlling the connection in only one direction:

The second parameter determines the disconnection mode. There are three main methods:

  • SHUT_RD(0) : Closes the “read” direction of the connection. If there is received data in the receive buffer, it will be discarded. If new data is received, it will ACK the data and quietly discard it. That is, the peer end will still receive an ACK, in which case it will never know that the data has been discarded.
  • SHUT_WR(1) : Closes the “write” direction of the connection, which is often referred to as a “half-closed” connection. If there is unsent data in the sending buffer, the device immediately sends a FIN packet to the peer end.
  • SHUT_RDWR(2) : closes the read and write directions of the socket, which is equivalent to SHUT_RD and SHUT_WR operations respectively.

The close and shutdown functions can both close a connection, but the connections they close differ not only in functionality but also in the Linux parameters that control them.

Optimization of FIN_WAIT1 state

After the active party sends a FIN packet, the connection is in FIN_WAIT1 state. Normally, if the active party receives an ACK from the passive party, the connection changes to FIN_WAIT2 state immediately.

However, when an ACK is not received, the connection remains in FIN_WAIT1 state. At this point, the kernel periodically resends FIN packets, and the number of resends is controlled by the tcp_orphan_retries parameter (note that orphan means orphan, but this parameter is valid for all connections in FIN_WAIT1). The default value is 0.

You might be wondering, how many times does this 0 represent? In fact, when 0, especially 8 times, from the following kernel source:

If there are many connections in the FIN_WAIT1 state, consider lowering the tcp_orphan_retries value. When the number of retries exceeds the NUMBER of tcp_ORPHAN_retries, the connection is shut down.

For generally normal conditions, an orphan_retries turn down is enough. If a malicious attack occurs, FIN packets cannot be sent. This is caused by the following TCP features:

  • First, TCP packets must be sent in order, including FIN packets. If there is still data in the sending buffer, FIN packets cannot be sent in advance.
  • Second, TCP has the flow control function. When the receiving window is 0, the sender cannot send data. Therefore, when an attacker downloads a large file, the receive window can be set to 0, which prevents FIN packets from being sent and keeps the connection in the FIN_WAIT1 state.

Orphans, which define the maximum number of “orphan connections”, is resolved by adjusting the tcp_max_orphans parameter:

When a process calls close to close a connection, it becomes an orphan connection because it cannot send or receive data. Linux provides the tcp_max_orphans parameter to prevent orphan connections from occupying system resources for a long time. If the number of orphan connections is larger than the preceding number, the newly added orphan connections will not be waved four times, but will be forcibly closed by sending an RST reset packet.

Optimization of FIN_WAIT2 state

After receiving an ACK packet, the active party is in FIN_WAIT2 state, which indicates that the sending channel of the active party is closed. Then, the active party waits for the FIN packet sent by the peer party to close the sending channel.

At this point, if the connection is closed with shutdown, the connection can remain in FIN_WAIT2 state because it may also be able to send or receive data. Tcp_fin_timeout controls how long a connection can last in this state. The default value is 60 seconds:

This means that for orphan connections (closed by calling close), if a FIN packet is not received after 60 seconds, the connection will be closed.

This 60 seconds is not arbitrary, it is the same duration as the TIME_WAIT state, which we will explain later.

Optimization of TIME_WAIT state

TIME_WAIT is the last and most commonly encountered state in which the active hand waves four times.

After receiving a FIN packet from the passive party, the active party replies with ACK to confirm that the FIN channel is closed and is in TIME_WAIT state. On Linux, the TIME_WAIT state is disabled for 60 seconds.

A connection in TIME_WAIT state is indeed closed in the view of the active party. However, the passive end remains in the LAST_ACK state before receiving the ACK packet. If the ACK packet does not reach the passive party, the passive party resends the FIN packet. The number of retries is still controlled by the tcp_orphan_retries parameter described earlier.

The time-wait state is particularly important for two reasons:

  • Prevent “old” packets with the same “quad” from being received;
  • Ensure that the “passive closing connection” can be closed correctly, that is, ensure that the last ACK can be received by the passive closing party, thus helping it to close properly;

Cause one: Prevent data packets from old connections

Time-wait is used to prevent historical data from being corrupted.

Given that time-wait has no WAIT TIME or is too short, what happens after the delayed packet arrives?

An exception that received historical data

  • As shown in the yellow box in the figure above, the SEQ = 301 packet sent by the server before closing the connection is delayed by the network.
  • In this case, the TCP connection with the same port is multiplexed, and the delayed SEQ = 301 reaches the client. In this case, the client may normally receive the expired packet, which causes serious problems such as data confusion.

Therefore, TCP designed such a mechanism, after 2MSL this time, enough to let the two directions of the packet are discarded, so that the original connection of the packet in the network are naturally disappeared, and the packets that appear again must be the new connection generated.

Cause two: Ensure that the connection is closed correctly

Another purpose of time-wait is to WAIT for enough TIME to ensure that the final ACK is received by the passive closing party to help it close properly.

Given that time-wait has no WAIT TIME or is too short, what problems do disconnections cause?

An exception that does not ensure a normal disconnect

  • As shown in the red box in the figure above, if the last ACK packet waved by the client for four times is lost in the network, if the time-wait of the client is too short or no, the client directly enters the CLOSE state, and the server stays in lase-ACK state all the TIME.
  • After the client sends a SYN request packet for establishing a connection, the server sends an RST packet to the client. In this way, the connection is terminated.

Let’s go back and see why TIME_WAIT is kept for 60 seconds. This works the same way as the orphan connection FIN_WAIT2 state is kept for 60 seconds by default, because both states need to be kept for 2MSL. MSL specifies the Maximum Segment Lifetime, which defines the Maximum Segment Lifetime of a packet. The TTL field in the IP header is reduced by 1 every time a packet is forwarded by a router.

Why is it 2 MSL? In this case, packets are allowed to be lost at least once. For example, if an ACK is lost in one MSL and the passive FIN resends arrives in the second MSL, a connection in TIME_WAIT state can handle this.

Why not 4 or 8 MSL? If you think of a bad network with a packet loss rate of 1 in 100, the probability of losing a packet twice in a row is 1 in 10,000. This probability is so small that it is more cost-effective to ignore it than to solve it.

Thus, the maximum duration for both TIME_WAIT and FIN_WAIT2 states is 2 MSL, and since MSL is fixed at 30 seconds on Linux systems, they are both 60 seconds.

While the TIME_WAIT state is necessary, it consumes system resources. If the TIME_WAIT status of the initiator is too large and occupies all port resources, new connections cannot be created.

  • The client is limited by port resources: if the client TIME_WAIT is too much, the port resources will be occupied, because there are 65536 ports, which will lead to the failure to create new connections.
  • The server is limited by system resources: since a quad represents a TCP connection, the server can theoretically establish many connections. The server does listen on one port but dumps the connection to the processing thread, so theoretically the listening port can continue to listen. But thread pools can’t handle that many continuous connections anymore. Therefore, when a large number of time_waits occur on the server, the system resources are used up and new connections cannot be processed.

In addition, Linux provides the tcp_max_TW_BUCKETS parameter. If the number of connections in TIME_WAIT exceeds this parameter, new connections are closed without TIME_WAIT.

When the number of concurrent connections on the server increases, the number of connections in TIME_WAIT state also increases. In this case, you should increase the tcp_MAX_TW_BUCKETS parameter to reduce the probability of data error between connections.

Tcp_max_tw_buckets is not the bigger the better, after all, memory and ports are limited.

One way to reuse a TIME_WAIT connection when creating a new one is to turn on the tcp_TW_reuse parameter. Note, however, that this parameter is only used by the client (the initiator of the connection), because it is used when connect() is called, and not by the server (the passive connector).

Tcp_tw_reuse is secure and controllable from a protocol perspective and can reuse a port in TIME_WAIT for new connections.

What is secure and controllable from a protocol perspective? There are two main points:

  • Only applicable to the connection initiator, that is, the client in the C/S model;
  • The connection in TIME_WAIT state can be reused only after the creation time exceeds 1 second.

To use this option, you need to turn TCP timestamp support on (and on) :

Due to the introduction of timestamps, it provides some benefits:

  • The 2MSL problem we mentioned earlier does not exist, because duplicate packets are discarded naturally because the timestamp expires;
  • At the same time, it can prevent serial number loopback, also because duplicate packets will be discarded naturally due to timestamp expiration;

Older versions of Linux also provided the tcp_TW_RECYCLE parameter, but when it was turned on, there were two pits:

  • Linux speeds up the TIME_WAIT state on both the client and server. That is, it makes the TIME_WAIT state less than 60 seconds, which can easily lead to data corruption.
  • In addition, Linux discards any packets from a remote with a timestamp less than the last recorded timestamp (sent by the same remote). To use this option, you must ensure that the timestamp of the packet is monotonically increasing. The problem, then, is that the timestamp here is not an absolute time in the usual sense, but a relative time. In many cases, we cannot guarantee that the timestamp is monotonically increasing, such as NAT, LVS, etc.;

Therefore, it is not recommended to set it to 1 and is recommended to turn it off:

After Linux 4.12, the Linux kernel simply cancels this parameter.

In addition, we can set the socket option in the program to call close to close the connection behavior.

If l_ONOFF is non-0 and L_LINGER is 0, the call to CLOSE will immediately send an RST flag to the peer end, and the TCP connection will skip four waves and the TIME_WAIT state and directly close.

But this opens up the possibility of jumping TIME_WAIT, but it’s a very dangerous behavior and not worth advocating.

Passive optimization

When the passive receives a FIN message, the kernel automatically replies with an ACK and the connection is in CLOSE_WAIT state, which, as the name implies, means waiting for the application process to call the close function to close the connection.

The kernel does not have the right to close the connection on behalf of the process, because if the active party closes the connection with shutdown, then it wants to receive or send data on the half-closed connection. Therefore, Linux does not limit the duration of CLOSE_WAIT state.

Of course, most applications do not use the shutdown function to close the connection. So, when you use the netstat command you find a lot of CLOSE_WAIT states. You need to check your application, because there may be a Bug in your application where close is not called when read returns 0.

When in CLOSE_WAIT state, the kernel will send a FIN message to close the connection channel and enter the LAST_ACK state, waiting for the active ACK to confirm the connection closure.

If an ACK fails to be received after a while, the core resends FIN packets. The number of resends is controlled by the TCP_orphan_retries parameter, which is consistent with the optimization policy for resending FIN packets on the active node.

Another thing to note is that if the passive party calls close quickly, then the ACK and FIN of the passive party may be sent in one message, so that it looks like the four waves will turn into three waves. This is a special case, so don’t worry about it.

What happens if both parties close the connection at the same time?

Because TCP is a full-service protocol, the two parties may close the connection at the same time, that is, send FIN packets at the same time.

In this case, the optimization strategy described above still applies. When sending FIN packets, the orphan_REtries parameter controls the number of FIN packet retransmissions. For example, when an ORPHAN node sends a FIN packet, the orphan_retries parameter controls the number of FIN packet retransmissions.

At the same time to close

Then, while waiting for an ACK packet, both parties receive a FIN packet. This is a new situation, so the connection goes into a new state called CLOSING, which replaces the FIN_WAIT2 state. Then, the two cores reply ACK to confirm the closure of the sending channel and enter the TIME_WAIT state. After waiting for 2MSL, the connection is automatically closed.

summary

For the optimization of TCP quad-wave, we need to adjust the system TCP kernel parameters according to the state changes of active and passive quad-wave.

Optimization strategy of four wave

Optimization of the initiative

If an ACTIVE HOST does not receive an ACK response, an ORPHAN_retries packet is retransmitted. The number of retransmissions is determined by the TCP_orphan_retries parameter.

When the active party receives an ACK packet, the connection enters the FIN_WAIT2 state, and the optimization mode varies according to the closing mode:

  • If this is a connection closed by the close function, it is an orphan connection. If no FIN packet is received within tcp_FIN_timeout seconds, the connection is closed. At the same time, tcp_MAX_orphans defines a maximum number of orphan connections, at which the connections will be released, in order to address orphaned connections that take up too many resources.
  • Otherwise, the connection closed by the shutdown function is not restricted by this parameter.

After the active party receives a FIN packet and returns an ACK, the active party enters the TIME_WAIT state. To prevent TIME_WAIT from consuming too many resources, tcp_MAX_TW_BUCKETS defines a maximum number of buckets. If this number is exceeded, the connection is released directly.

If the TIME_WAIT state is excessive, you can set tcp_tw_reuse and tcp_timestamps to 1 to reuse ports in TIME_WAIT state for new connections to clients. This parameter applies only to clients.

Passive optimization

A passively closed connection should simply go into CLOSE_WAIT after ACK and wait for the process to call close to close the connection. Therefore, when a large number of connections in CLOSE_WAIT state occur, you should look for the problem in the application.

After the host sends a FIN packet, the node enters the LAST_ACK state. Before an ACK packet is sent, an ORPHAN_retries parameter resends the FIN packet.


03 TCP data transmission performance is improved

In the previous section, the three-way handshake and four-way wave are introduced. In the next section, the TCP data transmission optimization is introduced.

TCP connections are maintained by the kernel, which creates a memory buffer for each connection:

  • If the memory configuration of the connection is too small, the network bandwidth cannot be fully used, and TCP transmission efficiency decreases.
  • If the connection memory configuration is too large, it is easy to exhaust the server resources, so that new connections cannot be established.

Therefore, we must understand the purpose of TCP memory under Linux to properly configure the size of memory.

How does sliding Windows affect transmission speed?

TCP ensures that each packet can reach the peer. The MECHANISM of TCP is as follows: After a packet is sent, it must receive an ACK packet from the peer. If the ACK packet is not received, TCP resends the packet until it receives an ACK from the peer.

Therefore, TCP packets are not deleted from the memory immediately after they are sent, because they are needed for retransmission.

Since TCP is maintained by the kernel, packets are stored in the kernel buffer. If there are a lot of connections, we can observe that the buff/cache size increases with the free command.

If TCP sends a data, an acknowledgement is made. When the previous packet receives a reply, the next packet is sent. This mode is a bit like me and you face to face chat, you sentence each other, but the disadvantage of this way is relatively low efficiency.

Acknowledgement by packet

Therefore, there is a disadvantage to this method of transmission: the longer the round-trip time of packets, the less efficient the communication.

To solve this problem, it is easy to send packets in batches and confirm packets in batches immediately.

Parallel processing

However, this raises another question: can a sender send a message at will? Of course this is unrealistic, and we have to consider the processing power of the recipient.

If the hardware of the receiver is inferior to that of the sender, or the system is busy or resources are limited, the number of packets cannot be processed instantly. Therefore, these packets can only be discarded, resulting in low network efficiency.

To solve this problem, TCP provides a mechanism that allows the “sender” to control the amount of data sent based on the “receiver” ‘s actual ability to receive data. This is the origin of sliding Windows.

According to its buffer, the receiver can calculate how many bytes of packets it can receive in the future. This number is called the receiving window. When the kernel receives messages, it must buffer them so that the remaining buffer space becomes smaller and the receiving window becomes smaller. When a process calls the read function, the data is read into user space and the kernel buffer is emptied, which means the host can receive more packets and the receive window becomes larger.

Therefore, the receiving window is not constant. Instead, the receiver puts the current acceptable size in the window field in the TCP packet header to notify the size of the window.

Is the sender’s window equivalent to the receiver’s window? If congestion control is not taken into account, the window size of the sender is approximately equal to the window size of the receiver, because the window notification packets have delay in network transmission, so it is approximately equal to the relationship.

The TCP header

As you can see from the figure above, the window field is only 2 bytes long, so it can represent a maximum window size of 65535 bytes, or 64KB.

This maximum window size is obviously inadequate on today’s high-speed networks. Therefore, there is a method to expand the window: the window enlargement factor is defined in the TCP options field to expand the TCP notification window, so that the DEFINITION of THE TCP window is increased from 2 bytes (16 bits) to 4 bytes (32 bits), so that the maximum value of the window can reach 1GB at this time.

To enable this in Linux, you need to set the tcp_WINDOW_scaling configuration to 1 (on by default) :

To use the windowing option, both communicators must send this option in their SYN packets:

  • This option is sent in a SYN message by the party that actively establishes the connection.
  • The passive party that establishes the connection can send this option only after receiving a SYN with the window expansion option.

Thus, as long as the process can call read in a timely manner and the receive buffer is configured to be large enough, the receive window can be infinitely enlarged and the sender can infinitely speed up the transmission.

This is impossible because the transmission capacity of the network is limited. When the sender sends a packet that exceeds the processing capacity of the network according to the sending window, the router directly discards the packet. Therefore, larger buffer memory is not always better.

What if the maximum transmission speed is determined?

We learned earlier that TCP transmission speed is subject to the sending and receiving Windows and network device transmission capabilities. The window size is determined by the kernel buffer size. If the buffer matches the network capacity, the buffer utilization is maximized.

The question is, how do you calculate the transmission capacity of the network?

I’m sure you all know that networks are limited by “bandwidth”, which describes the capacity of the network to transmit and is a different unit of measurement from the kernel buffer:

  • Bandwidth is the amount of traffic per unit of time, expressed as “speed”, such as the common bandwidth of 100 MB/s;
  • The buffer is in bytes, which is obtained when the network speed is multiplied by time.

Here we need to mention a concept, namely the bandwidth delay product, which determines the size of flight packets in the network. Its calculation method is as follows:

For example, if the maximum bandwidth is 100MB/s and the network delay (RTT) is 10ms, the total number of bytes stored on the network from the client to the server is 100MB/s x 0.01s.

The 1MB is the Product of Bandwidth and Delay, so it is called BDP (Bandwidth Delay Product). It also represents the size of TCP packets that are “in flight” on network devices such as network lines and routers. If the number of in-flight packets exceeds 1 MB, the network is overloaded and packets are easily lost.

The size of the sending buffer determines the upper limit of the sending window, which in turn determines the upper limit of “sent unacknowledged” flight messages. Therefore, the send buffer cannot exceed the “bandwidth delay-product”.

The relation between the sending buffer and the latency product of the bandwidth:

  • If the sending buffer “exceeds” the latency product of the bandwidth, the excess part will not be able to effectively transmit the network, resulting in network overload and easy packet loss.
  • If the transmit buffer is “less than” the latency product of the bandwidth, the transmission efficiency of the network cannot be well played.

Therefore, the size of the send buffer should be closer to the bandwidth product.

How do I resize the buffer?

Both the send buffer and the receive buffer can be parameterized in Linux. When you’re done, Linux will adjust dynamically based on the buffer you set.

Adjust the send buffer range

Let’s start with the send buffer, whose scope is configured with the tcp_wmem parameter;

The above three numeric units are all bytes and represent:

  • The first value is the minimum dynamic range, 4096 byte = 4K;
  • The second value is the initial default, 87380 byte ≈ 86K;
  • The third value is the maximum dynamic range, 4194304 byte = 4096K (4M);

The send buffer is self-regulating, freeing up the memory of the send buffer when the data sent by the sender is acknowledged and no new data is sent.

Adjust the receive buffer range

Tcp_rmem: tcp_rmem: tcp_rmem: tcp_rmem:

The above three numeric units are all bytes and represent:

  • The first value is the minimum dynamic range, representing the minimum receive buffer size that can be guaranteed even under memory pressure, 4096 byte = 4K;
  • The second value is the initial default, 87380 byte ≈ 86K;
  • The third value is the maximum dynamic range, 6291456 byte = 6144K (6M);

The receive buffer can adjust the receive window according to the size of the system free memory:

  • If the system has a lot of free memory, it can automatically increase the buffer size, so that the receiving window to the other side will also be larger, thus increasing the amount of transmission data sent by the sender;
  • In any case, if memory is tight on your system, the buffer will be reduced, which will reduce transmission efficiency and allow more concurrent connections to work.

The tuning function of the sending buffer is automatically enabled. For the receiving buffer, set tcp_moderate_RCvbuf to 1.

Adjust the TCP memory range

When receiving buffer adjustments, how do I know if the current memory is tight or full? This is done with the tcp_MEm configuration:

The above three numeric units are not bytes, but “page size”, 1 page represents 4KB, they represent respectively:

  • When the TCP memory is smaller than the first value, no automatic adjustment is required.
  • Between the first and second values, the kernel begins to adjust the size of the receive buffer;
  • When the value is greater than the third value, the kernel no longer allocates new memory for TCP, and new connections cannot be established.

Typically these values are calculated at system startup based on the amount of system memory. According to the current tcp_MEm maximum number of memory pages is 177120, when the memory is (177120 * 4) / 1024K ≈ 692M, the system cannot allocate memory for new TCP connections, that is, TCP connections will be rejected.

Policies that can be adjusted based on actual scenarios

In order to balance network speed and a large number of concurrent connections in high-concurrency servers, we should ensure that the maximum dynamic buffer adjustment reaches the bandwidth tDELtime product and the default minimum value of 4K remains unchanged. For memory-constrained services, lowering the default is an effective way to increase concurrency.

Also, if this is a network IO server, increasing the tcp_MEM limit allows TCP connections to use more system memory, which is good for concurrency. Note that tcp_wmem and tcp_rmem are in bytes, whereas tcp_mem is in page size. Also, do not set SO_SNDBUF or SO_RCVBUF directly on the socket. This will disable dynamic buffer adjustment.

summary

This section describes how TCP optimizes data transmission.

Optimization strategy for data transmission

TCP reliability is realized through ACK acknowledgement packets, and relies on sliding Windows to improve the sending speed and take into account the processing capability of the receiver.

However, the default maximum size of the sliding window is only 64 KB, which does not meet the requirements of today’s high-speed networks. To increase the sending speed, the maximum size of the sliding window must be increased. This was achieved by setting tcp_WINDOW_scaling to 1 under Linux, where the maximum size can be up to 1GB.

The sliding window defines the maximum number of bytes of flight packets in the network. When it exceeds the latency product of bandwidth, the network is overloaded and packet loss occurs. When it is less than the tDELtime product of the bandwidth, the network bandwidth cannot be fully utilized. Therefore, the setting of the sliding window must refer to the bandwidth delay product.

The kernel buffer determines the upper limit of the sliding window and can be divided into the send buffer tcp_wmem and the receive buffer tcp_rmem.

Linux dynamically adjusts buffers, so we should set the upper limit of the buffer to the bandwidth tDI. The tuning function of the send buffer is automatically enabled, but the tcp_MODERate_RCvbuf must be set to 1 to enable the receive buffer. The adjustment is based on the TCP memory range tcp_mem.

If the socket is set SO_SNDBUF and SO_RCVBUF in the program, the dynamic integration function of the buffer will be turned off. Therefore, it is not recommended to set it in the program, but to the kernel automatic adjustment is better.

When these parameters are configured effectively, concurrency is maximized while connection speeds are maximized when resources are abundant.