Wechat public account: operation and development story, author: Xia teacher

TCP is introduced

TCP is a connection-oriented unicast protocol. Before sending data, communication parties must establish a connection between each other. The so-called “connection” is actually a piece of information about each other stored in the memory of the client and server, such as IP address and port number. TCP can be thought of as a byte stream that handles packet loss, duplication, and errors at the IP layer and below. During the establishment of the connection, the two parties need to exchange some connection parameters. These parameters can be placed in the TCP header. TCP provides a reliable, connection-oriented, byte stream, transport layer service that uses a three-way handshake to establish a connection. Use 4 waves to close a connection.

TCP Three-way handshake

The client and the server need to connect before communicating. The function of the “three-way handshake” is that both parties can confirm that their receiving and sending capabilities are normal.

  • First handshake: The client sends a network packet and the server receives it. The server can then conclude that the sending capability of the client and the receiving capability of the server are normal.
  • Second handshake: the server sends the packet and the client receives it. Then the client can conclude that the receiving and sending capabilities of the server and the receiving and sending capabilities of the client are normal. From the client’s perspective, I received the response packet sent by the server, which indicates that the server received the network packet sent by me during the first handshake and successfully sent the response packet, which indicates that the receiving and sending capabilities of the server are normal. On the other hand, I received the response packet from the server, indicating that the network packet I sent for the first time reached the server successfully. In this way, my own sending and receiving capabilities are normal.
  • Third handshake: The client sends the packet and the server receives it. Then the server can conclude that the receiving and sending capabilities of the client and the sending and receiving capabilities of the server are normal. After the first and second handshakes, the server does not know whether the receiving capability of the client and its own sending capability are normal. On the third handshake, the server receives the client’s response to the second handshake. From the server’s perspective, my response data from the second handshake is sent and received by the client. So, my sending ability is normal. The reception capability of the client is normal.

After the preceding three handshakes, both the client and the server confirm that their receiving and sending capabilities are normal. After that, you can communicate normally.

Status of the TCP three-way handshake

During the TCP three-way handshake, the Linux kernel maintains two queues:

  • Half-connection queue, also known as SYN queue;
  • Full connection queue, also known as AccePET queue;

When a server receives a SYN from a client, the kernel stores the connection to the half-connection queue and sends a SYN+ACK to the client. The client then returns an ACK. After receiving the third HANDSHAKE, the kernel removes the connection from the half-connection queue and creates a new full connection. Add it to the Accept queue and wait for the process to call the accept function to pull the connection out.

Both half-connection queues and full-connection queues have a maximum length limit. When the limit is exceeded, the kernel either dismisses or returns RST packets.

Half a connection

Unfortunately, we don’t have a command to see the number of half-connection queues on the system. However, we can grasp the characteristics of TCP half-connection, that is, the server in the STATE of SYN_RECV TCP connection, is the TCP half-connection queue. Run the following command to calculate the current TCP half-connection queue length:

$ netstat |grep SYN_RECV |wc -l
1723
Copy the code

In the SYN_RECV state, the server must establish a SYN half-connection queue to maintain the incomplete handshake information. When this queue overflows, the server cannot establish new connections.

How do I simulate a TCP half-connection queue overflow scenario?

It is not difficult to simulate the TCP half-connection overflow scenario. In fact, sending TCP SYN packets to the server without a third ACK handshake will leave the server with a large number of TCP connections in the SYN_RECV state. This is also known as SYN flooding, SYN attacks, and DDos attacks.

Experimental environment:

  • The client and server are CentOS Linux Release 7.9.2009 (Core), Linux kernel version 3.10.0-1160.15.2.el7.x86_64
  • The server IP address is 172.16.0.20, and the client IP address is 172.16.0.157
  • The server is the Nginx service on port 80

In this experiment, Hping3 is used to simulate SYN attacks:

$Hping3 -s-p 80 --flood 172.16.0.20
HPING 172.16.0.20 (eth0 172.16.0.20): S set, 40 headers + 0 data bytes
hping in flood mode, no replies will be shown
Copy the code

There are many reasons why a new connection fails to be established. How do you get the number of failures due to a full queue? The statistics obtained from the netstat -s command can be obtained.

$ netstat -s|grep "SYNs to LISTEN"
    1541918 SYNs to LISTEN sockets dropped
Copy the code

Given here is the number of SYN’s discarded due to queue overflow. Note that this is a cumulative value, and if the value continues to increase, then the SYN half-connection queue should be scaled up. To change the queue size, set the Tcp_max_syn_backlog parameter on Linux:

sysctl -w net.ipv4.tcp_max_syn_backlog = 1024
Copy the code

Can only discard connections if the SYN half-connection queue is full? No, enabling Syncookies makes it possible to successfully establish a connection without using a SYN queue. Syncookies work like this: The server calculates a value based on the current status and sends it in its own SYN+ACK packet. When the client returns an ACK packet, the server takes out the value for authentication. If the value is valid, the connection is considered to be established, as shown in the following figure.How do YOU enable Syncookies in Linux? The value 0 indicates that the tcp_syncookies function is disabled. The value 2 indicates that the tcp_syncookies function is enabled unconditionally. The value 1 indicates that the TCP_syncookies function is enabled only when the SYN half-connection queue is empty. Syncookie is only used in response to SYN flood attacks (when an attacker constructs a large number of SYN packets and sends them to the server, the SYN half-connection queue overflows and normal clients fail to establish connections). Many TCP features are unavailable for connections established in this way. Therefore, you should set tcp_syncookies to 1 and enable them only when the queue is full.

sysctl -w net.ipv4.tcp_syncookies = 1
Copy the code

All connection

You can run the ss command on the server to check the status of the TCP full connection queue. However, the meanings of Recv -q/send-q obtained through the SS command in the LISTEN state and non-Listen state are different. The difference can be seen from the following kernel code:

static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
			      void *_info)
{
	const struct tcp_sock *tp = tcp_sk(sk);
	struct tcp_info *info = _info;

	if (sk->sk_state == TCP_LISTEN) {
		r->idiag_rqueue = sk->sk_ack_backlog;
		r->idiag_wqueue = sk->sk_max_ack_backlog;
	} else {
		r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
		r->idiag_wqueue = tp->write_seq - tp->snd_una;
	}
	if(info ! =NULL)
		tcp_get_info(sk, info);
}
Copy the code

In the LISTEN state, the meanings of recv-q/send-q are as follows:

$ ss -ltnp
LISTEN     0      1024         *:8081                     *:*                   users:(("java",pid=5686,fd=310))
Copy the code
  • Recv-q: Indicates the size of the current full connection queue, that is, the current three-way handshake is complete and waiting for the serveraccept()TCP connection;
  • Send -q: indicates the maximum length of the current full connection queue. The preceding output indicates that the TCP service on port 8088 is monitored and the maximum length of the full connection is 1024.

In the Non-LISTEN state, the meanings of recv-q /Send-Q are as follows:

$ ss -tnpESTAB 0 0 172.16.0.20:57672 172.16.0.20:2181 Users :((" Java ",pid=5686,fd=292))Copy the code
  • Recv -q: Indicates the number of bytes received but not read by the application process.
  • Send-q: Indicates the number of bytes that have been sent but not received.

How do I simulate a TCP full connection queue overflow scenario?

Experimental environment:

  • The client and server are CentOS Linux Release 7.9.2009 (Core), Linux kernel version 3.10.0-1160.15.2.el7.x86_64
  • The server IP address is 172.16.0.20, and the client IP address is 172.16.0.157
  • The server is the Nginx service on port 80

Ab is short for Apache bench command. Ab is a stress test tool provided by Apache. Ab is very practical, it can not only apache server for website access stress test, or other types of server stress test. For example, nginx, Tomcat, and IIS. Principle of ab: The ab command creates multiple concurrent access threads to simulate multiple visitors accessing a URL at the same time. Its test target is URL-based, so it can be used to test the stress of Apache as well as other Web servers such as Nginx, Lighthttp, Tomcat, AND IIS. The AB command is very low on the machine that is issuing the load. It does not use a lot of CPU or memory. However, it creates a huge load on the target server, which is similar to CC attacks. You also need to be careful to test your own use, otherwise you will have too much load at once. The target server resources may be used up or even crash.

The maximum value of the TCP full connection queue depends on the minimum value between the SOMAXConn and the backlog, which is min(SOMAXConn, backlog). From the following Linux kernel code:

int __sys_listen(int fd, int backlog)
{
	struct socket *sock;
	int err, fput_needed;
	int somaxconn;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
		if ((unsigned int)backlog > somaxconn)
			backlog = somaxconn;

		err = security_socket_listen(sock, backlog);
		if(! err) err = sock->ops->listen(sock, backlog); fput_light(sock->file, fput_needed); }return err;
}
Copy the code
  • Somaxconn is the Linux kernel parameter, the default value is 128, can through the/proc/sys/net/core/somaxconn to set its value; We set it to 40,000.
  • Backlog is the size of the backlog in listen(int sockfd, int Backlog). Nginx defaults to 511 and can set the length by modifying the configuration file.

Therefore, the maximum value of the TCP full connection queue in the test environment is min(128, 511), which is 511. You can run the ss command to check:

ss -tulnp|grep 80
tcp    LISTEN     0      511       *:80                    *:*                   users:(("nginx",pid=22913,fd=6),("nginx",pid=22912,fd=6),("nginx",pid=22911,fd=6))        
tcp    LISTEN     0      511    [::]:80                 [::]:*                   users:(("nginx",pid=22913,fd=7),("nginx",pid=22912,fd=7),("nginx",pid=22911,fd=7))
Copy the code

The client executes the ab command to stress test the server, and sends 10,000 connections and 100,000 packets concurrently:

-n indicates that the total number of requests is 10000. -c Indicates that the number of concurrent requests is 1000. Ab -c 10000 -n 100000 http://172.16.0.20:80/ This is ApacheBench, Version 2.3 < $Revision: 1430300 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 172.16.0.20 (be patient) Completed 10000 requests Completed 20000 requests Completed  30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: Nginx /1.20.1 Server Hostname: 172.16.0.20 Server Port: 80 Document Path: / Document Length: 4833 bytes Concurrency Level: 10000 Time Taken for tests: 2.698 seconds Complete requests: 100000 Failed requests: 167336 (Connect: 0, Receive: 0, Length: 84384, Exceptions: 82952) Write errors: 0 Total transferred: 863996bytes HTML transferred: 82392984 bytes Requests per second: 37069.19 [#/ SEC] (mean) Time per request: 269.766 [ms] (mean) Time per request: 0.027 [ms] (mean, across all concurrent requests) Transfer rate: 31276.86 [Kbytes/ SEC] Received Connection Times (ms) min mean[+/-sd] Median Max Connect: 0 129 151.5 106 1144 Processing: 39 121 37.7 114 239 Waiting: 0 23 51.8 0 159 Total: 142 250 152.4 224 1346 Percentage of the requests served within a certain time (MS)  50%    224
  66%    227
  75%    232
  80%    236
  90%    283
  95%    299
  98%   1216
  99%   1228
 100%   1346 (longest request)
Copy the code

The ss command is executed twice. According to the preceding output, the size of the current TCP full connection queue reaches 512, exceeding the maximum TCP full connection queue.

 ss -tulnp|grep 80
tcp    LISTEN     411    511       *:80                    *:*                   users:(("nginx",pid=22913,fd=6),("nginx",pid=22912,fd=6),("nginx",pid=22911,fd=6))
ss -tulnp|grep 80
tcp    LISTEN     512    511       *:80                    *:*                   users:(("nginx",pid=22913,fd=6),("nginx",pid=22912,fd=6),("nginx",pid=22911,fd=6))
Copy the code

When the TCP connection queue is exceeded, the server will discard all incoming TCP connections. The number of lost TCP connections will be counted. Use the netstat -s command to check the number of lost TCP connections.

 netstat -s|grep overflowed
    1233972 times the listen queue of a socket overflowed
Copy the code

The 1233972 times seen above indicates the number of times the full connection queue overflowed. Note that this is a cumulative value. This can be done every few seconds, but if the number keeps increasing, the full connection queue will be full.

netstat -s|grep overflowed
    1292022 times the listen queue of a socket overflowed
Copy the code

From the simulation above, we can see that when a server is processing a large number of requests concurrently, if the TCP full connection queue is too small, it is easy to overflow. When TCP full connection queue overflow occurs, subsequent requests will be discarded, so that the number of server requests does not increase.

Linux has an argument that specifies what policy will be used to respond to clients when the TCP full connection queue is full.

In fact, discarding connections is the default behavior of Linux, and we can also choose to send an RST reset message to the client telling it that the connection has failed.

$ cat /proc/sys/net/ipv4/tcp_abort_on_overflow
0
Copy the code

Tcp_abort_on_overflow has two values, 0 and 1, respectively:

  • 0: If the full connection queue is full, the server throws away the ACK sent by the client.
  • 1: If the full connection queue is full, the server sends a reset packet to the client, indicating that the handshake and the connection are disabled.

Tcp_abort_on_overflow set to 1 if the client fails to connect to the server. If the client fails to connect to the server and the TCP full connection queue is full, you can set tcp_abort_on_overflow to 1. It can then be proved that the server side TCP full connection queue overflow problem. In general, tcp_abort_on_OVERFLOW should be set to 0, as this is better for dealing with burst traffic. Therefore, tcp_abort_on_overflow is set to 0 to increase the success rate of establishing a connection. Only set it to 1 to notify the client as soon as possible if you are quite sure that the TCP full connection queue will overflow for a long time.

sysctl -w net.ipv4.tcp_abort_on_overflow = 0
Copy the code

How to increase the TCP full connection queue?

The maximum value of the TCP full connection queue as mentioned above depends on the minimum value between the SOMAXConn and the backlog, which is min(SOMAXConn, backlog). We now adjust the SOMAXCONN value:

$ sysctl -w net.core.somaxconn=65535
Copy the code

Adjust the nginx configuration:

server { listen 80 backlog=65535; .Copy the code

Finally, restart the Nginx service because the TCP full connection queue will only be reinitialized if the LISTEN () function is called again. On the server, run the ss command to check the TCP full connection queue size:

$ ss -tulntp|grep 80
tcp    LISTEN     0       65535    *:80                    *:*                   users:(("nginx",pid=24212,fd=6),("nginx",pid=24211,fd=6),("nginx",pid=24210,fd=6))
Copy the code

According to the execution result, the maximum value of a TCP full connection is 65535.

After the TCP full connection queue is increased, the load test continues

Ab-c 10000 -n 100000 http://172.16.0.20:80/ This is ApacheBench, Version 2.3 <$Revision: 1430300 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 172.16.0.20 (be patient) Completed 10000 requests Completed 20000 requests Completed  30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: Nginx /1.20.1 Server Hostname: 172.16.0.20 Server Port: 80 Document Path: / Document Length: 4833 bytes Concurrency Level: 10000 Time Taken for tests: 2.844 seconds Complete requests: 100000 Failed requests: 178364 (Connect: 0, Receive: 0, Length: 89728, Exceptions: 88636) Write errors: 0 Total transferred: 57592752 bytes HTML transferred: 54922212 bytes Requests per second: 35159.35 [#/ SEC] (mean) Time per request: 284.419 [ms] (mean) Time per request: 0.028 [ms] (mean, across all concurrent requests) Transfer rate: 19774.64 [Kbytes/ SEC] Received Connection Times (ms) min mean[+/-sd] Median Max Connect: 0 130 18.3 130 172 Processing: 45 142 40.1 138 Waiting: 0 19 52.4 0 185 Total: 159 272 31.2 272 390 Percentage of the requests served within a certain time (MS)  50%    272
  66%    274
  75%    275
  80%    276
  90%    280
  95%    358
  98%    370
  99%    375
 100%    390 (longest request)
Copy the code

The server runs the ss command to view the usage of the TCP full connection queue.

$ ss -tulnp|grep 80
tcp    LISTEN     8      65535     *:80                    *:*                   users:(("nginx",pid=24212,fd=6),("nginx",pid=24211,fd=6),("nginx",pid=24210,fd=6))
$ ss -tulnp|grep 80
tcp    LISTEN     352    65535     *:80                    *:*                   users:(("nginx",pid=24212,fd=6),("nginx",pid=24211,fd=6),("nginx",pid=24210,fd=6))
$ ss -tulnp|grep 80
tcp    LISTEN     0      65535     *:80                    *:*                   users:(("nginx",pid=24212,fd=6),("nginx",pid=24211,fd=6),("nginx",pid=24210,fd=6))
Copy the code

Netstat -s: TCP full connection queue overflow = TCP full connection queue overflow = TCP full connection queue overflow = TCP full connection queue overflow = TCP full connection queue overflow

$ netstat -s|grep overflowed
    1540879 times the listen queue of a socket overflowed
$ netstat -s|grep overflowed
    1540879 times the listen queue of a socket overflowed
$ netstat -s|grep overflowed
    1540879 times the listen queue of a socket overflowed
$ netstat -s|grep overflowed
    1540879 times the listen queue of a socket overflowed
Copy the code

Note After the maximum TCP full connection queue is increased from 512 to 65535, the server can resist 100,000 concurrent connection requests without overflow of the full connection queue. If continuous connections are being discarded because the TCP full connection queue overflows, you should turn up the backlog and the SOMAXCONN parameter.

TCP waves four times

  • First wave: The active closing party sends a FIN to close the data transfer from the active to the passive closing party. This means that the active closing party tells the passive closing party: I will not send you any more data (of course, if the data sent before the FIN packet is not received, the active closing party will still resend these data), but at this time, the active closing party can still accept the data.
  • Second wave: After receiving a FIN packet, the passive sender sends an ACK to the recipient. The ACK sequence number is receiving serial number +1(the same as SYN, in which a FIN occupies one serial number).
  • Third wave: The passive closing party sends a FIN to close the data transmission from the passive closing party to the active closing party, which is to tell the active closing party that my data has been sent and I will not send you any more data.
  • Fourth wave: After the active closing party receives the FIN, it sends an ACK to the passive closing party. The confirmation number is receiving number +1. At this point, four waves are completed.

In the Internet, the server is often the one who shuts down the connection. This is because THE HTTP message is a one-way transport protocol. After receiving the request, the server generates a response. After sending the response, the SERVER closes the TCP connection immediately.

The state of four waves

Let’s look at the state sequence diagram when disconnected:In fact, the four waves only involve two types of packets: FIN and ACK. A FIN packet indicates that it will not send any more data. Therefore, the transmission channel in this direction is closed. An ACK is an acknowledgement that your delivery channel has been closed.

  • When the active side closes the connection, it sends a FIN packet. In this case, the connection status of the active side changes from ESTABLISHED to FIN_WAIT1.
  • When the passive receives a FIN packet, the kernel automatically replies with an ACK packet and the connection state changes from ESTABLISHED to CLOSE_WAIT. As the name implies, it waits for the process to call close to close the connection.
  • When the initiator receives the ACK packet, the connection status changes from FIN_WAIT1 to FIN_WAIT2, and the sending channel of the initiator is closed.
  • When the passive side enters CLOSE_WAIT state, the process’s read function will return 0, which will trigger the kernel to send a FIN message. The passive side’s connection status will change to LAST_ACK. When the initiator receives the Segment Lifetime, the kernel automatically replies with an ACK, and the connection status changes from FIN_WAIT2 to TIME_WAIT. In Linux, the wait time is set to 2MSL, where MSL is Maximum Segment Lifetime. The maximum lifetime of a packet is the maximum duration for any packet to exist on the network. If the duration exceeds this period, the packet will be discarded. Connections in TIME_WAIT state are completely closed.
  • After the passive party receives an ACK packet, the connection is closed.

The active TIME_WAIT is optimized

A large number of connections in TIME_WAIT state, which consume a large amount of memory and port resources. At this point, we can optimize kernel options related to the TIME_WAIT state by, for example, taking the following steps.

  • Increase the number of connections in TIME_WAIT state net.ipv4. tcp_max_TW_buckets and increase the size of the connection tracking table net.net filter.nF_conntrack_max.
sysctl -w net.ipv4.tcp_max_tw_buckets=1048576
sysctl -w net.netfilter.nf_conntrack_max=1048576
Copy the code
  • Reduce the time of net.ipv4.tcp_fin_timeout for FIN_WAIT2 and net. filter. nf_conntrack_tcp_timeout_TIME_wait for TIME_WAIT. Get the system to release the resources they occupy as quickly as possible.
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
Copy the code
  • Enable port reuse on net.ipv4.tcp_tw_reuse. In this way, ports in the TIME_WAIT state can be used for new connections.
sysctl -w net.ipv4.tcp_tw_reuse=1
Copy the code
  • Increase the range of local ports net.ipv4. ip_LOCAL_port_range. This enables more connections and improves overall concurrency.
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
Copy the code
  • Increases the maximum number of file descriptors. You can use fs.nr_open and fs.file-max to increase the maximum number of file descriptors for the process and system, respectively. Or in the application’s Systemd configuration file, configure LimitNOFILE to set the maximum number of file descriptors for the application.
sysctl -w fs.nr_open=1048576
sysctl -w fs.file-max=1048576
Copy the code

Shoulders of giants

[1] System performance must be optimized. Tao Hui. Geek Time. [2] TCP/IP Detail Volume 2: Implementation