Look at Socket(TCP) listen and connection queue from Linux source

preface

I’ve always found it fascinating to know every bit of code, from applications to frameworks to operating systems. The backlog parameter for listen is associated with the semi-connected hash table and the full connected queue. I’ve covered it in this blog post.

The Server Socket needs to Listen

As we all know, the establishment of a Server Socket requires four steps: Socket, bind, listen, and accept. Today I’m going to focus on the Listen step.

The code is as follows:

void start_server(){ // server fd int sockfd_server; // accept fd int sockfd; int call_err; struct sockaddr_in sock_addr; . call_err=bind(sockfd_server,(struct sockaddr*)(&sock_addr),sizeof(sock_addr)); if(call_err == -1){ fprintf(stdout,"bind error! \n"); exit(1); Listen call_err=listen(sockfd_server,MAX_BACK_LOG); if(call_err == -1){ fprintf(stdout,"listen error! \n"); exit(1); }}Copy the code

First we create a socket through the socket system call, which specifies the SOCK_STREAM, and the last parameter is 0, which establishes a normal TCP socket. Here, we directly give the OPS corresponding to the TCP Socket, that is, the operation function.

If you want to know where the structure above came from, check out my previous blog:

https://my.oschina.net/alchemystar/blog/1791017
Copy the code

Listen system call

Okay, now let’s go straight to the Listen system call.

#include <sys/socket.h> // Return 0 on success, -1 on error, error code set in errno int listen(int sockfd, int backlog);Copy the code

Note that the listen call is coated with glibc’s INLINE_SYSCALL, which corrects the return value to only 0 and -1 and sets the absolute value of the error code inside errno. The backlog is a very important parameter here and can be a hidden pit if not properly set.

For Java developers, there is basically an off-the-shelf framework, and the default Backlog setting for Java itself is only 50. This can lead to some subtle phenomena, which will be explained in this article.

Next, let’s go to the Linux kernel source stack

listen |->INLINE_SYSCALL(listen......) | - > SYSCALL_DEFINE2 (listen, int, fd, int, backlog) / * to detect the existence of the corresponding descriptor fd and does not exist, Return - BADF | - > sockfd_lookup_light / * to get their backlog maximum limit does not exceed the/proc/sys/net/core/somaxconn | - > if ((unsigned int) backlog > somaxconn) backlog = somaxconn |->sock->ops->listen(sock, backlog) <=> inet_listenCopy the code

It should be noted that the Kernel made an adjustment to the backlog value we passed in to prevent it from > somaxconn in the Kernel parameter Settings.

inet_listen

Next comes the core caller, inet_LISTEN.

int inet_listen(struct socket *sock, int backlog) { /* Really, if the socket is already in listen state * we can only allow the backlog to be adjusted. *if ((sysctl_tcp_fastopen & TFO_SERVER_ENABLE) ! = 0 && inET_csk (sk)->icsk_accept_queue.fastopenq == NULL) {// FastOpen logic if ((syscTL_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) ! = 0) err = fastopen_init_queue(sk, backlog); else if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT2) ! = 0) err = fastopen_init_queue(sk, ((uint)sysctl_tcp_fastopen) >> 16); else err = 0; if (err) goto out; } if(old_state ! = TCP_LISTEN) { err = inet_csk_listen_start(sk, backlog); } sk->sk_max_ack_backlog =backlog; . }Copy the code

The first interesting thing about this code is that the listen system call can be called repeatedly! The second call only changed the backlog queue length (though it didn’t feel necessary).

First, let’s take a look at the logic other than Fastopen (more on fastopen later in the billing chapter). That’s the last inet_CSK_listen_start call.

int inet_csk_listen_start(struct sock *sk, const int nr_table_entries) { ...... // Nr_table_entries are the adjusted backlog. // But inside this function, nr_table_entries = min(backlog, syscTL_max_syn_backlog) is added to the logic int rc =  reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries); . inet_csk_delack_init(sk); // set the socket to listen sk-> SK_state = TCP_LISTEN; // Check the port number if (! Sk ->sk_prot->get_port(sk, inet->inet_num)){sk_dst_reset(sk); Listening_hash = listening_hash (sk ->sk_prot->hash(sk)); } sk->sk_state = TCP_CLOSE; __reqsk_queue_destroy(&icsk->icsk_accept_queue); -eaddrinUse return -eaddrinUse; }Copy the code

The most important call here is sk->sk_prot->hash(sk), also known as inet_hash, which links the current sock into the global LISTEN hash table so that the corresponding Sock can be found when SYN packets arrive. As shown below:

As shown in the figure, if SO_REUSEPORT is enabled, different sockets can listen to the same port, which can create load balancing for connections in the kernel. With Nginx 1.9.1 enabled, the pressure measurement performance is 3x!

Semi-connection queue Hash table and full connection queue

At the beginning of the author read the data inside, all mentioned. TCP connection queues are sync_queue and accept_queue. But the author read the source code carefully, in fact, not so. Sync_queue is actually a hash table (syn_table). The other queue is icSK_accept_queue.

So in this article, it will be called reqsk_queue(short for request_socket_queue). Here, the author first gives the emergence times of the two queues in the three-way handshake. As shown below:

Of course, in addition to the qlen and sk_ack_backlog counters mentioned above, there is also a qlen_young that does the following:

Qlen_young: Indicates the number of SOCK that has not been retransmitted by the SYN_ACK timer and has not completed the three-way handshakeCopy the code

As shown below:

The kernel code for the SYN_ACK retransmission timer is as follows:

static void tcp_synack_timer(struct sock *sk)
{
	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
}
Copy the code

This timer runs at intervals of 200ms(TCP_SYNQ_INTERVAL) when the half-connection queue is not empty. Limited to space, the author will not discuss here.

Why do half-connected queues exist

According to the characteristics of TCP, there are half-connection attacks, that is, SYN packets are continuously sent and SYN_ACK packets are never responded. If sending SYN packets causes the Kernel to set up an expensive SOCK, memory can easily run out. Therefore, before the three-way handshake is successful, the kernel allocates only one request_sock that occupies very little memory to prevent such attacks, and then, in coordination with the SYN_cookie mechanism, tries to resist the risk of such half-connection attacks.

Limits on half-join hash tables and full-join queues

The Kernel imposes a maximum length limit on full-connection queues because they store ordinary SOCK that consumes a lot of memory. This restriction is:

The minimum value of 1. The following three listen system call to transfer into the backlog of 2 / proc/sys/inet/ipv4 / tcp_max_syn_backlog 3. / proc/sys/net/core/somaxconn The min (backlog, tcp_ma_syn_backlog somaxcon)Copy the code

Somaxconn will be discarded by the kernel if it is exceeded, as shown below:

Connection discarding in this case can cause a strange phenomenon. If tcp_ABORT_ON_overflow is not set, the client will not be able to sense it and will not know that the peer connection has been discarded until the first call is made.

So, how to make the client aware in this case, we can set tcp_abort_on_overflow

echo '1' > tcp_abort_on_overflow
Copy the code

After setting, as shown below:

Of course, the most straightforward thing is to increase the backlog!

listen(fd,2048)
echo '2048' > /proc/sys/inet/ipv4/tcp_max_syn_backlog
echo '2048' > /proc/sys/net/core/somaxconn
Copy the code

Backlog impact on semi-connected queues

This backlog also has an effect on the half-connected queue, as shown in the following code:

/* TW buckets are converted to open requests without * limitations, Substitute real one for SYN Cookies. */ / If the length of the half-connection queue exceeds the backlog, substitute resources and peer for SYN cookies. If (inet_cSK_reqSK_queue_IS_full (sk) &&! isn) { want_cookie = tcp_syn_flood_action(sk, skb, "TCP"); if (! want_cookie) goto drop; } /* Accept backlog is full. If we have already queued enough * of warm entries in syn queue, drop request. It is better than * clogging syn queue with openreqs with exponentially increasing * timeout. */ // If there is young_ACK in the full connection queue, If (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); goto drop; }Copy the code

We see that all the time in DMESG

Possible SYN flooding on port 8080 
Copy the code

When the half-connection queue is full, the Kernel sends cookie verification.

conclusion

TCP as an ancient and popular protocol, after decades of evolution, its design has become quite complex. Therefore, it will be difficult to analyze when problems occur. At this time, it is necessary to read the fucking source code! And the author is also writing this blog and read the source code in detail when a flash of inspiration, found the root cause of a strange problem recently. The analysis process of this weird problem will be written and shared with you in the near future. Welcome everyone to pay attention to my public number “Bug solution road”, there are all kinds of dry goods, there are gifts to send oh!