Antecedents feed

Recently, our service based on Nginx+uWSGI+ Python often encountered some request errors due to high load in the peak period. When the QPS of a single machine was only about 2000-3000, the CPU usage of the kernel was as high as over 20%, and the kernel switched context over 200W times per second. After analysis, it was found that nginx+ UWSGi caused the stampede effect, resulting in a sharp performance decline, and the service was restored after the stampede problem was resolved by locking. Based on the screening process, plus I have written before about the epoll analysis finally also surprised in a group of effect, at that time didn’t write complete, the za this time will have a good talk about this topic, I’ll be detailed analysis of surprise cause from effect, and then take out nginx and uwsgi discuss their respective to this kind of problem is how to deal with, The analysis of the source code in this article is based on: Linux 2.6 and 4.5 kernel, nginx1.8 and 1.16, uWSGI2.20

1. How can multiple processes share port listening without using epoll/ SELECT?

Without using the multiplexing, process to accept a TCP connection must call the accept and be blocked, until there is a connection to arrive, before that, can’t do other things, that is said that a single process can only handle one connection at a time, business processing is completed calls close close connection, and then continue to wait for the accept, cycling, This case is unable to achieve high concurrency, so usually use multiple processes to deal with more connections at the same time, multiple processes in general there are two kinds of patterns The first kind is accept by a main process to monitor, to accept a connection again after the fork out of a child process, the connection to the child to deal with the business, the main process then continue to listen, This is the simplest mode, because only one process is using accept to listen, there is no multi-process contention problem, when the TCP connection event arrives, only the listener will wake up, naturally there is no stampede effect

In the second form, the main process forks out a group of children, who inherit the parent process’s listening port, share it, and then listen together. This involves the behavior of the kernel when multiple processes are in a blocked state waiting for the same port event, and we’ll focus on this scenario


// The process calls ACCEPT and enters inet_csk_ACCEPT, which is the heart of accept
struct sock *inet_csk_accept(struct sock *sk, int flags, int *err)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct sock *newsk;
	int error;

	lock_sock(sk);

	/* We need to make sure that this socket is listening, * and that it has something pending. */
	error = -EINVAL;
    // Verify that the socket is listening
	if(sk->sk_state ! = TCP_LISTEN)goto out_err;

	/* Find already established connection */
    /* The next step is to find an established connection */
	if (reqsk_queue_empty(&icsk->icsk_accept_queue)) { // If the sock connection queue is empty
		long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);

		/* If this is a non blocking socket don't sleep */
		error = -EAGAIN;
		if(! timeo)// If non-blocking mode is set, it returns directly. Err is welcome -EAGAIN
			goto out_err;
        // If in blocking mode, inet_cSK_WAIT_FOR_connect is entered, and the process is blocked and awakened directly to the new connection
		error = inet_csk_wait_for_connect(sk, timeo);
		if (error)
			goto out_err;
	}
    // At this point, the connection queue will have at least one available connection to return
	newsk = reqsk_queue_get_child(&icsk->icsk_accept_queue, sk);
	WARN_ON(newsk->sk_state == TCP_SYN_RECV);
out:
	release_sock(sk);
	return newsk;
out_err:
	newsk = NULL;
	*err = error;
	goto out;
}
EXPORT_SYMBOL(inet_csk_accept);

// inet_cSK_WAIT_for_connect suspends the process until it is woken up by a new connection
static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	DEFINE_WAIT(wait); // Define a wait node to hang in the socket listening queue
	int err;

	for (;;) {
        // Use prepare_to_WAIT_EXCLUSIVE to confirm the mutually exclusive wait, in which the kernel wakes up only one process in the wait queue after an event arrives
		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
					  TASK_INTERRUPTIBLE);
		release_sock(sk);
		if (reqsk_queue_empty(&icsk->icsk_accept_queue))
            If the queue is empty, the current process will be suspended
			timeo = schedule_timeout(timeo);
		lock_sock(sk);
		err = 0;
		if(! reqsk_queue_empty(&icsk->icsk_accept_queue))break;
		err = -EINVAL;
		if(sk->sk_state ! = TCP_LISTEN)break;
		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			break;
		err = -EAGAIN;
		if(! timeo)break;
	}
	finish_wait(sk_sleep(sk), &wait);
	return err;
}
Copy the code

I took the kernel code from the Accept section of the Linux kernel. Linux provides a system call to accepT4, which ultimately calls inet_cSK_ACCEPT. Inet_csk_accept eventually calls inet_CSK_WAIT_FOR_connect. If no connection is available, the kernel suspends the current process using the prepare_TO_WAIT_EXCLUSIVE function. There is no multi-process wake up PS: Prepare_to_wait_exclusive if you don’t have a basic understanding of kernel principles, you probably don’t know what prepare_TO_WAIT_EXCLUSIVE is. In short, the Linux kernel provides two modes for awakening processes. One is prepare_to_wait, Prepare_to_wait_exclusive is prepare_TO_WAIT_EXCLUSIVE. Prepare_to_wait_exclusive is prepare_TO_WAIT_EXCLUSIVE. If prepare_TO_WAIT_EXCLUSIVE is called, only one wait_queue process will be woken up. Prepare_to_wait does not set mutex, so it will wake up all processes suspended on the wait queue. In the normal multi-process shared listening port situation, the kernel will wake up only one of the processes when a new connection event arrives

The socket fd1 created by the parent process is shared by the two child processes. At this time, the two FDS of the child process belong to the same file in the kernel, which is recorded in the open Files table. Next steps 4 and 5, When two child processes call ACCEPT to block and listen, both processes are suspended, and the kernel registers the two Pids in the socket’s wait queue to wake up. In step 8, when a connection event arrives, the kernel will fetch the wait queue from the corresponding socket. For TCP connection events, the kernel will only wake up one process for each connection event. It will fetch the first node of the wait Queue and wake up the corresponding process. Before Linux 2.6, Accept would wake up all processes in the wait queue, causing a stampede effect. In 2.6, we added a mutex flag to fix this problem

2. Behavior of sharing listening ports in epoll

The most classic example is Nginx, where multiple workers listen to the same port together. Before version 1.11, nginx used the method described in section 1 above. After the master process created a listening port, The woker process forks the port and puts it in epoll to listen on. Now let’s focus on how the kernel behaves in this scenario (epoll+ Accept). Epoll calls epoll_create to create an epoll file in the kernel, as shown in the following figure

Epoll creates an anonymous inode node, which points to an Epoll main structure. There are two core fields in this structure. One is a red-black tree, under which files that need to be monitored in user mode are hung, and LGN search, insert and update complexity is realized. Another is rdlist, namely file ready queue of events, pointing to a linked list, events happening, epoll will take corresponding epitem (that is, the red and black tree node) is inserted into the list, when to return to user mode, you just need to traverse the ready list, without the need to iterate through all the files like select, However, this article focuses on the blocking and wake up process of epoll. The main structure of epoll itself is briefly mentioned. In fact, this structure is quite complex.

This procedure calls epoll_ctl and passes the FD to the kernel. The kernel actually does two things

  1. Hang the FD in a red-black tree
  2. Call the poll callback pointer to the file device driver (this is important)

Epoll/SELECT and other models to achieve multiplexing, in fact, mainly rely on: Will process on the corresponding fd waiting queue, so that when the fd things produce device drivers will awaken the queue process, if the process is not dependent on epoll, there is no doubt that he can’t hang myself at the same time in a number of fd queue, epoll help him doing this thing, and do this thing in a key step, Is call the corresponding fd drive devices provide the poll method of Linux, the standardization of testing of the model of a specification, such as the equipment is divided into character, piece of equipment, network equipment, etc., for developers to implement a driver for a device must be implemented according to the standard of Linux, which for the user interaction layer with this, The kernel requires developers to implement a structure called file_operations, which defines callback Pointers to a series of operations, such as read and write, that are familiar to users. Finally, the kernel calls back to the device’s file_operations.read and file_operations.write methods, whose logic is implemented by the driver developer, such as the accept call in this article, In fact, the file_operations.accept method is actually called under the socket. In order for a device to support epoll/ SELECT calls, it must implement the file_operations.poll method. In fact, this method is eventually called, and Linux also makes a series of specifications for this method, which requires developers to implement the following logic:

  1. The poll method is required to return flags for things of interest to the user, such as whether the current FD is readable, writable, and so on
  2. If poll passes in a poll-specific wait queue structure, it will call that structure, and there will be something in that structure called poll_table that has a callback function in it, and the poll method will eventually call that callback, which is set by epoll, The logic epoll implements in this method is to hang the current process on the fd’s wait queue

In simple terms, if a process calls ACCEPT, the protocol stack driver will hang the process on the wait queue. If epoll calls accept, the poll method will be called back, and epoll will hang the process on the wait queue. This is the root cause of the Accept crowd effect. Let’s look at how epoll interacts with the file_opreations->poll method

As shown in the figure above, when the user calls the add event of epoll_ctl, in step 6, epoll suspends the current process on the fd wait queue. However, by default, this mount does not set the mutex flag, which means that when the device has something to wake up in the wait queue, if there are multiple processes on the queue, As you can imagine, in the following epoll_wait call, if multiple processes add the same FD to epoll to listen, these processes will wake up together when the event arrives. But the wake up does not necessarily return to the user state, because epoll then has to traverse the ready list once. Make sure that at least one event has occurred before returning to the user state. We can imagine how epoll causes the Accept group:

  1. When multiple processes share the same listening port and all use ePoll for multiplexing listening, ePoll hangs these processes on the same wait queue
  2. When an event occurs, the socket device driver will attempt to wake up the queue. However, since epoll is used to mount the queue without setting the mutex flag (instead of accept’s own way of mounting the queue, as described in section 1), all processes in the queue will be woken up
  3. The accept->poll method checks whether there are any connections available in the TCP full connection list of the current socket, and returns an event flag if there are any
  4. When all processes have been woken up, but have not yet done anything to actually accept, all of the event checks that have been done assume that the Accept event is available, so all of these processes return to user mode
  5. When the user checks that an Accept event is available, they will actually call the Accept function to fetch the connection
  6. Only one process will actually get the connection, and all others will return an EAGAIN error, which can be traced using the strace -p PID command
  7. Not all processes will return to user state. The key point is that the awakened processes will not return to user state if a process has successfully accepted a connection during the check event
  8. Although it does not necessarily return to user mode, it also causes a kernel context switch, which is also a manifestation of the scare effect

3. Has the kernel solved the scare effect

The root cause is that the default behavior of epoll is not to set mutex for multiple processes listening on the same file, thus waking up all processes. Later versions of the kernel mainly provide two solutions

  1. The linux4.5 kernel has since added epoll’s EPOLLEXCLUSIVE flag bit. If this flag bit is set, ePoll will set the mutexusive flag bit when the process is put on the wait queue. Implementing the same kernel native Accept feature wakes up only one process in the queue
  2. The second method: The Linux 3.9 kernel provides the SO_REUSEPORT flag for sockets. This approach is more thorough. It allows sockets of different processes to bind to the same port, instead of requiring sub-processes to share socket listening. The listening sockets of each process will point to different nodes under open_FILe_tables, which means that different processes are suspended under their own device wait queue, there is no problem of sharing fd, and therefore there is no possibility of being woken up at the same time. When a TCP connection event arrives, the kernel will hash the source IP+ source port and then specify one of the processes in the group to accept the connection. This is equivalent to a load balancing at the kernel level

Based on the above two methods, there is no so-called surprise effect in epoll ecology at present, unless: If you overuse epoll, for example, if multiple processes share the same EPFD (the parent process creates an epoll that is called by multiple child processes), then epoll is not to blame because multiple processes are attached to the same epoll. In this case, it is no longer just a scare effect. A suspends A socket1 connection event in epoll, and B calls epoll_wait. Since it belongs to the same EPFD, process B will wake up when socket1 events occur. And that makes things very complicated. Summary: Never share an EPFD between multiple threads/processes

4. How does Nginx solve the stampede effect

Nginx fixes this problem by default with the SO_REUSEPORT option on by 1.11 or later. The application layer does not need to do anything special. Prior to this, Nginx solved this problem by locking a file. When epoll_wait returns, nginx calls Accept to retrieve the connection and release the file lock for another process to listen on. This is A kind of compromise, there is no perfect, first between processes for locks will have performance (even if are blocking locks), middle may have little time to no process to get the lock, such as A process to get the lock, and other processes will be A short period of time to try again to get the lock, and this small period of time if the request is heavy, If A accepts only A small number of requests to unlock, there will be some connection events hanging in the process. In short, upgrade to nginx and stop relying on this mode.

5. How does UWSGI address stampede effects

[uwsgi – docs. Readthedocs. IO/en/latest/a… the accept (), AKA Thundering Herd, AKA the Zeeg Problem – uwsgi 2.0 documentation)

Uwsgi applications generally don’t pursue concurrency, and don’t really need to focus on the stampedes. It also provides a thunder-lock option, which implements a lock for interprocess contention accept. SO_REUSEPORT is also supported in the new version and is turned on by default, but in practice it has been found that if this lock is not opened, the stampede effect on UWSGI will result in: In the case of CPU tilt, the first person who returns the user state from the kernel layer and accepts the socket succeeds in retrieving the socket. After receiving the socket, the socket is usually added to the epoll to listen for the event of receiving the data. The more sockets a process adds to epoll, the more likely it is to be woken up, and it will check accept after being woken up, so the more likely it is to accept successfully. Over time, you will see that there are always a few workers processing requests. Other workers starved to death, but I haven’t thought about this situation in detail, and I will write an end article for UWSGI if I have the opportunity later

My original post is on medium.com/heshaobo20…