Epoll is fundamentally broken

1. The introduction

Multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex I/O multiplexing multiplex The link to the original article is as follows:

  • The history of the Select(2) syscall
  • Select(2) is fundamentally broken
  • Epoll is fundamentally broken 1/2
  • Epoll is fundamentally broken 2/2

Thread 2.

Series 3 and 4 address two different problems with epoll(2) :

  1. Series three focuses on the multithreading extensibility of epoll
  2. The FDS (file Descriptors) registered by epoll and file descriptors controlled by the kernel have different life cycles

We’ll do that in that order.

3 epoll Multithreaded scalability

The problem of multithreaded extensibility of epoll is mainly reflected in load balancing between multiple cores. There are two typical scenarios:

  1. A TCP server that calls the same LISTEN FD on multiple cpusaccept(2)The system calls
  2. Large number of TCP connection callsread(2)On the system call

3.1 Problems with accept(2) of a specific TCP Listen FD

A typical scenario is an HTTP 1.0 server that needs to process a large number of short connections. Because it needs to accept() a large number of TCP connection requests, it wants to distribute these acceptances to different cpus to take advantage of the power of multiple cpus.

This exists in a real production environment. Tom Herbert reports an application that handles 40,000 build requests per second; When there are so many requests, it obviously makes sense to spread them out across different cpus.

In fact, however, things are not so simple that until the Linux 4.5 kernel, epoll(2) has not been able to scale these requests horizontally to other cpus. Let’s take a look at epoll’s two modes LT(Level trigger) and ET(edge trigger) to deal with the problem in this case.

3.1.1 Problem with horizontal triggering: unnecessary arousal

A foolish practice would be to put the same epoll FD on a different thread epoll_wait(). This would not work, nor would it work to add the same Epoll FD used for accept to an epoll FD on a different thread.

This is because the horizontal triggering mode of epoll has the same “scare effect” as select(2). In the horizontal triggering mode without special flags, all worker threads will be woken up when a new connection request comes in. Here is an example of this case:

Kernel: wake up two threads A and B in epoll_wait() due to "scare effect" 3. Thread A: epoll_wait() returns 4. Thread B: epoll_wait() returns 5. Thread A: Accept () is executed successfully. 6. Thread B: Accept () fails, and Accept () returns EAGAINCopy the code

The wake up of thread B is unnecessary and just wastes precious CPU resources. Epoll in horizontal trigger mode scales poorly.

3.1.2 Problems with edge triggers: unnecessary arousal and hunger

Since horizontal triggers don’t work, is edge triggers better? Not really. Let’s look at this example:

1. Kernel: received the first connection request. Both threads A and B wait on epoll_wait(). Because of the edge-triggered mode, only one thread is notified. This assumes that thread A receives notification 2. Thread A: epoll_wait() returns 3. Thread A: called accPET () and succeeded. Thread A: called accPET () and succeeded. Kernel: At this point, thread B is waiting for epoll_wait() because thread A is still accepting (). Thread A: Continue to accept() until EAGAIN 8. Thread B: Accept () and return EAGAIN. Thread B might be a little confused (" I was notified of an event but returned EAGAIN") 9. Thread A: Accept () again, this time finally returning EAGAINCopy the code

As you can see in the above example, thread B is completely unnecessary to wake up. In addition, there is actually a hunger problem in edge-triggered mode. Let’s look at the following example:

1. Kernel: Received two connection requests. Both threads A and B are waiting on epoll_wait(). Because of the edge-triggered mode, only one thread is woken up, so we assume thread A is woken up first 2. Thread A: epoll_wait() returns 3. Thread A: Accpet () is called with success 4. Kernel: received the third connection request. Because thread A has not finished processing (does not return EAGAIN), the current socket is still in A readable state, because it is edge triggered mode, so no new events will be generated 5. Thread A: Continues to accept(), hoping to return EAGAIN to wait for epoll_wait(), but it accepts () successfully and processes A new connection 6. 7. Thread A: Accepts () again, and returns successCopy the code

In this example, the socket changes from unreadable to readable only once, and since the socket is in edge-triggered mode, the kernel only wakes up epoll_wait() once. In this example, all connection requests are sent to thread A, causing the load balancing to not take effect at all. Thread A is busy and thread B has no work to do.

3.1.3 What is the right thing to do?

Since neither horizontal nor edge triggers work, what’s the right thing to do? There are two ways to workaround:

  1. The best and only way to support extensibility is to use the horizontal trigger mode that has been added since Linux 4.5+EPOLLEXCLUSIVEFlag, which ensures that only one epoll_wait() will be awakened for an event, avoids “stampede effects” and scales well horizontally across multiple cpus.
  2. When the kernel does not supportEPOLLEXCLUSIVECan be set to the ET modeEPOLLONESHOTTo simulate LT +EPOLLEXCLUSIVEEpoll_ctl (EPOLL_CTL_MOD) is called once more after each event to reset the fd. This splits the load evenly across cpus, but only one worker can call Accept (2) at a time. Obviously, this limits the processing of the accept(2) swallow. Here’s an example of doing this: ‘ ‘
  3. Kernel: Received two connection requests. Both threads A and B are waiting on epoll_wait(). Because of the edge-triggered mode, only one thread is awakened, and we assume thread A is awakened first
  4. Thread A: epoll_wait() returns
  5. Thread A: Calls accpet() and succeeds
  6. Thread A: calls epoll_ctl(EPOLL_CTL_MOD), which resets the EPOLLONESHOT state and reprepares the socket fd

3.1.4 Other schemes

Of course, there are alternatives if you don’t rely on epoll(). One alternative is to use the socket option SO_REUSEPORT to create multiple listen sockets that share the same port number. However, there are problems with this alternative: When a listen socket fd was closed, has been assigned to the listen socket fd of the accept request will be lost on the queue, specific can consult engineeringblog.yelp.com/2015/04/tru… And comments on LWN

From Linux 4.5, SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF are introduced. With clever design, it should be possible to avoid the situation where the connection request is dropped.

3.2 Problems with read(2) for a large number of TCP connections

In addition to the accept(2) issue described in #3.1, plain Read (2) also has scalability issues on multi-core systems. Imagine the following scenario: an HTTP server that needs to communicate with a large number of HTTP clients, and you want to process each client request as quickly as possible. The processing time of each client connection request may be different, some fast, some slow, and unpredictable, so simply shred these connections to different cpus can result in a longer average response time. A better queuing strategy might be to manage these connections with an epoll FD and set up EPOLLEXCLUSIVE, and then multiple worker threads epoll_wait() to fetch the ready connection and process it. There’s a video on YouTube about this model called the Combined Queue.

Here’s how epoll handles this model:

3.2.1 Problem of horizontal triggering: Data is out of order

In fact, we don’t want to use this model because of the “stampede effect” of horizontal triggering. In addition, even with the EPOLLEXCLUSIVE flag, there is still data competition, so let’s look at this example:

Epoll_wait () specifies that thread A and thread B are both in epoll_wait() mode, so the kernel will only wake up one thread. Thread A: epoll_wait() returns 4. Kernel: The kernel receives another 2 bytes of data 5. Kernel: Thread A is still working, only thread B is currently in epoll_wait(), and the kernel wakes thread B 6. Thread A: call read(2048) and read 2048 bytes 7. Thread B: calls read(2048) and reads the remaining 1 byte of dataCopy the code

In this scenario, the data is shard into two different threads and may be out of order without locking protection.

3.2.2 Edge trigger problem: Data is out of order

Since the horizontal trigger model doesn’t work, what about edge triggers? In fact, the same competition exists, so let’s look at this example:

Epoll_wait () specifies that thread A and thread B are both in epoll_wait() mode, so the kernel will only wake up one thread. Thread A: epoll_wait() returns 4. Thread A: calls read(2048) and returns 2048 bytes 5. Kernel: the buffer has been read, and the fd is attached to the epoll queue again. Kernel: thread A is still working, only thread B is currently in epoll_wait(), and the kernel wakes up thread B 8. Thread B: epoll_wait() returns 9. Thread B: calls read(2048) and reads only 1 byte of data 10. Thread A: Calls read(2048) again and returns EAGAIN because there is no data in the kernel bufferCopy the code

3.2.3 What is the right thing to do?

In fact, the only way to ensure that data from the same connection always ends up on the same thread in the epoll model is to add the EPOLLONESHOT flag to the epoll_ctl and then add the socket FD to the epoll at the end of each process.

3.3 EPoll Load Balance Summary

It is not easy to use epoll(2) properly. To achieve load balancing and avoid data contention with epoll, you must master the flags of EPOLLONESHOT and EPOLLEXCLUSIVE. EPOLLEXCLUSIVE was a late addition to ePoll’s logo, so it’s safe to say that EPoll was not originally designed to support multithreaded load balancing.

4. epoll 之 file descriptor 与 file description

In this chapter we focus on another big problem with epoll() : the inconsistency between the life cycles of file Descriptors and file descriptors.

4.1 Inconsistent Life cycles of File Descriptor and File Description

Foom said on LWN:

Obviously epoll has a huge design flaw, which should be obvious to anyone who knows anything about file Descriptors. In fact, if you look back at the history of Epoll, the people who implemented epoll obviously didn't know much about the difference between file Descriptor and file description. : (Copy the code

In fact, the main problem with epoll() is that it confuses the user-mode file descriptor (usually called a numeric FD) with the kernel mode file description that is actually used for implementation. This problem manifests itself when a process calls close(2) to close a FD.

Epoll_ctl (EPOLL_CTL_ADD) actually does not register a file descriptor (fd), but registers the FD with a tuple of Pointers to the kernel file description. The root cause of the problem is that the life cycle of the FDS managed in epoll is not the FDS themselves, but the corresponding file description in the kernel.

When close(2) is used to close a fd, if the fd is the only reference to file description in the kernel, the file description in the kernel is removed as well. This is OK. However, when a file description in the kernel has other references, close does not remove the file descrption. This will cause epoll() to report events on the lost FD when the fd is closed before it is removed from epoll.

Here’s a dUP (2) system call to illustrate the problem:

rfd, wfd = pipe()
write(wfd, "a")             # Make the "rfd" readable

epfd = epoll_create()
epoll_ctl(efpd, EPOLL_CTL_ADD, rfd, (EPOLLIN, rfd))

rfd2 = dup(rfd)
close(rfd)

r = epoll_wait(epfd, -1ms)  # What will happen?
Copy the code

Since close(RFD) turns this RFD off, you might expect the epoll_wait() to block forever, but it doesn’t. Since dUP () was called, the corresponding file description in the kernel still has a reference count that has not been deleted, so the file descption event is still reported to epoll. So epoll_wait() reports an event to a nonexistent FD. Even worse, once you close() the fd, there is no chance to remove the dead FD from the epoll, either way:

epoll_ctl(efpd, EPOLL_CTL_DEL, rfd)
epoll_ctl(efpd, EPOLL_CTL_DEL, rfd2)
Copy the code

Marc Lehmann also mentioned this issue:

Therefore, it is possible for close to drop a FD and still receive epoll events from that FD. And once that happens, nothing you can do can fix it.Copy the code

Therefore, you cannot rely on close() to clean up the epoll fd. Once close() is called and there is a reference to file description in the kernel, the epoll FD cannot be repaired. The only way to do this is to kill the epoll FD. Then create a new one and add all the previous fd’s to the new epoll FD. So remember this advice:

Always remember to display the call to epoll_ctl(EPOLL_CTL_DEL) before calling close()Copy the code

4.2 summarize

Explicitly removing fd from epoll works fine when calling close(), as long as you have control over all the code. However, in some scenarios this is not always the case, such as when writing a library that encapsulates epoll, sometimes you cannot forbid the user to make a close(2) system call. Therefore, writing a lightweight abstraction layer based on EPoll is not an easy task.

In addition, Illumos also implemented an epoll() mechanism, explicitly mentioning the strange semantics of epoll()/close() on Linux in their manual and refusing to support it.

Hopefully, the problems mentioned in this article will be of some help to anyone using the bad epoll() design on Linux.

Note 1: The author thinks that a master thread may be directly used for distribution in this scenario, while multiple worker threads may be used for processing or each worker thread may have its own independent epoll FD.

Reference:

Kernel.taobao.org/2019/12/epo…

zh.wikipedia.org/wiki/Epoll

Stackoverflow.com/questions/4…

Brother Dei, if you think my writing is good, please do me a favor

1, pay attention to my original wechat public number “Xiaoliang programming”, every day on time push dry goods technical articles, focus on writing algorithm + computer basic knowledge (computer network + operating system + database +Linux) + distributed + interview guide, heard that the attention is not good will become good oh.

Give me a thumbs up, can let more people see this article, by the way inspire me, hee hee.

Author’s brief introduction

Author: hello everyone, I am the trabecular, won a ICPC Asian games MEDALS, now at tencent, I knew algorithm, the importance of basic computer knowledge, so I apply for a WeChat public “trabecular programming hui”, professional in writing these underlying knowledge, improve our internal work, trabecular looking forward to your attention, and I study together. Reprint description: Without authorization, reprint is prohibited