preface

Epoll is not mentioned in UNIX Network Programming. For some reason, the following is summarized in the Linux Manual.

The API is introduced

Epoll is a mechanism provided on Linux to implement IO reuse. Epoll is similar to poll in that it can listen on multiple descriptors at the same time. Epoll adds the concepts of edge triggering and horizontal triggering, and is more advantageous when dealing with large numbers of descriptors.

The core concept of epoll API is epoll instance, which is a data structure in the kernel. From the user’s point of view, it can simply be regarded as containing two lists:

  • Interest List (or epoll set) : A collection of descriptors of interest that a user has registered
  • Ready List: a collection of ready descriptors that the kernel automatically adds to the Ready list when an IO is ready

The epoll API contains three system calls:

epoll_create

int epoll_create(int size);
int epoll_create1(int flags);
Copy the code

Epoll_create Creates an epoll instance. The function returns a descriptor pointing to the epoll instance. When the epoll instance is used, close should be called to close the epoll instance. The size parameter is similar to the capacity parameter of map and identifies the number of descriptors maintained by the epoll instance.

Epoll_create1 is similar to epoll_create, but the argument is changed to flags and size is ignored. There is one option for flags: EPOLL_CLOEXEC, which means setting the FD_CLOEXEC flag on the created descriptor.

epoll_ctl

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

/* Valid opcodes ( "op" parameter ) to issue to epoll_ctl(). */
#define EPOLL_CTL_ADD 1 /* Add a file decriptor to the interface. */
#define EPOLL_CTL_DEL 2 /* Remove a file decriptor from the interface. */
#define EPOLL_CTL_MOD 3 /* Change file decriptor epoll_event structure. */

typedef union epoll_data {
    void *ptr;
    int fd;
    uint32_t u32;
    uint64_t u64;
};

struct epoll_event
{
    uint32_t events;   /* Epoll events */
    epoll_data_t data; /* User data variable */
}
Copy the code

Epoll_ctl registers the descriptor and the event of interest to the epoll instance, which is equivalent to adding the descriptor to the interest list of the epoll instance. The function returns 0 on success, -1 otherwise and sets errno.

epoll_wait

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
Copy the code

Epoll_wait blocks waiting for IO events, which can be interpreted as fetching descriptors from the Ready list. The function returns the number of ready descriptions and stores the ready descriptors in the events parameter. Timeout allows you to set the timeout in milliseconds. -1 indicates that it never times out.

Edge trigger and horizontal trigger

There are a lot of things about edge triggering and horizontal triggering. Here’s a translation from the MAN manual.

Epoll provides two trigger mechanisms: edge-triggered (ET) and level-triggered (LT). The difference between them can be illustrated by the following example:

  1. Suppose we have a descriptorrfd, we will read a PIPE output from it, which we register with the epoll instance, and the event of interest is readable
  2. The writing end of PIPE writes 2KB of data to pipe
  3. The process calledepoll_wait, at this momentrfdIt will be put into the Ready List and returned successfully
  4. The reading end of PIPE reads 1KB of data from pipe
  5. The process is called againepoll_wait

If an RFD registers an epoll instance with the EPOLLET option, step 5 May block with epoll_wait, even though there is still readable data in the read buffer. At the same time, the other end of the pipe may be waiting for a response, and thus be trapped in endless mutual waiting. The reason for this phenomenon is that ET returns events only when the descriptor changes. In the example above, step 2 produces an event, which step 3 consumes. Because step 4 did not read all the data, step 5 could be blocked indefinitely.

The Linux manual recommends using edge triggers as follows:

  1. Used with non-blocking descriptors
  2. Until the timereadorwritereturnEAGAINBefore continuing to wait for the next event

Unlike edge triggering, when using the horizontal triggering option, epoll is an upgraded version of Poll that can simply be replaced with poll.

In general, the difference between ET and LT lies in the different conditions for triggering events. LT is more in line with programming thinking (it will trigger if the conditions are met), while ET has more stringent conditions (it will trigger only when changes occur). It also has higher requirements for users and higher theoretical efficiency. It is worth mentioning that Java NIO selector will be implemented differently depending on the operating system. In Linux 2.6 and later, epoll is used, and horizontal trigger is used. The additional EpollEventLoop provided in Netty uses edge triggering.

When listening for descriptor events, multiple events may occur consecutively on the same descriptor, which gives the user the option to set the EPOLLONESHOT option to tell EPoll to disable subsequent events. If the EPOLLONESHOT option is set, the user needs to re-register the event after the event is processed. This option is more useful in concurrent environments.

When multiple processes or threads are listening to a descriptor on an epoll instance at the same time, using the EPOLLET option ensures that only one process or thread is notified of each event, avoiding problems like “stampedes.”

Limitations of epoll listening

The /proc/sys/fs/epoll/max_user_watches configuration limits the total number of descriptors that a user can listen on in all epoll instances.

Example of using edge triggers

Because the use of horizontal triggers and poll are not very different, here is an example of edge triggers only:

    #define MAX_EVENTS 10
    struct epoll_event ev.events[MAX_EVENTS];
    int listen_sock, conn_sock, nfds, epollfd;

    /* Omit the procedure of calling socket, bind, and listen */

    // To create an epoll instance, the program should call close to close epollfd
    epollfd = epoll_create1(0);
    if (epollfd == - 1)
    {
        perror("epoll_create1");
        exit(EXIT_FAILURE);
    }

    ev.events = EPOLLIN; // The events of interest are read events
    ev.data.fd = listen_sock; // Register fd as listening socket
    
    / / the event registration
    if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == - 1)
    {
        perror("epoll_ctl: listen_sock");
        exit(EXIT_FAILURE);
    }

    for (;;)
    {
        // The wait descriptor is ready, with -1 indicating no timeout
        nfds = epoll_wait(epollfd, events, MAX_EVENTS, - 1);
        if (nfds == - 1)
        {
            perror("epoll_wait");
            exit(EXIT_FAILURE);
        }

        for (n = 0; n < nfds; ++n)
        {
            if (events[n].data.fd == listen_sock)
            {
                // When the socket is ready, call accept to establish the connection
                conn_sock = accept(listen_sock,
                                  (struct sockaddr *)&addr, &addrlen);
                if (conn_sock == - 1)
                {
                    perror("accept");
                    exit(EXIT_FAILURE);
                }
                // Set the new connection to non-blocking mode.
                setnonblocking(conn_sock);
                // The event of interest is a read event, and set to edge trigger
                ev.events = EPOLLIN | EPOLLET;
                // register fd as the newly established connection descriptor
                ev.data.fd = conn_sock;
                / / the event registration
                if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                              &ev) == - 1)
                {
                    perror("epoll_ctl: conn_sock");
                    exit(EXIT_FAILURE); }}else {// The newly established connection is ready
                //do_use_fd should read or write the fd until EAGAIN, and then record the current read or write progress until the next time it is readydo_use_fd(events[n].data.fd); }}}Copy the code

In edge trigger mode, if you want the event comes not immediately, but other conditions such as ready to again after the read or write, then can be registered at the same time EPOLLIN | EPOLLOUT event to improve performance, Instead of repeatedly calling epoll_ctl to switch back and forth between EPOLLIN and EPOLLOUT through EPOLL_CTL_MOD, you can’t do this if you’re in horizontal mode because the events of interest will keep happening once they’re ready, causing unnecessary consumption.

Why is epoll faster than poll

Epoll is faster than poll. According to other blogs on the Internet, epoll is faster than poll.

  1. Instead of passing the descriptor set to the kernel each time a descriptor is ready, the descriptor is registered with an EPoll instance, where the entire descriptor set is maintained internally
  2. Epoll instances internally use red-black trees and kernel cache areas to maintain descriptor collections, which improves the efficiency of descriptor collection registration and deletion operations
  3. Epoll maintains the Ready List internally through a callback mechanism. When a descriptor is ready, we place it in the Ready list. When we call epoll_wait, we simply check whether the Ready list is empty. If not, we copy the Ready list into user space and empty the Ready list. Or fall asleep
  4. There is no need to iterate over all the descriptors when a descriptor is ready. Epoll returns the set of ready descriptors directly

By the way, epoll_wait checks the trigger type of the ready descriptor before returning it. If it’s horizontal and there’s unprocessed data on the descriptor, it adds it to the ready list that was just cleared. This way the ready list will still have the descriptor the next time epoll_wait is called. This is the actual reason for the difference in the performance of LT and ET.