Epoll source code analysis and implementation in Redis

1. An overview of the

This article analyzes the implementation of epoll in Linux, mainly to enhance my understanding of network calls. There are many frameworks in the industry that use epoll. For example, the IMPLEMENTATION of NIO in the JDK under Linux, as well as netty, Redis and other places involving long link network requests, we can use Epoll directly. At the end of this article, we will take a brief look at how to use Epoll for IO multiplexing to achieve high concurrency from redis source code.

2. Implementation

Refer to the official documentation:

The central concept of the epoll API is the epoll instance, an inkernel data structure which, from a user-space perspective, can be considered as a container for two lists

So essentially epoll is a data structure in the kernel. From a user-space perspective, it’s really just two linked lists. So it’s basically just maintaining two linked lists. Epoll provides three methods:

Create is the data structure that initializes the kernel. Return a fd. As you know, everything in Unix is a file. So, the creation returns a file fd. Each time we pass in the FD, the kernel gets the epoll data structure.
Epoll_ctl is an operation on one of these lists. This linked list stores IO events of interest to the user. Of course, there are other things you can do when registering an event. More on this later
Epoll_wait, which returns ready events (events of interest). Then let the application layer handle it.

int epoll_create(int size);int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);

struct epoll_event {
    __uint32_t events;
    epoll_data_t data;
};
Copy the code

Epoll_create method

SYSCALL_DEFINE1(epoll_create1, int, flags)
{
    int error, fd;
    struct eventpoll *ep = NULL;
    struct file *file;
    // Create an internal data structure eventPoll
    error = ep_alloc(&ep);
    // Query the unused fd
    fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
    file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
                 O_RDWR | (flags & O_CLOEXEC));

    ep->file = file;
    fd_install(fd, file);  // create a connection between fd and file
    return fd;
out_free_fd:
    put_unused_fd(fd);
out_free_ep:
    ep_free(ep);
    return error;
}
Copy the code

Just a quick word about this method. First, the method returns a file descriptor. We use this file descriptor to find the corresponding structure. It’s an area of memory. This is where all the data from epoll is stored. This memory region is represented by an EventPoll structure. So the logic of this method is as follows:

1. Create the EventPoll structure. Initializes the corresponding data of the structure

2. Query for an unused FD and create a file for epoll. Point file’s ->private_data to ep. I won’t go into the process of creating the file

3. Point ep->file to file. It’s just binding.

4. Associate fd with file. In this way, we can find the corresponding file by fd. And find the structure of the ep (memory area). Again, file private_data is very important in device drivers. It can point to a custom data structure. This is why one device driver is guaranteed to fit multiple devices. Because different devices may have different properties. Epoll there is no problem pointing to your own data structure with private_data.

The contents of the EventPoll structure are as follows. Later encountered in detail.

struct eventpoll {
    spinlock_t lock;
    struct mutex mtx;
    wait_queue_head_t wq; // wait queue used by sys_epoll_wait ()
    wait_queue_head_t poll_wait; //file-> wait queue used by poll()
    struct list_head rdllist; // List of all ready file descriptors
    struct rb_root rbr; // Store the red and black root nodes that have been monitored
    
    struct epitem *ovflist; // Listen for the structure of the file. If the RDLList is locked, temporary events will be connected here
    struct wakeup_source *ws; Wakeup_source is used when ep_scan_ready_list is running
    struct user_struct *user; // The user that creates the eventPoll descriptor
    struct file *file;
    int visited;           // To optimize the loop check
    struct list_head visited_list_link;
};
Copy the code

Epoll_ctl method

This method is to add, delete and modify the monitoring events.

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    int error;
    int full_check = 0;
    struct fd f.tf;
    struct eventpoll *ep;    
    struct epitem *epi;     
    struct epoll_event epds; 
    struct eventpoll *tep = NULL;
    error = -EFAULT;
    // Copy user-space epoll_event to the kernel if it is not a delete operation
    if (ep_op_has_event(op) &&
        copy_from_user(&epds, event, sizeof(struct epoll_event)))
    f = fdget(epfd); // The corresponding file of epfd
    tf = fdget(fd); //fd corresponds to the file.. ep = f.file->private_data;// Retrieve the EP created by the epoll_create procedure. epi = ep_find(ep, tf.file, fd);// view the fd in the ep red-black tree
    switch (op) {
    case EPOLL_CTL_ADD:
        if(! epi) { epds.events |= POLLERR | POLLHUP; error = ep_insert(ep, &epds, tf.file, fd, full_check); }if (full_check)
            clear_tfile_check_list();
        break;
    case EPOLL_CTL_DEL:
        if (epi)
            error = ep_remove(ep, epi); 
        break;
    case EPOLL_CTL_MOD:
        if (epi) {
            epds.events |= POLLERR | POLLHUP;
            error = ep_modify(ep, epi, &epds); 
        }
        break; } mutex_unlock(&ep->mtx); fdput(tf); fdput(f); .return error;
}
Copy the code

Some of the judgment code is omitted above. The main core is to execute different functions based on different event types.

Epoll calls ep_find to get the corresponding EPI from the red-black tree. If it already exists, you do not need to add. If no, remove or modify cannot be performed. The whole process will be locked. Because it is a red-black tree, lookup and insert performance are logn levels. Therefore, for high concurrency scenarios, it is also possible to achieve fast registration monitoring. Let’s look at the logic of each of these operations.

Ep_insert operation

As the name suggests, join the listening event.

static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
             struct file *tfile, int fd, int full_check)
{
    int error, revents, pwake = 0;
    unsigned long flags;
    long user_watches;
    struct epitem *epi;
    struct ep_pqueue epq; / / [section 2.4.5]


    user_watches = atomic_long_read(&ep->user->epoll_watches);
    if (unlikely(user_watches >= max_user_watches))
        return -ENOSPC;
    if(! (epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))return -ENOMEM;

    // Construct and fill the EPI structure
    INIT_LIST_HEAD(&epi->rdllink);
    INIT_LIST_HEAD(&epi->fllink);
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd); // assign both tfile and fd to FFD
    epi->event = *event;
    epi->nwait = 0;
    epi->next = EP_UNACTIVE_PTR;
    if (epi->event.events & EPOLLWAKEUP) {
        error = ep_create_wakeup_source(epi);
    } else {
        RCU_INIT_POINTER(epi->ws, NULL);
    }
    epq.epi = epi;
    // Set the polling callback function
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    // execute the poll method
    revents = ep_item_poll(epi, &epq.pt);
    spin_lock(&tfile->f_lock);
    list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
    spin_unlock(&tfile->f_lock);
    ep_rbtree_insert(ep, epi); // Will add the current EPI to the RB tree
    spin_lock_irqsave(&ep->lock, flags);
    // The event is ready and the READY queue of the EPI has data
    if((revents & event->events) && ! ep_is_linked(&epi->rdllink)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi);// Wake up the process that is waiting for the file to be ready, that is, call epoll_wait
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }
    spin_unlock_irqrestore(&ep->lock, flags);
    atomic_long_inc(&ep->user->epoll_watches);


    if (pwake)
        ep_poll_safewake(&ep->poll_wait); // Wake up the process waiting for the eventPoll file to be ready
    return 0; . }Copy the code

1. An EPI will be initialized first, and the monitoring of target files needs to be maintained through EPI. One file listener corresponds to one EPI. And stored in ep’s red-black tree.

struct epitem {
    union {
        struct rb_node rbn; // The RB tree node links this structure to the EventPoll RB tree
        struct rcu_head rcu; // Use to release the epItem structure
    };


    struct list_head rdllink; // The ready queue of time, mainly the rdlList linked to EventPoll
    struct epitem *next; // Used in conjunction with Ovflist in EventPoll to hold one-way chain entries
    struct epoll_filefd ffd; // This structure listens for file descriptor information. Each socket fd corresponds to an EPItem. And that's how it relates
    int nwait; // Number of active wait queues attached to a poll poll

    struct list_head pwqlist; // Wait queue for listening files
    struct eventpoll *ep;  // EpI belongs to the EP
    struct list_head fllink; // The main purpose is to implement a file by multiple epoll listening. Link this structure to the f_EP_link of the file.
    struct wakeup_source __rcu *ws; // Wakeup_source used when setting EPOLLWAKEUP
    struct epoll_event event; // Monitor event and file descriptors
};
Copy the code

2. After the initialization, the FD and the file pointer are bound to the FFD of the EPI. The main function is to bind fd and epI.

struct epoll_filefd {
    struct file *file;
    int fd;
} __packed;
Copy the code

3. Register ep_ptable_queue_proc for epQ’s pt (which is a poll_table).

struct ep_pqueue {
    poll_table pt;
    struct epitem *epi;
};

typedef struct poll_table_struct {
    poll_queue_proc _qproc;
    unsigned long _key;
} poll_table;
Copy the code

Here epQ is a structure that binds epI and poll_table. Poll_table mainly registers the ep_ptable_queue_proc function. _key Records events. So EPQ holds the EPI, and the corresponding EP_ptable_queue_proc. When the callback function is executed, we can retrieve the corresponding EPQ from the poll_table address, and finally the corresponding EPI, which is the purpose of defining this structure.

4. Call the ep_item_poll method. And I’ll tell you a little bit about that. It calls the poll method of the file system.

static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
{ pt->_key = epi->event.events; ,return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
}
Copy the code

Different drivers have their own poll method, which in the case of TCP sockets is tcp_poll. In TCP, this method is called periodically, depending on the protocol stack interrupt frequency setting. Once an event arrives, the corresponding tcp_poll method is called, and the tcp_poll method calls sock_poll_wait() back, which calls the ep_ptable_queue_proc method registered here. Epoll uses this mechanism to add its own callback function to the file’s Waitqueue. This is also the purpose of ep_ptable_queue_proc.

5. Revents is returned after ep_item_poll is called. This is the event that the FD fires. If there are events we are interested in, they are inserted into the EP’s RDLList. If a process is waiting for the ready state of the file, this is the process that called the epoll_wait sleep. This will wake up the waiting process.

 if((revents & event->events) && ! ep_is_linked(&epi->rdllink)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi);// Wake up the process that is waiting for the file to be ready, that is, call epoll_wait
        if (waitqueue_active(&ep->wq))
            wake_up_locked(&ep->wq);
        if (waitqueue_active(&ep->poll_wait))
            pwake++;
    }
Copy the code

Ep_ptable_queue_proc method

All it does is bind the ep_ptable_queue_proc function via the file’s poll method. This function is called when an event arrives in the file corresponding to the file descriptor

Ps: File is the structure of the corresponding file. Of course multiple FDS can be pointed through the file structure. Multiple files can point to the same InNode node simultaneously. In Linux, the contents of a file are described by inNode. A file is simply created when we operate on a file. That’s a concept you need to understand.

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                 poll_table *pt)
{
    struct epitem *epi = ep_item_from_epqueue(pt);
    struct eppoll_entry *pwq;  


    if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
        // Initialize the callback method
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
        pwq->whead = whead;
        pwq->base = epi;
        // Put ep_poll_callback into the wait queue WHEad
        add_wait_queue(whead, &pwq->wait);
        // Put llink at the end of epI -> pwqList
        list_add_tail(&pwq->llink, &epi->pwqlist);
        epi->nwait++;
    } else {
        epi->nwait = - 1; // Flag that an error occurred}}static inline void
init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func)
{
    q->flags    = 0;
    q->private    = NULL;
    q->func        = func;
}
Copy the code

Ep_ptable_queue_proc takes three arguments. File is the pointer to the listening file, and WHEad is the device wait queue corresponding to this FD. Pt is what the poll passed in when we called the file.

In ep_ptable_queue_proc, eppoll_entry is introduced. That’s PWQ. PWQ mainly completes the association between EPI and callback functions when EPI events occur.

As you can see from the code above. First, get the corresponding EPI according to PT. The three are then correlated by PWQ.

Finally, the ADD_wait_queue method is used to attach eppoll_ENTRY to the device wait queue of the FD. That is, the callback function that registers epoll.

So the primary goal of this method is to attach eppoll_entry to the fd’s device wait queue. The wake up function ep_poll_callback is called when the hardware interrupt handler wakes up the waiting process on the queue when hardware data arrives from the device.

Ep_poll_callback method

This function adds the file’s EPI to the ready queue when the file’s event is ready. When the application layer calls epoll_wait(), the kernel copies ready queue events into user space. Report to the application.

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
    int pwake = 0;
    unsigned long flags;
    struct epitem *epi = ep_item_from_wait(wait);
    struct eventpoll *ep = epi->ep;
    spin_lock_irqsave(&ep->lock, flags);
    if(unlikely(ep->ovflist ! = EP_UNACTIVE_PTR)) {if (epi->next == EP_UNACTIVE_PTR) {
            epi->next = ep->ovflist;
            ep->ovflist = epi;
            if(epi->ws) { __pm_stay_awake(ep->ws); }}goto out_unlock;
    }
    // If the file is already in the ready list, it will exit soon
    if(! ep_is_linked(&epi->rdllink)) {// Insert the EPI ready event into the EP ready queue
        list_add_tail(&epi->rdllink, &ep->rdllist);
        ep_pm_stay_awake_rcu(epi);
    }
    // If active, wake up the eventPoll wait queue and ->poll() wait queue
    if (waitqueue_active(&ep->wq))
        wake_up_locked(&ep->wq);  // Wake up the process when the queue is not empty
    if (waitqueue_active(&ep->poll_wait))
        pwake++;
out_unlock:
    spin_unlock_irqrestore(&ep->lock, flags);
    if (pwake)
        ep_poll_safewake(&ep->poll_wait);


    if ((unsigned long)key & POLLFREE) {
        list_del_init(&wait->task_list); // delete the corresponding wait
        smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
    }
    return 1;
}

// Check whether the wait queue is empty
static inline int waitqueue_active(wait_queue_head_t *q)
{
    return! list_empty(&q->task_list); }Copy the code

3. Epoll implementation summary

In fact, the soul of epoll is EP_item_poll and ep_poll_callback.

Epoll Depends on the EP_ITEM_poll of the virtual file system. Register ep_poll_callback to the waitqueue of the corresponding file. When data arrives in the corresponding file. The registered function will be called. The epoll callback adds the epI of the corresponding file to the ready queue.

When the user calls epoll_wait(), epoll locks the queue, passing the queue data to user space, and the timed event is hung in ovflist.

4. Use Epoll Redis

The specific implementation is in ae_epoll.c

typedef struct aeApiState {

    // epoll_event instance descriptor
    int epfd;
    / / event slots
    struct epoll_event *events;

} aeApiState;
Copy the code

AeApiCreate method

Redis calls the aeCreateEventLoop method when initializing the server. AeCreateEventLoop calls aeApiCreate to create an epoll instance.

static int aeApiCreate(aeEventLoop *eventLoop) {
    aeApiState *state = zmalloc(sizeof(aeApiState));


    if(! state)return - 1;
    state->events = zmalloc(sizeof(struct kevent)*eventLoop->setsize);
    if(! state->events) { zfree(state);return - 1;
    }
    state->kqfd = kqueue();
    if (state->kqfd == - 1) {
        zfree(state->events);
        zfree(state);
        return - 1;
    }
    eventLoop->apidata = state;
    return 0;    
}
Copy the code

AeApiAddEvent method

This method associates events to epoll, so epoll’s CTL method is called

static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
    aeApiState *state = eventLoop->apidata;
    struct epoll_event ee;

    /* If the fd was already monitored for some event, We need a MOD * operation. Otherwise we need an ADD operation. * This is a MOD action if an event/event is already associated. * /
    int op = eventLoop->events[fd].mask == AE_NONE ?
            EPOLL_CTL_ADD : EPOLL_CTL_MOD;

    // Register events to epoll
    ee.events = 0;
    mask |= eventLoop->events[fd].mask; /* Merge old events */
    if (mask & AE_READABLE) ee.events |= EPOLLIN;
    if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
    ee.data.u64 = 0; /* avoid valgrind warning */
    ee.data.fd = fd;

    if (epoll_ctl(state->epfd,op,fd,&ee) == - 1) return - 1;
    return 0;
}
Copy the code

This method is called when the Redis service creates a new client. Will register the client read event.

The prepareClientToWrite method is called when Redis needs to write data to the client. This method mainly registers the write event corresponding to the FD.

If the registration fails, Redis does not write the data to the buffer.

If the corresponding suite word is writable, the Redis event loop writes the buffer’s new data to the socket.

AeMain method

The main loop of the Redis event handler.

void aeMain(aeEventLoop *eventLoop) {

    eventLoop->stop = 0;

    while(! eventLoop->stop) {// If there is a function that needs to be executed before event processing, run it
        if(eventLoop->beforesleep ! =NULL)
            eventLoop->beforesleep(eventLoop);

        // Start processing eventsaeProcessEvents(eventLoop, AE_ALL_EVENTS); }}Copy the code

This method eventually calls epoll_wait() to fetch the event and execute it.