As mentioned in the previous article, Redis uses the Epoll model to improve link processing power. In this article, we briefly understand how Redis uses epoll and the implementation principle of epoll from the perspective of source code. Shallow in shallow out ~ find me to communicate
Here are three things you can learn from this article to achieve its purpose:
1. Epoll is a system implementation provided by Linux, with only three core methods
2. Epoll is highly efficient because it is based on red-black tree, bidirectional linked list and event callback mechanism
3. The IO multiplexing of Redis is implemented on Linux with Epoll
Epoll is a multiplexer provided by the Linux kernel. Ask Linux men as usual:
EPOLL(7) Linux Programmer's Manual EPOLL(7) NAME epoll - I/O event notification facility SYNOPSIS #include
DESCRIPTION The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. The following system calls are provided to create and manage an epoll instance: * epoll_create(2) creates an epoll instance and returns a file descriptor referring to that instance. (The more recent epoll_create1(2) extends the functionality of epoll_create(2).) * Interest in particular file descriptors is then registered via epoll_ctl(2). The set of file descriptors currently registered on an epoll instance is sometimes called an epoll set. * epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available.
Copy the code
Core method
Man tells us the definition of epoll in sys/epoll. H, check the core function has three: (online code elixir.bootlin.com/linux/v4.19…).
epoll_create
epoll_create(int size)
Core functions:
-
Create an epoll file descriptor
-
Create eventPoll, which contains a red-black tree cache and a two-way linked list
The size parameter does not limit the maximum number of file descriptors epoll can listen on, but is merely a suggestion for the initial allocation of internal data structures by the kernel. After Linux 2.6.8, the size parameter is ignored, but a number greater than 0 must be passed.
When epoll_create is called, it takes up one fd value. On Linux, you can view /proc/? /fd/ file descriptor. You need to call close to close it.
Eventpoll code snippet:
struct eventpoll {
/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
* collection loop, the file cleanup path, the epoll file exit* code and the ctl operations. */ struct mutex mtx; /* Wait queue used by sys_epoll_wait() */ wait_queue_head_t wq; /* Wait queue used by file->poll() */ wait_queue_head_t poll_wait; /* List of ready file descriptors */ struct list_head rdllist; /* RB tree root used to store monitored fd structs */ struct rb_root_cached RBR; /* * This is a single linked list that chains all the"struct epitem" that
* happened while transferring ready events to userspace w/out
* holding ->wq.lock.
*/
struct epitem *ovflist;
/* wakeup_source used when ep_scan_ready_list is running */
struct wakeup_source *ws;
/* The user that created the eventpoll descriptor */
struct user_struct *user;
struct file *file;
/* used to optimize loop detection check */
int visited;
struct list_head visited_list_link;
#ifdef CONFIG_NET_RX_BUSY_POLL
/* used to track busy poll napi_id */
unsigned int napi_id;
#endif
};
Copy the code
epollctl
int epollctl(int epfd, int op, int fd, struct epollThe event * event);
Core functions:
-
Performs the op binding on the specified descriptor fd
-
Write the fd to the red-black tree while registering the callback function in the kernel
Op operation type, using three macros EPOLL_CTL_ADD, EPOLL_CTL_DEL, EPOLL_CTL_MOD, respectively to represent the increase, deletion and change of fd monitoring.
epollwait
int epollwait(int epfd, struct epollevent *events, int maxevents, int timeout);
Core functions:
-
Gets the I/O event on an EPFD
The events parameter is the ready event used to get the desired set of events. Maxevents specifies the size of events. Maxevents must be greater than 0 and timeout is the timeout time. Epollwait blocks until a file descriptor fires an event, is interrupted by a signal handler, or timeout times out. The return value is the number of FDS to process.
Working mechanism
-
Build cache (red-black tree) and read list (bidirectional linked list)
-
To monitor the FD (everything is FD, refer to NIO to see also (1) – Linux/IO foundation), the event binding. The event occurs and is placed into the list to be read via callback
-
Blocks to get the list to be read
Execute the process
advantages
-
Epoll creates a red-black tree that stores all FDS, has no size limit, and the complexity of adding, deleting and searching O(logN)
-
Based on callback, the system kernel is used to trigger the event of interest
-
Ready list is a two-line linked list time complexity O(1)
-
The FDS obtained by the application are the FDS that actually generate I/O. Compared with select and poll, which require constant polling to determine whether they are available, it can avoid useless memory copy
Combine with Redis code
There is so much source code that we will only look at the modules relevant to this article
Event processing module AE.c/AE_epoll.c
Network link libraries anet.c and networking.c
C on the server
Create an event manager
The L2702 initServer() of server.c is the redis server startup entry,
First create aeEventLoop object, call aeCreateEventLoop() on L2743, initialize the unready file event table, ready file event table. The events pointer points to the unready file event table, and the FIRED pointer points to the ready file event table.
aeEventLoop *aeCreateEventLoop(int setsize) {
aeEventLoop *eventLoop;
int i;
if ((eventLoop = zmalloc(sizeof(*eventLoop))) == NULL) goto err;
eventLoop->events = zmalloc(sizeof(aeFileEvent)*setsize);
eventLoop->fired = zmalloc(sizeof(aeFiredEvent)*setsize);
if (eventLoop->events == NULL || eventLoop->fired == NULL) goto err;
eventLoop->setsize = setsize;
eventLoop->lastTime = time(NULL);
eventLoop->timeEventHead = NULL;
eventLoop->timeEventNextId = 0;
eventLoop->stop = 0;
eventLoop->maxfd = -1;
eventLoop->beforesleep = NULL;
eventLoop->aftersleep = NULL;
eventLoop->flags = 0;
if (aeApiCreate(eventLoop) == -1) goto err;
/* Events with mask == AE_NONE are not set. So let's initialize the * vector with it. */ for (i = 0; i < setsize; i++) eventLoop->events[i].mask = AE_NONE; return eventLoop; err: if (eventLoop) { zfree(eventLoop->events); zfree(eventLoop->fired); zfree(eventLoop); } return NULL; }Copy the code
Call the aeApiCreate function at ae_epoll. cl39, first create the aeApiState object, initialize the epoll ready event table; Epoll_create is then called to create an epoll instance, and the aeApiState is assigned to the apidata attribute
static int aeApiCreate(aeEventLoop *eventLoop) {
aeApiState *state = zmalloc(sizeof(aeApiState));
if(! state)return- 1; state->events = zmalloc(sizeof(struct epoll_event)*eventLoop->setsize);if(! state->events) { zfree(state);return- 1; } state->epfd = epoll_create(1024); /* 1024 is just a hintfor the kernel */
if (state->epfd == -1) {
zfree(state->events);
zfree(state);
return- 1; } eventLoop->apidata = state;return 0;
}
Copy the code
The binding event
AeFileEvent is a file event structure. For each specific event, there are read and write handlers. Redis calls the aeCreateFileEvent function to register file events for different socket read/write events.
/* File event structure */ typedef struct aeFileEvent { int mask; /* one of AE_(READABLE|WRITABLE|BARRIER) */ aeFileProc *rfileProc; / / read aeFileProc * wfileProc; / / write void * clientData; } aeFileEvent;Copy the code
Server. c L2848 aeCreateFileEvent Creates a file. The ae_epoll.c L73 aeApiAddEvent command is executed
static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
aeApiState *state = eventLoop->apidata;
struct epoll_event ee = {0}; /* avoid valgrind warning */
/* If the fd was already monitored for some event, we need a MOD
* operation. Otherwise we need an ADD operation. */
int op = eventLoop->events[fd].mask == AE_NONE ?
EPOLL_CTL_ADD : EPOLL_CTL_MOD;
ee.events = 0;
mask |= eventLoop->events[fd].mask; /* Merge old events */
if (mask & AE_READABLE) ee.events |= EPOLLIN;
if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
ee.data.fd = fd;
if (epoll_ctl(state->epfd,op,fd,&ee) == -1) return- 1;return 0;
}Copy the code
AeApiAddEvent calls the system epoll_ctl to register the event
Handle events
The third from last line of server.c calls the aeMain method
void aeMain(aeEventLoop *eventLoop) {
eventLoop->stop = 0;
while(! eventLoop->stop) {if (eventLoop->beforesleep != NULL)
eventLoop->beforesleep(eventLoop);
aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
}
}Copy the code
AeProcessEvents () for event and file event processing, aeApiPoll is called in Ae. c L433.
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
aeApiState *state = eventLoop->apidata;
int retval, numevents = 0;
retval = epoll_wait(state->epfd,state->events,eventLoop->setsize,
tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
if (retval > 0) {
int j;
numevents = retval;
for (j = 0; j < numevents; j++) {
int mask = 0;
struct epoll_event *e = state->events+j;
if (e->events & EPOLLIN) mask |= AE_READABLE;
if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
if (e->events & EPOLLERR) mask |= AE_WRITABLE|AE_READABLE;
if(e->events & EPOLLHUP) mask |= AE_WRITABLE|AE_READABLE; eventLoop->fired[j].fd = e->data.fd; eventLoop->fired[j].mask = mask; }}return numevents;
}Copy the code
Epoll_wait blocks to wait for the epoll event to be ready. The timeout time is calculated according to the event that reaches the fastest time. The ready epoll event is then converted to the fired ready event. AeApiPoll is the I/O multiplexing program mentioned above.
conclusion
-
Epoll_create Create ready list
-
Epoll_ctl binds the event, fd to the ready list when the event occurs
-
Epoll_wait reads the ready list
Reference:
Redis6.0.1 source
https://elixir.bootlin.com/linux/v4.19.76/source/fs/eventpoll.c
Man manual
NIO: Linux/IO fundamentals
If you read on wechat, please click the link to follow me. If you read on PC, please scan the code to follow me. Welcome to communicate with me and point out mistakes at any time