1. Write in front

It is Saturday again, but this week is a little busy new article has not been written, in order not to jump ticket, thinking of the early good article, re-typesetting and modification to send, because at that time, the reader is very few, now can be regarded as a new article (there is a kind of sophists meaning)…
Today let’s take a look at the important basics of high concurrency implementations:
I/O multiplexing technology & Epoll principle.
In this article you will learn the following:
  • The concept of IO reuse
  • I/O reuse tools before Epoll
  • Epoll three-stage rocket
  • Epoll low-level implementation
  • ET mode &LT mode
  • A Tencent interview question
  • Epoll swarm problem


Warm hint: technical article involves a lot of details will be more obscure, repeatedly pondering to master.

【 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — original article offenders will investigate — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –.

2. Get familiar with reuse technology and I/O reuse

Before we look at epoll, let’s look at the concept of reuse and what does I/O reuse really mean?

2.1 Reuse Concepts and Resource Features

2.1.1 Concept of reuse

Multiplexing technology is not a new technology but a design idea, in communication and hardware design there are frequency division multiplexing, time division multiplexing, wavelength division multiplexing, code division multiplexing, in daily life multiplexing scene is also very much, so do not be confused by professional terms.

In essence, reuse is about solving the imbalance between limited resources and too many users to achieve maximum utilization and handle more problems.

2.1.2 Resource Release

For example: the non-release scenario: The ventilator in the ICU ward is a limited resource, and once the patient occupies it, he cannot give it up until he is out of danger, so it is impossible for several patients in the same situation to use it in turn.

Releasable scenario: For some other resources such as medical staff, multiple patients can be monitored at the same time. Theoretically, there is no scenario where a patient occupies the medical staff resources and does not release them.

So we can think about, multiple IO shared resources (processing threads) can be released?

2.1.3 Understanding I/O overcommitment

I/O: In the computer field often said IO including disk IO and network IO, we say I multiplexing mainly refers to network IO, in Linux everything is a file, so network IO is often used to describe the file descriptor FD.

What does reuse mean: What is reuse of the file descriptor FD? In the network scenario, the task processing thread is reused, so the simple understanding is that multiple IO share one processing thread.

The feasibility of IO reuse: The basic operation of IO request includes read and write. Due to the nature of network interaction, there must be waiting. In other words, the read and write of FD are alternating in the whole network connection, sometimes it is readable and writable, and sometimes it is idle.

In conclusion, IO reuse technology is to coordinate multiple FD that can release resources to share task processing threads alternately to complete communication tasks, so as to realize the reuse scenario in which multiple FD correspond to one task processing thread.

In real life, IO reuse is like a side-herd managing hundreds of sheep:

Management of multiple sheep sharing a single side pasture

2.1.4 Background of I/O Overcommitment

Business needs are the driving force behind technological evolution.


In the original period when the amount of network concurrency was very small, even the per req per process processing of network requests could meet the requirements, but with the increase of the amount of network concurrency, the original way will hinder the progress, so it stimulated the implementation and promotion of IO reuse mechanism.
The efficient IO reuse mechanism should satisfy the following requirements: the coordinator consumes the least system resources, minimizes the waiting time of FD, maximizes the number of FD, minimizes the idle task processing thread, and completes the task efficiently and efficiently.

Voice-over: The above paragraph may be a bit confusing to read, but the simple idea is to allow the task processing thread to coordinate more network connections with less resource consumption. IO reuse tools are also evolving, and a comparison shows that this principle has been used throughout.

After understanding the basic concept of IO reuse technology, we then look at the various IO reuse tools and their characteristics successively appeared in Linux system to deepen understanding.


3. Overview of Linux I/O reuse tools

In Linux, select, poll, epoll and so on have appeared. FreeBSD kQueue is also a very good TOOL for I/O reuse. The principle of KQueue is very similar to epoll. And I won’t go into the details of how select and poll are implemented.

3.1 Pioneer Select

Select is around 2000, the external interface definition:

/* According to POSIX.1-2001 */
#include <sys/select.h>

/* According to earlier standards */
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);
void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);Copy the code

3.1.1 Official Tips

As the first call to the IO reuse system, select uses a macro to define the function to fill the fd based on the bitmap principle. The default size is 1024. Therefore, if the fd is larger than 1024, there may be problems.

Macro: int FD_SETSIZE


The value of this macro is the maximum number of file descriptors that a fd_set object can hold information about. On systems with a fixed maximum number, FD_SETSIZE is at least that number. On some systems, including GNU, there is no absolute limit on the number of descriptors open, but this macro still has a constant value which controls the number of bits in an fd_set; if you get a file descriptor with a value as high as FD_SETSIZE, you cannot put that descriptor into an fd_set.


A brief explanation of what this passage means:
When the fd
If the absolute value is greater than 1024, it cannot be controlled, because the underlying reasons of array, the official don’t recommend more than 1024, but we can’t control the size of the absolute value of fd, before done some research on this problem, the conclusion is the system about the distribution of the fd has its own strategy, will be big probability distribution to 1024, I didn’t fully understand, just mention the pit.

3.1.2 Existing problems and objective evaluation

Due to the limitations of the underlying implementation, select has some problems, including:
  • High concurrency cannot be achieved if the number of fd and value can not exceed 1024
  • Using O(n) complexity to walk through an array of FDS to see how readable a FD is is inefficient
  • It involves a large number of kernels and consumption of user mode copy
  • Each time the monitoring is completed, it needs to be repassed again and the operation of passing events is redundant
Select implements IO reuse in a naive way, increasing concurrency to the maximum K level, but the cost and flexibility to accomplish this task need to be improved.
No matter what
Select as a pioneer in IO reuse has a huge push“, and points the way to further optimization, don’t ignorantly criticize SELECT.

3.2 Inheritor epoll

Before epoll came along, poll was an improvement on SELECT, but in essence it didn’t change much, so let’s skip poll and go straight to ePoll.
Epoll has been around for more than a decade since it first appeared in the 2.5.44 kernel version, and was optimized in the 2.6.x version to make the code more compact, and then added Settings to solve hidden problems in the face of external questions.
In the third edition of Unix Network Programming (2003), epoll was not introduced, because epoll did not exist at that time. In the book, only select and poll were introduced. Epoll solved all the problems in SELECT one by one.
  • – There is no limit to the number of fd’s (of course this is also fixed in poll)
  • Ditching bitmap arrays to implement new structures to store multiple event types
  • There is no need to duplicate the fd as used, discarded and deleted
  • Event-driven to avoid searching for read and write events
Application status of Epoll:


The advent of ePoll has significantly increased the amount of concurrency needed to cope with the C10K problem, and even the subsequent emergence of true asynchronous IO hasn’t (yet) shaken ePoll’s status.
Because epoll can handle tens of thousands of concurrent quantities, it already solves most of today’s scenarios. Asynchronous IO is good, but it’s harder to program than epoll.

4. Meet epoll

Epoll inherits the style of Select, leaving the user with a very simple interface, which can be said to be simple but not simple. Let’s take a look.

4.1 Basic APIS and data structures for EPoll

Epoll consists of epoll_data, epoll_Event, and three apis, as follows:

Typedef union epoll_data {void * PTR; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; Struct epoll_event {uint32_t events; // uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; // API int epoll_create(int size); int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);Copy the code

4.2 Technical understanding of epoll three-stage rockets

  • The epoll_create interface creates a string of epoll-related structures in the kernel and returns a handle fd to the user state. All subsequent operations are based on this fd. The size parameter tells the kernel the size of the elements in the structure. If the size is not appropriate, it will involve replication and expansion, but it seems that size is not much use after 4.1.2 kernel;

  • Epoll_ctl This interface is to add/delete fd to the EPFD returned by epoll_CREATE. Epoll_event is the structure of interaction between user mode and kernel mode. It defines the event type concerned by user mode and the data carrier epoll_data when triggering.

  • Epoll_wait is an interface that blocks to wait for read/write events to be returned by the kernel. Epfd is also the return value of epoll_CREATE. Events is a pointer to an array of structures that store epoll_events. Maxevents tells the kernel the maximum number of FDS to return. This is dependent on the array events point to;

There is a data structure that runs through three of the apis: epoll_Event, which is a proxy for the fd that needs to be monitored by the user. Subsequent user operations on the FD are based on this structure.

4.3 Popular interpretation of the Epoll stage rocket

Maybe the above description is a little abstract, but here’s a real world example to help you understand:
  • Epoll_create scenario


    University in the first week, you need as a monitor for the class to receive the goods, you tell staff in the student affairs office, I am a professional xx xx xx college class monitor, this staff to determine your identity and give you a proof, behind do things need to use (is also called epoll_create apply to the kernel epfd structure, The kernel returns the epfd handle for you to use.
  • Epoll_ctl scenario


    You take a certificate at the beginning of the business hall, sorting office staff said the monitor you all need to deal with things students students books and record need to deal with things, so the monitor started writing separately for each student handbook corresponding need to do things: li Ming need open laboratory permissions, Sun Daxiong swimming card… So the monitor wrote it and handed it to the staff (that is, to tell the kernel which FD needs to do what).
  • Epoll_wait scenario


    You take the certificate in front of the office waiting, at this time the radio shouted xx monitor your class sun bear swimming card to get, Li Ming lab permission card to get…. The monitor can only continue (that is, call epoll_wait to wait for the kernel feedback read-write event to occur and process).

4.4 ePoll Official Demo

The official demo can be seen by man epoll, although only 50 lines, but
Dry fullThat is as follows:

#define MAX_EVENTS 10
struct epoll_event ev.events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;

/* Set up listening socket, 'listen_sock' (socket(), bind(), listen()) */

epollfd = epoll_create(10);
if(epollfd == - 1) {
   perror("epoll_create");
   exit(EXIT_FAILURE);
}

ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if(epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == - 1) {
   perror("epoll_ctl: listen_sock");
   exit(EXIT_FAILURE);
}

for(;;) {
   nfds = epoll_wait(epollfd, events, MAX_EVENTS, - 1);
   if (nfds == - 1) {
       perror("epoll_pwait");
       exit(EXIT_FAILURE);
   }

   for (n = 0; n < nfds; ++n) {
       if (events[n].data.fd == listen_sock) {
           // The primary listener socket has a new connection
           conn_sock = accept(listen_sock,
                           (struct sockaddr *) &local, &addrlen);
           if (conn_sock == - 1) {
               perror("accept");
               exit(EXIT_FAILURE);
           }
           setnonblocking(conn_sock);
           ev.events = EPOLLIN | EPOLLET;
           ev.data.fd = conn_sock;
           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
                       &ev) == - 1) {
               perror("epoll_ctl: conn_sock");
               exit(EXIT_FAILURE); }}else {
           // The read/write handle of the established connectiondo_use_fd(events[n].data.fd); }}}Copy the code

Note: In epoll_WAIT, it is necessary to distinguish between a new connection event fd of the main listener thread and a read/write request for a connected event, and process it separately.

5. The low-level details of Epoll

The bottom layer of ePoll implements the two most important data structures: epItem and EventPoll.
We can simply think that epItem corresponds to each fd of the user mode monitoring IO. Eventpoll is the structure created by the user mode to manage all fd’s being monitored. Let’s look at the data structure related to ePoll from the local to the whole, and from the inside out.

5.1 Underlying data structures

Red black tree node definition:

#ifndef _LINUX_RBTREE_H
#define _LINUX_RBTREE_H
#include <linux/kernel.h>
#include <linux/stddef.h>
#include <linux/rcupdate.h>

struct rb_node {
  unsigned long  __rb_parent_color;
  struct rb_node *rb_right;
  struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
/* The alignment might seem pointless, but allegedly CRIS needs it */
struct rb_root {
  struct rb_node *rb_node;
};
Copy the code

Epitem definition:

struct epitem {
  struct rb_node  rbn;
  struct list_head  rdllink;
  struct epitem  *next;
  struct epoll_filefd  ffd;
  int  nwait;
  struct list_head  pwqlist;
  struct eventpoll  *ep;
  struct list_head  fllink;
  struct epoll_event  event;
}Copy the code

Eventpoll definition:

struct eventpoll { spin_lock_t lock; struct mutex mtx; wait_queue_head_t wq; wait_queue_head_t poll_wait; struct list_head rdllist; // Struct rb_root RBR; // Struct epitem *ovflist; }Copy the code

5.2 Underlying invocation procedure

Epoll_create creates an object of type Struct EventPoll and returns a corresponding file descriptor that applications will then rely on when using epoll in user mode. In ePoll, the file descriptor is also used to obtain the EventPoll type object, and then the corresponding operation is carried out to complete the user mode and kernel mode through.
Epoll_ctl/epoll_insert
  • Create and initialize an object of type Strut EpItem to associate the object with the monitored event and the ePoll object EventPoll.

  • Struct epitem objects are managed by adding them to the red-black tree of ePoll EventPoll.

  • Add an object of type struct epitem to the wait list of the object file corresponding to the monitored event, and register the callback function to be called when the event is ready. In epoll, this callback function is ep_poll_callback();

  • When ep_poll_callback() is called, the ovFList member of eventPoll is not equal to EP_UNACTIVE_PTR, indicating that rDLList is being scanned. At this time, the epitem corresponding to the ready event will be added to the OVFLIST linked list and stored temporarily. After the rDLList is scanned, the elements in the OVFLIST linked list will be moved to the RDLList.

The figure shows the relationship between red-black trees, double-linked lists, and EPitems:

5.3 Confusable data copy

A widely held misconception:

When epoll_WAIT returns, epoll uses shared memory for ready events, with both user mode and kernel mode pointing to the ready list, so memory copy consumption is avoided

Shared memory? It doesn’t exist!
The idea that epoll_wait uses shared memory to speed up user-mode and kernel-mode data interaction and avoid memory copying is not confirmed by the 2.6 kernel version of the code, and the implementation of this copy is as follows:

revents = ep_item_poll(epi, &pt); // Get ready eventsif (revents) {
  if(__put_user(revents, &uevent->events) || __put_user(epi->event.data, &uevent->data)) { list_add(&epi->rdllink, head); // If the processing fails, add ep_pm_stay_awake(epI) again;return eventcnt ? eventcnt : -EFAULT;
  }
  eventcnt++;
  uevent++;
  if(epi->event.events & EPOLLONESHOT) epi->event.events &= EP_PRIVATE_BITS; //EPOLLONESHOT tag processingelse if(! (epi->event.events & EPOLLET)) { list_add_tail(&epi->rdllink, &ep->rdllist); // Ep_pm_stay_awake (epI); }}Copy the code

6.LT and ET modes

The two modes of Epoll are left to users to play, but also a key issue.

6.1 Simple understanding of LT/ET

LT supports both blocking and non-blocking sockets. ET supports only non-blocking sockets. It is more efficient and safer than LT.
In both LT and ET modes, the epoll_WAIT method can be used to obtain the event. In LT mode, the event is copied to the user program. If the event is not processed or not processed, it will be reported back to the user program on the next call.
In ET mode, if it is not processed or not processed, the user program will not be notified next time, so it avoids being reminded repeatedly, but strengthens the requirements for user program reading and writing.

6.2 Deep understanding of LT/ET

The above explanation can be found in any article on the Internet, but it is still difficult to use LT and ET.

6.2.1 Read and write Operations for LT

If the socket is idle, it will not send the buffer. If the fd is always monitored, it will always notify the write event.
Make sure there is no data to send, remove the fd’s write event monitor from the epoll list, add it back when needed, and so on.
There is no such thing as a free lunch, and it is impossible to always be reminded for nothing. Users need to be reminded whenever they want to write too much, otherwise they will always be reminded of writable events.

6.2.2 Read and Write Operations on ET

If the socket does not read all the data, ePoll will not notify the read event again. If the socket does not read all the data, ePoll will not notify the read event again. In fact, it is easy to do this.
If the send buffer is full, ePoll notifying the write event. Until the developer fills the send buffer, ePoll notifying the write event the next time the send buffer changes from full to full.
In ET mode, only when the socket state changes, that is, when the read buffer changes from no data to data, the read event is notified, and when the send buffer changes from full to less, the write event is notified.

6.2.3 A Tencent interview question

As if a bit of a loophole, here’s an interview question:

When the socket is writable, the event that the socket is writable will be constantly triggered. How to deal with this?

A very good question indeed! Let’s take a look at what that means.
This topic examines LT and ET in depth, and verifies the LT mode write problem mentioned above.
Common practice:


When data needs to be written to a socket, the socket is added to ePoll and waits for writable events. After receiving the socket writable event, it calls write or send to send data. When all data is written, it moves the socket descriptor out of the Epoll list. In this way, it needs to be added and deleted repeatedly.
Improved practices:


When send returns the error code EAGAIN, the socket is added to epoll, and data is sent after the writable event. After all data is sent, the epoll model is removed. The improved approach is equivalent to thinking that sockets are writable most of the time. No more writing and epoll monitoring.
The above two methods are a fix for frequent notifications of write events in LT mode, which can be handled directly in ET mode without requiring user level program patching.

6.2.4 Thread Starvation in ET mode

If a socket receives a large amount of data and tries to read all the data, other sockets may not be able to process the data, resulting in hunger.
The solution:


A queue is maintained for each descriptor that has been prepared, so that the program knows which descriptors are ready but not yet read, and then the program reads them regularly or quantitatively, removing them if read, until the queue is empty. This ensures that every FD is read and no data is lost.
The process is shown as follows:

6.2.5 EPOLLONESHOT set

After reading the data on A socket, thread A begins to process the data. At this time, new data can be read on the socket, and thread B is awakened to read the new data. In this situation, two threads operate A socket at the same time.

6.2.6 Selection of LT and ET

From the previous comparison, it can be seen that THE LT mode is safer and the code writing is clearer, but the ET mode is a high-speed mode, and it is better to use it properly when dealing with high concurrency scenarios. The specific choice should be based on your actual needs and your team’s code ability.
There is a comparison between the choice of ET and LT on Zhihu, and there are a lot of opinions on it. You can check it out if you are interested.

7. Epoll swarm problem

If you don’t know what the stampede effect is, imagine this:

You feed the pigeons in the square. You only feed one piece of food, but a group of pigeons fight for it. In the end, only one pigeon gets the food, which is in vain for the other pigeons.

This kind of imagination also exists in network programming.
In 2.6.18, the accept stamen issue has been resolved, but there is still a stamen issue in epoll, which means that when multiple processes/threads call epoll_WAIT, they block and wait, and when the kernel fires a read-write event, all processes/threads respond. But only one process/thread actually handles these events.
Before epoll officially fixed this problem, Nginx, as a well-known user, used a global lock to limit the number of fd processes to one at a time. Later, the Linux 3.9 kernel added the SO_REUSEPORT option to implement load balancing at the kernel level. Nginx1.9.1 supports a new feature called ReusePort to address the stampede issue.
EPOLLEXCLUSIVE is an epoll identifier that was added to the Linux 4.5 kernel in 2016. Ngnix added NGX_EXCLUSIVE_EVENT to support this feature after 1.11.3. The EPOLLEXCLUSIVE identifier will ensure that only one thread will be awakened during an event, avoiding the possibility of multiple alerts.

8. Giant shoulders

  • http://harlon.org/2018/04/11/networksocket5/
  • https://devarea.com/linux-io-multiplexing-select-vs-poll-vs-epoll/#.Xa0sDqqFOUk
  • https://jvns.ca/blog/2017/06/03/async-io-on-linux–select–poll–and-epoll/
  • https://zhuanlan.zhihu.com/p/78510741
  • http://www.cnhalo.net/2016/07/13/linux-epoll/
  • https://www.ichenfu.com/2017/05/03/proxy-epoll-thundering-herd/
  • https://github.com/torvalds/linux/commit/df0108c5da561c66c333bb46bfe3c1fc65905898
  • https://simpleyyt.com/2017/06/25/how-ngnix-solve-thundering-herd/