1. Write in front
I/O multiplexing technology & Epoll principle.
-
The concept of IO reuse
-
I/O reuse tools before Epoll
-
Epoll three-stage rocket
-
Epoll low-level implementation
-
ET mode < mode
-
A Tencent interview question
-
Epoll swarm problem
【 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — original article offenders will investigate — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –.
2. Get familiar with reuse technology and I/O reuse
Before we look at epoll, let’s look at the concept of reuse and what does I/O reuse really mean?
2.1 Reuse Concepts and Resource Features
2.1.1 Concept of reuse
Multiplexing technology is not a new technology but a design idea, in communication and hardware design there are frequency division multiplexing, time division multiplexing, wavelength division multiplexing, code division multiplexing, in daily life multiplexing scene is also very much, so do not be confused by professional terms.
In essence, reuse is about solving the imbalance between limited resources and too many users to achieve maximum utilization and handle more problems.
2.1.2 Resource Release
For example: the non-release scenario: The ventilator in the ICU ward is a limited resource, and once the patient occupies it, he cannot give it up until he is out of danger, so it is impossible for several patients in the same situation to use it in turn.
Releasable scenario: For some other resources such as medical staff, multiple patients can be monitored at the same time. Theoretically, there is no scenario where a patient occupies the medical staff resources and does not release them.
So we can think about, multiple IO shared resources (processing threads) can be released?
2.1.3 Understanding I/O overcommitment
I/O: In the computer field often said IO including disk IO and network IO, we say I multiplexing mainly refers to network IO, in Linux everything is a file, so network IO is often used to describe the file descriptor FD.
What does reuse mean: What is reuse of the file descriptor FD? In the network scenario, the task processing thread is reused, so the simple understanding is that multiple IO share one processing thread.
The feasibility of IO reuse: The basic operation of IO request includes read and write. Due to the nature of network interaction, there must be waiting. In other words, the read and write of FD are alternating in the whole network connection, sometimes it is readable and writable, and sometimes it is idle.
In conclusion, IO reuse technology is to coordinate multiple FD that can release resources to share task processing threads alternately to complete communication tasks, so as to realize the reuse scenario in which multiple FD correspond to one task processing thread.
In real life, IO reuse is like a side-herd managing hundreds of sheep:
2.1.4 Background of I/O Overcommitment
Voice-over: The above paragraph may be a bit confusing to read, but the simple idea is to allow the task processing thread to coordinate more network connections with less resource consumption. IO reuse tools are also evolving, and a comparison shows that this principle has been used throughout.
3. Overview of Linux I/O reuse tools
3.1 Pioneer Select
/* According to POSIX.1-2001 */
#include <sys/select.h>
/* According to earlier standards */
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
int select(int nfds, fd_set *readfds, fd_set *writefds,
fd_set *exceptfds, struct timeval *timeout);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);Copy the code
3.1.1 Official Tips
Macro: int FD_SETSIZE
The value of this macro is the maximum number of file descriptors that a fd_set object can hold information about. On systems with a fixed maximum number, FD_SETSIZE is at least that number. On some systems, including GNU, there is no absolute limit on the number of descriptors open, but this macro still has a constant value which controls the number of bits in an fd_set; if you get a file descriptor with a value as high as FD_SETSIZE, you cannot put that descriptor into an fd_set.
If the absolute value is greater than 1024, it cannot be controlled, because the underlying reasons of array, the official don’t recommend more than 1024, but we can’t control the size of the absolute value of fd, before done some research on this problem, the conclusion is the system about the distribution of the fd has its own strategy, will be big probability distribution to 1024, I didn’t fully understand, just mention the pit.
3.1.2 Existing problems and objective evaluation
-
High concurrency cannot be achieved if the number of fd and value can not exceed 1024
-
Using O(n) complexity to walk through an array of FDS to see how readable a FD is is inefficient
-
It involves a large number of kernels and consumption of user mode copy
-
Each time the monitoring is completed, it needs to be repassed again and the operation of passing events is redundant
Select as a pioneer in IO reuse has a huge push“, and points the way to further optimization, don’t ignorantly criticize SELECT.
3.2 Inheritor epoll
-
– There is no limit to the number of fd’s (of course this is also fixed in poll)
-
Ditching bitmap arrays to implement new structures to store multiple event types
-
There is no need to duplicate the fd as used, discarded and deleted
-
Event-driven to avoid searching for read and write events
The advent of ePoll has significantly increased the amount of concurrency needed to cope with the C10K problem, and even the subsequent emergence of true asynchronous IO hasn’t (yet) shaken ePoll’s status.
4. Meet epoll
4.1 Basic APIS and data structures for EPoll
Typedef union epoll_data {void * PTR; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; Struct epoll_event {uint32_t events; // uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; // API int epoll_create(int size); int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);Copy the code
4.2 Technical understanding of epoll three-stage rockets
-
The epoll_create interface creates a string of epoll-related structures in the kernel and returns a handle fd to the user state. All subsequent operations are based on this fd. The size parameter tells the kernel the size of the elements in the structure. If the size is not appropriate, it will involve replication and expansion, but it seems that size is not much use after 4.1.2 kernel;
-
Epoll_ctl This interface is to add/delete fd to the EPFD returned by epoll_CREATE. Epoll_event is the structure of interaction between user mode and kernel mode. It defines the event type concerned by user mode and the data carrier epoll_data when triggering.
-
Epoll_wait is an interface that blocks to wait for read/write events to be returned by the kernel. Epfd is also the return value of epoll_CREATE. Events is a pointer to an array of structures that store epoll_events. Maxevents tells the kernel the maximum number of FDS to return. This is dependent on the array events point to;
4.3 Popular interpretation of the Epoll stage rocket
-
Epoll_create scenario
University in the first week, you need as a monitor for the class to receive the goods, you tell staff in the student affairs office, I am a professional xx xx xx college class monitor, this staff to determine your identity and give you a proof, behind do things need to use (is also called epoll_create apply to the kernel epfd structure, The kernel returns the epfd handle for you to use. -
Epoll_ctl scenario
You take a certificate at the beginning of the business hall, sorting office staff said the monitor you all need to deal with things students students books and record need to deal with things, so the monitor started writing separately for each student handbook corresponding need to do things: li Ming need open laboratory permissions, Sun Daxiong swimming card… So the monitor wrote it and handed it to the staff (that is, to tell the kernel which FD needs to do what). -
Epoll_wait scenario
You take the certificate in front of the office waiting, at this time the radio shouted xx monitor your class sun bear swimming card to get, Li Ming lab permission card to get…. The monitor can only continue (that is, call epoll_wait to wait for the kernel feedback read-write event to occur and process).
4.4 ePoll Official Demo
Dry fullThat is as follows:
#define MAX_EVENTS 10
struct epoll_event ev.events[MAX_EVENTS];
int listen_sock, conn_sock, nfds, epollfd;
/* Set up listening socket, 'listen_sock' (socket(), bind(), listen()) */
epollfd = epoll_create(10);
if(epollfd == - 1) {
perror("epoll_create");
exit(EXIT_FAILURE);
}
ev.events = EPOLLIN;
ev.data.fd = listen_sock;
if(epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == - 1) {
perror("epoll_ctl: listen_sock");
exit(EXIT_FAILURE);
}
for(;;) {
nfds = epoll_wait(epollfd, events, MAX_EVENTS, - 1);
if (nfds == - 1) {
perror("epoll_pwait");
exit(EXIT_FAILURE);
}
for (n = 0; n < nfds; ++n) {
if (events[n].data.fd == listen_sock) {
// The primary listener socket has a new connection
conn_sock = accept(listen_sock,
(struct sockaddr *) &local, &addrlen);
if (conn_sock == - 1) {
perror("accept");
exit(EXIT_FAILURE);
}
setnonblocking(conn_sock);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
&ev) == - 1) {
perror("epoll_ctl: conn_sock");
exit(EXIT_FAILURE); }}else {
// The read/write handle of the established connectiondo_use_fd(events[n].data.fd); }}}Copy the code
Note: In epoll_WAIT, it is necessary to distinguish between a new connection event fd of the main listener thread and a read/write request for a connected event, and process it separately.
5. The low-level details of Epoll
5.1 Underlying data structures
Red black tree node definition:
#ifndef _LINUX_RBTREE_H
#define _LINUX_RBTREE_H
#include <linux/kernel.h>
#include <linux/stddef.h>
#include <linux/rcupdate.h>
struct rb_node {
unsigned long __rb_parent_color;
struct rb_node *rb_right;
struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
/* The alignment might seem pointless, but allegedly CRIS needs it */
struct rb_root {
struct rb_node *rb_node;
};
Copy the code
Epitem definition:
struct epitem {
struct rb_node rbn;
struct list_head rdllink;
struct epitem *next;
struct epoll_filefd ffd;
int nwait;
struct list_head pwqlist;
struct eventpoll *ep;
struct list_head fllink;
struct epoll_event event;
}Copy the code
Eventpoll definition:
struct eventpoll { spin_lock_t lock; struct mutex mtx; wait_queue_head_t wq; wait_queue_head_t poll_wait; struct list_head rdllist; // Struct rb_root RBR; // Struct epitem *ovflist; }Copy the code
5.2 Underlying invocation procedure
-
Create and initialize an object of type Strut EpItem to associate the object with the monitored event and the ePoll object EventPoll.
-
Struct epitem objects are managed by adding them to the red-black tree of ePoll EventPoll.
-
Add an object of type struct epitem to the wait list of the object file corresponding to the monitored event, and register the callback function to be called when the event is ready. In epoll, this callback function is ep_poll_callback();
-
When ep_poll_callback() is called, the ovFList member of eventPoll is not equal to EP_UNACTIVE_PTR, indicating that rDLList is being scanned. At this time, the epitem corresponding to the ready event will be added to the OVFLIST linked list and stored temporarily. After the rDLList is scanned, the elements in the OVFLIST linked list will be moved to the RDLList.
5.3 Confusable data copy
A widely held misconception:
When epoll_WAIT returns, epoll uses shared memory for ready events, with both user mode and kernel mode pointing to the ready list, so memory copy consumption is avoided
revents = ep_item_poll(epi, &pt); // Get ready eventsif (revents) {
if(__put_user(revents, &uevent->events) || __put_user(epi->event.data, &uevent->data)) { list_add(&epi->rdllink, head); // If the processing fails, add ep_pm_stay_awake(epI) again;return eventcnt ? eventcnt : -EFAULT;
}
eventcnt++;
uevent++;
if(epi->event.events & EPOLLONESHOT) epi->event.events &= EP_PRIVATE_BITS; //EPOLLONESHOT tag processingelse if(! (epi->event.events & EPOLLET)) { list_add_tail(&epi->rdllink, &ep->rdllist); // Ep_pm_stay_awake (epI); }}Copy the code
6.LT and ET modes
6.1 Simple understanding of LT/ET
6.2 Deep understanding of LT/ET
6.2.1 Read and write Operations for LT
6.2.2 Read and Write Operations on ET
6.2.3 A Tencent interview question
When the socket is writable, the event that the socket is writable will be constantly triggered. How to deal with this?
When data needs to be written to a socket, the socket is added to ePoll and waits for writable events. After receiving the socket writable event, it calls write or send to send data. When all data is written, it moves the socket descriptor out of the Epoll list. In this way, it needs to be added and deleted repeatedly.
When send returns the error code EAGAIN, the socket is added to epoll, and data is sent after the writable event. After all data is sent, the epoll model is removed. The improved approach is equivalent to thinking that sockets are writable most of the time. No more writing and epoll monitoring.
6.2.4 Thread Starvation in ET mode
A queue is maintained for each descriptor that has been prepared, so that the program knows which descriptors are ready but not yet read, and then the program reads them regularly or quantitatively, removing them if read, until the queue is empty. This ensures that every FD is read and no data is lost.
6.2.5 EPOLLONESHOT set
6.2.6 Selection of LT and ET
7. Epoll swarm problem
You feed the pigeons in the square. You only feed one piece of food, but a group of pigeons fight for it. In the end, only one pigeon gets the food, which is in vain for the other pigeons.
8. Giant shoulders
-
http://harlon.org/2018/04/11/networksocket5/
-
https://devarea.com/linux-io-multiplexing-select-vs-poll-vs-epoll/#.Xa0sDqqFOUk
-
https://jvns.ca/blog/2017/06/03/async-io-on-linux–select–poll–and-epoll/
-
https://zhuanlan.zhihu.com/p/78510741
-
http://www.cnhalo.net/2016/07/13/linux-epoll/
-
https://www.ichenfu.com/2017/05/03/proxy-epoll-thundering-herd/
-
https://github.com/torvalds/linux/commit/df0108c5da561c66c333bb46bfe3c1fc65905898
-
https://simpleyyt.com/2017/06/25/how-ngnix-solve-thundering-herd/