Welcome to pay attention to github.com/hsfxuebao, I hope to help you, if you think it can trouble to click on the Star ha

1. Key concepts

1.1 IO multiplexing

I/O multiplexing stands for I/O multiplexing. It’s called select, poll, epoll, and in some places it’s called event Driven IO. It is a mechanism by which a process can monitor multiple descriptors, and when a descriptor is ready (usually read or write), it can tell the program to read or write accordingly. Instead of using multiple threads (one thread per file descriptor, one new thread at a time), waiting ready on multiple descriptors at the same time, based on a blocking object, can greatly save system resources. So, I/O multiplexing is characterized by a mechanism whereby a process can wait for multiple file descriptors at the same time and any one of these file descriptors (socket descriptors) goes into a read-ready state and the select() function returns. See IO Series 2 for an in-depth understanding of the five IO models. Multiplexing refers to the simultaneous management of multiple I/O streams by a single thread recording and tracking the state of each Sock(I/O flow). The reason it was invented was to increase the throughput of the server as much as possible.

Doesn’t that sound like a mouthful? Look at the picture

Everyone has used Nginx, nginx uses epoll to receive requests, Ngnix will have a lot of links coming in, EPoll will monitor them all, and then like a dial switch, whoever has data will call the corresponding code to process it. Redis is similar.

1.2 File descriptors

A File descriptor is a computer science term, an abstract concept used to describe a reference to a File. The file descriptor is formally a non-negative integer. In fact, it is an index value that points to the record table of open files that the kernel maintains for each process. When a program opens an existing file or creates a new file, the kernel returns a file descriptor to the process. In programming, some low-level programming tends to revolve around file descriptors. However, the concept of file descriptors is usually only applicable to operating systems such as UNIX and Linux. As follows:

/* * @author Pavani Diwanji * @see java.io.FileInputStream * @see java.io.FileOutputStream * @since JDK1.0 */
public final class FileDescriptor {

    private int fd;

    private Closeable parent;
    private List<Closeable> otherParents;
    private boolean closed;
}
Copy the code

2. Linux implements IO multiplexing functions

Known as I/O multiplexing mechanism refers to the kernel once find specify one or more of the process conditions of IO ready to read, it informs the process, that is through a mechanism that can monitor multiple descriptor, once a descriptor ready ready (usually read or write in place), to inform the program to read and write operations. The use of this mechanism requires select, poll, and epoll.

  • Multiple connections share a blocking object, and the application waits on only one blocking object instead of all connections. 
  • When a connection has new data to process, the operating system notifies the application, and the thread returns from the blocked state to begin processing the business.

2.1 the select function

Select was the first implementation of IO multiplexing (implemented in BSD circa 1983). Select is the lowest of the three, and its event rotation mechanism is bit-based. Each query iterates through the entire event list.

2.1.1 Data structure and Parameter definition

To understand select, you must first understand the fD_set data structure that a select is processed with. Each SELECT is processed with an FD_set structure. Fd_set data structure is as follows:

typedef long int __fd_mask;

/* fd_set for select and pselect. */
typedef struct
  {
#ifdef __USE_XOPEN
    __fd_mask fds_bits[__FD_SETSIZE / __NFDBITS];
# define __FDS_BITS(set) ((set)->fds_bits)
#else
    __fd_mask __fds_bits[__FD_SETSIZE / __NFDBITS];
# define __FDS_BITS(set) ((set)->__fds_bits)
#endif
  } fd_set;
Copy the code

Fd_set is simply a bit of 1024 length. Each bit represents a File descriptor that needs to be processed. If it is 1, it means that the FD has I/O events that need to be processed; otherwise, it does not. To simplify bit manipulation, Linux defines a set of macro functions to handle this bit array.

void FD_CLR(int fd, fd_set *set);     // Clear the mapping of the fd on fd_set. Select is not processing the FD
int  FD_ISSET(int fd, fd_set *set);   // check whether the fd_set indicated by fd has an event request
void FD_SET(int fd, fd_set *set);     // set fd_set to 1
void FD_ZERO(fd_set *set);            // Empty the entire fd_set, usually used for initialization
Copy the code

The maximum number of FDS that select can process is 1024, which is determined by the size of the fd_set.

The select function is called in the following format

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);
Copy the code
  • nfds: indicates the maximum number of file descriptors +1. This number refers to the maximum number of read and write events. +1 is for overall check
  • readfds: indicates the fd that needs to be monitored for read events. NULL is not set
  • writefds: indicates the fd that needs to be monitored for write events. NULL is not set
  • exceptfds: indicates exception processing, temporarily not used…
  • timeout: indicates the blocking time, 0 indicates a non-blocking model, and NULL indicates forever blocking until data arrives
    • timevalThe definition is as follows:
      struct timeval {
         long    tv_sec;         /* seconds */
         long    tv_usec;        /* microseconds */
      };
      Copy the code
    • timeval *timeoutThere are three cases:
      1. NUll, waiting forever
      2. Set timeVal to wait for a fixed time
      3. Set timeVal to 0 and return immediately after checking the description, polling

The select function returns three types of values:

  • Positive: indicatesreadfdsandwritefdsTotal number of ready events
  • 0:0 is returned after timeout
  • -1: an error occurs

The select function is called as follows:

  • The select function monitors three types of file descriptors, namely readfds, Writefds and EXCEPTFDS, and copies the array passed in by the user into the kernel space
  • The select function blocks until a descriptor is ready (data readable, writable, or except) or a timeout (timeout specifies the wait time, or null if returned immediately), and the function returns.
  • When the select function returns, the ready descriptor can be found by iterating through the FDset.

2.1.2 Execution Process



Process:

  1. The user thread calls SELECT, copying fd_set from user space into kernel space
  2. The kernel iterates over the fd_set in kernel space to see if there is a ready socket descriptor, and if there is none, it goes to sleep until there is a ready socket descriptor
  3. The kernel returns the result of select to the user thread, which is the number of file descriptors in place
  4. After getting the number of ready file descriptors, the user iterates fd_set again to find the ready file descriptors
  5. The user thread reads and writes to the ready file descriptor

2.1.3 C language code cases

2.1.4 the pros and cons

Advantages:

  1. Select from NIO, copy the fd array that the user mode traverses into the kernel state, and let the kernel state traverse it, because the user state determines whether the socket has data or calls the kernel state. In this way, you don’t have to switch between user mode and kernel mode frequently when traversing the judgment
  2. Select system call returns a set of &rset, so that users can quickly know which sockets need read data with a simple binary comparison, effectively improve efficiency
  3. All platforms are supported, good cross-platform

Disadvantages:

  1. A process can process a maximum of 1024 clients
  2. The &rset is not reusable. Every time the socket has data, the corresponding bit is set
  3. The file descriptor array is copied to the kernel state (only without the overhead of system call context switching. (kernel layer can be optimized for asynchronous event notification), there is still overhead. The SELECT call requires passing in the FD array and making a copy of it to the kernel, which can be a huge resource drain in high-concurrency scenarios. (Can be optimized to not copy)
  4. Select does not tell the user which socket has data and still requires O(n) traversal. Select simply returns the number of file descriptors that can be read, which one the user must traverse. (Can be optimized to return only user-ready file descriptors without the user doing invalid traversal)

Summary: Select allows one thread to process multiple client connections (file descriptors) while reducing the overhead of system calls (multiple file descriptors only one SELECT system call + N ready-to-state file descriptors read system calls)

2.2 the poll function

The poll function was implemented in 1993. Poll can be considered an enhanced version of SELECT, because the bitwise operation of SELECT determines that a maximum of 1024 read or write events can be processed at one time, and Poll optimizes the model in a new way.

2.2.1 Data Structure and Parameter definition

Pollfd, the data structure of the poll underlying operation, is stored using a linked list:

struct pollfd {
	int fd;          // The file descriptor to monitor
	short events;    // Events that need to be monitored by the kernel, such as read and write events
	short revents;   // The event that actually occurred, set to 1 if an event occurred in the file descriptor
};
Copy the code

When you use this structure, you don’t manipulate the bits, you manipulate the event itself. You can also customize the type of event. Please refer to the manual for details.

Similarly, all events are initialized to 0 by default, which can be initialized by bZero or memset. Then register the events of interest on events, and listen on Revents when listening. Register event using | operation, the query events using & operation. Want to register POLLIN data coming events, for example, need PFD. Events | = POLLIN, registration | operation can be many times more than one event. Cancel events to perform ~ operations, such as pfd.events ~= POLLIN. Query events: pfd.revents & POLLIN.

Use the poll function to operate:

int poll(struct pollfd* fds, nfds_t nfds, int timeout);
Copy the code

Parameter Description:

  • fdsA:pollfdThe queue header pointer, we first put the file descriptors we want to monitor and the events above them into this queue
  • nfds: Indicates the length of the queue
  • timeout: Event operation. Sets blocking events with a specified positive number. 0 indicates non-blocking mode and -1 indicates permanent blocking.

    Time data structure:
    struct timespec {
    long    tv_sec;         /* seconds */
    long    tv_nsec;        /* nanoseconds */
    };
    Copy the code

2.2.2 Execution process

Select function execution process is similar:

  1. The user thread invokes the poll system call and copies the linked list of file descriptors into the kernel space
  2. The kernel iterates over the file descriptor, and if there is no ready descriptor, the kernel starts sleeping until one is ready
  3. Returns the number of thread-ready file descriptors to the user
  4. The user thread iterates through the linked list of file descriptors again to find ready file descriptors and resets events to 0 for reuse
  5. The user thread reads and writes to the ready file descriptor

2.2.3 C language code cases

2.2.4 the pros and cons

Advantages:

  1. Poll uses a PollFD array instead of a bitmap in SELECT, which has no limit of 1024 and can manage more clients at once. The main difference from SELECT is that the limit of 1024 file descriptors that select can listen on is removed.
  2. When an event occurs in the PollFDS array, the corresponding Revents is set to 1 and then set back to zero during traversal, which enables reuse of the PollFD array

Disadvantages: Poll solves the first two disadvantages of SELECT, its essential principle is still the method of select, there are original problems in select

  1. The pollfds array is copied to the kernel state and still has overhead
  2. Poll does not tell the user which socket has data and still requires O(n) traversal

2.3 epoll function

Epoll function was invented by Davide Libenzi in 2002. Epoll is a more advanced operation. The above select or poll operation requires polling all candidate queues to determine whether there are events one by one. Moreover, the event queue is directly exposed to the caller, such as the write_FD of select and the FDS of poll, which is highly complex and prone to misoperation. Epoll provides a new mode, which directly applies for an EPollFD file and manages them in a unified manner. It initially has an object-oriented thinking mode.

2.3.1 Data Structure and Parameter definition

Let’s start with the underlying data structure:

#include <sys/epoll.h>
// Data structure
// Each epoll object has a separate EventPoll structure
// The red-black tree is used to store events added to the epoll object by the epoll_ctl method
// epoll_wait to check whether an event has occurred simply checks for epItem elements in the RDList double-linked list in the eventPoll object
struct eventpoll {./* The root node of the red-black tree that stores all the events added to epoll to be monitored */
    struct rb_root  rbr;
    /* Double-linked lists store all ready file descriptors */
    struct list_head rdlist;. };typedef union epoll_data {
	void *ptr;
	int fd;
	uint32_t u32;
	uint64_t u64;
} epoll_data_t;

struct epoll_event {
	uint32_t events;
	epoll_data_t data;
};
// API 
int epoll_create(int size); // Add an eventPoll object in the middle of the kernel to put all the sockets that need to listen into the eventPoll object
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); // epoll_ctl is responsible for adding and removing sockets to the kernel red-black tree
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);// epoll_wait checks if there are ready file descriptors in the double-linked list and returns if there are
Copy the code

Notice that epoll_data is a union type. Fd is easy to understand and is a file descriptor; The file descriptor is essentially a subscript description of the resource address in an indexed kernel, so it can also be replaced by a PTR pointer. These same numbers can be replaced by integers. Epoll_event has a data for fd, followed by events for registered events.

Epoll is conducted through the following three functions.

  1. Create ePollFD: Create an epoll handle

    int epoll_create(int size);
    Copy the code

    Size is used to specify the queue size maintained by the kernel, but after 2.6.8 this parameter has no practical value because the kernel maintains a dynamic queue. The ePollFD function returns an epoll FD through which subsequent event operations are performed.

    There is another function created:

    int epoll_create1(int flag);
    Copy the code

    If flag==0, the other option is EPOLL_CLOEXEC. This option means that when the parent forks a child process, the child process does not contain an Epoll FD, which is useful in multi-process programming.

  2. Handling events: Add, modify, or delete file descriptors to the kernel to monitor

    int epoll_ctl(int epfd, int op, int fd, struct epoll_event* event);
    Copy the code
    • epfdIs to create theepollThe fd
    • Op indicates the type of operation
      • EPOLL_CTL_ADD: registration event
      • EPOLL_CTL_MOD: change event
      • EPOLL_CTL_DEL: delete event
    • Fd is the corresponding file descriptor
    • An event is an event queue
  3. Waiting for events: Similar to launching a select() call

    int epoll_wait(int epfd, struct epoll_event* evlist, int maxevents, int timeout);
    Copy the code
    • epfdisepollFile descriptor of
    • evlistIs the queue of events that occur
    • maxeventsIs the longest length of the queue
    • timeoutIs the time limit, positive integer time, 0 is non-blocking, -1 is permanently blocked until the event occurs. Returns the number of ready, 0 for none and -1 for error.

It can be seen from the picture belowepollRelated interface functions: Event callback notification mechanism:

  • When data arrives from a network card, it is first put into DMA (a buffer in memory that the network card can access directly)
  • The nic initiates an interrupt to the CPU, allowing the CPU to process the nic first
  • The interrupt number is bound to a callback in memory, which puts the socket in the ready list with data in it

2.3.2 Execution process

  1. Epoll_create Creates an EventPoll object (red-black tree, double-linked list)
  2. A red-black tree that stores all listened file descriptors and adds and removes file descriptors to the red-black tree using epoll_ctl
  3. A double-linked list that stores a list of ready file descriptors. When epoll_wait is called, it checks if there is data in the list and returns it if there is. When data is present, the corresponding file descriptor is’ set ‘, butepollThere is no revent flag bit, so it is not really a bit. The file descriptor with the data is placed at the head of the queue.
  4. All events added to EventPoll have a callback relationship with the device driver;epollReturns the number of file descriptors with data. Based on the number of returned file descriptors, read the first N file descriptors

2.3.3 C language code cases

2.3.4 Epoll Working Mode (TRIGGERED by LT and ET)

2.3.4.1 Horizontal Triggering Mode (LT Mode)

The LT mode, also known as the horizontal trigger mode, is the default epoll trigger mode (SELECT and poll only have this mode).

  • Readable event: An event is triggered when the size of the data in the buffer is higher than the low water mark
  • Writable event: An event is emitted when the size of the remaining space in the send buffer is greater than the low watermark flag
  • Low watermark: a reference value, default is 1

So to put it simply, the horizontal triggering mode is to keep firing events as long as there is data in the buffer. When epoll detects that an event is ready on a socket, it does not process the event immediately. Or just part of it. For example, when epoll_wait is called a second time, epoll_wait will return immediately and notify the socket that the read event is ready. Epoll_wait does not return immediately until all data on the buffer has been processed. Blocking and non-blocking reads and writes are supported

2.3.4.2 Edge Triggering Mode (ET Mode)

ET mode is edge-triggered mode. If we add the socket to the epoll_event descriptor using the EPOLLET flag, epoll will be in ET mode. The trigger condition

  • Readable event :(does not care whether the receiving buffer has data or not) an event is fired whenever new data arrives.
  • Writable events: Events are triggered when free space is created from scratch

To put it simply, in ET mode events are triggered only when new data arrives. This requires us to take out all the data at once when the new data arrives, otherwise the second event will not be triggered until the next time the new data arrives. We don’t know exactly how much data there is, so we need to loop until the buffer is empty, but recV is a blocking read, and if there is no data, it will block waiting. In this case, we need to set the property of the descriptor to non-blocking to solve this problem

void SetNoBlock(int fd) 
{
    int flag = fcntl(fd, F_GETFL);

    flag |= O_NONBLOCK;
    fcntl(fd, F_SETFL, flag);
}
Copy the code

When epoll detects that an event is ready on a socket, it must process the event immediately. In the example above, epoll_wait does not return 1K of data left in the buffer, even though it is only 1K of data. That is, in ET mode, an event on a file descriptor is processed once it is ready.

ET performs better than LT (epoll_wait returns much fewer times). Only non-blocking reads and writes are supported

2.3.4.3 difference

  • The advantages of LT mode mainly lie in its simplicity and stability, and it is not prone to problems. Traditional SELECT and poll use this mode. However, there is also a disadvantage, that is, because too many events are triggered, the efficiency is reduced

  • The biggest advantage of ET is that it reduces the number of epoll triggers, but it also brings a huge cost, that is, all data must be processed at one time. Although the efficiency is improved, the complexity of the code is greatly increased. Nginx uses ET mode by default

  • Another scenario is suitable for THE ET mode. If we need to receive a piece of data, but the data is incomplete due to some problem, we need to send it in batches. Therefore, there is only part of data in the buffer at this time. If you take it out at this time, it will increase the cost of data maintenance. The correct way is to wait for the subsequent data to arrive and complete it, and then take it out once. However, if you use LT mode at this point, events will keep firing because the buffer is not empty, so it is better to use ET in this case. (Unpacking sticking problem)

2.3.5 pros and cons

Advantages:

  1. The time complexity is O(1). When an event is ready, epoll_wait only needs to check whether there is data in the ready list and return it directly if there is
  2. Memory mapping (MMAP) techniques are used instead of frequently copying file descriptor collections from user space to kernel space
  3. The user thread is notified in the form of a callback when a ready event occurs

Disadvantages:

  1. It only works under Linux

2.4 summarize

  • The reason multiplexing is fast is because the operating system provides such system calls, so that what used to be a while loop with multiple system calls becomes a while loopA system call + kernel layer traverses these file descriptors.
  • Epoll is the most advanced IO multiplexer available today. It is used by Redis, Nginx, and Java NIO in Linux.
  • here"Multiplexing" refers to multiple network connections, and "multiplexing" refers to multiplexing the same thread.
    • The process of copying a socket from the user state to the kernel state is only once in the socket life cycle
    • Using the event notification mechanism, every time there is data in the socket will be actively notified to the kernel, and added to the ready list, no need to traverse all sockets
  • In the multiplexing IO model, there is a kernel thread constantly polling for the state of multiple sockets, and the actual IO read and write operations are called only when the actual read and write events are sent. Because in the multiplexing IO model, only one thread can be used to manage multiple sockets, the system does not need to create new processes or threads, and do not need to maintain these threads and processes.And IO resources are only used when there are actually read and write events going on, so it greatly reduces resource usage.
  • Multi-channel I/O multiplexing model uses the ability of SELECT, poll and epoll to monitor I/O events of multiple streams at the same time. When idle, the current thread will be blocked. When one or more streams have I/O events, it will wake up from the blocking state. The program then polls all the streams (epoll only polls the streams that actually emitted the event) and only polls the ready streams sequentially, which avoids a lot of useless operations. Using multipleI/O multiplexing allows a single thread to efficiently process multiple connection requests (minimizing network IO time consumption).

2.5 Comparison between Select, poll, and epoll

select poll epoll
How do I get a ready FD from fd data traverse traverse The callback
Underlying data structure Bitmaps store file descriptors Linked lists store file descriptors Red and black treeStore the file descriptor for monitoring,Double linked listStore ready file descriptors
Time complexity To get the ready file descriptor, traverse the fd array, On) Getting ready file descriptors requires traversing the FD linked list, O(n) When there is a ready event, the system-registered callback function is called to place the ready FD into the ready list. O(1)
Maximum supported file descriptor There is usually a maximum limit 65535 65535
Maximum number of connections 1024 (x86) or 2048(x64) unlimited unlimited
FD data copy Each time select is called, the FD data needs to be copied from user space to kernel space Each time poll is called, fd data needs to be copied from user space to kernel space With memory mapping (MMAP), there is no need to copy FD data frequently from user space to kernel space

3. I/O multiplexing application scenarios

3.1 Ngnix supports IO multiplexing model

  • select
  • poll
  • Epoll: IO multiplexing, efficient concurrency model, available in Linux 2.6+ and above kernels
  • Kqueue: IO multiplexing, efficient concurrency model, available on FreeBSD 4.1+, OpenBSD 2.9+, NetBSD 2.0, and Mac OS X platforms
  • /dev/poll: Efficient concurrency model, available on Solaris 7 11/99+, HP/UX 11.22+ (EventPort), IRIX 6.5.15+, and Tru64 UNIX 5.a + platforms
  • Eventport: Efficient concurrency model for Solaris 10 platform, PS: due to some known issues, /dev/poll is recommended instead.

3.2 Redis supports IO multiplexing model

Redis runs in a single thread, and all operations are performed sequentially and linearly. However, as read and write operations wait for user input or output are blocked, I/O operations cannot be returned directly under normal circumstances, which will lead to I/O blocking of a file and the whole process cannot provide services to other customers. I/O multiplexing appears to solve this problem

The I/O multiplexing mechanism is a mechanism by which multiple descriptors can be monitored and, once a descriptor is ready (usually read or write), the program can be told to read or write accordingly. The use of this mechanism requires select, poll, and epoll. Multiple connections share a blocking object, and the application waits on only one blocking object instead of all connections. When a connection has new data to process, the operating system notifies the application, and the thread returns from the blocked state to begin processing the business.

The Redis service uses the Reactor approach to implement file event handlers (each network connection actually has a file descriptor). Redis developed the network event handler based on the Reactor pattern, which is called the file event handler. It consists of four parts:

  • Multiple sockets
  • IO multiplexing procedures
  • File event dispatcherRedis is called the single-threaded model because the consumption of the file event dispatcher queue is single-threaded
  • Event handler.

Redis’s choice of IO multiplexing functions is shown below:

IO multiplexing, poll, epoll, poll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll, epoll