BIO

int main(a)
{
 int sk = socket(AF_INET, SOCK_STREAM, 0);
 connect(sk, ...)
 recv(sk, ...)
}
Copy the code

In high concurrency server development, the performance of this network IO is extremely poor. because

1. The recV process will most likely be blocked, leading to a process switchover (data has not arrived yet).

2. The process will wake up again when the data is ready for the connection. Another process switch (wake up when the data is copied to the socket receiving buffer)

3. A process can only wait for one connection at a time. If there are many concurrent connections, many processes are needed (threads are blocked in the socket wait queue).

In the demo above it was a simple two or three lines of code, but the user process and the kernel did a lot of work together. First, the user process initiates the instruction to create the socket, and then switches to the kernel state to complete the initialization of the kernel object. The next step in Linux is the hard interrupt and KsoftirQd process on the receiving of packets. When the Ksoftirqd process is finished, it notifies the relevant user process.

Create a socket

After the socket call is executed, the kernel internally creates a series of socket-related kernel objects (yes, not just one)

When packets are received on the soft interrupt, processes waiting on Sock are awakened by calling the sk_datA_ready function pointer (which is actually set to sock_def_readable()). We’ll talk about that later when we talk about soft interrupts, but just keep that in mind.

At this point, a TCP object, or rather SOCK_STREAM object of the AF_INET protocol family, is created. This is the cost of a socket system call

Wait to receive messages

  1. The recv function underlies the recvfrom system call

  2. After the system call, the user process enters the kernel state to check whether there is data in the corresponding socket. If there is no data, the user process blocks itself in the waiting queue of the socket, leaving the CPU, and waiting to be woken up

Soft interrupt module

After the soft interrupt module puts the data into the socket’s receive queue, it wakes up the process on the waiting queue through the callback function, and the process enters the ready state, waiting for the CALL of CPU

Four, summary

The first part is the process in which our code resides. The socket() function we call goes into kernel mode to create the necessary kernel objects. The recv() function is responsible for viewing the receive queue once it enters the kernel state and for blocking the current process to free the CPU if there is no data left to process.

The second part is hard interrupt, soft interrupt context (system process KsoftirQD). In these components, packets are processed and placed in the socket’s receive queue. Then the socket kernel object is used to find the process in its wait queue that is blocked because of waiting, and then wake it up.

Each time a process has to wait for data on a socket to be removed from the CPU. And then another process. When the data is ready, the sleeping process is woken up again. The total cost of two process context switches is about 3-5 us(microseconds) per switch, based on previous tests. If it is a network IO intensive application, the CPU will do endless process switching which is useless.

Select and epoll

Comparison of epoll with select and poll

1, user mode to pass file descriptor into the kernel

Select: create three file descriptor sets and copy them to the kernel to listen for read, write, and exception actions respectively. There is a limit to the number of FDS that can be opened by a single process (select thread). The default is 1024.

Poll: Copy the array of struct Pollfd structures passed into the kernel for listening.

Epoll: Executing epoll_create creates a red-black tree and a ready list that stores ready file descriptors in the kernel’s cache area. The user’s epoll_ctl function then adds the file descriptor to the red-black tree.

2, kernel-state detection of file descriptor read and write state

Select: polling through all FDS, and finally returning a mask mask for whether the descriptor is ready to read or write, according to which the value of fd_set is assigned.

Poll: Polling is also used to query the status of each FD. If it is ready, an item is added to the wait queue and the traversal continues.

Epoll: Uses the callback mechanism. When the add operation of epoll_ctl executes, it not only puts the file descriptor on the red-black tree, but also registers the callback function that the kernel calls when it detects that a file descriptor is readable/writable, which places the file descriptor in the ready list.

3. Find the ready file descriptor and pass it to the user mode

Select: copies the previously passed fd_set to user state and returns the total number of ready file descriptors. The user state does not know which file descriptors are in the ready state and needs to traverse to determine. (If you go through here and you find one, you get -1.)

Poll: Copies the previously passed FD array out of user state and returns the total number of ready file descriptors. The user state does not know which file descriptors are in the ready state and needs to traverse to determine.

Epoll: epoll_wait simply looks for data in the ready list and returns the list data to the array and the number of ready lists. The kernel puts the ready file descriptor in the array passed in, so you just iterate through it.

The file descriptors returned here are passed through mmap to allow kernel and user space to share the same block of memory, reducing unnecessary copying.

4. Repeated monitoring

Select: Copy the new set of listener file descriptors into the kernel, continuing with the previous steps.

Poll: Copy the new struct PollFD array into the kernel and continue with the above steps.

Epoll: No need to rebuild the red-black tree, just use the existing one.

The calling code

int main(a){ listen(lfd, ...) ; cfd1 = accept(...) ; cfd2 = accept(...) ; efd = epoll_create(...) ; epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...) ; epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...) ; epoll_wait(efd, ...) }Copy the code

Eventpoll kernel object

If you look at this image in conjunction with the previous one, you can see that after creating two socket objects (5000 and 5001), an eventepoll object (5003) was created.

The meanings of several members of the eventPoll structure are as follows:

Wq: Wait queue list. When the soft interrupt data is ready, wq is used to find the user process blocking on the epoll object.

RBR: A red black tree. To support efficient lookup, insertion, and deletion of massive connections, EventPoll uses a red-black tree internally. This tree is used to manage all socket connections added to the user process.

Rdllist: Linked list of ready descriptors. When a connection is ready, the kernel places the ready connection in the RDLList. This allows the application process to find ready processes only by judging the linked list, rather than walking through the entire tree.

Epoll_ctl Add socket

When registering each socket with epoll_ctl, the kernel does three things

1. Allocate a red-black tree node object epItem,

2. Add the wait event to the wait queue of the socket. Its callback function is ep_poll_callback

3. Insert the EpItem into the red-black tree of the epoll object

After adding two sockets with epoll_ctl, the resulting kernel data structure in the process looks like this:

1. Allocate and initialize epItems

For each socket, an EPItem is assigned when epoll_ctl is called

The socket corresponds to an EPItem, which is stored on the eventepoll red and black tree.

2. Set the socket waiting queue

When a socket wrapper is registered with an EventPoll kernel object, a callback function is registered on the socket’s wait queue. The callback function is called by the soft interrupt handler to wake up the thread on the EventePoll wait queue and call the

3 Insert the red-black tree

After allocating the EpItem object, we immediately insert it into the red-black tree. A red-black tree in epoll with some socket descriptors inserted looks like this:

The main reason for choosing a red-black tree is that it is more balanced in search, delete and insert efficiency

Three epoll_wait Wait for receiving

Epoll_wait does nothing more complicated than to look for data in the eventPoll -> rDLList list when it is called. Return with data, create a wait queue item without data, add it to the wait queue in EventPoll, and then block itself.

Four here comes the data

After the soft interrupt handler puts the data into the socket’s accept queue, it calls the callback function on the wait queue, first adds the epItem corresponding to the socket to the ready queue of epoll, and then wakes up the blocked user thread on the epoll wait queue. The user thread only needs to traverse the epItem on the ready queue to find the socket object containing the data, avoiding scanning all sockets.

Five summary

To sum up, the kernel operating environment of epoll-related functions is divided into two parts:

User process kernel state. A function such as epoll_wait is executed by plunging the process into kernel mode. This part of the code is responsible for viewing the receive queue and for blocking the current process to free the CPU.

Hard and soft interrupt context. In these components, packets are received from the network adapter, processed, and placed on the socket’s receive queue. For epoll, find the Socket associated EPItem and add it to the ready list of epoll objects. Check epoll to see if any processes are blocked, and if any are awakened.

To cover every detail, this article covers a lot of processes, including blocking.

But in practice, epoll_wait doesn’t block at all if you do enough work. The user process will work and work until there is no more work to do in epoll_wait before voluntarily relinquish the CPU. This is where Epoll is effective!

The select principle of six

Select is essentially the next step by setting or examining the data structure that holds the FD flag bit. This brings disadvantages: The maximum number of connections a single process can open is defined by the macro FD_SETSIZE, which is 32 integers (3232 on a 32-bit machine, 3264 on a 64-bit machine). Of course, we can make changes and recompile the kernel, but performance may be affected and further testing is required. Generally, this number depends on the system memory, and you can check the exact number by cat /proc/sys/fs/file-max. The default value is 1024 for 32-bit VMS and 2048 for 64-bit VMS.

(1) Copy fd_set from user space to kernel space using copy_from_user

(2) Register the callback function __pollwait

(3) Traverse all FDS and call the corresponding poll method (sock_poll for sockets, sock_poll for tcp_poll, UDP_poll, or datagram_poll depending on the case).

(4) Take tcp_poll as an example, its core implementation is __pollwait, which is the callback function registered above.

(5) The main job of __pollwait is to attach current to the device’s wait queue. Different devices have different wait queues. For tcp_poll, the wait queue is SK -> SK_sleep. Current is awakened when the device receives a message (network device) or fills in file data (disk device) and wakes up the sleeping process on the device wait queue.

(6) When the poll method returns, it returns a mask mask describing whether the read and write operation is ready, and assigns a value to fd_set according to this mask mask.

Schedule_timeout (select current) is called to sleep if all FDS are iterated and no readable mask is returned. When the device driver reads or writes to its own resources, it wakes up the sleeping process on its wait queue. If no one wakes up after a certain timeout (specified by schedule_timeout), the process that called select will wake up again to get the CPU, and then iterate over the FD to determine if any fd is ready.

Copy fd_set from kernel space to user space.

Seven edge triggers and horizontal triggers

EPOLLLT and EPOLLET:

LT, the default mode (level trigger) : as long as there is the fd data can be read, every epoll_wait returns its events, remind the user program to operate, ET is “high speed” mode (edge trigger) : only you will be prompted once, until the next time again will not again until the data into the prompt, whether in the fd and data can be read. Therefore, in ET mode, when reading a fd, it is necessary to read its buffer to the end, i.e., if the value returned by read is less than the requested value or an EAGAIN error is encountered

Epoll uses the ready notification mode of “event” to register a FD with epoll_ctl. Once the FD is ready, the kernel activates the FD using a callback-like mechanism, and epoll_wait is notified.

Meaning of EPOLLET trigger mode

With EPOLLLT, once the system has a large number of reader-ready file descriptors that do not need to be read or written, they are returned every time epoll_wait is called, greatly reducing the efficiency of the processor to retrieve the reader-ready file descriptors that it is interested in. With EPOLLET, epoll_wait notifies the handler to read or write when a read or write event occurs on the file descriptor being monitored. If you do not read or write all the data this time (for example, if the read/write buffer is too small), epoll_wait will not notify you the next time it is called, that is, it will notify you only once until a second read/write event occurs on the file descriptor. This is more efficient than horizontal triggering, and the system won’t be flooded with ready file descriptors that you don’t care about.

Eight advantages

The biggest advantage of Epoll is that it only cares about “active” connections, and has nothing to do with the total number of connections. Therefore, in the actual network environment, Epoll is much more efficient than SELECT and poll memory copy, and mmap() file mapping memory is used to speed up the message transmission with the kernel space. That is, epoll uses Mmap to reduce replication overhead. Epoll is implemented by sharing a single piece of memory between the kernel and user space

reference

Cartoon small P | see process tells the story of its network performance

Graphic | deep understanding of the high performance network on the way of development – synchronous blocking IO (🇨 🇳 new)

Graphic | further reveal epoll is how to realize the IO multiplexing!

zhuanlan.zhihu.com/p/272891398

Blog.csdn.net/JMW1407/art…

Mmap usage principles

File descriptor