[toc]

Original is not easy, more dry goods, focus: unique cloud storage

There is a core weapon in Linux: the ePoll pool, which is often found in high-concurrency, high-throughput IO systems.

IO multiplexing

At the heart of Go is the Goroutine, the so-called coroutine. One of the best implementations of the coroutine is that the asynchronous code is the same length as the synchronous code. For example, in Go, the read and write of network IO seem to be synchronous code, but actually they are asynchronous calls. The general process is as follows:

Write (/* IO parameter */) Request to join the queue and wait for the completion of the background loop program to send a network request to wake up the service partyCopy the code

Go coroutine implements synchronous coding of asynchronous process on network IO. The core is to use epoll pools to manage network FDS.

In the form of implementation, the background program only needs one to manage multiple FD handles, responsible for dealing with all business side IO requests. This one-to-many MODE of IO is called IO multiplexing.

What does multiple mean? Concurrent IO from multiple business parties (handles).

What is reuse? Reuse this daemon.

From the point of view of IO system designers, we can’t ask the business side, because the business side is god, only you obey, they want to create multiple FD, then you need to be responsible for these FD processing, and preferably concurrently.

The business side can not ask for, then can only ask for background loop program!

What do you want? Come on! Come on! Come on! This is the core requirement. The processing must be fast, and the fastest feeling should be given to each FD channel. Each FD should feel that you are only running errands for him.

Then someone asked again, that I an IO request (such as write) corresponding to a thread to handle, so all the IO concurrent? Yes, but there is a bottleneck, once the number of threads is too much, the performance will be poor.

Instead of comparing the differences between multithreading and IO multiplexing to achieve high concurrency, you can learn more about the secrets of high concurrency in Nginx and Redis.

The most modest way to do it?

Can I multiplex I/O without any other system calls?

You can. So write a for loop, every time try IO, read/write to deal with, don’t read/write to sleep. So we’ve got 1 to many I/O multiplexing.


while True:

forEach handle array {read/write(fd,/ * * /)

}

sleep(1s)

Copy the code

Wait, there’s a problem. The above program might get stuck on line three, causing the entire system to stop running. Why?

By default, we create handles with no arguments that block. When we read data, if the data is not ready, we will have to wait, when we write data, if it is not ready, the default is stuck waiting. Therefore, the third line of the pseudocode above is likely to be stuck directly, causing the entire thread to not run.

For example, we have handles 11, 12, and 13, and now 11 is not ready to read or write, so as long as the read/write(11, /* argument */) is ready, we’ll get stuck, but 12,13 is ready, so we’ll get stuck in front of the array of handles 11, 12, and 13. The next 12 and 13 don’t run. This is not in line with our expectations, because our IO multiplexing loop thread is a public service and cannot be directly disabled by a SINGLE FD.

How to solve this problem?

Just set the FDS to non-blocking mode. When reading/writing, if the data is not ready, EAGIN is returned, and the thread is not stuck, and the system is running. For example, if handle 11 is not ready, then read/write(11, /* argument */) will not block. Only an EAGIN error will be reported, which requires special handling, and the loop thread can continue reading and writing for 12 and 13.

So that’s the bare-faced implementation of IO multiplexing. But don’t you see this kind of IO multiplexing implemented in a production environment? Why is that?

Because it’s not advanced enough. The for loop will periodically sleep for 1s each time. This will cause the throughput to be very poor, because it is possible that all the FDS will be ready for IO data at the time of the sleep, and then it will have to wait for 1s.

In the for loop, there is no sleep, so it can be handled in time.

It’s on time, but the CPU is about to fly. Without sleep, it is estimated that the CPU will run at 100% when there is no FD to process. This is also unacceptable.

If you don’t sleep, it will waste CPU.

In this case, the user mode is difficult to do anything, only to call the kernel to provide mechanisms to assist. The kernel manages these notifications and schedules in a timely manner.

Let’s go through the requirements and principles of IO multiplexing. IO multiplexing is a mode in which one thread handles multiple FDS. Our requirements are: this “1” should be as fast as possible, avoid all inefficiency, and spend all of the time processing the HANDLE IO, without any idling, sleep time.

Is there a tool where we can put a bunch of FDS in it, and as soon as one fd can read and write data, the background loop thread will wake up and run at full speed. Keep the CPU out the rest of the time.

Can you do that? Yes, this requirement can only be satisfied by the kernel mechanism.

The Linux kernel has to say something about this, right?

Yes, if you want to avoid the implementation of sleep, the Linux kernel must do something, after all, IO processing is in the kernel, the kernel is the most clear data is not good.

The kernel provides three tools: SELECT, Poll, and epoll.

Why are there three?

History keeps improving, lame -> lame -> Omg, efficient evolution.

Is there any other way Linux can do I/O multiplexing?

I don’t think so!

What do these three do?

All three manage both read and write events for FDS, blocking threads and cutting off the CPU if all FDS are unreadable and unwritable. The thread must be able to wake up when the fd has a situation.

Of these three methods, epoll pool is the most efficient. Why is it most efficient?

In fact, epoll does the least useless work, select and poll more or less need more copies, blind guess fd, so the efficiency is naturally low.

For example, to compare select and epoll, the pool manages 1024 handles. When the loop thread wakes up, select is not informed which of the 1024 handles is IO ready. What to do in this case? You can only traverse the 1024 FDS, testing them one by one. If only one handle is ready, that’s over a thousand times as much useless work done.

Epoll_wait, on the other hand, wakes up with an array of FDS that are ready to be processed. No tests are required.

Epoll pool principle

Let’s look at how epoll pools work and how they work.

Epoll involves a system call

The use of epoll is very simple, with just the following three system calls.


epoll_create

epollctl

epollwait

Copy the code

Is this it? Yes, it’s that simple.

  • Epollcreate is responsible for creating a pool, a pool that monitors and manages handle fd;

  • Epollctl manages the addition, deletion, and modification of FDS in the pool.

  • Epollwait is a snooze that allows the CPU to schedule, but wakes up as soon as “something” happens.

How epoll is efficient

On Linux, epoll has been blown up as the secret weapon of highly concurrent IO implementations. The principle is actually very simple: the implementation of epoll does almost no useless work. Let’s take a step by step analysis from the perspective of use.

First, the first step in epoll is to create a pool. This is done using epoll_create:

Prototype:


int epoll_create(int size);

Copy the code

Example:


epollfd = epoll_create(1024);

if (epollfd == - 1) {

perror("epoll_create");

exit(EXIT_FAILURE);

}

Copy the code

This pool is a black box for us, and this black box is used to hold the FD, so we won’t get into the details. So we’ve got an ePollFD, and this ePollFD is uniquely representative of this epoll pool.

And then, what we’re going to do is we’re going to put a FD in this epoll pool, and that’s where we’re going to use epoll_ctl

Prototype:


int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Copy the code

Example:


if (epoll_ctl(epollfd, EPOLL_CTL_ADD, 11, &ev) == - 1) {

perror("epoll_ctl: listen_sock");

exit(EXIT_FAILURE);

}

Copy the code

Op (EPOLL_CTL_ADD) indicates that the operation is to add, modify, or delete. The event structure specifies the type of event to listen to, which can be read or written.

The first problem with efficiency is adding fd to the pool. What about changing or removing fd? How to make time fast?

Here comes how you manage the data structure of the FD.

The most common way to think about it is: list, ok? Functionally yes, but performance is down. List structure to manage the elements, the time complexity is too high O(n), each time to traverse the list to find the location. The bigger the pool, the slower the performance.

Is there a simple and efficient data structure?

Yes, red-black trees. The Linux kernel’s internal implementation of the Epoll pool uses a red-black tree structure to manage the fd handles of these registration processes. The red-black tree is a balanced binary tree with order log n time, and the pool has very stable search performance even if it is constantly being added or deleted.

Now consider the second productivity secret: How do you make sure you know when the data is ready?

Epoll_ctl there’s a little bit involved here. The secret: callback Settings. In the internal implementation of epoll_ctl, in addition to managing the handle structure with a red-black tree, another core step is to set the poll callback.

Here’s the thought: What is a poll callback? How do I set it?

What is file_operations->poll?

As mentioned in fd, Linux is designed to be all about files, and this is not just talk, it’s everywhere. When you implement a file system, you implement this file call. This structure is denoted by struct file_operations. This structure has a lot of functions, but I’ve simplified a few, as follows:


struct file_operations {

ssize_t (*read) (struct file *, char __user *, size_t.loff_t *);

ssize_t (*write) (struct file *, const char __user *, size_t.loff_t *);

__poll_t (*poll) (struct file *, struct poll_table_struct *);

int (*open) (struct inode *, struct file *);

int (*fsync) (struct file *, loff_t.loff_t.int datasync);

/ /...

};

Copy the code

So you see read, write, open, fsync, poll, and so on, these are all custom operations on the file, and all the operations on the file are implemented logically within this framework, so for ext2, if there’s a custom for read/write, it’s ext2_read, Ext2_write, ext4 would be ext4_read, ext4_write. When you open a specific “file,” you assign the corresponding file_operations value to the file structure.

Write: file_operations->poll: file_operations->poll

This is the implementation of a custom listener mechanism. Through the poll mechanism, the upper layer can directly tell the lower layer that once the FD is ready to read and write, the underlying hardware (such as network card) will automatically put the FD-related structure in the specified queue when calling back, and wake up the operating system.

For example, the network card sends and receives packets asynchronously. The operating system drops the data to a specific location, and the network card continuously retrieves the data from the specified location. Requests and responses are handled by interrupt callbacks. Interrupts are typically split into two parts: hard interrupts and soft interrupts. The poll function is to add more material to the soft interrupt back, as long as the read and write event triggered, will immediately notify the upper layer, using this form of event notification can completely disappear the wasted time window.

Highlight: This poll event callback mechanism is the core of the efficient ePoll pool.

The handle to epoll pool management is only supportedfile_operations->pollFile fd. In other words, if a “file” resides on a file system that does not implement a poll interface, then the epoll mechanism cannot be used.

Second question: How to set poll?

In the implementation of epoll_ctl, one step is to call vfs_poll, which has a judgment. If the file_operations of the file system on which the fd is located implements poll, then it will call it directly. If not, then it will report an error code in response.


static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)

{

if(unlikely(! file->f_op->poll))return DEFAULT_POLLMASK;

return file->f_op->poll(file, pt);

}

Copy the code

You must be wondering what’s actually implemented in the poll call?

To summarize: a hook is hung and a callback path is set to wake up. Epoll connects to the underlying callback function ep_poll_callback. This function is simple and does two things:

  1. Place the struct corresponding to the event-ready FD on a specific queue (ready queue, ready list);

  2. Wake up Epoll, come alive!

When a fd is readable and writable, it goes through a series of callbacks, finally calling the callback function and putting the corresponding FD structure in the ready queue, thus waking epoll from epoll_WAIT.

What is this corresponding structure?

The structure is called epItem, one for each FD registered to the Epoll pool.

Is the ready queue advanced?

The ready queue is easy, because there is no need to find it. Any epitem that is in the ready queue is event ready and must be processed. So the ready queue is a simple list of double Pointers.

Here are two key things that make Epoll so effective:

  1. The internal management FD uses the efficient red-black tree structure management to achieve the optimization and balance of performance after adding, deleting and changing.

  2. To add a fd to the ePoll pool, call file_operations->poll to arrange the callback path once the fd is ready. Through the form of event notification, achieve the most efficient operation;

  3. Two data structures at the core of the EPoll pool: a red-black tree and a ready list. The red-black tree is to respond to the user’s need to add, delete and change. The ready list is a special place placed after the FD event is ready. The epoll pool only needs to iterate over the ready list to return all the ready FD arrays to the user.

Which Fd can be managed using epoll?

Consider another question: Not all FDS implement a poll interface, so naturally not all FDS can fit into an epoll pool. Which filesystems implement a poll interface for file_operations?

First of all, regular filesystems like ext2, ext4, and XFS are not implemented. In other words, the filesystems that you most commonly use, which are really files, don’t use epoll.

So who’s for it?

The most common is a network socket: socket. Networks are also the most common place for epoll pools to be used. On Linux, everything is a file. Sockets implement socket_file_operations logic (net/socket.c) :


static const struct file_operations socket_file_ops = {

.read_iter = sock_read_iter,

.write_iter = sock_write_iter,

.poll = sock_poll,

// ...

};

Copy the code

We see that sockets implement poll calls, so it is natural for sockets to be managed in epoll pools.

Anything else?

Yes, there are actually two typical FDS on Linux that are often placed in the Epoll pool.

  • Eventfd: The implementation of eventFD is very simple, so the name is meant to be used for event notification. Created using the system call EventFD, a file FD that does not transmit data and is only used to transmit events, often used to produce consumer-mode event implementations;

  • Timerfd: this is a timerfd, created using timerfd_create, that fires a readable event at a point in time;

To summarize:

  1. Ext2, ext4, XFS, and other true file system FDS cannot be managed using epoll.

  2. Socket FD, EventFD, TimerFD implements poll calls can be managed in the EPoll pool;

In fact, in Linux module partition, EventFD, TimerFD, epoll pool are a module implementation of the file system.

thinking

We’ve already thought a lot about this, but here are some simple and interesting tips for readers.

Question: Can a single core CPU achieve parallelism?

I can’t.

Question: Can a single thread achieve high concurrency?

You can.

Question: What is the difference between concurrent and parallel?

One looks at execution over time, and one looks at execution at time.

Question: How does a single thread achieve high concurrency?

I’m going to do IO multiplexing, so I’m going to do the epoll pool today.

Question: Are there open source examples of single-threaded concurrency?

Redis, Nginx are great examples. Of course, our Golang Runtime implementation is designed with high concurrency in mind.

Original is not easy, more dry goods, focus: unique cloud storage

conclusion

  1. The original implementation of IO multiplexing was simple. It was a 1-to-many service pattern, with one loop handling multiple FDS.

  2. For IO multiplexing to be truly efficient, it must be provided by a kernel mechanism. Because the processing and completion of IO is in the kernel, if the kernel does not help, the user mode program can not accurately grasp the processing time;

  3. Make fd non-blocking, remember;

  4. With the efficient internal management structure and the poll event registration mechanism provided by the operating system, the ePoll pool realizes the efficient FD event management, which provides the precondition for high concurrent IO processing.

  5. Epoll, full name eventPoll, is implemented as a file system module under the Linux kernel, so it is often said that epoll is a file system itself.

  6. The three “file” FDS, socketFD, EventFD, and TimerFD, implement the poll interface, so network FDS, event FDS, and timer FDS can all be registered to the pool using epoll_CTL. The most common is the multiplexing of network fd;

  7. True file systems like ext2, ext4, and XFS do not provide a poll interface, so you cannot use an epoll pool to manage their handles. So the file can’t use the epoll mechanism? No, there is a library called Libaio, which allows you to indirectly tell files to use epoll to notify events.

Afterword.

The epoll pool is simple to use, but not simple to implement. Again, the Linux kernel does it for you.

Today does not list the source code implementation, in a very small point to think about the topic, a simple talk about some epoll thinking, later have the opportunity to share asynchronous IO (AIO) and epoll can produce what sparks? How does Golang use the Epoll pool? Stay tuned.

Original is not easy, more dry goods, focus: unique cloud storage