The basic principle of

Io_uring is a major feature added to Linux 5.1 in May 2019 — a new asynchronous I/O support for Linux that is expected to completely address the long-standing limitations of Linux AIO.

Io_uring implements asynchronous I/O in a producer-consumer model:

  1. The user process produces I/O requests and puts them into a Submission Queue (SQ).
  2. The kernel consumes the I/O requests in the SQ and, upon Completion, puts the results into a Completion Queue (hereafter referred to as CQ).
  3. The user process harvests I/O results from the CQ.

SQ and CQ are created when the kernel initializes io_uring instances. To reduce system calls and reduce data copying between user processes and the kernel, IO_uring uses Mmap to share SQ and CQ memory space between user processes and the kernel.

In addition, the I/O request submitted first may not be completed first. SQ stores an Array index (data type uint32). SQE stores a separate Array (SQ Array). So to submit an I/O request, first find a free SQE in the SQ Array, set it up, and put its Array index in the SQ.

The basic relationships among user processes, kernel, SQ, CQ, and SQ Array are as follows:

(Image from:Getting Hands on with io_uring using Go)

Initialize the

 int io_uring_setup(int entries, struct io_uring_params *params);
Copy the code

The kernel provides the IO_URing_setup system call to initialize an IO_uring instance.

The return value of io_URing_setup is a file descriptor, temporarily called ring_fd, for subsequent MMAP memory mapping and other related system call parameters.

Io_uring creates SQ, CQ and SQ Array. The entries parameter indicates the size of SQ and SQ Array. CQ size is 2 * entries by default.

The params parameter is both an input parameter and an output parameter and is defined as follows:

struct io_uring_params {
    __u32 sq_entries;
    __u32 cq_entries;
    __u32 flags;
    __u32 sq_thread_cpu;
    __u32 sq_thread_idle;
    __u32 features;
    __u32 resv[4];
    struct io_sqring_offsets sq_off;
    struct io_cqring_offsets cq_off;
};

struct io_sqring_offsets {
    __u32 head;
    __u32 tail;
    __u32 ring_mask;
    __u32 ring_entries;
    __u32 flags;
    __u32 dropped;
    __u32 array;
    __u32 resv[3];
};

struct io_cqring_offsets {
    __u32 head;
    __u32 tail;
    __u32 ring_mask;
    __u32 ring_entries;
    __u32 overflow;
    __u32 cqes;
    __u32 flags;
    __u32 resv[3];
};
Copy the code
  • flags,sq_thread_cpu,sq_thread_idleIs an input parameter that sets some of the properties of IO_uring.
  • resv[4]It’s a reserved field, so we’ll ignore it for now.
  • The other parameters are output parameters set by the kernel and may be used by the user process for some initialization and judgment:
    • sq_entriesIs the size of the commit queue.
    • cq_entriesIs the size of the completion queue.
    • featuresDescribes the IO_uring features supported by the current kernel release. Among them,IORING_FEAT_SINGLE_MMAPAn important feature of IO_uring is that the kernel supports memory mapping of SQ and CQ in one mmap, as described in liburingio_uring_mmap.
    • sq_offOffsets represent the attributes of the SQ.
    • cq_offOffsets represent some attributes of CQ.

User processes need to perform Mmap for the SQ, CQ, and SQ Array parameters before sharing them with the kernel. Here is the sample code:

int io_uring_mmap(int ring_fd, struct io_uring_params *p) {
  unsigned sq_ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
  unsigned cq_ring_sz =
      p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
  unsigned sq_array_size = p->sq_entries * sizeof(struct io_uring_sqe);

  // Create a memory map for SQ and CQ
  if (p->features & IORING_FEAT_SINGLE_MMAP) {
    if (cq_ring_sz > sq_ring_sz) {
      sq_ring_sz = cq_ring_sz;
    }
    cq_ring_sz = sq_ring_sz;
  }

  void *sq_ring_ptr =
      mmap(nullptr, sq_ring_sz, PROT_READ | PROT_WRITE,
           MAP_SHARED | MAP_POPULATE, ring_fd, IORING_OFF_SQ_RING);
  if (sq_ring_ptr == MAP_FAILED) {
    return -errno;
  }

  void *cq_ring_ptr = nullptr;
  if (p->features & IORING_FEAT_SINGLE_MMAP) {
    cq_ring_ptr = sq_ring_ptr;
  } else {
    cq_ring_ptr = mmap(nullptr, cq_ring_sz, PROT_READ | PROT_WRITE,
                       MAP_SHARED | MAP_POPULATE, ring_fd, IORING_OFF_CQ_RING);
    if (cq_ring_ptr == MAP_FAILED) {
      return -errno; // FIXME: unmap sq_ring_ptr}}// Create a memory mapping for the SQ Array
  void *sq_array_ptr =
      mmap(nullptr, sq_array_size, PROT_READ | PROT_WRITE,
           MAP_SHARED | MAP_POPULATE, ring_fd, IORING_OFF_SQES);
  if (sq_array_ptr == MAP_FAILED) {
    return -errno; // FIXME: unmap sq_ring_ptr and cp_range_ptr
  }
  
  / /...
}
Copy the code

How to submit AN I/O request

After initialization is complete, we need to submit I/O requests to IO_uring. By default, using IO_uring to submit I/O requests requires:

  1. Find a free SQE from the SQ Arrary.
  2. Set up this SQE for specific I/O requests.
  3. Put the array index of SQE into SQ.
  4. Call system callio_uring_enterSubmit the I/O request in the SQ.

To further improve performance, IO_uring provides kernel polling to submit I/O requests (set the IORING_SETUP_SQPOLL bit of params.flags) : Create a kernel thread (SQ thread) that polling SQ (polling) and automatically commit uncommitted I/O requests:

  • If I/O requests keep coming in, the SQ thread keeps polling and submitting I/O requests to the kernel, without a system call.
  • If the idle time of the SQ thread exceedssq_thread_idleIn milliseconds, the io_sq_ring flag is automatically set to sleep. sq_ring_ptr + p.sq_off.flags)IORING_SQ_NEED_WAKEUPposition
  • User processes need to check whether flags are setIORING_SQ_NEED_WAKEUPTo determine if it needs to be calledio_uring_enterTo wake up the SQ thread:
/* fills in new sqe entries */
add_more_io();
/*
* need to call io_uring_enter() to make the kernel notice the new IO
* if polled and the thread is now sleeping.
*/
if((*sqring→flags) & IORING_SQ_NEED_WAKEUP) io_uring_enter(ring_fd, to_submit, min_complete, ioring_enter,NULL);
Copy the code

How to harvest I/O results

By default, io_uring_enter is called to harvest I/O:

io_uring_enter(ring_fd, to_submit, min_complete, IORING_ENTER_GETEVENTS, NULL);
Copy the code

If the number of COMPLETED I/ OS is less than MIN_complete, the request will block.

When io_uring_setup is called and the IORING_SETUP_IOPOLL bit of params.flags is set, io_uring_enter is called to harvest I/ OS. If the number of completed I/ OS is not 0, it is not blocked and returns immediately.

A user process can harvest I/O by iterating through the CQ’s [head, tail] interval to retrieve the completed CQE, process it, and then move the head pointer to tail.

Thus, I/O commit and harvest of IO_uring can be done without a system call.

io_uring_register

int io_uring_register(unsigned int fd, unsigned int opcode,
                      void *arg, unsigned int nr_args);
Copy the code

Io_uring_register registers an array of file descriptors, an array of IOVEc, to an IO_URING instance to improve THE performance of I/O operations.

  • IORING_REGISTER_FILES 

Each time an I/O request is submitted, the kernel increases the reference count for the file fd. After each I/O, the kernel needs to reduce the reference count for the file fd. Reference count changes are atomic, which can seriously affect performance in high-IOPS scenarios.

A user process can register an array of fd files with IO_uring. The kernel increments the reference count of these files by one during registration, and only decreases the reference count by one when the registration is canceled or io_uring is destroyed.

Then set the IOSQE_FIXED_FILE for SQE flags to the index of the FD array registered to IO_uring when the I/O request is submitted. This avoids the overhead of adding one or minus one atomically to the reference count each I/O.

  • IORING_REGISTER_BUFFERS

When using O_DIRECT, the kernel needs to map user process memory to the kernel when committing I/O operations. After the I/O operation is complete, the kernel needs to unmap the user process’s memory to the kernel.

In the case of high IOPS, the frequent creation and cancellation of memory mapping costs a lot of money. A user process can pre-register an IOVEC array to an IO_uring instance to establish a memory mapping, which will only be cancelled when the registration is cancelled or the IO_uring instance is destroyed.

IORING_OP_READ_FIXED, IORING_OP_WRITE_FIXED, and IORING_OP_WRITE_FIXED can be used for subsequent I/O requests with these registered buffers Buffer) Completes the I/O operation.

summary

Io_uring addresses some of the slots in the Linux AIO by sharing two Ring queues between the user process and the kernel to reduce system calls and memory copies.

Io_uring is simple in principle. However, this is a very detailed set of interfaces, many parameters are not understood. To learn more about IO_uring, we recommend taking a closer look at the resources below.

The resources

  1. io_uring_setup
  2. io_uring_enter
  3. io_uring_register
  4. Efficient IO with io_uring
  5. Io_uring performance data
  6. Getting Hands on with io_uring using Go
  7. The rapid growth of io_uring
  8. Ringing in a new asynchronous I/O API
  9. liburing
  10. io_uring-by-example
  11. io_uring tools