Io_uring is a new asynchronous mechanism introduced in the Linux 5.x era, designated as the future of Linux asynchrony.

This article will explore a number of design issues for safely encapsulating IO_uring in Rust and suggest some possible solutions.

Io_uring method of work

Io_uring is divided into two queues, the Submission Queue SQ (Submission Queue) and the Completion Queue CQ (Completion Queue). The commit queue holds asynchronous tasks that are waiting to be executed, and the completion queue holds completion events.

The io_uring structure is allocated by the kernel, and the user state gets access to the memory of the relevant structure through MMAP. This allows the kernel and user states to share memory, bypassing system calls to transfer data in both directions.

The concept workflow has three phases

  1. Preparation: The application program obtains some Submission Queue entries (SQE), sets each asynchronous task to each SQE, and initializes it with opcodes and parameters.
  2. Commit: The application pushes some SQE to the SQE to commit, notifying the kernel of a new task through a system call, or letting the kernel poll continuously to get the task.
  3. Harvest: The application retrieves some Completion Queue events (CQE) from the CQ, and uses user_Data to identify and wake up threads/coroutines in the application, passing the return value.

Epoll is the implementation of Reactor model, and IO_uring is the implementation of Proactor model.

This means that epoll-based programs are difficult to migrate directly to IO_uring.

Problem 1: Changing the asynchronous model is not easy unless the differences are smoothed out at the expense of partial performance.

Problem 2: IO_uring requires a higher version of the kernel. At this stage, applications have to consider how to fallback without the higher version features of IO_uring.

Io_uring constraints

In blocking synchronization models and non-blocking synchronization models such as epoll, user-mode IO operations are a one-shot deal without worrying about lifetime.

However, IO_uring is a Proactor, a non-blocking asynchronous model with constraints on the lifetime of resources.

Read, for example, has two resource parameters, fd and buf. When preparing an IO operation, we need to fill SQE with fd, buf pointer, and count, and ensure that both fd and BUf are valid until the kernel completes or cancels the task.

Fd unexpected replacement

fd = 6, buf = 0x5678; Ready to SQE; close fd =6;
open -> fd = 6; Submit the SQE; The kernel performs IO;Copy the code

The application “accidentally” closes and opens the file before committing SQE, which causes IO operations to be accidentally executed on a completely unrelated file.

Stack memory UAF

char stack_buf[1024];
fd = 6, buf = &stack_buf; Ready to SQE; Submit the SQE; Function returns; The kernel performs IO;Copy the code

IO executed by the kernel manipulates memory on the stack that has been freed, creating a use-after-free vulnerability.

Heap memory UAF

char* heap_buf = malloc(1024);
fd = 6, buf = heap_buf; Ready to SQE; Submit the SQE; Error executing other code;free(heap_buf); The function returns an error code. The kernel performs IO;Copy the code

IO executed by the kernel will use memory on the heap that has been freed, another UAF vulnerability.

Use after moving

struct Buf<T>(T);
let mut buf1: Buf<[u8;1024]> = Buf([0;1024]);
fd = 6, buf = buf1.0.as_mut_ptr();
unsafe{to SQE; } submit SQE;let buf2 = Box::new(buf1); The kernel performs IO;Copy the code

When the kernel executes IO, buF1 has been moved and the pointer is invalid. The vulnerability of “use after moving” appears, which is called UAM vulnerability in this paper.

Use after cancellation

async fn foo() -> io::Result< > () {let mut buf1: [u8;1024] = [0;1024];
    fd = 6, buf = buf1.as_mut_ptr();
    unsafe{to SQE; } submit SQE; bar().await
}
Copy the code

Rust’s async function generates a stackless coroutine, where stack variables are stored in a structure. If the structure is destructed, the underlying leaf Future is destructed and the asynchronous operation is cancelled.

While destructors are synchronous, the kernel may still be occupying the buffer to execute IO while the coroutine is destructing. If left untreated, a UAF vulnerability occurs.

Use after closing

Ready to SQE; Submit the SQE; io_uring_queue_exit(&ring) ???Copy the code

Does the kernel cancel executing I/OS immediately after IO_uring_queue_exit?

// TODO: Find the answer

If it cancels immediately, the user program cannot get a cancellation event, wake up the task, or release resources.

If it is not cancelled immediately, the kernel will consume resources beyond the lifetime of the IO_uring instance, causing further problems.

This seems to indicate that the IO_uring instance must have a static lifetime, as long as the thread itself. Or delay exit by some sort of reference-counting.

Io_uring features Rust

The bottom line of Rust is memory security and does not allow for memory security vulnerabilities or data contention. Rust’s ownership rules provide a good safeguard for this.

Transfer ownership

“Migration ownership” is a self-invented concept in this article, which means that to perform an operation you must relinquish ownership of a parameter and “migrate” its ownership somewhere else.

When using IO_uring, the kernel holds ownership of the resource. The user state must relinquish control of the resource unless it can safely operate concurrently. When an I/O operation is complete or cancelled, all resources occupied by the kernel are returned to the user mode.

But the kernel can’t really hold ownership, it’s actually the asynchronous runtime that stores these resources and simulates the “transfer ownership” model.

The BufRead trait represents a readable type that contains an internal buffer. BufReader

is a typical use.

BufReader

can match the io_uring work mode.

Prepare fd, buF prepares SQE submits SQE to wait for wake up to fetch the return value to reclaim FD, buF exposes buF's shared referenceCopy the code

Problem 3: When the Future is cancelled, buf is still occupied by the kernel and BufReader

is in an invalid state. The next IO, it can only choose to die.

Imagine such an underlying Future

pub struct Read<F, B>
where
    F: AsRawFd + 'static,
    B: AsMut"[u8] > +'static,
{
    fd: F,
    buf: B,
    ...
}
Copy the code

Buf can be [U8; N] and also satisfy AsMut<[U8]> + ‘static, but it cannot be passed to io_uring with a fetching pointer.

The BUF expires when the Future is destructed, not satisfying the IO_uring constraints.

There are two fixes: move both fd and BUF to the heap before preparing SQE, or limit BUF to a safe-escape buffer type.

Heap allocation

Heap allocation is the only option if you want to ensure that fd and BUF are not destructed before preparing SQE.

This ensures that fd and BUF cannot be moved or destructed before IO operations are completed or cancelled.

pub struct Read<F, B>
where
    F: AsRawFd + 'static,
    B: AsMut"[u8] > +'static,
{
    state: ManualDrop<Box<State<F, B>>>
}
Copy the code

Most of the time, however, BUFs are smart Pointers to dynamically sized buffers on the heap. Heap allocation for the Pointers themselves is not worthwhile, and custom allocators must be implemented in some way to improve efficiency.

The escape

The usual “escape analysis” is to analyze the dynamic scope of an object and assign it to the heap if it has the potential to leave function scope.

The “escape” proposed in this paper refers to the transfer of structural members to a stable place from the destruction.

A safely escapable buffer type does not change the memory address of the buffer as it moves.

[U8;N] completely changes the buffer address range when moving, while Box<[U8]> and Vec

do not change.

SmallVec<[u8;N]> Stores data on the stack when it is not larger than N, and on the heap when it is larger.

Box<[u8]> and Vec

can be safely escaped as buffers, but [u8;N] and SmallVec<[u8;N]> cannot.

If you limit BUF to the type of buffer that can be safely escaped, then in the best case, no system calls are required for IO operations, no additional heap allocation is required, and the buffer is controlled by the caller, almost perfectly.

Question 4: How to express such a constraint without infecting unsafe?

Defining an unsafe trait is naturally easy, but not universal for all eligible buffers. It may also be affected by the orphan rule, forcing users to write newType or unsafe.

As you can see, there is a connection between “safe escape” and the concept of Pin. Is there a way to link it?

Send

Harvesting of IO_uring can be done by this thread or by a dedicated driver thread.

Currently SQ does not support multithreaded submissions and global shares need to be locked. Io_uring more matches each thread’s own implementation of a ring.

Consider a Future whose resources escape to the heap when it is destructed.

pub struct Read<F, B>
where
    F: AsRawFd + 'static,
    B: EscapedBufMut + 'static,
{
    fd: F,
    buf: B,
    ...
}
Copy the code

If the final destructor is done by the global driver thread, the resource is transferred from the current thread to the driver thread, requiring the resource to satisfy the Send.

If the thread does the final destructor, then the resource does not need to be transferred, and Send can not be satisfied.

Q5: Harvesting and destructor strategies also affect generic constraints on apis. How do you design an APPROPRIATE API?

copy

The buffer must remain valid after the Future destructor, which means we cannot pass temporary &mut [U8] or &[U8] into IO_uring, nor do in-place reads or writes.

Epoll, on the other hand, can wait for a FD to become readable or writable before reading or writing in place.

In any case, the step of putting the buffer on the heap is unavoidable, and the difference is whether the buffer is controlled by the asynchronous type itself or by the caller.

Having the caller in control of the buffer avoids extra copying, but makes security checks more difficult, and the incoming buffer must be restricted to good behavior.

Asynchronous types have built-in buffers that add extra copies, but security is guaranteed by the library author, reducing the possibility of vulnerabilities.

Problem 6: IO_uring makes it difficult to implement zero copy in user mode.

ecological

Uring -sys: Binding of liburing.

Iou: Rust-style low-level IO_uring interface.

Ringbahn: Experimental IO_uring high-level packaging

Maglev: Experimental IO_uring asynchronous drive/run time

conclusion

Rowed a point

Problem 1: Epoll is an implementation of the Reactor model, while IO_uring is an implementation of the Proactor model. Changing the asynchronous model is not easy unless the differences are smoothed out at the expense of performance.

Problem 2: IO_uring requires a higher version of the kernel. At this stage, applications have to consider how to fallback without the higher version features of IO_uring.

Problem 3: When the Future is cancelled, the BUF is still occupied by the kernel and the asynchronous type may be in an invalid state. The next IO, it can only choose to die.

Q4. If you choose to restrict BUF to a safe-escape buffer type, how can you express this constraint without infecting unsafe?

Q5: Harvesting and destructor strategies also affect generic constraints on apis. How do you design an APPROPRIATE API?

Problem 6: IO_uring makes it difficult to implement zero copy in user mode.

Regardless of maximum performance, we have various options for encapsulating a usable IO_uring library.

If generic is not considered, we can prudently lock the type with IO_uring in our own program.

Rust’s pursuit of safety, performance, and generality brings high difficulty to IO_uring.

Ringbahn’s design idea is one possible direction. The community also needs to explore what makes perfect design.

Further reading

Efficient IO with io_uring

AIO’s new home: IO_uring

Go with asynchronous IO-IO_uring thinking

Notes on io-uring

Ringbahn: a safe, ergonomic API for io-uring in Rust

Ringbahn II: the central state machine

Ringbahn III: A deeper dive into drivers

feature requests: submit requests from any thread


This article was first published in Rust Daily on Zhihu.

About the author:

Wang Xu Yang, a junior student, has been learning and using Rust language since 2018.

GitHub ID: Nugine