Direct I/O

In both buffered I/O and MMAP, files are accessed through the kernel’s Page cache. This may not be particularly friendly for self-caching applications such as databases:

  1. The user-level cache and the kernel page cache actually duplicate each other, resulting in a waste of memory.
  2. Data transfer: Disk -> page cache -> user buffer requires two memory copies.

To do this, Linux provides a way to bypass the Page cache to read and write files: Direct I/O.

To use direct I/O, add the O_DIRECT flag when opening the file. Such as:

int fd = open(fname, O_RDWR | O_DIRECT);
Copy the code

Using direct I/O has one major limitation: the memory address of the buffer, the size of each read/write, and the offset of the file must all be aligned with the logical block size of the underlying device (typically 512 bytes).

Since the Linux server, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. The logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command: blockdev –getss

$ blockdev --getss /dev/vdb1 
512
Copy the code

If the alignment requirements are not met, the system reports EINVAL (Invalid argument). Error:

#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

int main(a) {
  int fd = open("/tmp/direct_io_test", O_CREAT | O_RDWR | O_DIRECT);
  if (fd < 0) {
    perror("open");
    return - 1;
  }

  constexpr int kBlockSize = 512;

  // The buffer address is not aligned
  {
    char* p = nullptr;
    int ret = posix_memalign(
        (void**)&p, kBlockSize / 2,
        kBlockSize);  // There is a probability that memory aligned with kBlockSize can be allocated
    assert(ret == 0);
    int n = write(fd, p, kBlockSize);
    assert(n < 0);
    perror("write");
    free(p);
  }

  // The buffer size is not aligned
  {
    char* p = nullptr;
    int ret = posix_memalign((void**)&p, kBlockSize, kBlockSize / 2);
    assert(ret == 0);
    int n = write(fd, p, kBlockSize / 2);
    assert(n < 0);
    perror("write");
    free(p);
  }

  // File offset is not aligned
  {
    char* p = nullptr;
    int ret = posix_memalign((void**)&p, kBlockSize, kBlockSize);
    assert(ret == 0);
    off_t offset = lseek(fd, kBlockSize / 2, SEEK_SET);
    assert(offset == kBlockSize / 2);
    int n = write(fd, p, kBlockSize);
    assert(n < 0);
    perror("write");
    free(p);
  }

  // Align the three
  {
    char* p = nullptr;
    int ret = posix_memalign((void**)&p, kBlockSize, kBlockSize);
    assert(ret == 0);
    off_t offset = lseek(fd, 0, SEEK_SET);
    assert(offset == 0);
    int n = write(fd, p, kBlockSize);
    assert(n == kBlockSize);
    printf("write ok\n");
    free(p);
  }

  return 0;
}
Copy the code

Direct I/O and data persistence

When buffered I/O is used to write data, the data actually only arrives in the Page cache — still in memory, waiting for the kernel thread to periodically flush the dirty pages to persistent storage, or for the application to call fsync to actively flush the dirty pages.

In theory, direct I/O does not go through the Page cache, and when write returns, the data should go to persistent storage. However, in addition to the file data itself, some important metadata about the file, such as the file size, can also affect data integrity. Direct I/O only applies to the file data itself. Metadata reads and writes of the file are still cached by the kernel. Using direct I/O to read and write files, you still need to use fsync to refresh the metadata of the file.

However, not all file metadata affects data integrity, such as file modification time. To do this, MySQL provides a new flush parameter: O_DIRECT_NO_FSYNC — use O_DIRECT to read and write data, and fsync only when necessary.

As of MySQL 8.0.14, fsync() is called after creating a new file, after increasing file size, and after closing a file, to ensure that file system metadata changes are synchronized. The fsync() system call is still skipped after each write operation.

Linux AIO

As described earlier, whether you call read/write to read or write files (buffered I/O and Direct I/O) or use mmap mapping files, this is synchronous file I/O.

The advantages of synchronous I/O interfaces are simplicity and clear logic. But other than that, synchronous I/O doesn’t seem to have any other advantages. If the page cache is not hit, the thread will block. If the page cache is not hit, the thread will block. Continue to increase the number of threads => base overhead of a large number of threads + a large number of context switches + load on the kernel scheduler => system performance is poor => Asynchronous file I/O is required to improve performance at high concurrent reads and writes.

Linux AIO is a set of interfaces provided by the kernel to support asynchronous file I/O (only O_DIRECT is supported to read and write files) :

int io_setup(unsigned nr_events, aio_context_t *ctx_idp);
int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);
int io_getevents(aio_context_t ctx_id, long min_nr, long nr,
                 struct io_event *events, struct timespec *timeout);
int io_cancel(aio_context_t ctx_id, struct iocb *iocb,
                     struct io_event *result);
int io_destroy(aio_context_t ctx_id);
Copy the code
  1. io_setupCreate one that supportsnr_eventsAsynchronous I/O context for four operations.
  2. io_submitSubmit an asynchronous I/O request.
  3. io_geteventsGets the result of completed asynchronous I/O.
  4. io_cancelCancel a previously submitted asynchronous I/O request.
  5. io_destroyCancel all submitted asynchronous I/O requests and destroy the asynchronous I/O context.

Normally, the Linux AIO flows as follows:

  1. callio_setupCreate an I/O context for submitting and harvesting I/O requests.
  2. Create 1 to n and I/O requests, callio_submitSubmit the request.
  3. The I/O request completes and the data is transferred directly to the User buffer via DMA.
  4. callio_geteventsHarvest the completed I/O.
  5. Re-execute step 2, or confirm that AIO does not need to continue, the callio_destroyDestroy the I/O Context.

Here is a simple example to introduce the basic use of Linux AIO interface. For details about each Linux AIO interface, you are advised to refer to the Linux Manual Page. Here are some things to note:

  1. Glibc does not provide encapsulation of the Linux AIO system call interface. Use syscall to encapsulate the interface.
  2. Libaio is a Linux AIO wrapper with a similar but not identical interface, not to be confused.
#include <assert.h>
#include <fcntl.h>
#include <inttypes.h>
#include <linux/aio_abi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>

int io_setup(unsigned nr, aio_context_t *ctxp) {
  return syscall(__NR_io_setup, nr, ctxp);
}

int io_destroy(aio_context_t ctx) { return syscall(__NR_io_destroy, ctx); }

int io_submit(aio_context_t ctx, long nr, struct iocb **iocbpp) {
  return syscall(__NR_io_submit, ctx, nr, iocbpp);
}

int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
                 struct io_event *events, struct timespec *timeout) {
  return syscall(__NR_io_getevents, ctx, min_nr, max_nr, events, timeout);
}

int main(int argc, char *argv[]) {
  int fd = open("/tmp/linux_aio_test", O_RDWR | O_CREAT | O_DIRECT, 0666);
  if (fd < 0) {
    perror("open");
    return - 1;
  }

  aio_context_t ctx = 0;
  int ret = io_setup(128, &ctx);
  if (ret < 0) {
    perror("io_setup");
    return - 1;
  }

  struct iocb cb;
  memset(&cb, 0.sizeof(cb));
  cb.aio_fildes = fd;
  cb.aio_lio_opcode = IOCB_CMD_PWRITE;
  char *data = nullptr;
  ret = posix_memalign((void **)&data, 512.4096);
  assert(ret == 0);
  memset(data, 'A'.4096);
  cb.aio_buf = (uint64_t)data;
  cb.aio_offset = 0;
  cb.aio_nbytes = 4096;

  struct iocb *cbs[1].
  cbs[0] = &cb;
  ret = io_submit(ctx, 1, cbs);
  if(ret ! =1) {
    free(data);
    perror("io_submit");
    return - 1;
  }

  struct io_event events[1].
  ret = io_getevents(ctx, 1.1, events, nullptr);
  assert(ret == 1);
  printf("%ld %ld\n", events[0].res, events[0].res2);
  free(data);
  ret = io_destroy(ctx);
  if (ret < 0) {
    perror("io_destroy");
    return - 1;
  }
  return 0;
}
Copy the code

summary

Linux AIO is the next Linux solution to try to solve asynchronous FILE I/O, but it is not complete or perfect.

  • Linux AIO supports only direct I/O. This means that all read and write operations using Linux AIO are limited by direct I/O: 1) Buffer address, buffer size, and file offset alignment limits; 2) Page cache cannot be used.
  • Incomplete asynchrony can still be blocked. For example, on an ext4 file system, if you need to read the metadata of a file, the call may be blocked.
  • High parameter copy overhead.A 64 byte copy is required for each I/O commitstruct iocbObject; A 32-byte copy is required for each I/O completionstruct io_eventObject; A total of 96 bytes of copy is required for an I/O request. Whether the copy overhead is affordable depends on the size of the single I/O: if the single I/O is large, the cost is negligible by comparison. However, in a scenario with a large number of small I/ OS, such copy has a large impact.
  • Multithreaded submission or harvesting of I/O requests causes a large lock contention for io_CONTEXt_T.
  • Each I/O requires two system calls to complete (io_submit  和 io_getevents), a large number of small I/ OS are unacceptable — howeverio_submit 和 io_geteventsBoth support batch operations and can reduce system calls through batch commit and batch harvest.

Linux AIO has been evolving since its inception (kernel 2.5) until now (2021). But for the most part, it was tinkering, and the Linux AIO was never fully implemented — especially with direct I/O only support, where asynchronous interfaces could still be blocked.

To address the design flaws of the Linux AIO, Linux 5.1 introduced a new asynchronous I/O interface: IO_uring — hopefully, IO_uring will solve the problem of asynchronous I/O for Linux.

The resources

  1. Linux Programmer’s Manual – open (2)
  2. Page-based direct I/O
  3. InnoDB innodb_flush_method configuration
  4. Why does O_DIRECT require I/O to be 512-byte aligned?
  5. linux-aio