Author: kabbalah tree www.jianshu.com/p/fad3339e3…

The body of the

This article discusses several major zero-copy technologies in Linux and the scenarios where zero-copy technologies are applicable. To quickly establish the concept of zero copy, let’s introduce a common scenario:

01 citation

When writing a server-side application (Web Server or file Server), file download is a basic function. In this case, the server’s task is to send the files on the server’s host disk from the connected socket without modification. We usually use the following code to accomplish this:

while((n = read(diskfd, buf, BUF_SIZE)) > 0)    write(sockfd, buf , n);Copy the code

The basic operation is to read the contents of the file from disk into the buffer, and then send the contents of the buffer to the socket. But since Linux I/O operations are buffered I/O by default. The two main system calls used here are read and write, and we don’t know what the operating system does. In fact, multiple copies of data occurred during the above I/O operation.

When an application to access a block of data, operating system, will first check whether recently visited this file, the file content is cached in the kernel buffer, if it is, the operating system is directly according to the read system call for buf address, copy the contents of the kernel buffer to the user space buffer specified by buf. If not, the operating system first copies the data on disk into the kernel buffer, which currently relies on DMA, and then copies the contents of the kernel buffer into the user buffer.

The write system call then copies the contents of the user buffer into the kernel buffer associated with the network stack, and the socket sends the contents of the kernel buffer to the network adapter. Having said so much, it is better to look at the picture:

Data copies

As you can see from the figure above, there are four data copies in total. Even if DMA is used to handle the communication with the hardware, the CPU still needs to process two data copies. At the same time, there are several context switches between the user mode and the kernel mode, which undoubtedly increases the CPU burden.

No changes were made to the contents of the file, so copying data back and forth between kernel space and user space was a waste, and zero copy was designed to address this inefficiency.

2 What is zero-copy technology?

Zero copy main task is to avoid copies the data from a storage CPU to another block storage, main is to use various zero copy technique, to avoid the CPU to do a lot of data copy task, reduce unnecessary copies, or other components to do this kind of simple data transmission task, let the CPU free to focus on other tasks. This allows system resources to be utilized more efficiently.

Going back to the example in the citation, how can we reduce the number of copies of data? One obvious focus is to reduce copying of data back and forth between kernel space and user space, which introduces a type of zero-copy: data transfers that do not need to go through user space.

3 use mmap

One way we can reduce the number of copies is to call mmap() instead of read:

buf = mmap(diskfd, len); write(sockfd, buf, len);Copy the code

The application calls mmap(), and the data on disk is then DMA through a copy of the kernel buffer, which the operating system then shares with the application, so there is no need to copy the contents of the kernel buffer into user space. The application then calls write(), and the operating system copies the contents of the kernel buffer directly into the socket buffer, all in kernel mode. Finally, the socket buffer sends the data to the network card. Again, it’s easy to look at the picture:

mmap

Using Mmap instead of Read significantly reduces the number of copies, and improves efficiency when copying a large amount of data. But using Mmap comes at a cost. When you use MMAP, you may encounter some hidden traps. For example, when your program maps a file, but when the file is truncated by another process (TRUNCate), the write system call is terminated by the SIGBUS signal for accessing an invalid address. The SIGBUS signal by default kills your process and produces a coredump, which can cause a loss if your server is aborted in this way.

We usually avoid this problem with the following solutions:

1. Establish signal processing program for SIGBUS signal

When it encounters a SIGBUS signal, the signal handler simply returns, and the write system call returns the number of bytes written before it is interrupted, and errno is set to success, but this is bad practice because you are not addressing the heart of the problem.

2. Use the file rental lock

Usually we use this method, using a rent-lock on the file descriptor. We apply a rent-lock on the file to the kernel, and when another process tries to truncate the file, the kernel sends us a real-time RTSIGNALLEASE signal telling us that the kernel is breaking the read/write lock you hold on the file. This will interrupt your write system call before the program accesses illegal memory and is killed by SIGBUS. Write returns the number of bytes that have been written and sets errno to SUCCESS.

We should lock the mmap file before it and unlock it after manipulating it:

if(fcntl(diskfd, F\_SETSIG, RT\_SIGNAL\_LEASE) == -1) { perror("kernel lease set signal"); return -1; }/* l\_type can be F\_RDLCK F\_WRLCK */ * l\_type can be F\_UNLCK */if(FCNTL (diskfd, F\_SETLEASE, l\_type)){perror("kernel lease set type"); return -1; }Copy the code

4 use sendfile

Starting with the 2.1 kernel, Linux introduced sendfile to simplify operations:

#include<sys/sendfile.h>ssize\_t sendfile(int out\_fd, int in\_fd, off\_t *offset, size_t count);Copy the code

The system call sendFile () passes the file contents (bytes) between the descriptor infd representing the input file and the descriptor outfd representing the output file. The descriptor outfd must point to a socket, and the file infD points to must be Mmapable. These limitations limit the use of SendFile so that it can only pass data from a file to a socket and not vice versa.

Using Sendfile not only reduces the number of data copies, but also reduces the context switch. Data transfer always takes place only in kernel space.

Sendfile System call procedure

What happens if another process truncates the file when we call sendFile? Assuming we don’t set any signal handlers, the sendFile call simply returns the number of bytes it transferred before it was interrupted, and errno is set to SUCCESS. If we lock the file before calling SendFile, sendFile will still behave the same as before and we will also receive the RTSIGNALLEASE signal.

So far, we have reduced the number of data copies, but there is still one copy, a copy of the page cache to the socket cache. So can we omit this copy as well?

With the help of hardware, we can do it. We used to copy the data from the page cache to the socket cache. In fact, we only need to pass the buffer descriptor to the socket buffer and then pass the data length, so that the DMA controller can directly package the data from the page cache and send it to the network.

To summarize, the SendFile system call uses the DMA engine to copy the contents of the file into the kernel buffer, and then adds the buffer descriptor with the file location and length to the socket buffer. This step does not copy data from the kernel into the socket buffer. The DMA engine copies the kernel buffer data to the protocol engine, avoiding the last copy.

Sendfile with DMA

However, this collection and copy function requires hardware and driver support.

5 used splice

Sendfile only works for copying data from a file to a socket, limiting its use. Linux introduced the splice system call in 2.6.17 to move data between two file descriptors:

#define \_GNU\_SOURCE         /* See feature\_test\_macros(7) */#include<fcntl.h>ssize\_t splice(int fd\_in, loff\_t \*off\_in, int fd\_out, loff\_t \*off\_out, size\_t len, unsignedint flags);Copy the code

Splice calls move data between two file descriptors without copying data back and forth between kernel space and user space. It copies len length of data from fDIN to FDOUT, but one party must be a pipe device, which is currently splice’s limitation. The flags parameter has the following values:

  • SPLICEFMOVE: Attempts to move data instead of copying it. This is just a tip to the kernel: if the kernel can’t move data from pipe or if the pipe’s cache is not a full page, you still need to copy the data. Linux had some problems with its initial implementation, so this option does not work as of 2.6.21, which should be implemented in later versions of Linux.

  • SPLICEFNONBLOCK: The splice operation will not block. However, if the file descriptor is not set to I/O in a non-blocking manner, then the call to Splice may still block.

  • SPLICEFMORE: Later splice calls will have more data.

Splice calls take advantage of the pipeline buffer mechanism proposed by Linux, so at least one descriptor must be a pipe.

All of the above zero-copy techniques are implemented to reduce the copying of data between user space and kernel space, but sometimes data must be copied between user space and kernel space. At this point, we have to focus on when the data is copied in user space and kernel space. Linux often uses copy on write to reduce overhead, a technique often referred to as COW.

For reasons of space, this article does not cover write-time copying in detail. The general description is: If multiple applications access the same data at the same time, so each application has a pointer to the data, in each program’s view, oneself are independent with this data, only when a program need to modify the data content, will copy the data content to program their application in the space, at that time, the data become the private data of the program. If the application does not need to modify the data, it never needs to copy the data into its own application space. This reduces copying of data. The content copied while writing can be used for another article…

In addition, there are some zero-copy technologies, such as traditional Linux I/O with the O_DIRECT flag to directly I/O, avoiding automatic caching, and fBUFS which is not yet mature. This article does not cover all zero-copy technologies, but just introduces some common ones. If you are interested, you can explore them by yourself. Mature server projects also modify the I/O part of the kernel to improve their data transfer rates.

Read more on my blog:

1.Java JVM, Collections, Multithreading, new features series tutorials

2.Spring MVC, Spring Boot, Spring Cloud series tutorials

3.Maven, Git, Eclipse, Intellij IDEA series tools tutorial

4.Java, backend, architecture, Alibaba and other big factory latest interview questions

Life is good. See you tomorrow