To tell a writer about seven years ago, I first in one was an application to the fire distribution of startup episode in the interview, the company arranged for a new job more than a year of a classmate to face me, talk to the configuration files in the project we write a switch, the student will jump out, you read this file, each user request to have much of a disk IO, Performance must be poor. In fact, I found a problem with this story, although most of us are computer majors and write good code. But the vast majority of us don’t really understand, or don’t understand well enough, some seemingly commonplace issues.

No matter what language you’re using, C/PHP/GO or Java, you’ve all read files at one time or another. Let’s think about two questions. If we read a byte in a file:

  • Will disk I/O occur?
  • When that happens, how many bytes does Linux actually read into disk?

To understand the problem, let’s also list the code for C:

int main()  
{  	
	char    c;  
	int     in;

	in = open("in.txt", O_RDONLY); 
	read(in,&c,1);
	return 0;  
} 
Copy the code

If you are not a c/ C ++ developer, this is not an easy question to understand in depth. Because the current commonly used mainstream language, PHP/Java/Go what packaging level is relatively high, the kernel of many details are more completely shielded. To make sense of the above two problems, you need to cut through Linux to understand the IO stack.

Introduction to the Linux I/O stack

Without further ado, let’s draw a simplified version of the Linux IO stack (see linux.io. Stack_v1.0.pdf for the official IO stack).

We also shared several articles discussing the hardware layer in the previous figure, as well as the file system module. However, through the IO stack, we found that our understanding of the IO of Linux files is still far from enough, and there are several kernel components: IO engine, VFS, PageCache, common block management, IO scheduling layer, and so on. Don’t worry, let’s get to it one by one:

IO engine

If you want to read and write files, there are many functions available in the lib library layer, such as read, write, mmap, etc. This is essentially choosing the IO engine that Linux provides. The most commonly used read and write functions belong to the sync engine. In addition to sync, there are map, psync, vsync, libaio, POSIxaio, etc. Sync and psync are synchronous. Libaio and POSIxaio are asynchronous I/OS.

Of course, the IO engine also requires VFS, common block layer, and more low-level support. The read function in the Sync engine enters the READ system call provided by the VFS.

VFS Virtual file system

At the kernel level, VFS is the first thing you see. VFS was born with the idea of abstracting a common file system model, providing a common set of interfaces to us as developers or users without having to care about the implementation of a specific file system. VFS provides four core data structures, defined in the kernel source code include/ Linux /fs.h and include/ Linux /dcache.h.

  • Superblock: Used by Linux to mark information about a specific installed file system
  • Inode: Every file in Linux has an inode, which you can think of as an ID card for the file
  • File: a file object in memory that stores the mapping between process files and disk files
  • Desty: indicates a directory entry, which is a part of a path. All the directory entry objects are strung together into a directory tree in Linux.

VFS also defines a set of operations around each of these four core data structures. For example, inode operations define inode_operations(include/ Linux /fs.h), which defines familiar mkdir and rename.

struct inode_operations { ...... int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); .Copy the code

The corresponding file_operations for file define the read and write methods we often use:

struct file_operations { ...... ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); . int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id);Copy the code

Page Cache

Looking down at the VFS layer, we notice the Page Cache. It is the main disk cache used by the Linux kernel. It is a pure memory working component whose function is to speed up the access of relatively slow disk. If the file block to be accessed happens to be in the Page Cache, no actual disk I/O will occur. If it doesn’t, a new page is requested, a page-missing interrupt is issued, and it is filled with the contents of the block read from disk, to be used next time. The Linux kernel uses search trees to efficiently manage large numbers of pages.

If you have a special need to bypass the Page Cache, simply set DIRECT_IO. There are two situations to avoid:

  • Test the disk I/O performance
  • Saves the cost of system calls falling into kernel mode when Page Cache is used, and the cost of kernel memory being copied to user process memory.

The file system

In my previous article “How much Disk Space does creating an empty file Take?” Understanding Formatting is all about specific file systems. The two most important concepts in file systems are inodes and blocks, both of which we have seen in previous articles. The size of a block is determined by operations during formatting, and the default is 4KB.

In addition to inodes and blocks, each file system defines its own actual manipulation functions. For example, ext4_file_operations and ext4_file_inode_operations are defined in ext4 as follows:

const struct file_operations ext4_file_operations = { .read_iter = ext4_file_read_iter, .write_iter = ext4_file_write_iter, .mmap = ext4_file_mmap, .open = ext4_file_open, ...... }; const struct inode_operations ext4_file_inode_operations = { .setattr = ext4_setattr, .getattr = ext4_file_getattr, . };Copy the code

General block layer

The generic block layer is a kernel module that handles IO requests from all block devices in the system. It defines a data structure called bio to represent an IO operation request (include/ Linux/bio-.h).

Is the unit of IO size in bio a page or a sector? No, it’s a segment! Each BIO may contain multiple segments. A segment is a complete page, or is part of the page, specific please refer to www.ilinuxkernel.com/files/Linux… .

Why make a paragraph that’s so confusing? This is because data stored consecutively on disk may not be consecutively stored in the Page Cache. This situation is normal, I can not say that disk contiguous data in memory must use contiguous space cache. Segments are designed to allow a single memory IO to be DMA to multiple “segments” that are not contiguous.

A common sector/segment/page size pair is shown below:

IO scheduling layer

When the generic block layer actually sends an IO request, it is not necessarily executed immediately. The scheduling layer tries to maximize disk I/O performance globally. The idea is to make the head work like an elevator, going in one direction and coming back, so the disk is more efficient. The specific algorithms include NOOP, Deadline and CFG, etc.

On your machine, through the dmesg | grep -i scheduler to check your Linux support algorithm, and at the time of testing can choose one of them.

File reading procedure

We’ve covered the various kernel components in the Linux IO stack. Now let’s go through the whole process of reading the file again

  • The read function in lib first enters the system call sys_read
  • At sys_read, the vfs_read, generic_file_read and other functions in the VFS are accessed
  • Generic_file_read in the VFS determines if the cache hit, and returns if it does
  • If the kernel allocates a new Page box in the Page Cache and issues a page-missing interrupt,
  • The kernel makes block I/O requests to the common block layer, and block devices mask the differences between disks and USB drives
  • The generic block layer puts THE I/O requests represented by bio into the IO request queue
  • The IO scheduling layer uses an elevator algorithm to schedule requests in a queue
  • The driver issues read command control to the disk controller and populates the new Page box directly into the Page Cache in DMA mode
  • The controller sends an interrupt notification
  • The kernel fills the user’s memory with 1 byte needed by the user
  • Then your process is woken up

As you can see, if the Page Cache hits, no disk IO is generated at all. So, just because you have a few reads and writes in your code doesn’t mean performance is going to be slow. The operating system has been optimized for you a lot, and the memory level access latency is about ns, 2-3 orders of magnitude faster than mechanical disk IO. If your memory is large enough, or if your files are accessed frequently enough, read operations rarely actually do disk IO.

Let’s look at the second case, how many bytes of disk I/O Linux actually does if the Page Cache misses. There are several kernel components involved in the whole IO process. Each component manages disk data in blocks of different lengths.

  • Page Cache is a Page Cache, and Linux pages are typically 4KB in size.
  • File systems are managed in blocks. usedumpe2fsThe default value for a block is 4KB
  • The generic block layer processes disk I/O requests in segments, each of which is a page or part of a page
  • The IO scheduler transfers N sectors, typically 512 bytes, to memory via DMA
  • Hard disks also use “sectors” to manage and transfer data

As you can see, we did read only 1 byte from the user’s point of view (we only left one byte of cache for this disk IO in the opening code). But the smallest unit of work in the entire kernel workflow is the disk sector, which at 512 bytes is much larger than one byte. In addition, higher-level components such as blocks and page cache work in larger units, so many bytes are actually read at a time. Assuming the segment is a memory page, a disk IO is 4KB (eight 512-byte sectors) read together.

What we didn’t talk about in the Linux kernel is that there is also a sophisticated prefetch strategy. So, in practice, more sectors than 8 May be transferred into memory together.

The last

The operating system is meant to be easy to rely on, to treat as much of it as a black box. You want a byte, it gives you a byte, but it does a lot of work in silence. Most of our domestic development is not low-level, but if you care about the performance of your application, you should know when and how the operating system quietly improves your performance. So that at some point in the future, when your online server is about to fail, you can quickly identify the problem.

If the Page Cache does not hit, there must be disk I/O driven to the mechanical axis.

Well, it doesn’t have to be. Why? Because disks now come with a cache. In addition, now the server will form a disk array, the core hardware Raid card in the disk array will also integrate RAM as a cache. The mechanical shaft with the head only works when all the caches miss.



The development of internal training of hard disk album:

  • 1. Disk opening: remove the hard coat of mechanical hard disk!
  • 2. Disk partitioning is also a technical trick
  • 3. How can we solve the problem that mechanical hard disks are slow and easy to break?
  • 4. Disassemble the SSD structure
  • 5. How much disk space does a new empty file occupy?
  • 6. How much disk space does a 1-byte file occupy
  • 7. Why is the ls command stuck when there are too many files?
  • 8. Understand formatting principles
  • 9. Read file How much disk IO actually occurs per byte?
  • 10. When to initiate disk I/O write after one byte of the write file?
  • 11. Mechanical hard drive random IO is slower than you might think
  • 12. How much faster is an SSD server than a mechanical one?

My public account is “developing internal Skills and Practicing”. Here I am not simply introducing technical theories, nor only introducing practical experience. But to combine theory with practice, with practice to deepen the understanding of theory, with theory to improve your technical practice ability. Welcome you to follow my public number, also please share with your friends ~~~