Virtual file systems are the magical abstraction that makes the “everything is a file” philosophy possible in Linux.

What is a file system? According to early Linux contributor and author Robert Love, “A file system is a layered storage of data that follows a specific structure.” However, this description also applies to VFAT (Virtual File Allocation Table), Git, and Cassandra (a NoSQL database). So how do you distinguish between file systems?

Basic concepts of file systems

The Linux kernel requires that the file system be entities, that it implement the open(), read(), and write() methods on persistent objects, and that those entities need to have names associated with them. From an object-oriented programming perspective, the kernel treats the generic file system as an abstract interface, and these three functions are “virtual” without default definitions. Therefore, the default file system implementation of the kernel is called a virtual file system (VFS).

Console

If we could justopen(),read()write(), it is a file, as shown in this console session.

VFS is the basis of the “everything is a file” concept of well-known Unix-like systems. To see how weird it is, the little demo above shows how the character device /dev/console actually works. This figure shows an interactive Bash session on a virtual teletype console (TTY). Sending a string to the virtual console device causes it to appear on the virtual screen. VFS has even stranger properties. For example, it can address in it.

Familiar file systems such as ext4, NFS, and /proc all provide definitions of the three functions in a C data structure named [file_operations] 7. In addition, individual file systems extend and override VFS functionality in familiar object-oriented ways. As Robert Love pointed out, the abstraction of VFS makes it easy for Linux users to copy files to (or from) external operating systems or abstract entities, such as pipes, without worrying about their internal data formats. On the user-space side, through a system call, a process can use one of the file system methods read() to copy from a file to the kernel’s data structure, and then use another file system method write() to output the data.

The function definitions themselves of the basic VFS type can be found in the fs/*.c file of the kernel source, and the fs/ subdirectories contain specific file systems. The kernel also contains file-system-like entities, such as cgroup, /dev, and TMPFS, which are required early in the boot process and are therefore defined in the kernel’s init/ subdirectory. Note that cgroup, /dev, and TMPFS do not call the three file_operations functions, but instead read and write directly to memory.

The following diagram Outlines how user space accesses the various types of file systems that are typically mounted on Linux systems. Structures such as pipes, DMESG, and POSIX clocks, which are not shown in this figure, also implement struct file_operations and are accessed through the VFS layer.

How userspace accesses various types of filesystems

VFS is a “shim layer” that sits between system calls and the implementation of specific File_operations, such as ext4 and Procfs. The file_operations function can then communicate with device-specific drivers or memory accessors. TMPFS, devtmpfs, and cgroup do not use file_operations but access memory directly.

The presence of VFS promotes code reuse because the basic methods associated with file systems do not need to be reimplemented for each file system type. Code reuse is a widely accepted software engineering best practice! Alas, if the reused code introduces serious bugs, all implementations that inherit common methods will suffer.

/ TMP: a tip

Find out the system exist in the VFS is the simple way to enter the mount | grep -v sd | grep -v: /, on most computers, it will list all does not reside on a disk, also not the NFS mounted file system. One of the VFS mounts listed must be/TMP, right?

Man with shocked expression

Everyone knows that putting/TMP on physical storage is crazy! Image:tinyurl.com/ybomxyfo

Why is it not advisable to leave/TMP on the storage device? Because files in/TMP are temporary (!) And storage devices are slower than memory, so file systems like TMPFS were created. In addition, physical devices that write frequently are more likely to wear out than memory. Finally, files in/TMP may contain sensitive information, so it is a feature to make them disappear on each reboot.

Unfortunately, the installation script for some Linux distributions still creates/TMP on the storage device by default. If this happens to your system, don’t despair. Just follow the simple instructions on the always excellent Arch Wiki and remember that the memory allocated to TMPFS cannot be used for any other purpose. In other words, a huge TMPFS containing large files can cause the system to run out of memory and crash.

Another tip: When editing the /etc/fstab file, make sure to end with a newline character; otherwise, the system will fail to start. (Guess how I know.)

The/proc/sys

Other than/TMP, the VFS most Linux users are most familiar with are /proc and /sys. /dev relies on shared memory and has no file_operations structure. Why are there two? Let’s look at more details.

Procfs provides user space with a snapshot of the instantaneous state of the kernel and the processes it controls. In /proc, the kernel publishes information about the facilities it provides, such as interrupts, virtual memory, and schedulers. In addition, /proc/sys is a place for Settings that can be configured through the sysctl command and is accessible in user space. Status and statistics for individual processes are reported in the /proc/ directory.

Console

/proc/meminfo is an empty file, but it still contains valuable information.

The behavior of the /proc file illustrates how the VFS can be different from file systems on disk. On the one hand, /proc/meminfo contains information that can be presented by the free command. On the other hand, it’s empty! How did that happen? The situation is reminiscent of a 1985 paper by Cornell University physicist N. David Mermin called “Does No One See the Moon? Reality and Quantum Theory.” The fact is that the kernel collects statistics about memory when a process requests data from /proc, and when no one is looking at it, the file in /proc actually has nothing. As Mermin says, “It is a basic quantum theory that, in general, measurements do not reveal the preexisting value of the property being measured.” (The answers to the moon questions are for practice.)

Full moon

The files in /proc are empty when no process is accessing them. (source)

The empty file of Procfs makes sense because the information available there is dynamic. Sysfs is a different story. Let’s compare the number of non-empty files in /proc with /sys.

Procfs has only one non-empty file, the exported kernel configuration, which is an exception because it only needs to be generated once per boot. On the other hand, /sys has many larger files, most of which consist of a single page of memory. Typically, sysFS files contain only a single number or string, in contrast to tables of information generated by reading files such as /proc/meminfo.

The purpose of SYSFS is to expose a read-write property called “kObject” by the kernel to user space. The sole purpose of a KObject is reference counting: when the last reference to a KObject is deleted, the system reclaims the resource associated with it. However, /sys constitutes the kernel’s famous “stable ABI to user space,” and most of it is something that no one can “break” under any circumstances. This does not mean, however, that files in SYSFS are static, as opposed to reference counting for volatile objects.

The kernel’s stable ABI limits what can be present in /sys, not what is actually present at any given moment. Listing permissions for files in SYSFS shows how to set or read configurable, tunable parameters for devices, modules, file systems, and so on. The conclusion logically emphasizes that procfs is also part of the kernel-stable ABI, although the kernel documentation is not explicit.

Console

Files in SYSFS describe exactly each attribute of an entity and can be readable, writable, or both. 0 in the file indicates the storage device that the SSD cannot move.

Take a look inside VFS with eBPF and BCC tools

The easiest way to see how the kernel manages sysFS files is to watch it in action, and the easiest way to watch it on ARM64 or X86_64 is to use eBPF. EBPF (Extended Berkeley Packet Filter) consists of virtual machines running in the kernel and can be queried by privileged users from the command line. The kernel source tells the reader what the kernel can do; Running the eBPF tool on a booted system shows what the kernel actually does.

Happily, it’s easy to get started with eBPF through the BCC tools, which are included in the packages of major Linux distributions and are well documented by Brendan Gregg. The BCC tools are Python scripts with small embedded SNIPpets of C, which means that anyone familiar with either language can easily modify them. Currently, there are 80 Python scripts in BCC/Tools, making it possible for a system administrator or developer to find existing scripts relevant to her or his needs.

To see how the VFS works on a running system, try using a simple vfscount or vfsstat script, which sees calls to vfs_open() and its associated functions occurring dozens of times per second.

Console – vfsstat.py

Vfsstat.py is a Python script with an embedded C fragment that simply counts VFS function calls.

As a less important example, let’s look at what happens in SYSFS when a USB memory stick is inserted on a running system.

Console when USB is inserted

Use eBPF to observe what happens in/SYS when inserting a USB memory stick, simple and complex examples.

In the first simple example above, the trace.py BCC tool script prints a message whenever the sysfs_create_files() command is run. We see that sysfs_create_files() is started by a Kworker thread in response to the USB stick insertion event, but what file does it create? The second example illustrates the power of eBPF. Here trace.py is printing the kernel traceback (-k option) and the name of the file created by sysfs_create_files(). The code snippets inside single quotes are C source code, including an easily recognizable format string, and the supplied Python script introduces the LLVM just-in-time compiler (JIT) to compile and execute it inside the kernel virtual machine. The full sysfs_create_files() function signature must be reproduced in the second command so that the format string can reference one of the parameters. An error in this C fragment results in a recognizable C compiler error. For example, if the -i parameter is omitted, the result is “unable to compile BPF text.” Developers familiar with C or Python will find the BCC tool easy to extend and modify.

After inserting the USB memory stick, the kernel backtracked to show that PID 7711 was a Kworker thread that created a file called Events in sysFS. The corresponding call using sysfs_remove_files() shows that removing the USB memory stick causes the events file to be deleted, which is consistent with the idea of reference counting. Observing sysfs_create_link() in eBPF during USB stick insertion (not shown) indicates that no less than 48 symlinks were created.

Anyway, what is the purpose of the events file? Use the cscope lookup function __device_add_disk() to show that it calls disk_add_events() and that “media_change” or “eject_request” can be written to the file. Here, the block layer of the kernel notifies user space of the appearance and disappearance of the “disk”. Consider this method of checking how USB stick insertion works versus trying to figure out how fast the process is just at the source.

Read-only root file systems make embedded devices possible

Indeed, no one shut down a server or desktop system by unplugging it. Why is that? This is because the file system mounted on the physical storage device may have pending (incomplete) writes, and the data structure that records its state may be out of sync with what was written to storage. When this happens, the system owner will have to wait for the FSCK file system recovery tool to finish running at the next startup and, in the worst case, actually lose data.

However, aficionados will hear that many iot and embedded devices like routers, thermostats, and cars now run Linux. Many of these devices have almost no user interface at all, and there is no way to cleanly “unboot” them. Think of starting a car with a drained battery, where the power to the host device running Linux is constantly coming on and off. When the engine finally starts running, how does the system boot up without long FSCK? The answer is that embedded devices rely on read-only root file systems (ro-rootFS for short).

Photograph of a console

Ro-rootfs is the reason why embedded systems do not often require FSCK. Source:tinyurl.com/yxoauoub

Ro-rootfs offers many advantages, although these are not as obvious as durability. One is that if Linux processes cannot write, then malware cannot write to /usr or /lib. Another is that a largely immutable file system is essential for on-site support of remote devices, because the support staff has a local system theoretically identical to the one in the field. Perhaps the most important (but subtle) advantage is that Ro-RootFS forces developers to decide at the design stage of a project which system objects are immutable. Dealing with Ro-rootFs can often be inconvenient or even painful, as constant variables in programming languages often are, but the benefits can easily pay for the extra overhead.

For embedded developers, creating a read-only root file system does require a little extra work, and that’s where the VFS comes in. Linux requires files in /var to be writable, and many popular applications running on embedded systems try to create configured dot files in $HOME. One solution to configuration files in your home directory is usually to pre-generate them and build them into rootfs. For /var, one approach is to mount it on a separate writable partition, with/itself mounted read-only. Using bindings or supermounts is another popular alternative.

Bind and overlay mounts and use in containers

Running man Mount is the best way to learn about binding mount and overlay mount, which enables embedded developers and system administrators to create file systems in one path location and then provide them to applications in another path. For embedded systems, this means you can store files on non-writable flash devices in /var, but superimpose or bind the paths in TMPFS to the /var path at startup, so that applications can write their content there at will. The changes in /var will disappear the next time you power up. A supermount provides federation for TMPFS and the underlying file system, allowing direct modification of existing files in ro-rootfs, while a binding mount makes the new empty TMPFS directory appear as writable in the ro-Rootfs path. While superimposed file systems are an appropriate type of file system, binding mounts are implemented by the VFS namespace tool.

Based on the description of superimposed and bound mounts, no one should be surprised to see their extensive use in Linux containers. Let’s monitor what happens when the container is started with systemd-nspawn by running BCC’s mountsnoop tool:

Console – system-nspawn invocation

While mountsnoop.py is running, the system-nspawn call starts the container.

Let’s see what happened:

Console – Running mountsnoop

Runs during container startupmountsnoopYou can see that the container runtime relies heavily on binding mounts. (Displays only the beginning of lengthy output)

Here, Systemd-nspawn feeds selected files from the host’s procfs and SYSFS to the container in its rootfs path. In addition to setting the MS_BIND flag at bind mount time, several other flags of the mount system call are used to determine the relationship between the host namespace and changes in the container. For example, binding mounts can propagate changes in /proc and /sys to the container or hide them, depending on the call.

conclusion

Understanding Linux internals may seem like an impossible task, because in addition to Linux user-space applications and system call interfaces in C libraries like Glibc, the kernel itself contains a lot of code. One way to make progress is to read the source code for a kernel subsystem, focusing on understanding user-space oriented system calls and headers and the main kernel internal interfaces, as shown here in the File_operations table. File_operations make everything is a file work, so mastering them is especially rewarding. The kernel C source files in the top-level FS/directory form the implementation of the virtual file system, a shim layer that supports extensive and relatively simple interoperability between popular file systems and storage devices. Binding mount and overwrite mount through the Linux namespace is the VFS magic that makes containers and read-only root file systems possible. Combined with source code research, eBPF kernel tools and their BCC interfaces make probing the kernel easier than ever.

Many thanks to Akkana Peck and Michael Eager for their comments and corrections.

Alison Chaiken also spoke on this topic at the 17th Annual Southern California Linux Expo (SCaLE 17X) held March 7-10 in Pasadena, California.


Via: opensource.com/article/19/…

Author: Alison Chariken

This article is originally compiled by LCTT and released in Linux China