The article talk about the Linux file system first released in: blog.ihypo.net/16246922377…

Recently, I was reading a book about programming in Linux environments, and I learned something about storage in my previous job. I was interested in sorting out how Linux supports file systems.

File systems were supposed to be on the agenda for 2020, but last year there was so much trivia that an article took six months to get started. Recently transferred to the post, busy with the performance, just to close.

File reading and writing

We’ll start by talking about reading and writing files, what happens when we try to write a string of characters to a file, as in the following lines of Python code:

f = open("file.txt"."w")
f.write("hello world")
f.close()
Copy the code

Using the strace command, you can easily see which system calls are used behind this line of command:

$strace python justwrite.py -e trace=file ··· open("file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
...
write(3, "hello world"Close (3) = 0 ···Copy the code

The most critical of these is write. As you can see, “Hello world” is written to the file descriptor with id 3 through the write system call.

Instead of describing what happens behind these system calls, let’s look at what happens when we read the file. For example:

f = open("file.txt"."r")
_ = f.readlines()
f.close()
Copy the code

Also, let’s look at which system calls are used to read files:

$ strace python justread.py -e trace=file
...
open("file.txt", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=11, ... }) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=11, ... }) = 0...read(3, "hello world"= 11, 8192)read(3, "", 4096)                       = 0
close(3)
...
Copy the code

As you can see, the most important of these is the read system call, which reads the “Hello World” string from the file descriptor 3.

The system call is an API provided by the kernel. As we all know, the operating system hosts all resources, and in general, the program can only use the system call to make the kernel do what we need (or rather, the program is stuck in the kernel to complete the operation).

VFS and file systems

To understand what’s going on behind the system calls, start with the VFS.

The Virtual File System (VFS) is an important interface and infrastructure for I/O operations in Linux. A simplified Linux I/O stack shows the VFS location:

VFS itself can be thought of as Linux’s Interface for the file system convention, which Linux implements with a similar object-oriented design (and code structure).

VFS abstracts four main object types:

  • Super block: represents an installed file system.
  • Inode: represents a specific file;
  • Dentry: represents a directory entry and is part of the file path.
  • File object (file) : represents the open file of the process;

superblock

A superblock is a data structure used to store information about a particular file system. Usually in a specific sector of the disk. If inodes are the metadata of files, superblocks are the metadata of file systems.

A superblock represents the metadata and control information of a file system as a whole. When the file system is mounted, the contents of the superblock are read and the superblock structure is built in memory.

The code structure of the superblock is in Linux /fs.h:

struct super_block {
	struct list_head	s_list;		/* Keep this first */
	dev_t			s_dev;		/* search index; _not_ kdev_t */
	unsigned char		s_blocksize_bits;
	unsigned long		s_blocksize;
	loff_t			s_maxbytes;	/* Max file size */
	struct file_system_type	*s_type;
	const struct super_operations	*s_op;
	const struct dquot_operations	*dq_op;
	const struct quotactl_ops	*s_qcop;
	const struct export_operations *s_export_op;
	unsigned long		s_flags;
	unsigned long		s_magic;
	struct dentry		*s_root;
	struct rw_semaphore	s_umount;
	int			s_count;
	atomic_t		s_active;
#ifdef CONFIG_SECURITY
	void                    *s_security;
#endif
	const struct xattr_handler支那s_xattr;

	struct list_head	s_inodes;	/* all inodes */
	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
	struct block_device	*s_bdev;
	struct backing_dev_info *s_bdi;
	struct mtd_info		*s_mtd;
	struct hlist_node	s_instances;
	struct quota_info	s_dquot;	/* Diskquota specific options */

	struct sb_writers	s_writers;

	char s_id[32];				/* Informational name */
	u8 s_uuid[16];				/* UUID */

	void 			*s_fs_info;	/* Filesystem private info */
	unsigned int		s_max_links;
	fmode_t			s_mode;

	/* Granularity of c/m/atime in ns. Cannot be worse than a second */
	u32		   s_time_gran;

	/* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */
	struct mutex s_vfs_rename_mutex;	/* Kludge */

	/* * Filesystem subtype. If non-empty the filesystem type field * in /proc/mounts will be "type.subtype" */
	char *s_subtype;

	/* * Saved mount options for lazy filesystems using * generic_show_options() */
	char __rcu *s_options;
	const struct dentry_operations *s_d_op; /* default d_op for dentries */

	/* * Saved pool identifier for cleancache (-1 means none) */
	int cleancache_poolid;

	struct shrinker s_shrink;	/* per-sb shrinker handle */

	/* Number of inodes with nlink == 0 but still referenced */
	atomic_long_t s_remove_count;

	/* Being remounted read-only */
	int s_readonly_remount;

	/* AIO completions deferred from interrupt context */
	struct workqueue_struct *s_dio_done_wq;

	/* * Keep the lru lists last in the structure so they always sit on their * own individual cachelines. */
	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
	struct rcu_head		rcu;
};
Copy the code

Although there are a few fields, I think they can be grouped into several broad categories:

  1. Metadata and control bits for the device
  2. Metadata and control bits for the file system
  3. Super fast structure operation function

One of the most interesting things about VFS code is the handling of manipulation functions, and I was surprised that OBJECT-ORIENTED programming could be done in C.

The operation function of the superblock is in a separate structure:

const struct super_operations	*s_op;
Copy the code

The super_operations structure contains operations on superblocks. I think of it as an Interface for superblocks, because there is no implementation of this structure. Instead, it is a pointer to functions that can be implemented by different file systems to implement superblocks:

struct super_operations {
   	struct inode* (*alloc_inode) (struct super_block *sb);
	void (*destroy_inode)(struct inode *);

   	void (*dirty_inode) (struct inode *, int flags);
	int (*write_inode) (struct inode *, struct writeback_control *wbc);
	int (*drop_inode) (struct inode *);
	void (*evict_inode) (struct inode *);
	void (*put_super) (struct super_block *);
	int (*sync_fs)(struct super_block *sb, int wait);
	int (*freeze_fs) (struct super_block *);
	int (*unfreeze_fs) (struct super_block *);
	int (*statfs) (struct dentry *, struct kstatfs *);
	int (*remount_fs) (struct super_block *, int *, char *);
	void (*umount_begin) (struct super_block *);

	int (*show_options)(struct seq_file *, struct dentry *);
	int (*show_devname)(struct seq_file *, struct dentry *);
	int (*show_path)(struct seq_file *, struct dentry *);
	int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
	ssize_t (*quota_read)(struct super_block *, int.char *, size_t.loff_t);
	ssize_t (*quota_write)(struct super_block *, int.const char *, size_t.loff_t);
#endif
	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
	long (*nr_cached_objects)(struct super_block *, int);
	long (*free_cached_objects)(struct super_block *, long.int);
};
Copy the code

These functions are pretty literal, so they don’t need much explanation, including low-level operations on file systems and index nodes.

inode

The inode is a relatively common concept, and contains all the information the kernel needs to operate on a file or directory.

The inode structure is also in Linux /fs.h:

/* * Keep mostly read-only and often accessed (especially for * the RCU path lookup and 'stat' data) fields at the beginning * of the 'struct inode' */
struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/* * Filesystems may only read i_nlink directly. They shall use the * following functions for modification: * * (set|clear|inc|drop)_nlink * inode_(inc|dec)_link_count */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec		i_atime;
	struct timespec		i_mtime;
	struct timespec		i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	unsigned int		i_blkbits;
	blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	unsigned long		i_state;
	struct mutex		i_mutex;

	unsigned long		dirtied_when;	/* jiffies of first dirtying */

	struct hlist_node	i_hash;
	struct list_head	i_wb_list;	/* backing dev IO list */
	struct list_head	i_lru;		/* inode LRU list */
	struct list_head	i_sb_list;
	union {
		struct hlist_head	i_dentry;
		struct rcu_head		i_rcu;
	};
	u64			i_version;
	atomic_t		i_count;
	atomic_t		i_dio_count;
	atomic_t		i_writecount;
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
	struct file_lock	*i_flock;
	struct address_space	i_data;
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];
#endif
	struct list_head	i_devices;
	union {
		struct pipe_inode_info	*i_pipe;
		struct block_device	*i_bdev;
		struct cdev		*i_cdev;
	};

	__u32			i_generation;

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_marks;
#endif

#ifdef CONFIG_IMA
	atomic_t		i_readcount; /* struct files open RO */
#endif
	void			*i_private; /* fs or device private pointer */
};
Copy the code

Like superblocks, inodes have inode_operations that contain operation methods:

struct inode_operations {
	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
	void * (*follow_link) (struct dentry *, struct nameidata *);
	int (*permission) (struct inode *, int);
	struct posix_acl * (*get_acl)(struct inode *, int);

	int (*readlink) (struct dentry *, char __user *,int);
	void (*put_link) (struct dentry *, struct nameidata *, void *);

	int (*create) (struct inode *,struct dentry *, umode_t, bool);
	int (*link) (struct dentry *,struct inode *,struct dentry *);
	int (*unlink) (struct inode *,struct dentry *);
	int (*symlink) (struct inode *,struct dentry *,const char *);
	int (*mkdir) (struct inode *,struct dentry *,umode_t);
	int (*rmdir) (struct inode *,struct dentry *);
	int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
	int (*rename) (struct inode *, struct dentry *,
			struct inode *, struct dentry *);
	int (*setattr) (struct dentry *, struct iattr *);
	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
	ssize_t (*listxattr) (struct dentry *, char *, size_t);
	int (*removexattr) (struct dentry *, const char *);
	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
		      u64 len);
	int (*update_time)(struct inode *, struct timespec *, int);
	int (*atomic_open)(struct inode *, struct dentry *,
			   struct file *, unsigned open_flag,
			   umode_t create_mode, int *opened);
	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
} ____cacheline_aligned;
Copy the code

The inode_operations structure contains the file related operation interfaces, including:

  1. Create, delete, and rename files and directories
  2. Soft and hard connection management
  3. Permission related management
  4. Expand parameter management

As such, the inode_operations structure has no concrete implementation, but only exists as a function pointer as an Interface, and the specific operating system will implement the functions in the structure.

Directory entry

VFS treats directories as a special kind of file, so in a file path such as /dir1/file1, /, dri1, and file1 are directory entries.

Each part of the path, be it a directory or a file, is represented by a directory entry structure, which makes it easier to perform directory operations in the VFS, such as pathname lookups.

The data structure of the directory entry is in Linux /dcache.h:

/* * Try to keep struct dentry aligned on 64 byte cachelines (this will * give reasonable cacheline footprint with larger lines without the * large memory footprint increase). */
#ifdef CONFIG_64BIT
# define DNAME_INLINE_LEN 32 /* 192 bytes */
#else
# ifdef CONFIG_SMP
#  define DNAME_INLINE_LEN 36 /* 128 bytes */
# else
#  define DNAME_INLINE_LEN 40 /* 128 bytes */
# endif
#endif

#define d_lock	d_lockref.lock

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is * negative */
	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */

	/* Ref lookup also touches following */
	struct lockref d_lockref;	/* per-dentry lock and refcount */
	const struct dentry_operations *d_op;
	struct super_block *d_sb;	/* The root of the dentry tree */
	unsigned long d_time;		/* used by d_revalidate */
	void *d_fsdata;			/* fs-specific data */

	struct list_head d_lru;		/* LRU list */
	/* * d_child and d_rcu can share memory */
	union {
		struct list_head d_child;	/* child of parent list */
	 	struct rcu_head d_rcu;
	} d_u;
	struct list_head d_subdirs;	/* our children */
	struct hlist_node d_alias;	/* inode alias list */
};
Copy the code

Unlike the previous structure, the fields of a directory entry are simple and have no disk-specific attributes. This is because directory entries are created at use, and VFS creates them parsed from the path string. Therefore, you can see that directory entries are not data stored on disk, but structures in memory that act as a cache.

The cache of directory entries can be viewed with slabinfo:

$ slabinfo | awk 'NR==1 || $1=="dentry" {print}'Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg dentry 608813 192 139.5m 33924/16606/142 21 0 48 83 ACopy the code

Since directory entries are the in-memory cache of the file system, directory entries are managed very similar to regular caches. For example, to determine whether it is valid, the release of the cache structure, etc. These operations are contained in the action structure of the directory entry:

struct dentry_operations {
	int (*d_revalidate)(struct dentry *, unsigned int);
	int (*d_weak_revalidate)(struct dentry *, unsigned int);
	int (*d_hash)(const struct dentry *, struct qstr *);
	int (*d_compare)(const struct dentry *, const struct dentry *,
			unsigned int.const char *, const struct qstr *);
	int (*d_delete)(const struct dentry *);
	void (*d_release)(struct dentry *);
	void (*d_prune)(struct dentry *);
	void (*d_iput)(struct dentry *, struct inode *);
	char *(*d_dname)(struct dentry *, char *, int);
	struct vfsmount* (*d_automount) (struct path *);
	int (*d_manage)(struct dentry *, bool);
} ____cacheline_aligned;
Copy the code

file

The file structure is used to represent the open file of the process. It is the data structure of the memory of the current file. This structure is created on the open system call and released on the close system call, and all operations on files are centered around this structure.

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
#define f_dentry	f_path.dentry
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	/* * Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR. * Must not be taken from IRQ context. */
	spinlock_t		f_lock;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
#ifdef CONFIG_DEBUG_WRITECOUNT
	unsigned long f_mnt_write_state;
#endif
};
Copy the code

For a file structure, it is used to represent an open file, but it should be noted that when a program opens a file, it gets a file descriptor, and there are differences between the file descriptor and the file, which will be discussed later. As you can see, the file structure has a reference count field called f_count. When the reference count is cleared, the release method in the file operation structure is called. The effect of this method is determined by the implementation of the file system.

For the file operation structure, namely file_operations, the operation function names and system call/library function names are basically the same.

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t.int);
	ssize_t (*read) (struct file *, char __user *, size_t.loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t.loff_t *);
	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long.loff_t);
	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long.loff_t);
	int (*iterate) (struct file *, struct dir_context *);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int.unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int.unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t.loff_t.int datasync);
	int (*aio_fsync) (struct kiocb *, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	ssize_t (*sendpage) (struct file *, struct page *, int.size_t.loff_t *, int);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long.unsigned long.unsigned long.unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t.unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t.unsigned int);
	int (*setlease)(struct file *, long, struct file_lock **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
	int (*show_fdinfo)(struct seq_file *m, struct file *f);
};
Copy the code

Run-time related data structures

The four VFS structures and their associated operations have been described above, but they are static concepts that provide a set of interface definitions, or a standard for connecting file systems.

But for the user to be able to use and feel a file system, you need to mount it to the current directory tree, and you need to Open files in the file system. These operations require some additional data structures.

The kernel also uses data structures to manage file systems and related data, such as file_system_type to describe specific file system types:

struct file_system_type {
	const char *name;
	int fs_flags;
#define FS_REQUIRES_DEV		1 
#define FS_BINARY_MOUNTDATA	2
#define FS_HAS_SUBTYPE		4
#define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
#define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
#define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
	struct dentry* (*mount) (struct file_system_type *, int.const char *, void *);
	void (*kill_sb) (struct super_block *);
	struct module *owner;
	struct file_system_type * next;
	struct hlist_head fs_supers;

	struct lock_class_key s_lock_key;
	struct lock_class_key s_umount_key;
	struct lock_class_key s_vfs_rename_key;
	struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

	struct lock_class_key i_lock_key;
	struct lock_class_key i_mutex_key;
	struct lock_class_key i_mutex_dir_key;
};
Copy the code

Each file system installed to the system is only a file_system_type object, which contains methods for mounting and unmounting superblocks to implement the mount.

The mount operation not only completes the mount, but also creates a vfsmount structure to represent a mount point. The vfsmount code is in Linux /mount.h:

struct vfsmount {
	struct dentry *mnt_root;	/* root of the mounted tree */
	struct super_block *mnt_sb;	/* pointer to superblock */
	int mnt_flags;
};
Copy the code

File descriptor

In a system, each process has its own set of “open files,” and each file in each program has a different file descriptor and read/write offset. Therefore, there are several related data structures that are closely related to the VFS data structures mentioned above.

Files_struct is a file descriptor that is used to maintain all open files in the system, while files_struct is used to maintain all open files in the current process.

The structure of files_struct is in Linux /fdtable.h:

/* * Open file table structure */
struct files_struct {
  /* * read mostly part */
	atomic_t count;
	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /* * written part on a separate cache line in SMP */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	int next_fd;
	unsigned long close_on_exec_init[1];
	unsigned long open_fds_init[1];
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
Copy the code

The fd_array array pointer points to all files opened by the process. The process uses this field to find the corresponding file and inode.

The file system

By analyzing the data in the Linux kernel structure, basic understanding from the process can open the file, to the processing of the VFS in direct relationship to the data structure, because these structures adopted “object-oriented programming” around data, structure itself with “methods”, all basic comb can also understand the IO request execution process. The diagram below:

Finally, I’ll show you how file systems work with VFS by looking at two very typical file systems.

Traditional file system: ext2

Ext2 was an excellent file system, the most widely used file system on Linux, and the successor to the original Linux file system ext. Although ext2 is not currently in use, its simplicity makes it a good introduction to how file systems work.

As mentioned earlier, an ext2 file system consists of the following parts.

Boot block: always the first block of the file system. The boot block is not used by the file system, but contains information used to boot the operating system. Operating systems need only one boot block, but all file systems have one, and the vast majority are unused.

Superblock: A separate block that follows the boot block and contains file system-related parameter information, including:

  • Inode table capacity;
  • The size of logical blocks in the file system;
  • The size of a file system in logical blocks;

Inode table: Each file or directory in a file system has a unique record in the inode table. This record records all kinds of information about the document, such as:

  • File types (for example, regular files, directories, symbolic links, character devices, and so on)
  • File owner (also known as user ID or UID)
  • File Ownership Group (also known as Group ID or GID)
  • Access permissions of the owner, owner group, and other users
  • Three timestamps: last access time, last modified time, last change of file state time
  • Number of hard links to a file
  • The size of the file in bytes
  • The number of blocks actually allocated to the file
  • A pointer to a file data block

Data blocks: Most of the space of a file system is used to store data to form files and directories that reside on the file system.

In the ext2 file system, data blocks may not be contiguous or even in sequence. To locate blocks of file data, the kernel maintains a set of Pointers within the inode.

An inode structure consists of 15 Pointers (0-14). The first 11 Pointers are used to point to data blocks, which can be referenced directly in small-file scenarios, and the next 11 Pointers point to indirect pointer blocks to point to subsequent data blocks.

So it can also be seen that for a 4096-byte block, theoretically, a single file is at most 1024×1024×1024×4096 bytes, or 4TB (4096 GB).

Log file system: XFS

As a second example, we can take a look at XFS, a common modern file system developed by Silicon Graphics for their IRIX operating system, which has good performance in large file processing and transfer.

The point of this article is not to explain how XFS works, but to look at XFS from a Linux perspective. XFS, unlike Ext, is customized for Linux, so file system handling is not designed as VFS.

In VFS, file operations are split into two layers: file (reading, writing, etc.) and inode (file creation, deletion, etc.), whereas in XFS there is only one layer of VNodes to provide all operations. So in porting XFS to Linux, a transformation middle layer, LINVFS, was introduced for VFS, mapping operations on files and inodes to VNodes.

The last

This article has reviewed the VFS core data structures and their relationships, but it’s useful to know what VFS is. I think it’s two things.

The first is to understand how Linux file systems work, which is helpful in understanding what happens to IO.

The second is to help you understand IO caching in Linux. VFS is associated with many caches, including page caches, directory caches, and inode caches:

In addition to inodes and directory entries that cache the in-memory structure, the page cache is used to cache recently read and written blocks of file data, and the file.address_space field is used to manage the page cache.

reference

  • Linux/UNIX System Programming Manual
  • Linux Kernel Design and Implementation
  • Porting the SGI XFS File System to Linux