In my article “How Much Memory per TCP Connection in Linux?” This article plots the data structures of the file Descriptor tables now used by the Linux kernel (4.x). This article does some code archaeology to reconstruct the context of five versions of this data structure. Table – Shuo Chen’s Notes, File Descriptor
Linux kernel version history
Linux has been around for a quarter of a century since Linux 0.01 was released in September 1991. Linux kernel “Historical” Git Repository with Full History Prior to Linux 2.6, the kernel used the “odd development, even stable” version naming rule. Therefore, 1.0, 1.2, 2.0, 2.2, 2.4, 2.6 are stable releases, while 1.1.x, 1.3.x, 2.1.x, 2.3.x, 2.5.x are development releases. After 2.6, instead of setting “development” separately, development will scroll directly on the current stable version. 2. How the Development Process Works and bootlin.com/doc/trainin… Section “Linux Versioning Scheme and Development Process” from page 401.
Since its inception, Linux has used struct task_struct to represent processes/threads, struct file to represent open files, and struct inode to represent files themselves. The difference between struct file and struct inode is that if you open the same file twice (e.g., web server write access log, you use less to view the assess log file), There will be two struct file objects pointing to the same struct inode object. It is easy to imagine that the offset of the open file should be placed in the struct file (the offset of the Access log opened by the Web server is at the end of the file, and the offset of the access log you opened with less is at the beginning of the file). The length of the file itself should be in struct inode (the Web server continues to write to the access log and you don’t need to reopen the file to see it). In other words, the information you see with ls -l (except for the filename itself) is stored in the struct inode; (f)stat(2) returns a struct inode. The code that updates offsets in struct files has had a multithreaded security bug for a long time, until Linux 3.14 was released at the end of March 2014. This means that the write(2) system call in Ubuntu 14.04 is not thread-safe. Bugs – Shuo Chen’s Notes
Struct task_struct; struct file;
Version A: from 0.01 to 1.1.10
The earliest Linux kernels simply put fixed-length arrays of elements struct file* inside struct task_struct.
// include/ Linux /sched.h of linux-1.1.10 struct task_struct {// include/ linux-1.1.10 struct task_struct {//... struct file * filp[NR_OPEN]; fd_set close_on_exec; / /... }Copy the code
Struct file* fp from int fd
struct file* fp = current->filp[fd];
Copy the code
Struct files and struct inodes are also in their own fixed-length arrays.
// fs/file_table.c of linux-0.99 struct file file_table[NR_FILE]; Static struct inode inode_table[NR_INODE]; static struct inode inode_table[NR_INODE];Copy the code
The values of the macros NR_OPEN, NR_FILE, and NR_INODE determine the size of the array, and their values gradually increase. Changing NR_OPEN affects sizeof (struct task_struct) and also directly affects how much physical memory each process occupies, since task_struct objects do not swap to disk.
| Version | NR_OPEN | NR_FILE | NR_INODE | | | -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- - | -- -- -- -- -- -- -- - | | | 0.01 20 32 | | 64 | | | 0.12 20 | 64 | 64 | | 0.95 20 | | 64 | 128 | | 3 | a - 0.96 32 64 | | 128 | | | | 32 0.96 c - 1 128 | 128 | | 0.96 pre | | 64 | 32 0.97 32 | | 128 | | 128 | 128 | | 0.98.4 | 256 | 128 | 128 | | 0.99.10 | 256 | | 1024 | 2048Copy the code
In 0.99.10, struct files and struct inodes were dynamically allocated, so the number of files that the entire system can open at the same time is greatly increased, but the number of files that can be opened per process is still NR_OPEN.
// fs/file_table.c of linux-0.99.10
-struct file file_table[NR_FILE];
+struct file * first_file;
Copy the code
Version B: 1.1.11 to 1.3.21
1.1.11 Fs_struct, files_struct, mm_struct are separated from task_struct.
// include/ Linux /sched.h of linux-1.3.21 /* Open file table structure */ struct files_struct {int count; fd_set close_on_exec; struct file * fd[NR_OPEN]; }; struct task_struct { // ... /* filesystem information */ struct fs_struct fs[1]; /* open file information */ struct files_struct files[1]; /* memory management info */ struct mm_struct mm[1]; / /... };Copy the code
This does not change the functionality of the program, but better organizes the data structure so that closely related data members are in the same structure, reflecting the idea of encapsulation. Modifying NR_OPEN also directly affects sizeof (struct task_struct).
Struct file* fp from int fd
struct file* fp = current->files->fd[fd];
Copy the code
I guess I don’t have to change the client code when I change it to a pointer in the future.
File Descriptor flag and File Status flag
As mentioned in man 2 FCNTL, the flags of files are divided into file Descriptor flag and file status flag. Use F_GETFD/F_SETFD and F_GETFL/F_SETFL respectively (see muduo/net/SocketsOps. Cc for example). The file Descriptor flag has only one: close-on-exec; File status flags include O_NONBLOCK, O_APPEND, O_DIRECT, and so on. So files_struct should have fd_set close_on_exec member, which is used to store the File Descriptor flag, while file status flag is placed in f_flags member of the struct file. The difference between these two types of flags is reflected in the DUP (2) system call, which will be covered later.
Version C: 1.3.22 to 2.1.89
Sizeof a task_struct (struct task_struct) is slimming by changing the sizeof a task_struct (files, fs, mm) to a pointer. This is done to support multithreading.
// include/linux/sched.h of linux-2.0.2
struct task_struct {
// ...
/* filesystem information */
- struct fs_struct fs[1];
+ struct fs_struct *fs;
/* open file information */
- struct files_struct files[1];
+ struct files_struct *files;
/* memory management info */
- struct mm_struct mm[1];
+ struct mm_struct *mm;
// ...
};
Copy the code
Struct file* fp = current->files->fd[fd];
Linux 2.0 began to support multithreading. (The original LinuxThreads implementation was changed in 2.6 to the NPTL implementation with more POSIX semantics.) Remove the files_struct member from the task_struct, so that multiple threads in the same process can share a files_struct object, so that the files opened by thread 1 can be seen by thread 2.
The same goes for fs_struct and mm_struct.
Version D: 2.1.90 to 2.6.13
2.1.90 Change the fd member of files_struct from fixed-length array to dynamic array, so that each process can open many files at the same time, which clears a big obstacle for writing high concurrency network services.
// include/ Linux /sched.h of linux-2.2.0 /* * Open file table structure */ struct files_struct {atomic_t count; + int max_fds; + struct file ** fd; /* current fd array */ fd_set close_on_exec; // Changed to fd_set* in 2.2.12 fd_set open_fds; - struct file * fd[NR_OPEN]; };Copy the code
Data structure diagram:
Struct file* fp = current->files->fd[fd];
At this point, the functionality of the file descriptor table is complete, and the next version is a performance improvement.
Version E: 2.6.14 to present
2.6.14 Introduces struct fdtable as an indirect member of files_struct, and moves fd, max_fds, close_on_exec and other members into FDtable. This is done to facilitate RCU adoption and make fdTable replaceable as a whole. Read Copy Update (RCU) is the work of Paul E. McKenney. It is a more scalable Read/write synchronization mechanism widely adopted by the kernel. He also wrote the famous Is Parallel Programming Hard, And If So, What Can You Do About It? A book.
// include/ Linux /fdtable.h of linux-2.6.37 struct fdtable {unsigned int max_fds; struct file __rcu **fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct fdtable *next; }; /* * Open file table structure */ struct files_struct { /* * read mostly part */ atomic_t count; struct fdtable __rcu *fdt; struct fdtable fdtab; /* * written part on a separate cache line in SMP */ spinlock_t file_lock ____cacheline_aligned_in_smp; int next_fd; struct embedded_fd_set close_on_exec_init; struct embedded_fd_set open_fds_init; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; };Copy the code
The data structure is shown as follows:
Struct file* fp from int fd
current->files->fdt->fd[fd];
Copy the code
The actual code is more complex than this, because the files-> FDT step is done with rcu_dereference (red line above).
SO_REUSEPORT and dup (2)
Does adding multiple Epolls to the same listening socket reduce response time? Someone mentioned that you can use dUP (2) on the listening socket to achieve the same effect, which is not possible. The reason is that dUP (2) does not copy the struct file itself, but merely copies the pointer to the struct file and increments the reference count in the file (f_count member). Reading and writing to both FDS is still done through the same file object and performance is not improved (see figure below). If you add 3 and 4 fd’s to two epolls, you are actually adding the same file to two epolls (the f_EP_links member of file will chain the two epolls together), which is essentially different from SO_REUSEPORT. For further information, see improvements to SO_REUSEPORT in Linux 4.5/4.6.
BSD and FreeBSD
Finally, let’s look at how it works in the BSD and FreeBSD kernels. Here is a brief history of BSD releases.
The history of the file descriptor table in the BSD kernel is much simpler.
4.3 BSD-REno and previous versions (up to Unix V4) take Linux version A and place A fixed length array of struct file* directly in the process descriptor.
struct user {
// ...
struct file *u_ofile[NOFILE]; /* file structures for open files */
// ...
};
Copy the code
Unix V7 is written exactly like this minnie.tuhs.org/cgi-bin/utr… .
BSD net/2 to FreeBSD 9.3 uses Linux version D, where proc == task_struct, filedesc == files_struct, file == file
// sys/proc.h
/*
* Process structure.
*/
struct proc {
// ...
struct filedesc *p_fd; /* (b) Open files. */
// ...
};
// sys/filedesc.h
struct filedesc {
struct file **fd_ofiles; /* file structures for open files */
char *fd_ofileflags; /* per-process open file flags */
struct vnode *fd_cdir; /* current directory */
struct vnode *fd_rdir; /* root directory */
struct vnode *fd_jdir; /* jail root directory */
int fd_nfiles; /* number of open files allocated */
NDSLOTTYPE *fd_map; /* bitmap of free fds */
int fd_lastfile; /* high-water mark of fd_ofiles */
int fd_freefile; /* approx. next free file */
u_short fd_cmask; /* mask for file creation */
u_short fd_refcnt; /* thread reference count */
u_short fd_holdcnt; /* hold count on structure + mutex */
struct sx fd_sx; /* protects members of this struct */
struct kqlist fd_kqlist; /* list of kqueues on this filedesc */
int fd_holdleaderscount; /* block fdfree() for shared close() */
int fd_holdleaderswakeup; /* fdfree() needs wakeup */
};
// sys/file.h
struct file {
void *f_data; /* file descriptor specific data */
struct fileops *f_ops; /* File operations */
struct ucred *f_cred; /* associated credentials. */
struct vnode *f_vnode; /* NULL or applicable vnode */
short f_type; /* descriptor type */
short f_vnread_flags; /* (f) Sleep lock for f_offset */
volatile u_int f_flag; /* see fcntl.h */
volatile u_int f_count; /* reference count */
// ...
off_t f_offset;
// ...
};
Copy the code
As you can see, the BSD kernel source code is much cleaner and more detailed than Linux.