From RM to Linux Virtual File System

So let’s talk a little bit about what happened after RM as a subproblem.

Open rm source:

[qianzichen @ dev03v/SRC/app/coreutils/coreutils - 8.21] $vi SRC/rm. CCopy the code

Start with the main function:

int
main (int argc, char **argv)
{...while ((c = getopt_long (argc, argv, "dfirvIR", long_opts, NULL)) != - 1)
    {
      switch (c)
        {
        case 'f':
          x.interactive = RMI_NEVER;
          break; . }}...enum RM_status status = rm (file, &x);
}
Copy the code

First the command line arguments are parsed and then rm is called:

enum RM_status status = rm (file, &x);
Copy the code

The implementation of the rm function is removed from rm. C and placed in remove.c:

/* Remove FILEs, honoring options specified via X. Return RM_OK if successful. */
enum RM_status
rm (char *const *file, struct rm_options const *x) 
{
  enum RM_status rm_status = RM_OK;

  if (*file)
    {
      FTS *fts = xfts_open (file, bit_flags, NULL);
      while (1) {... }}... }Copy the code

The file argument is a read-only pointer array that represents a list of filenames to delete. The structure of the x argument is defined as follows, storing the rm option parsed from the command line.

struct rm_options
{
  /* If true, ignore nonexistent files. */
  bool ignore_missing_files;

  /* If true, query the user about whether to remove each file. */
  enumrm_interactive interactive; ./* If true, recursively remove directories. */
  bool recursive;
  bool require_restore_cwd;
};
Copy the code

When the file list exists, rm calls xfts_open:

FTS *
xfts_open (char * const *argv, int options,
           int (*compar) (const FTSENT **, const FTSENT **))
{
  FTS *fts = fts_open (argv, options | FTS_CWDFD, compar);
  if (fts == NULL)
    {
...
  return fts;
}
Copy the code

Xfts_open returns a valid return value for fts_open. Fts_open is implemented as follows:

FTS *
fts_open (char * const *argv,
          register int options,
          int (*compar) (FTSENT const **, FTSENT const **))
{
        register FTS *sp;

        /* Options check. */
        /* Allocate/initialize the stream */
        /* Initialize fts_cwd_fd.  */
        sp->fts_cwd_fd = AT_FDCWD;
        if ( ISSET(FTS_CWDFD) && ! HAVE_OPENAT_SUPPORT)
          {
            int fd = open (".",
                           O_SEARCH | (ISSET (FTS_NOATIME) ? O_NOATIME : 0));
        /*
         * Start out with 1K of file name space, and enough, in any case,
         * to hold the user's file names.
         */
        /* Allocate/initialize root's parent. */
        if (*argv != NULL) {
                if ((parent = fts_alloc(sp, "", 0)) == NULL)
                        goto mem2;
                parent->fts_level = FTS_ROOTPARENTLEVEL;
          }

        /* Allocate/initialize root(s). */
        for (root = NULL, nitems = 0; *argv != NULL; ++argv, ++nitems) {
                /*
                 * If comparison routine supplied, traverse in sorted
                 * order; otherwise traverse in the order specified.
                 */
                if (compar) {
                        p->fts_link = root;
                        root = p;
                } else {
                        p->fts_link = NULL;
                        if (root == NULL)
                                tmp = root = p;
                        else {
                                tmp->fts_link = p;
                                tmp = p;
                        }
                }
        }
        if (compar && nitems > 1)
                root = fts_sort(sp, root, nitems);
...  
        if (!ISSET(FTS_NOCHDIR) && !ISSET(FTS_CWDFD)
            && (sp->fts_rfd = diropen (sp, ".")) < 0)
                SET(FTS_NOCHDIR);

        i_ring_init (&sp->fts_fd_ring, -1);
        return (sp);

mem3:   fts_lfree(root);
...
        return (NULL);
}
Copy the code

Some Error handling has been removed from the reference, and it can be seen that it is mainly to obtain some information about the file system, which is stored in the FTS structure, which is defined as follows:

typedef struct {
        struct _ftsent *fts_cur;        /* current node */
        int (*fts_compar) (struct _ftsent const **, struct _ftsent const* *);/* compare fn */.int fts_options;                /* fts_open options, global flags */
        struct hash_table *fts_leaf_optimization_works_ht;
        union{...struct cycle_check_state *state;
        } fts_cycle;

        I_ring fts_fd_ring;
} FTS;
Copy the code

Back to the rm function, which reads the file system information in a loop via fts_read and caches it in ENT:

rm (char *const *file, struct rm_options const *x) 
{
  enum RM_status rm_status = RM_OK;

  if (*file)
    {
      FTS *fts = xfts_open (file, bit_flags, NULL);
      while (1)
        {
           ent = fts_read (fts);
           enumRM_status s = rm_fts (fts, ent, x); }}... }Copy the code

Ent has a large structure and will not be expanded here.

Then, rm_FTS is used to operate a certain ENT. Here, the rm is a regular file, so the control structure will be executed under the FTS_F branch, and execise will be called finally.

static enum RM_status
rm_fts (FTS *fts, FTSENT *ent, struct rm_options const *x)
{
  switch (ent->fts_info)
    {
    case FTS_D:			/* preorder directory */
        if (s == RM_OK && is_empty_directory == T_YES)
          {
            /* When we know (from prompt when in interactive mode) that this is an empty directory, don't prompt twice. */
            s = excise (fts, ent, x, true); fts_skip_tree (fts, ent); }... }case FTS_F:			/* regular file */
      {
        bool is_dir = ent->fts_info == FTS_DP || ent->fts_info == FTS_DNR;
        enum RM_status s = prompt (fts, ent, is_dir, x, PA_REMOVE_DIR, NULL);
        if(s ! = RM_OK)return s;
        returnexcise (fts, ent, x, is_dir); }... }}Copy the code

Again, ignoring some fault tolerance and optimization, Execise ends up calling Unlinkat

static enum RM_status
excise (FTS *fts, FTSENT *ent, struct rm_options const *x, bool is_dir)
{
  int flag = is_dir ? AT_REMOVEDIR : 0;
  if (unlinkat (fts->fts_cwd_fd, ent->fts_accpath, flag) == 0)
    {
      if (x->verbose)
        {
          printf((is_dir ? _ ("removed directory: %s\n")... }returnRM_OK; }... }Copy the code

As we can see above, rm finally called the core function unlinkat, for example, to delete a.txt:

unlinkat(AT_FDCWD, "a.txt".0)
Copy the code

User rm called unlinkat in C library. After searching, its declaration is in <unistd.h>

#ifdef __USE_ATFILE
/* Remove the link NAME relative to FD.  */
extern int unlinkat (int __fd, const char *__name, int __flag)
     __THROW __nonnull ((2));
#endif

/* Remove the directory PATH.  */
extern int rmdir (const char *__path) __THROW __nonnull ((1));
Copy the code

Glibc provides an implementation of the unlinkat function, which is defined in IO /unlink.c:

* Remove the link named NAME.  */
int
__unlink (name)
     const char *name;
{
  if (name == NULL)
    {   
      __set_errno (EINVAL);
      return - 1; 
    }   

  __set_errno (ENOSYS);
  return - 1; 
}
stub_warning (unlink)

weak_alias (__unlink, unlink)
Copy the code

Ok, here is a weak symbols, the real implementation in. / sysdeps / / sysv/Linux/Unix unlinkat. C

./* Remove the link named NAME. */
int
unlinkat (fd, file, flag)
     int fd;
     const char *file;
     int flag;
{
  int result;

#ifdef __NR_unlinkat
# ifndef __ASSUME_ATFCTS
  if (__have_atfcts >= 0)
# endif
    {
      result = INLINE_SYSCALL (unlinkat, 3, fd, file, flag);
# ifndef __ASSUME_ATFCTS
      if (result == - 1 && errno == ENOSYS)
        __have_atfcts = - 1;
      else
# endif
        return result;
    }
  char *buf = NULL; }... INTERNAL_SYSCALL_DECL (err);if (flag & AT_REMOVEDIR)
    result = INTERNAL_SYSCALL (rmdir, err, 1, file);
  else
    result = INTERNAL_SYSCALL (unlink, err, 1, file); . }Copy the code

Syscall’s name is __NR_##name, which in this case is __NR_unlinkat by gluing strings in the macro. It is defined in /usr/include/asm/unistd_64.h.

#ifndef _ASM_X86_UNISTD_64_H
#define _ASM_X86_UNISTD_64_H 1

#define __NR_read 0
#define __NR_write 1.#define __NR_newfstatat 262
#define __NR_unlinkat 263.#define __NR_kexec_file_load 320
#define __NR_userfaultfd 323

#endif /* _ASM_X86_UNISTD_64_H */
Copy the code

So the macro is enabled.

* The *at syscalls were introduced just after 2.6.16-rc1. Due to The way The kernel versions are advertised we can only Rely on 2.6.17 to have the code. On PPC they were introduced in 2.6.17-Rc1, on SH in 2.6.19-rc1. */
#if__LINUX_KERNEL_VERSION >= 0x020611 \ && (! defined __sh__ || __LINUX_KERNEL_VERSION >= 0x020613)
# define __ASSUME_ATFCTS        1
#endif
Copy the code

Obviously, if the kernel version is later than 2.6.17, the __ASSUME_ATFCTS macro is enabled. INLINE_SYSCALL (unlinkat, 3, fd, file, flag) without checking __have_ATFCTS >= 0.

Here directly see the underlying implementation (. / sysdeps Linux/Unix/sysv / / x86_64 / sysdep. H), is a inline assembly:

# undef INLINE_SYSCALL_TYPES
# define INLINE_SYSCALL_TYPES(name, nr, args...) \
  ({                                                                          \
    unsigned long int resultvar = INTERNAL_SYSCALL_TYPES (name, , nr, args);  \
    if(__builtin_expect (INTERNAL_SYSCALL_ERROR_P (resultvar, ), 0)) \ { \ __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));  \ resultvar = (unsigned long int) -1; \ } \ (long int) resultvar; })

# undef INTERNAL_SYSCALL_DECL
# define INTERNAL_SYSCALL_DECL(err) do { } while (0)

# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \
  ({                                                                          \
    unsigned long int resultvar;                                              \
    LOAD_ARGS_##nr (args)                                                     \
    LOAD_REGS_##nr                                                            \
    asm volatile (                                                            \
    "syscall\n\t"                                                             \
    : "=a" (resultvar)                                                        \
    : "0" (name) ASM_ARGS_##nr : "memory"."cc"."r11"."cx");                \
    (long int) resultvar; })
# undef INTERNAL_SYSCALL
# define INTERNAL_SYSCALL(name, err, nr, args...) \
  INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)

# define INTERNAL_SYSCALL_NCS_TYPES(name, err, nr, args...) \
Copy the code

The parameters are passed into the register before syscall. The return value is in the EAX register, usually 0 for success.

The RM utility calls glibc and then assembles the syscall -> kernel

However, the current machine may not be installed with upstream’s C library.

Let’s take a look at how the final machine code is implemented, and I’ll disassemble it directly here:

[qianzichen@dev03v /usr/lib64]$ objdump -D -S libc.so.6 > /tmp/libc.txt
[qianzichen@dev03v /usr/lib64]$ cd /tmp
[qianzichen@dev03v /tmp]$ grep -A12 'unlinkat' libc.txt 
00000000000e9c00 <unlinkat>:
   e9c00:       48 63 d2                movslq %edx,%rdx
   e9c03:       48 63 ff                movslq %edi,%rdi
   e9c06:       b8 07 01 00 00          mov    $0x107,%eax
   e9c0b:       0f 05                   syscall 
   e9c0d:       48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
   e9c13:       77 02                   ja     e9c17 <unlinkat+0x17>
   e9c15:       f3 c3                   repz retq 
   e9c17:       48 8b 15 4a 12 2d 00    mov    0x2d124a(%rip),%rdx        # 3bae68 <_DYNAMIC+0x2e8>
   e9c1e:       f7 d8                   neg    %eax
   e9c20:       64 89 02                mov    %eax,%fs:(%rdx)
   e9c23:       48 83 c8 ff             or     $0xffffffffffffffff,%rax
   e9c27:       c3                      retq   
   e9c28:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
   e9c2f:       00 

00000000000e9c30 <rmdir>:
   e9c30:       b8 54 00 00 00          mov    $0x54,%eax
   e9c35:       0f 05                   syscall 
[qianzichen@dev03v /tmp]$
Copy the code

You can see here that GlibC-2.17 ends up using some AT&T Syntax Assembly language.

With a relatively new instruction, movslq, the first register is expanded to 64 bits and copied into the second register, without the sign bits.

Next, load the value 0x107 into the EAX register

The syscall directive is then called.

Open Intel’s related chip manual and search for Syscall. The following figure shows the related description.

Use CPUID to check if SYSCALL and SYSRET are available (cpuID.80000001h. EDX[bit 11] = 1) 11 bits in the EDX register need to be set before the call to enable syscall/ SYSret for 64-bit platforms, ok we find the EDX register correlation.

We determined that unlinkat was a system call, and the RM utility handed the task of deleting files to the operating system, at which point the program fell into kernel mode.

Ok, now we go to kernel and directly search unlinkat:

[qianzichen@dev03v /src/linux/linux]$ grep unlinkat ./ -rn
./arch/parisc/include/uapi/asm/unistd.h:297:#define __NR_unlinkat (__NR_Linux + 281)
./arch/parisc/kernel/syscall_table.S:379:       ENTRY_SAME(unlinkat)
./arch/m32r/include/uapi/asm/unistd.h:309:#define __NR_unlinkat 301
./arch/m32r/kernel/syscall_table.S:303: .long sys_unlinkat
./arch/sparc/include/uapi/asm/unistd.h:358:#define __NR_unlinkat 290
./arch/sparc/kernel/systbls_32.S:78:/*290*/     .long sys_unlinkat, 
./arch/ia64/include/uapi/asm/unistd.h:279:#define __NR_unlinkat 1287
./arch/ia64/kernel/entry.S:1695:        data8 sys_unlinkat
./arch/ia64/kernel/fsys.S:815:  data8 0                         // unlinkat
./arch/alpha/include/uapi/asm/unistd.h:420:#define __NR_unlinkat 456
./arch/alpha/kernel/systbls.S:477:      .quad sys_unlinkat
...
./arch/x86/entry/syscalls/syscall_32.tbl:310:301        i386    unlinkat                sys_unlinkat
./arch/x86/entry/syscalls/syscall_64.tbl:272:263        common  unlinkat                sys_unlinkat
...
[qianzichen@dev03v /src/linux/linux]$
Copy the code

Directly look at the x86 system under the source code:

[qianzichen@dev03v /src/linux/linux]$ vi arch/x86/entry/syscalls/syscall_64.tbl
Copy the code

This is a list file,

#
# 64-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point>
#
# The abi is "common", "64" or "x32" for this file.
#
0	common	read			sys_read
...
261	common	futimesat		sys_futimesat
262	common	newfstatat		sys_newfstatat
263	common	unlinkat		sys_unlinkat
264	common	renameat		sys_renameat
265	common	linkat			sys_linkat
...

#
# x32-specific system call numbers start at 512 to avoid cache impact
# for native 64-bit operation.
#
512	x32	rt_sigaction		compat_sys_rt_sigaction
...
Copy the code

So the unlinkat number is 263 remember the value that was written into the EAX register is 0x107. Obviously, 0x107 = 1 * 16 ^ 2 + 0 * 16 ^ 1 + 7 * 16 ^ 0 = 263

Common represents the establishment of system call mappings between common user Spaces and kernel Spaces for 32-bit / 64-bit platforms.

In fact, kernel space’s mapping of numbers is not so simple, so it is no longer expanded here.

We know that the unlinkat of user space will eventually be sys_unlinkat at the entry point of kernel space.

Let’s go straight to assembly code:

[qianzichen@dev03v /src/linux/linux]$ vi arch/x86/entry/entry_64.S
Copy the code

. ENTRY(entry_SYSCALL_64) /* * Interrupts are off on entry. * Wedo not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
         * it is too small to ever cause noticeable irq latency.
         */
        SWAPGS_UNSAFE_STACK
        movq    %rsp, PER_CPU_VAR(rsp_scratch)
        movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

        TRACE_IRQS_OFF

        /* Construct struct pt_regs on stack */
        pushq   $__USER_DS
...
        ja      1f                              /* return -ENOSYS (already in pt_regs->ax) * /movq    %r10, %rcx

        /*
         * This call instruction is handled specially in stub_ptregs_64.
         * It might end up jumping to the slow path.  If it jumps, RAX
         * and all argument registers are clobbered.
         */
        call    *sys_call_table(, %rax, 8)... END(entry_SYSCALL_64)Copy the code

The num of syscall, __NR_unlinkat, is stored in rax.

ENTRY(entry_SYSCALL_64) is the 64-bit syscall assembly ENTRY point. After preparing a series of registers, Call * sys_call_TABLE (, %rax, 8) jumps to the offset address in the system call table, This is the syscall num function in the sys_call_table array.

Sys_call_table is defined in a separate file, which uses a little compiler extension and a more efficient use of precompilation techniques, and is no longer expanded here.

/* System call table for x86-64. */.#define __SYSCALL_64_QUAL_(sym) sym
#define __SYSCALL_64_QUAL_ptregs(sym) ptregs_##sym

#define __SYSCALL_64(nr, sym, qual) extern asmlinkage long __SYSCALL_64_QUAL_##qual(sym)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64

#define __SYSCALL_64(nr, sym, qual) [nr] = __SYSCALL_64_QUAL_##qual(sym),

extern long sys_ni_syscall(unsigned long.unsigned long.unsigned long.unsigned long.unsigned long.unsigned long);

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { 
        /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */
        [0. __NR_syscall_max] = &sys_ni_syscall, #include <asm/syscalls_64.h>
};
Copy the code

When to set syscall number to sys_unlinkat? See < ASM/sySCalls_64.h >, this header file is a procedure file that is generated at compile time. The original mapping information is from the aforementioned. / arch/x86 / entry/syscalls/syscall_64 TBL.

Syscalls_64. h = syscalls_64.h

__SYSCALL_COMMON(49, sys_bind, sys_bind)
__SYSCALL_COMMON(50, sys_listen, sys_listen)
...
__SYSCALL_COMMON(263, sys_unlinkat, sys_unlinkat)
Copy the code

__SYSCALL_COMMON is __SYSCALL_64. As described in the sys_call_table definition above, the first __SYSCALL_64 is defined to expand syscalls_64. H as a function declaration, then __SYSCALL_64 is redefined. To expand syscalls_64.h to define a group member.

So the kernel ends up with a read-only sys_call_TABLE array, subscript syscall number, pointing to the kernel’s sys_call_ptr_t. Syscall num starts at 0, so you can find sys_unlinkat directly by 263.

Now that the kernel has determined that sys_unlinkat is called, where is this function defined? After my attempts, finding sys_unlinkat directly in 4.9 was not possible because the string might have been precompiled and glued.

The macro I eventually found was defined like this:

.#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

#define SYSCALL_DEFINEx(x, sname, ...)                          \
        SYSCALL_METADATA(sname, x, __VA_ARGS__)                 \
        __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define__SYSCALL_DEFINEx(x, name, ...) \ asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ __attribute__((alias(__stringify(SyS##name)))); \ static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \ asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));  \ __MAP(x,__SC_TEST,__VA_ARGS__); \ __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \ return ret; \ } \ static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

asmlinkage long sys32_quotactl(unsigned int cmd, const char __user *special,
...
Copy the code

Sys_unlinkat = fs/namei.c

4078 SYSCALL_DEFINE3(unlinkat, int, dfd, const char __user *, pathname, int, flag)
4079 {
4080         if((flag & ~AT_REMOVEDIR) ! =0)
4081                 return -EINVAL;
4082 
4083         if (flag & AT_REMOVEDIR)
4084                 return do_rmdir(dfd, pathname);
4085 
4086         return do_unlinkat(dfd, pathname);
4087 }
Copy the code

Then call do_unlinkat:

3999 /* 4000 * Make sure that the actual truncation of the file will occur outside its 4001 * directory's i_mutex. Truncate can take a long time if there is a lot of 4002 * writeout happening, and we don't want to prevent access to the directory 4003 * while waiting on the I/O. 4004 */
4005 static long do_unlinkat(int dfd, const char __user *pathname)
4006 {
4007         int error;
4008         struct filename *name;
4009         struct dentry *dentry;
4010         struct path path;
4011         struct qstr last;
4012         int type;
4013         struct inode *inode = NULL;
4014         struct inode *delegated_inode = NULL;
4015         unsigned int lookup_flags = 0;
4016 retry:
4017         name = filename_parentat(dfd, getname(pathname), lookup_flags,
4018                                 &path, &last, &type);
4019         if (IS_ERR(name))
4020                 return PTR_ERR(name);
4021 
4022         error = -EISDIR;
4023         if(type ! = LAST_NORM)4024                 goto exit1;
4025 
4026         error = mnt_want_write(path.mnt);
4027         if (error)
4028                 goto exit1;
4029 retry_deleg:
4030         inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
4031         dentry = __lookup_hash(&last, path.dentry, lookup_flags);
4032         error = PTR_ERR(dentry);
4033         if(! IS_ERR(dentry)) {4034                 /* Why not before? Because we want correct error value */
4035                 if (last.name[last.len])
4036                         goto slashes;
inode = dentry->d_inode;
4038                 if (d_is_negative(dentry))
4039                         goto slashes;
4040                 ihold(inode);
4041                 error = security_path_unlink(&path, dentry);
4042                 if (error)
4043                         goto exit2;
4044                 error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
4045 exit2:
4046                 dput(dentry);
4047         }
4048         inode_unlock(path.dentry->d_inode);
4049         if (inode)
4050                 iput(inode);    /* truncate the inode here */
4051         inode = NULL;
4052         if (delegated_inode) {
4053                 error = break_deleg_wait(&delegated_inode);
4054                 if(! error)4055                         goto retry_deleg;
4056         }
4057         mnt_drop_write(path.mnt);
4058 exit1:
4059         path_put(&path);
4060         putname(name);
4061         if (retry_estale(error, lookup_flags)) {
4062                 lookup_flags |= LOOKUP_REVAL;
4063                 inode = NULL;
4064                 goto retry;
4065         }
4066         return error;
4067 
4068 slashes:
4069         if (d_is_negative(dentry))
4070                 error = -ENOENT;
4071         else if (d_is_dir(dentry))
4072                 error = -EISDIR;
4073         else
4074                 error = -ENOTDIR;
4075         goto exit2;
4076 }
Copy the code

Well, as I’ve gotten this far, the reader has seen one of the more aesthetically pleasing parts of software engineering: line 4044, calling vfs_unlink. From user space to system call, sys_UNlinkat dispatches unlinkat tasks to the virtual file system of the operating system.

Let’s take a look at the vfs_unlink implementation:

3941 /** 3942 * vfs_unlink - unlink a filesystem object 3943 * @dir: parent directory 3944 * @dentry: victim 3945 * @delegated_inode: returns victim inode, if the inode is delegated. 3946 * 3947 * The caller must hold dir->i_mutex. 3948 * 3949 * If vfs_unlink discovers a delegation, it will return -EWOULDBLOCK and 3950 * return a reference to the inode in delegated_inode. The caller 3951 * should then  break the delegation on that inode and retry. Because 3952 * breaking a delegation may take a long time, the caller should drop 3953 * dir->i_mutex before doing so. 3954 * 3955 * Alternatively, a caller may pass NULL for delegated_inode. This may 3956 * be appropriate for callers that expect the underlying filesystem not 3957 * to be NFS exported. 3958 */
3959 int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegated_inode)
3960 {
3961         struct inode *target = dentry->d_inode;
3962         int error = may_delete(dir, dentry, 0);
3963 
3964         if (error)
3965                 return error;
3966 
3967         if(! dir->i_op->unlink)3968                 return -EPERM;
3969 
3970         inode_lock(target);
3971         if (is_local_mountpoint(dentry))
3972                 error = -EBUSY;
3973         else {
3974                 error = security_inode_unlink(dir, dentry);
3975                 if(! error) {3976                         error = try_break_deleg(target, delegated_inode);
3977                         if (error)
3978                                 goto out;
3979                         error = dir->i_op->unlink(dir, dentry);
3980                         if(! error) {3981                                 dont_mount(dentry);
3982                                 detach_mounts(dentry);
3983}}3985         }
3986 out:
3987         inode_unlock(target);
3988 
3989         /* We don't d_delete() NFS sillyrenamed files--they still exist. */
3990         if(! error && ! (dentry->d_flags & DCACHE_NFSFS_RENAMED)) {3991                 fsnotify_link_count(target);
3992                 d_delete(dentry);
3993         }
3994 
3995         return error;
3996 }
3997 EXPORT_SYMBOL(vfs_unlink);
Copy the code

As you can see, at line 3979, the unlink pointer to the i_op member of the inode instance is called, which points to the actual HAL layer implementation.

Now look at the definition of the inode structure:

/* * Keep mostly read-only and often accessed (especially for * the RCU path lookup and 'stat' data) fields at the beginning * of the 'struct inode' */
struct inode {
        umode_ti_mode; .const struct inode_operations   *i_op;
        struct super_block      *i_sb;

        /* Stat data, not accessed from path walking */
        unsigned longi_ino; .#ifdef CONFIG_FSNOTIFY
        __u32                   i_fsnotify_mask; /* all events this inode cares about */
        struct fsnotify_mark_connector __rcu    *i_fsnotify_marks;
#endif

#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
        struct fscrypt_info     *i_crypt_info;
#endif

        void                    *i_private; /* fs or device private pointer */
};
Copy the code

You can see that the i_op member in the inode instance above is an inode_operations structure pointer.

Now look at the definition of inode_operations:

struct inode_operations {
        struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);.int (*create) (struct inode *,struct dentry *, umode_t.bool);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct inode *,struct dentry *,const char*); . } ____cacheline_aligned;Copy the code

File systems at the lower layer of the VFS must implement unlink and register with the kernel VFS according to inode_operations.

Instead of starting the hardware initialization after bootloader, or ignoring some of the register mechanisms after the kernel takes over the machine’s resources, see how the current machine is finally registered with the VFS.

Ext4 unlink unlink unlink unlink unlink unlink unlink unlink unlink

.3845 /* 3846 * directories can handle most operations... 3847 * /
3848 const struct inode_operations ext4_dir_inode_operations = {.3851         .link           = ext4_link,
3852         .unlink         = ext4_unlink,
3853         .symlink        = ext4_symlink,
...
3865 }
Copy the code

The assignment of the function pointer is complete in the ext4_dir_inode_operations instance.

See the ext4_unlink implementation directly:

static int ext4_unlink(struct inode *dir, struct dentry *dentry)
{
        int retval;
        struct inode *inode;
        struct buffer_head *bh; 
        struct ext4_dir_entry_2 *de; 
        handle_t *handle = NULL;

        if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb))))
                return -EIO;

        trace_ext4_unlink_enter(dir, dentry);
        /* Initialize quotas before so that eventual writes go * in separate transaction */
        retval = dquot_initialize(dir);
        if (retval)
                return retval;
        retval = dquot_initialize(d_inode(dentry));
        if (retval)
                return retval;

        retval = -ENOENT;
        bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
        if (IS_ERR(bh))
                return PTR_ERR(bh);
        if(! bh)goto end_unlink;

        inode = d_inode(dentry);

        retval = -EFSCORRUPTED;
        if(le32_to_cpu(de->inode) ! = inode->i_ino)goto end_unlink;

        handle = ext4_journal_start(dir, EXT4_HT_DIR,
                                    EXT4_DATA_TRANS_BLOCKS(dir->i_sb));
        if (IS_ERR(handle)) {
                retval = PTR_ERR(handle);
                handle = NULL;
                goto end_unlink;
        }    

        if (IS_DIRSYNC(dir))
                ext4_handle_sync(handle);

        if (inode->i_nlink == 0) { 
                ext4_warning_inode(inode, "Deleting file '%.*s' with no links",
dentry->d_name.len, dentry->d_name.name);
                set_nlink(inode, 1);
        }
        retval = ext4_delete_entry(handle, dir, de, bh);
        if (retval)
                goto end_unlink;
        dir->i_ctime = dir->i_mtime = current_time(dir);
        ext4_update_dx_flag(dir);
        ext4_mark_inode_dirty(handle, dir);
        drop_nlink(inode);
        if(! inode->i_nlink) ext4_orphan_add(handle, inode); inode->i_ctime = current_time(inode); ext4_mark_inode_dirty(handle, inode); end_unlink: brelse(bh);if (handle)
                ext4_journal_stop(handle);
        trace_ext4_unlink_exit(dentry, retval);
        return retval;
}
Copy the code

Look at the implementation of d_inode:

static inline struct inode *d_inode(const struct dentry *dentry)
{
	return dentry->d_inode;
}
Copy the code

D_inode (dentry) Extracts the inode information from the dentry structure, which is defined as follows:

struct dentry {
	/* RCU lookup touched fields */.struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is ... union { struct hlist_node d_alias; /* inode alias list */
		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
	 	struct rcu_head d_rcu;
	} d_u;
};
Copy the code

The dentry layer is not simply removed from the hard drive. For high performance, ext4 currently does some caching for directories. Set the flag bit and write back to the storage according to sync.

I won’t go into detail about the mechanism behind VFS, because I don’t know, Clam.

From RM to Linux Virtual File System

Related Posts

Network Protocol Series 10 – Transport Layer -TCP connections

CentOS7 Enable the Telnet server and perform the SSH upgrade together

MySQL Must Know must Note – query basics