This series of blog posts are basically reading notes for openEuler Operating System and Modern Operating System Principles and Implementation

Most of the referenced code was dug up in openEuler by bloggers following the instructions in the book. In order to minimize the length of the code, the referenced code has been omitted and deleted as much as possible

Other references:

5. What exactly does Execve () do? _CHENGonghao blog -CSDN blog _execve

ELF file loading process -Linux process management and scheduling (13) _oskernellab-csDN blog

Process control primitive

Process control refers to the process that the OS uses to create, destroy, and transition from one state to another.

Process control primitives, including create, destroy, block, wake up

OS control of the process, to achieve through [control primitive], each control primitive is a piece of instruction code, this code resident memory, running in the kernel state, exposed to the system call.

It is called a primitive because the execution of this code is atomic, that is, it cannot be interrupted during execution (which can be achieved through a close interrupt). OS typically disallows primitive concurrency to avoid PCB data errors that may result from interleaving instructions.

Starting a new program in Linux consists of two steps. The first step is to fork a new process based on the current process, and the second step is to use exec to load the new program and start a new task for the new process

Create a new process: fork()

Usage:

#include <sys/types.h>
#include <unistd.h>

pid_t fork(void);
Copy the code

The new process created with fork is a copy of the original process. The two processes are identical except for PID, virtual memory space.

When forking, the operating system needs to do the following:

  1. Create a new PCB and initialize it
  2. Copy the CPU context of the parent PCB to the PCB of the child, and the parent and child have the same execution environment
  3. Allocate physical memory for the new process

Create and copy the PCB

The code for PCB page application is as follows:

static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
{
	struct task_struct *tsk;
	unsigned long *stack;
	struct vm_struct *stack_vm_area;
	int err;
    
	tsk = alloc_task_struct_node(node);	// Assign a PCB page
    
    // Create a kernel stack
	stack= alloc_thread_stack_node(tsk, node); . stack_vm_area = task_stack_vm_area(tsk);/ / copy the PCB
	err = arch_dup_task_struct(tsk, orig);
	tsk->stack = stack; .return tsk;
}
Copy the code
  • There are two definitions of alloc_task_struct_node. One is defined in kernel/fork.c as follows

    static inline struct task_struct *alloc_task_struct_node(int node)
    {
    	return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
    }
    Copy the code

    Another kind is in the arch/ia64 / include/asm/thread_info h class function in defined macro way, here is a little

  • For PCB copy, defined in the arch/arm64 / kernel/process. C

    int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
    {
    	if (current->mm)
    		fpsimd_preserve_current_state();
    	*dst = *src;	// Just copy itBUILD_BUG_ON(! IS_ENABLED(CONFIG_THREAD_INFO_IN_TASK)); dst->thread.sve_state =NULL;
    	clear_tsk_thread_flag(dst, TIF_SVE);
    	return 0;
    }
    Copy the code

Copy the CPU Context

int copy_thread(unsigned long clone_flags, unsigned long stack_start,
		unsigned long stk_sz, struct task_struct *p)
{
    // pt_regs saves the user register state that needs to be saved when entering kernel mode from user space
    The p argument is the task_struct of the new process
	struct pt_regs *childregs = task_pt_regs(p);

    // The kernel state register of the new process is cleared
	memset(&p->thread.cpu_context, 0.sizeof(struct cpu_context));

	fpsimd_flush_task_state(p);

	if(likely(! (p->flags & PF_KTHREAD))) { *childregs = *current_pt_regs();// Assign the current register to the new process
		childregs->regs[0] = 0;		// arm uses X0 to return the value, and returns 0 on success

		*task_user_tls(p) = read_sysreg(tpidr_el0);

		if (stack_start) {			// If the user stack start address is set
			if (is_a32_compat_thread(task_thread_info(p)))
				childregs->compat_sp = stack_start;
			elsechildregs->sp = stack_start; }... }else{... } p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
	p->thread.cpu_context.sp = (unsigned long)childregs;

	ptrace_hw_copy_thread(p);

	return 0;
}
Copy the code

Pt_regs: arch/arm64 / asm/ptrace. H

Pt_regs defines how to store register state on the stack during an exception

struct pt_regs {
	union {
		struct user_pt_regs user_regs;
		struct {
			u64 regs[31]; u64 sp; u64 pc; u64 pstate; }; }; u64 orig_x0; u64 orig_addr_limit; . };Copy the code

Copy address space

Fork () assigns the new process the same address space as the parent, using copy-on-write: the new process copies the parent’s page table directly, so that the new process points to the same physical memory as the parent, and the physical pages in this segment are marked as read-only (by modifying the PTE page table entry). When either side wants to modify the memory, a page missing exception will be triggered, and the OS will make a copy of the page and modify the page table to point to the new page. When you return, the program can write a new page.

Arm64 supports four levels of page tables called: Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD), and Page Table Entry (PTE), which are Page entries. Each page entry corresponds to a page. The copy of the page table is a four-level circular copy of the four-level page table. The copies of these hierarchies correspond to COPY_page_range, copy_pud_range, copy_pMD_range, copy_pte_range, and copy_one_pte respectively

Both are defined in mm/memory.c:

int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, struct vm_area_struct *vma)
{
    pgd_t *src_pgd, *dst_pgd;
	unsigned long next;
	unsigned long addr = vma->vm_start;
	unsigned long end = vma->vm_end;
    boolis_cow; .// Check if it is copy on write, if it is, there will be a corresponding processing below
    is_cow = is_cow_mapping(vma->vm_flags);
	if (is_cow)
		...
    do{...if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
					    vma, addr, next))) {
			ret = -ENOMEM;
			break; }}while (dst_pgd++, src_pgd++, addr = next, addr != end);
}
...
static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
		unsigned long addr, unsigned long end)
{...do{...if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
						vma, addr, next))
			return -ENOMEM;
	} while(dst_p4d++, src_p4d++, addr = next, addr ! = end);return 0;
}

int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, struct vm_area_struct *vma)
{
	pgd_t *src_pgd, *dst_pgd;
	unsigned long next;
	unsigned long addr = vma->vm_start;
	unsigned longend = vma->vm_end; .do{...if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
					    vma, addr, next))) {
			ret = -ENOMEM;
			break; }}while(dst_pgd++, src_pgd++, addr = next, addr ! = end); .returnret; }...// Copy one task's vm_area to another task
static inline unsigned long
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
		unsigned long addr, int *rss)
{...// Copy on write
	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
		ptep_set_wrprotect(src_mm, addr, src_pte);
		pte = pte_wrprotect(pte);
	}
	// For shared memory
	if(vm_flags & VM_SHARED) pte = pte_mkclean(pte); .return 0;
}
Copy the code

Judgment and treatment of abnormal cause of missing page

There are many reasons for missing pages, and functions to handle memory errors are defined in MM /memory.c.

Static vm_fault_t handLE_pte_FAULT (struct vm_fault * VMF) is used to handle pTE errors

Static vm_fault_t wp_page_copy(struct vm_fault * VMF) is used to copy pages

Run the new program: exec function cluster

Exec refers to a group of functions that can carry out the task of loading and running the program.

The exec cluster loads the binaries in the memory into the address space, replaces the original program, allocates a new user stack, makes minor changes to the PCB, and starts executing the new program’s instructions.

Exec has a family of functions called execl, execLP, execle, execv, execvp, and execve. Only the last one of these functions is a system call. The rest are library functions (which essentially call execve after wrapping the first layer). System call functions are also functions

The most complete.

execve

int execve(char const *path, char const *argv[], char const *envp[]);
Copy the code

Parameters:

  • path: Path of the executable file
  • argv[]The process argument is passed to the called functionmain(int argc, char *argv[]))
  • envp[]: environment variable (the program’s main function can be extended tomain(int argc, char *argv[], char *envp[]).envp[]It goes into this.)

Find the program file and load it

Find files

Since you want to run a file, you need to find the file first. The kernel retrieves the physical location of the file by parsing the file name and finding the inode object. The kernel then creates a File object and fills the file object with information about the file path, inode object, and file opening mode. The kernel then accesses the open file directly through the file object.

File lookup and parsing are defined in fs/namei.c and are performed by the path_init and link_path_walk functions, respectively. Opening the file and filling the file structure are defined by alloc_file in fs/file_table.c

loader

Execve into the kernel, the actual call is do_execve, where a string of functions will be called as follows:

/*/fs/exec.c*/
// The system call interface is this, and then a string of calls
SYSCALL_DEFINE3(execve,
		const char __user *, filename,
		const char __user *const __user *, argv,
		const char __user *const __user *, envp)
{
	return do_execve(getname(filename), argv, envp);
}

int do_execve(struct filename *filename,
	const char __user *const __user *__argv,
	const char __user *const __user *__envp)
{
	struct user_arg_ptr argv = { .ptr.native = __argv };
	struct user_arg_ptr envp = { .ptr.native = __envp };
	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}

static int do_execveat_common(int fd, struct filename *filename,
			      struct user_arg_ptr argv,
			      struct user_arg_ptr envp,
			      int flags)
{
	return __do_execve_file(fd, filename, argv, envp, flags, NULL);
}

static int __do_execve_file(int fd, struct filename *filename,
			    struct user_arg_ptr argv,
			    struct user_arg_ptr envp,
			    int flags, struct file *file)
{
    ...
    // To do this, open ELF and load all the information into BPRM:
    retval = exec_binprm(bprm); // . }static int exec_binprm(struct linux_binprm *bprm)
{... ret = search_binary_handler(bprm);// Find a function to parse ELF.return ret;
}

/** * I would like to express my gratitude to CHENG Jian */

// The executable type queue supported by Linux is searched here for handlers of various executables to claim and process
// If the types match, the handler pointed to by the load_binary function pointer is called to process the target image file
int search_binary_handler(struct linux_binprm *bprm)
{... list_for_each_entry(fmt, &formats, lh) { retval = fmt->load_binary(bprm); }... }// The function load_binary is called based on the binary format
// The task of load_binary is to create a new execution environment for the current process by reading the information stored in the executable file
// For our ELF, of course, load_elf_binary
/* fs/binfmt_elf.c */
static struct linux_binfmt elf_format ={.module		= THIS_MODULE,
    // The elf binary is called
	.load_binary	= load_elf_binary,
    // Used to dynamically bind a shared library to an already running process, which is activated by the uselib() system call
	.load_shlib	= load_elf_library,
    // In a file named core, store the execution context of the current process. This file is typically created when the process receives a signal whose default operation is "dump", depending on the executable type of the program being executed
	.core_dump	= elf_core_dump,
	.min_coredump	= ELF_EXEC_PAGESIZE,
};

static int load_elf_binary(struct linux_binprm *bprm)
{
    // Some variables that you can guess one, two, three just by looking at their names
	unsigned long elf_entry;
	unsigned long interp_load_addr = 0;
	unsigned long start_code, end_code, start_data, end_data;
	unsigned long reloc_func_desc __maybe_unused = 0;
	int executable_stack = EXSTACK_DEFAULT;
    / / save the ELF Header
    struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
    
	struct pt_regs *regs = current_pt_regs();
	struct {
		struct elfhdr elf_ex;	// Save ELF Header
		struct elfhdr interp_elf_ex;
	} *loc;
	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
	loff_t pos;

    // The kernel can only request 128K of memory at one time
	loc = kmalloc(sizeof(*loc), GFP_KERNEL);
	if(! loc) { retval = -ENOMEM;goto out_ret;
	}
	
	/* Convert to ELF Header format */
	loc->elf_ex = *((struct elfhdr *)bprm->buf);
	retval = -ENOEXEC;

    /* Check the image type, EXEC is an executable and DYN is a shared library */
	if(loc->elf_ex.e_type ! = ET_EXEC && loc->elf_ex.e_type ! = ET_DYN)goto out;
    /* Check the schema. Here are some other checks, omitted */
	if(! elf_check_arch(&loc->elf_ex))goto out;
    
    /* Read Program Header from ELF, which contains kmalloc, and return pointer */
	elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);

    // This loop is used to find and process the interpreter segment of the target image
	for (i = 0; i < loc->elf_ex.e_phnum; i++) {
		if (elf_ppnt->p_type == PT_INTERP) { // PT_INTERP is the type of interpreter section
            / / points in space
			elf_interpreter = kmalloc(elf_ppnt->p_filesz, GFP_KERNEL);
            // Read elf_interpreter according to the position. That is, reading into the interpreter segment
			retval = kernel_read(bprm->file, elf_interpreter,
					     elf_ppnt->p_filesz, &pos);
            // Open the interpreter file
			interpreter = open_exec(elf_interpreter);
			/* Get the exec headers */
			pos = 0;
			retval = kernel_read(interpreter, &loc->interp_elf_ex,
					     sizeof(loc->interp_elf_ex), &pos);
			break;
		}
		elf_ppnt++;
	}

	elf_ppnt = elf_phdata;

	/* Check the interpreter header */
	if (elf_interpreter) {
		/* Verify the interpreter has a valid arch */
		if(! elf_check_arch(&loc->interp_elf_ex) || elf_check_fdpic(&loc->interp_elf_ex))goto out_free_dentry;

		/* Load the interpreter header */
		interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
						   interpreter);
		if(! interp_elf_phdata)goto out_free_dentry;

		/* Pass PT_LOPROC.. PT_HIPROC headers to arch code */
		elf_ppnt = interp_elf_phdata;
		for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
			switch (elf_ppnt->p_type) {
			case PT_LOPROC ... PT_HIPROC:
				retval = arch_elf_pt_proc(&loc->interp_elf_ex,
							  elf_ppnt, interpreter,
							  true, &arch_state);
				if (retval)
					goto out_free_dentry;
				break; }}/* Flush the code inherited from the parent */
	retval = flush_old_exec(bprm);

	setup_new_exec(bprm);
	install_exec_creds(bprm);

	/* Do this so that we can load the interpreter, if need be. We will change some of these later */
	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
				 executable_stack);
    
	current->mm->start_stack = bprm->p;

	/* Map ELF segments into memory */
	for(i = 0, elf_ppnt = elf_phdata;
	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
		int elf_prot = 0, elf_flags, elf_fixed = MAP_FIXED_NOREPLACE;
		unsigned long k, vaddr;
		unsigned long total_size = 0;

        // Only PT_LOAD segments need to be loaded
		if(elf_ppnt->p_type ! = PT_LOAD)continue;

		if (unlikely (elf_brk > elf_bss)) {
            // Check the address and page information
			unsigned long nbyte;
	        // Set BRK to generate BSS
			retval = set_brk(elf_bss + load_bias,
					 elf_brk + load_bias,
					 bss_prot);
			nbyte = ELF_PAGEOFFSET(elf_bss);
			if (nbyte) {
                ...
			}
			elf_fixed = MAP_FIXED;
		}
		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
			elf_flags |= elf_fixed;
		} else if (loc->elf_ex.e_type == ET_DYN) {	// What about DYN?.if (elf_interpreter) {
				load_bias = ELF_ET_DYN_BASE;
				if (current->flags & PF_RANDOMIZE)
					load_bias += arch_mmap_rnd();
				elf_flags |= elf_fixed;
			} else
				load_bias = 0;

			load_bias = ELF_PAGESTART(load_bias - vaddr);

			total_size = total_mapping_size(elf_phdata,
							loc->elf_ex.e_phnum);
			if(! total_size) { retval = -EINVAL;gotoout_free_dentry; }}// After determining the load address, map ELF to establish the user space virtual address space
		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, total_size);
		if (BAD_ADDR(error)) {
			retval = IS_ERR((void *)error) ?
				PTR_ERR((void*)error) : -EINVAL;
			gotoout_free_dentry; }... }// Adjust the location of each segment
	loc->elf_ex.e_entry += load_bias;	// Main function entry address
	elf_bss += load_bias;		// BSS start address
	elf_brk += load_bias;
	start_code += load_bias;	// code start address
	end_code += load_bias;
	start_data += load_bias;	// data Start address
	end_data += load_bias;

    
	retval = set_brk(elf_bss, elf_brk, bss_prot);
	if (retval)
		goto out_free_dentry;
	if(likely(elf_bss ! = elf_brk) && unlikely(padzero(elf_bss))) { retval = -EFAULT;/* Nobody gets to see this, but.. * /
		goto out_free_dentry;
	}

	if (elf_interpreter) {
		unsigned long interp_map_addr = 0;

		elf_entry = load_elf_interp(&loc->interp_elf_ex,
					    interpreter,
					    &interp_map_addr,
					    load_bias, interp_elf_phdata);
		if(! IS_ERR((void *)elf_entry)) {
			/* * load_elf_interp() returns relocation * adjustment */
			interp_load_addr = elf_entry;
			elf_entry += loc->interp_elf_ex.e_entry;
		}
		reloc_func_desc = interp_load_addr;

		allow_write_access(interpreter);
		fput(interpreter);
		kfree(elf_interpreter);
	} else {
		elf_entry = loc->elf_ex.e_entry;
	}

	kfree(interp_elf_phdata);
	kfree(elf_phdata);

	set_binfmt(&elf_format);

    // Set the stack further, such as auxiliary vectors, environment variables, program parameters, etc
	retval = create_elf_tables(bprm, &loc->elf_ex,
			  load_addr, interp_load_addr);
    
	/* N.B. passed_fileno might not be initialized? * /
	current->mm->end_code = end_code;
	current->mm->start_code = start_code;
	current->mm->start_data = start_data;
	current->mm->end_data = end_data;
	current->mm->start_stack = bprm->p;

	if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
		// For architectures with ELF randomization...}... finalize_exec(bprm); start_thread(regs, elf_entry, bprm->p);// Enter the address to the PC
	retval = 0;
    
out:
	kfree(loc);
out_ret:
	return retval;
	/* omit a bunch of error cleanup */
}

// Fill in the parameters of the target file, environment variables and other necessary information
static int
create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
		unsigned long load_addr, unsigned long interp_load_addr)
{
    /* Create the ELF interpreter info */
	elf_info = (elf_addr_t *)current->mm->saved_auxv;
	sp = (elf_addr_t __user *)bprm->p;  // sp points to the top of the user stack

	/* Now, let's put argc (and argv, envp if appropriate) on the stack */
	if (__put_user(argc, sp++)) // Push the number of parameters to the user stack
		return -EFAULT;

	/* Populate list of argv pointers back to argv strings. */
	p = current->mm->arg_end = current->mm->arg_start;
	while (argc-- > 0) {	// Parameters are pushed
		size_t len;
		if (__put_user((elf_addr_t)p, sp++))
			return -EFAULT;
		len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
		if(! len || len > MAX_ARG_STRLEN)return -EINVAL;
		p += len;
	}
	if (__put_user(0, sp++))
		return -EFAULT;
	current->mm->arg_end = p;

	/* Populate list of envp pointers back to envp strings. */
    // Environment variables are pushed
	current->mm->env_end = current->mm->env_start = p;
	while (envc-- > 0) {	// Environment variables are pushed one by one
		size_t len;
		if (__put_user((elf_addr_t)p, sp++))
			return -EFAULT;
		len = strnlen_user((void __user *)p, MAX_ARG_STRLEN);
		if(! len || len > MAX_ARG_STRLEN)return -EINVAL;
		p += len;
	}
	if (__put_user(0, sp++))
		return -EFAULT;
	current->mm->env_end = p;

	/* Put the elf_info on the stack in the right place. */
    // auxiliary Vector to the stack
	if (copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t)))
		return -EFAULT;
	return 0;
}
Copy the code

In the final create_ELF_tables, the kernel pushes Auxiliary vectors, environment variables, parameters, and so on. Auxiliary vectors are a mechanism for communicating information from the kernel to user space

Start a thread, defined in the corresponding to the processor architecture. H, such as the arch/arm64 / include/asm/processor. H

static inline void start_thread(struct pt_regs *regs, unsigned long pc,
				unsigned long sp)
{
	start_thread_common(regs, pc);	// Enter the write PC register
	regs->pstate = PSR_MODE_EL0t;	// Set the pstate register
    
	if(arm64_get_ssbd_state() ! = ARM64_SSBD_FORCE_ENABLE) set_ssbs_bit(regs); regs->sp = sp;// Set the sp register
}

static inline void start_thread_common(struct pt_regs *regs, unsigned long pc)
{
	memset(regs, 0.sizeof(*regs));
	forget_syscall(regs);
	regs->pc = pc;					/ / write PC spicy. }Copy the code

When the program is loaded and the process is executing normally, the values in the PT_regs structure are written to the CPU register and the CPU starts running the code from the PC