Chapter 5 Linux kernel architecture

Zhao Jiong; Linux Kernel Full Notes 0.11 revision V3.0

Operating system: hardware, operating system kernel, operating system services, user applications.

Operating system services: X window system, shell command interpretation system, kernel programming interface…
Operating system kernel: Abstraction and access scheduling of hardware resources.

5.1 Kernel Mode

Operating system kernel mode:

Single kernel mode (Linux), the code structure is compact, fast execution speed, structure is not strong
Microkernel mode

In single-kernel mode, the flow of operating system services is as follows: Application program — [parameters, init X80] — [CPU: switch from user mode to core state] — System call service program — underlying support function — [Complete specific function] — CPU: Core mode to user mode] — application.

5.2 Linux Kernel Architecture

Main modules of the Linux kernel:

Process scheduling: Controls the CPU resources used by processes.
Memory management: controls the use of memory resources by processes. Virtual memory: allows processes to use more memory than the actual space.
File system: Controls processes’ use of external storage, peripheral resources, and virtual file system: provides a common file interface for all external storage, shielding hardware device details.
Interprocess communication
The network interface

In single-kernel mode, the operating system provides the following services:

Application — [Parameters, init X80] — [CPU: user state switched to core state] — System call server — underlying support function — [complete specific function] — [CPU: core state switched to user state] — application.

5.3 Memory Management and Usage by the Linux Kernel

Physical memory distribution

Section, paging

CPU multi-task operation and protection mode

Virtual address, linear address, physical address

5.3.1 Physical Memory

Intel 80386 and later cpus provide two types of memory management (address shifting) systems: memory Segmentation System and Paging System.

5.3.2 Memory Address space

Three types of addresses:

Virtual and logical addresses of programs (processes);
Linear address of CPU;
Actual physical memory address;

An Intel 80X86 CPU can index 16384 selectors. If the maximum length of each segment is 4 GB, the maximum virtual address space is 64 TB (16384 x 4 GB =64 TB). In the Linux0.11 kernel, each program (process) is divided into a total capacity of 64MB virtual memory space.

5.3.3 Memory fragmentation Mechanism

Jiaming.blog.csdn.net/article/det…

Logical address – [segmentation mechanism] – linear address.

GDT(GDTR)
IDT(IDTR) : Stored in kernel code snippets.
LDT(LDTR) : The local descriptor table of each task LDT is also a memory segment defined by the descriptor in GDT, which stores the code segment and data segment descriptors of the corresponding task. Therefore, the LDT segment is very short, and its segment length is usually greater than 24 bytes. The task state segment TSS of each task is also a memory segment defined by the descriptor in GDT, and its segment length is sufficient as long as it can store a TSS data structure.

Task switching: THE CPU saves the information such as registers in the TSS segment of the task, and at the same time, the CPU uses the information in the TSS segment of the task to set each register to restore the execution environment of the task. The fourth descriptor in GDT table (syscall descriptor item) is not used in Linux0.11 kernel.

5.3.4 Memory paging Management

Jiaming.blog.csdn.net/article/det…

Linear address – [paging mechanism] – physical memory

For Linux0.11, the kernel sets the maximum number of segment descriptors in the global descriptor table GDT to 256, of which two are idle, two are used by the system, and two are used by each process. The system supports a maximum of (256-4)/2=126 tasks. The virtual address range (256-4)/2 x 64MB=8GB, but the maximum number of manually defined tasks is 64. The logical address range of each task is 64 m, and the start position of each task in the linear address space is (task id) x 64MB.
In fact, the instruction space (I) and data space (D) for all tasks in Linux0.11 share a single memory segment, meaning that all code, data, and stack parts of a process are in the same memory segment.
The code segment and data segment lengths for tasks 0 and 1 are in the 640KB range from linear address 0.

The concept of code and data segments in the logical address space of a process is not the same as the concept of code and data segments in the CPU segmentation mechanism.

The code and data segments in the process logical address space are the code areas, initialized and uninitialized data areas, and stack areas that are ordered in the process logical space by the compiler when the compiler and the operating system load the program.
The concept of a segment of CPU segmentation determines the purpose of a segment in a linear address space and the constraints and limitations that can be performed or accessed. Each segment can be set anywhere in the 4GB linear address space, independent of each other, or overlapping.

5.3.5 CPU Multitasking and Protection Modes

Kernel code and data are shared by all tasks.
Each task has its own code and data area and cannot access other tasks.
When a process executes a system call and gets stuck in kernel code (kernel run-time), the process’s kernel stack is used.

5.3.6 Relationships between Virtual Addresses, Linear Addresses, and Physical Addresses

Addresses of kernel code and data

The initialization operation of head.s program sets the kernel code segment and data segment to the length of 16MB, overlapping;
Contains all kernel code, kernel segment tables (GDT, IDT, TSS), page directory tables, secondary page tables, local data, kernel temporary stack (user stack for task 0);

By default, the Linux0.11 kernel manages up to 16MB of physical memory, with 4096 page frames of 4KB each. A machine with 4MB (or even 2MB) of physical memory can run a Linux0.11 system at full capacity. The 4-16MB address range is mapped to non-existent physical memory addresses, but will not be used. If the machine has more than 16MB memory, the usage of more than 16MB memory will be limited during the initialization phase.

Kernel code and data segments are the same in linear address space as in physical address space, which helps simplify kernel initialization.
All tasks except task 0 require mapping operations for them in the main memory area.

Address mapping of task 0

Is the first task manually started in the system;
The length of code segment and data segment is 640KB;
Code and data are determined in kernel code and data (0~640KB), no memory pages need to be reallocated;
The task status segment TSS0 code is determined and is located in the task 0 data structure information;
The kernel-mode stack and user-mode stack Spaces are also in the kernel code area;

Address mapping for task 1

Fork () allocates a page of memory in the main memory area for task 1’s secondary page table;
The page directory and secondary page table entries of the parent process (task 0) are copied.
Linear address space: 64MB-128MB;
Linear address mapped to physical address 0-640KB;
A page of memory is allocated in the main memory area to hold its task data structure and kernel stack space for task 1. Task data structure (PCB) includes TSS segment structure information of task 1.

The user-mode stack space of task 1 shares the user-mode stack space of task 0.
When task 1 was created at the beginning, the user-mode stack of task 0 was shared with task 1. However, when task 1 started to run, the page table entries mapped from task 1 to user-mode stack were set as read-only, so that task 1 would cause a page write exception when performing stack operations. The kernel allocates main memory pages as stack space.

Address mapping for other tasks

For processes created from task 2, their parent is the init(task 1) process;
64 processes can exist simultaneously in Linux0.11;
Starting from task 2, each task starts at nr*64MB;
The maximum length of the task code segment and data segment is 64MB;
Execute the execve() function in Task 2 to execute the shell program;
When the code for task 2 calls the execve() system call to start executing the shell program, the system call frees the page directory and page table entries copied from task 1, along with the corresponding memory pages, and then resets the associated page directory and page table entries for the new executor shell.
Although task 2 is allocated 64MB in the linear address space, the kernel does not immediately allocate and map physical memory pages for it.
Only when a page missing exception occurs, the MM allocates a page of physical memory and loads on demand.

5.3.7 Users Applied for Dynamic Memory Allocation

Malloc () applies for memory. The dynamically applied memory capacity is managed by the C library function malloc().
Because the kernel already has a linear address space of 64MB for each process, the kernel will also handle page misses if malloc causes them.
Using malloc does not map a physical memory page for a newly requested page. The kernel maps a physical memory page only if the program addresses an address for which no physical page exists.
Free (), the C library will release the memory block tags for free, for application to apply for to use again, in the process, the kernel distribution by the process of the physical pages will not be released, only when the process eventually at the end of the kernel will fully recover the allocated and mapped to the process address space range of all physical memory page;

5.4 Interruption Mechanism in Linux

5.4.1 Interrupt Operation Principle

Polling.
Interrupt, interrupt request – Interrupt request – interrupt service process

Programmable Interrupt Controller (PIC) is the manager of device interrupt request. Hardware interrupt processing flow:

PIC collects interrupt service requests, compares their priorities, and selects the interrupt request with the highest priority for processing, and also decides whether to preempt;
The PIC sends an interrupt signal to the INT pin of the processor, and the processor immediately stops what it is doing and asks the PIC which interrupt service procedure it needs to execute.
PIC tells the processor which interrupt service process to perform by sending interrupt number corresponding to interrupt request to the data bus.
According to the read interrupt number, the processor obtains the interrupt vector of related equipment by querying the interrupt direction table and starts to execute the interrupt service program.
When the interrupt service routine is finished, the processor continues to execute the program interrupted by the interrupt signal.

Software interrupt: By using the int instruction and using operands to indicate the interrupt number, the processor can perform the corresponding interrupt processing;

5.4.2 Interrupt subsystem of 80X86 microcomputer

Jiaming.blog.csdn.net/article/det…

Each 8259A chip can manage 8 interrupt sources;
Through multi-chip cascade, 8259A can form a system that can manage 64 interrupt vectors at most.

Here are some examples:

The INT pin from the chip is connected to the IR2 pin of the main chip, that is, the interrupt signal sent by 8259A from the chip is the IRQ2 signal of the main chip of 8259A;

5.4.3 Interrupt Direction Table (Interrupt Descriptor Table)

80X86 microcomputer supports 256 interrupts, corresponding to each interrupt needs to arrange an interrupt service program;
In 80X86 real mode, each interrupt vector consists of 4 bytes, specifying the segment value and the offset value within the segment.
Location of the interrupt vector in memory: 0x0000:N4, that is, the corresponding interrupt service program entry address is stored in physical memory 0x0000:n4 position;
In Linux system, in addition to the display and disk read interruption function provided by BIOS at the beginning of the kernel loading, the 8259A chip will be re-initialized in the setup.s program and an interrupt direction table will be reset in the head.s program before the normal operation of the kernel. Completely abandoned the INTERRUPT service function provided by BIOS;
When the Intel CPU runs in 32-bit protected mode, the interrupt descriptor table IDT is used to manage interrupts or exceptions, which is a substitute for the interrupt vector table.
Linux operating system works in 80X86 protected mode, so it uses interrupt descriptor table to set and store the vector information of each interrupt.

5.4.4 Linux kernel interrupt Processing

Interrupt signal:

Hardware interrupt
1. Maskable interrupt: CPU pin INTR
2. Non-maskable interrupt: CPU pin NMI
Software interrupt
1. Fault: the CPU reexecutes the command that caused the error
2. Trap: The CPU continues to execute the following instructions
3. Abort: The program that causes this error should be terminated

Each interrupt is identified by a number between 0 and 255;
Int0-int31 Each interrupt function is reserved by Intel and belongs to software interrupt (exception).
Int32 – int255 can be set by the user;

In Linux, int32-INT47 corresponds to the hardware interrupt request signal irq0-IRq15 issued by the 8259A interrupt control chip, and the system call interrupt issued by the program is set to int 0x80 —A unique interface for user programs to use operating system resources.

During system initialization, the kernel uses interrupt descriptors in the head.s program to default all 256 descriptors in the interrupt descriptor table, otherwise a general protection error may occur;
The Linux kernel uses interrupt gate and trap gate descriptors when setting the interrupt descriptor table IDT. The difference between the two descriptors lies in the influence of the interrupt permit flag IF in the flag register EFLAGS.
- An interrupt executed by an interrupt gate descriptor resets the IF flag (IF=0: prevents interrupt execution from being interrupted). The end of the interrupt instruction iret restores the IF flag from the stack.
- Interrupts performed through trap doors do not affect the IF flag.

5.4.5 Interrupt flag of the flag register

In order to avoid the interference of competition conditions and interrupts to the critical code area, cli and sti instructions are used in many places of Linux0.11 kernel code.
Cli: IF=0, no corresponding external interrupt;
Sti: IF=1, allowing the CPU to recognize and respond to interrupts sent by external devices.

5.5 Linux system calls

5.5.1 System call Interface

System call (Syscalls) is the unique interface between the Linux kernel and the upper application program.
A user program can use kernel resources by calling int 0x80 directly or indirectly (via a library function) and specifying the system call function number in the EAX register;
Applications typically use kernel-level system calls indirectly, using functions in C libraries with standard interface definitions;

A system call is called in the form of a function, which can take one or more arguments.
The result of system call execution is displayed in the return value, with negative value indicating error and 0 indicating success.
The error type code is stored in the global variable errno. By calling the library function perror(), we can print out the error string information corresponding to the error code.
In the Linux kernel, each system call has a unique system call function number, defined in include/unistd.h. These system call function numbers actually correspond to index values of entries in sys_call_table[], the system call handler pointer array table defined in include/ Linux /sys.h;
System call handler nouns generally begin with the symbol sys_.

5.5.2 System Call Processing Procedure

A system call is executed when the application issues an interrupt call int 0x80 to the kernel via a library function;
The system call number is stored in register EAX, and the parameters carried are stored in registers EBX, ECX and EDX successively.
The procedure for handling system call interrupt INT0x80 is system_call in the kernel/system_call.s program.
The kernel source code defines the macro function _syscalln() in include/unistd.h, where n represents the number of arguments carried, which can be 0 to 3 respectively. If you need to pass a chunk of data to the kernel, you can pass a pointer to that chunk of data.

// For the read() system call, the definition is:
// unistd.h
#define __NR_read	3
int read(int fd, char *buf, int n);

// We execute the corresponding system call directly in the user program, so the system call macro is in the form of:
#define _LIBRARY_
#include <unistd.h>

_syscall3(int, read, int, fd, char *, buf, int, n)

// We can use syscall3() above directly in the user program to execute a system call read() without mediating through the C library
Copy the code

For each system call macro given in include/unistd.h, there are 2+2* N arguments, the first of which corresponds to the type of system call return value; The second argument is the name of the system call; This is followed by the type and name of the parameters carried by the system call. This macro will be extended to C functions containing embedded assembly statements;

// unistd.h
#define _syscall3(type,name,atype,a,btype,b,ctype,c) \
type name(atype a,btype b,ctype c) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
	: "=a" (__res) \
	: "0" (__NR_##name),"b" ((long)(a)),"c" ((long)(b)),"d" ((long)(c))); \
if(__res>=0) \ return (type) __res; \ errno=-__res; \ return -1; The \}
Copy the code

int read(int fd, char *buf, int n)
{
	long _res;
	__asm__ volatile (
		"int $0x80"
		:"=a" (_res)	// eax holds the return value, which holds the actual number of bytes read
		:"0"(_NR_read), "b"((long)(fd)), "c"((long)(buf)), "d"((long)(n))); // system call number (ebx,ecx,edx)
	if (_res>=0)
		return int __res;
	errno =- _res;	// Read error, put the error number into the global variable errno, and return -1 to the caller.
	return - 1;
}
Copy the code

As you can see, this macro expands to be a concrete implementation of a read operating system call.

When entering the system call handler kernel/system_call.s, the system_call code first checks whether the system call function number in the EAX is in the valid system call number range. The corresponding system call handler is then executed according to the sys_call_TABLE [] function pointer table call.

call _sys_call_table(,%eax,4)

This assembler operand is meant to indirectly call the function whose address is _sys_call_TABLE +%eax*4. Since the sys_call_table[] pointer is 4 bytes each, we need to multiply the system call function number by 4. The resulting value is then used to get the address of the called handler function from the table.

5.5.3 System Call Parameter Transfer Mode

General register transfer method;
System Call gate, which automatically copies passed parameters in the process user stack and kernel stack;

5.6 System time and Timing

In order for the operating system to automatically provide accurate current time and date information, the PC/AT microcomputer system provides battery-powered real clock RT circuit support. Usually this part of the circuit is integrated on a chip with a small amount of CMOS RAM that holds the system information, so this part of the circuit is called RT/CMOS RAM circuit.

At initialization, the Linux 0.11 kernel reads the current time and date information in the chip through the time_init() function in init/main.c. The kernel/mktime() function in the kernel/ mktime. c program is converted to the time in seconds remembered from 0 o ‘clock on January 1, 1970 — start_time. A user program can read the value of startup_time by calling time().

A program can uniquely determine the current time value of the run-time by counting the system tick value jiffies from system startup.

5.6.2 System Timing

The programmable timing chip Intel 8253 (8254) is set to send a clock interrupt request (IRQ0) signal every 10 milliseconds. This time beat is the pulse of the operating system, known as a system tick. Every time after a clock tick, the system will call a clock interrupt handler.

The clock interrupt handler timer_interrupt accumulates the number of clock ticks that have passed since the system was started, primarily through the jiffies variable. The jiffies value increases by 1 each time a clock interrupt occurs, and the C function do_timer() is called for further processing.

The do_timer() function accumulates the running time of the current process according to the privilege level. If CPL=0, it indicates that the process is interrupted when it is running in kernel mode. The kernel will add the kernel-mode running time stime+1; otherwise, the user-mode running time+1.

The ** timeslice is the amount of CPU time, in ticks, that a process can continue to run before being switched. ** If the process time slice value is greater than 0 after decrement, it indicates that the time slice is not used up, so exit do_timer() to continue running the current process. If the process time slice has been reduced to 0, it indicates that the process has used up the CPU time slice, so the program will determine the method of further processing according to the level of the interrupted program. If the interrupted current process is working in user mode, do_timer() calls the scheduler schedule() to switch to another program to run. Do_timer () exits immediately if the interrupted current process is working in kernel mode, that is, if it is interrupted while running in a kernel program. This approach ensures that Linux processes running in kernel mode will not be switched by callers. That is, the process is not preemptible when running in kernel-mode program, but can be preempted when running in user-mode program.

In the Linux 0.11 kernel, there can be up to 64 timers at a time. Sched. C.

5.7 Linux Process Control

A program is an executable file, while a process is an instance of an executing program. Using time-sharing technology, multiple processes can be run simultaneously on Linux operating system. For the Linux 0.11 kernel, the system can have up to 64 processes at the same time. Except for the first process created manually, the rest are created by the existing process using the system call fork. A process consists of executable instruction code, data, and stack areas. The code and data parts in the process correspond to the code and data sections in an execution file respectively. Each process can only execute its own code and access its own data and stack areas. Communication between processes needs to be done through system calls.

A process can execute in user mode or kernel mode, and use separate kernel-mode stack and user-mode stack respectively. The user-mode stack is used by the process to temporarily store the parameters of the called function, local variables and other data in user mode. The kernel stack contains information about function calls made by the kernel program.

5.7.1 Task Data Structure

The kernel program manages processes through a process table. Each process has an entry in the process table. In Linux, the entry is a task_struct task structure pointer. The task data structure is defined in the header file include/ Linux /sched.h.

// P72
struct task_struct {
/* these are hardcoded - don't touch */
	long state;														/* -1 unrunnable, 0 runnable (ready), >0 stopped */
	long counter;													// Task run time count, decrement
	long priority;													// The higher the priority, the longer the operation
	long signal;													// Bitmap, where each bit represents a signal
	struct sigaction sigaction[32].									// Signal execution attribute, corresponding to the signal to perform the operation and flag information
	long blocked;	/* bitmap of masked signals */					// Process signal masking code
/* various fields */
	int exit_code;													// Exit code used by the parent process after the task is stopped
	unsigned long start_code,end_code,end_data,brk,start_stack;		// Code segment address, code segment length, code segment + data length, total length, stack segment address
	long pid,father,pgrp,session,leader;							// Process ID, parent process ID, process group ID, session ID, and session leader
	unsigned short uid,euid,suid;									// User ID, valid user ID, and saved user ID
	unsigned short gid,egid,sgid;									// Group ID, valid group ID, and saved group ID
	long alarm;														// Alarm timer (tick number)
	long utime,stime,cutime,cstime,start_time;						// User state running time, system state running time, child user state running time, child system state running time, child start time
	unsigned short used_math;										// flag whether a coprocessor is used
/* file system info */
	int tty;		/* -1 if no tty, so it must be signed */		// The process uses the subdevice number of the TTY terminal
	unsigned short umask;											// File create attribute mask bit
	struct m_inode * pwd;											// Current working directory I node structure pointer
	struct m_inode * root;											// root directory I node structure pointer
	struct m_inode * executable;									// execute file I node structure pointer
	unsigned long close_on_exec;									// Close file handle bitmap flags when executing
	struct file * filp[NR_OPEN];									// File structure pointer table
/* ldt for this task 0 - zero 1 - cs 2 - ds&ss */
	struct desc_struct ldt[3].										// Local descriptor table
/* tss for this task */				
	struct tss_struct tss;											// The task status segment information structure of the process
};
Copy the code

5.7.2 Process Running Status

The process state is stored in the state field of the process task structure. In Linux, states such as sleep are divided into interruptible and non-interruptible wait states.

Interruptible sleep: When the system generates an interrupt, releases the resource that the process is waiting for, or receives a signal, the process can be awakened to the ready state.
Uninterruptible sleep state: You can only transition to an operational ready state if you explicitly wake up using wake_up(). This state is usually used when the process is waiting undisturbed or when the waiting time can occur quickly.
Suspended state: When a process enters the suspended state after receiving SIGSTOP, SIGTSTP, SIGTTIN, and SIGTTOU signals, it sends SIGCONT signals to switch to the running state. In Linux 0.11, transitions to this state are not implemented, and processes in this state are treated as process terminations.
Dead: A process is said to be dead when it has stopped running but its parent has not called wait() to ask about its status.

When a process runs out of time, the system uses the scheduler to force it to switch to another process, and the process goes to sleep. Only when the process runs in kernel mode, the kernel will switch operation for process, running process under the kernel mode cannot be preempted by other processes, and a process can’t change the state of another process, in order to avoid the kernel data errors when switching the process to the kernel will ban all interrupt in critical region code execution.

5.7.3 Initializing processes

After completing all the initialization functions in init/main.c, the parts of the system are in a runnable state. The program then moves itself manually to process 0 and creates process 1 for the first time using a fork() call, where it continues to initialize the application environment and execute shell logins. The original process 0 will be scheduled to execute when the system is idle, and task 0 will only execute the pause() system call, which in turn will execute the scheduling function.

Moving to task 0 to perform this process is done by the macro move_to_user_mode(include/ ASM /system.h). It moves the main.c program execution flow from kernel mode to task 0 in user mode to continue running. Before moving, the system first sets the running environment for task 0 in its sched_init() initialization of the scheduler. This includes manually pre-setting the values of each field of the personage 0 data structure, adding the task state segment descriptor of task 0 to the global descriptor table and the segment descriptor of the local descriptor table, and loading them into the task register TR and local descriptor list register LDTR respectively.

Kernel initialization is a special process. The code segment of task 0 and the data segment are contained in the kernel code segment and data segment respectively. The kernel initializer main.c is also the code in task 0, but the system is running main.c with kernel privilege level 0 before moving to task 0. The function of the macro move_to_user_mode is to change the privilege level from level 0 in kernel mode to level 3 in user mode, while still executing the original code instruction flow.

In moving to task 0, the macro move_to_user_mode interrupts the return instruction to cause the privilege level to change. The use of this method to transfer control is caused by the CPU protection mechanism. The CPU allows low-level code (privilege level 3) to be called or moved to run in higher-level code by invoking invocation gates or interrupt or trap gates, and not vice versa. So the kernel takes this approach of simulating IRET’s return of low-level code (user-mode). The main idea of this method is to construct the contents needed by the interrupt return instruction in the stack, and set the segment selector of the return address as the task 0 code segment selector, and its privilege level is 3. The subsequent execution of the interrupt return instruction iret will cause the system CPU to jump from privilege level 0 to privilege level 3.

When the IRET instruction is executed, the CPU feeds the return address into CS:EIP and pops the contents of the flag register on the stack. Since the CPU determines that the privilege level of the destination code segment is 3, as opposed to level 0 in the current kernel state, the CPU pops the stack segment selector and stack pointer from the stack into SS:ESP. The value of the segment registers DS, ES, FS, and GS becomes invalid due to the privilege level change, and the CPU will clear these segment registers to zero. So the registers need to be reloaded after executing the IRET instruction, and the system starts running at privilege level 3 on the code for task 0. The user stack used is the same stack used before the move. The kernel stack is specified to start at the top of the page where its task data structure resides (PAGE_SIZE+(long)&init_task). Since the task data structure of task 0, including its user stack pointer, will need to be copied later when a new process is created, the user stack of task 0 is required to remain clean until task 1 is created.

5.7.4 Creating a Process

The fork() system call is used to create a new process in Linux. All processes are created by copying process 0 and are children of process 0.

During the process of creating a new process, the system first finds an empty item in the task array that has not yet been used by any process. If 64 processes are already running on the system, the fork() system call returns an error because there are no empty entries available in the task array table. Then, the system applies for a page of memory in the main memory area for the new process to store its task data structure information, and copies all contents in the task data structure of the current process as the template of the task data structure of the new process. To prevent the unprocessed new process from being executed by the scheduler, you should immediately set the new process state to an uninterruptible wait state. The replicated task data structure is then modified. Set the current process as the parent process of the new process, clear the signal bitmap and reset the statistics of the new process, and set the initial running time slice value to 15 system ticks (150ms). Esp0 is set to the top of the memory page where the task data structure of the new process is located, and the stack segment TSS.ss0 is set to the kernel data segment selector. Tss.ldt is set to the index value of the local table descriptor in the GDT. If the current process uses a coprocessor, you also need to save the full state of the coprocessor in the tsS.i387 structure of the new process. The system then sets the code and data segment base address and length limit for the new task, and copies the page table managed by paging the current process. Note that instead of assigning the new process the actual physical memory page, it shares the memory page of its parent process. Only when either the parent or the new process has a write memory operation, the system will allocate the associated independent memory page for the process that performs the write memory operation. (Copy On Write) If a file is open in the parent process, the number of times the file is opened should be increased by 1. The TSS and LDT descriptor entries for the new task are then set in the GDT, where the base address information points to the TSS and LDT in the new process task structure. Finally, the new task is set to runnable and the new process number is returned.

Creating a new child process and loading and running an executable file are two different concepts. When the child process is created, it copies exactly the code and data area of the parent process and executes the child part of the code in it. When a program on a block device is executed, the exec() system call is typically run in the child process, and when exec() is entered, the original code and data areas of the child process are cleared. ** When the child starts running the new program, the CPU will immediately raise the exception that the code page does not exist because the kernel has not loaded the code page from the block device. ** The memory manager will load the corresponding code page from the block device, and the CPU will execute the instruction that caused the exception. That’s when the code for the new program actually gets executed.

5.7.5 Process Scheduling

Select the next process to run in the system. This selective operation mechanism is the basis of multitasking operating systems. Scheduler (management code that allocates CPU elapsed time between all running processes). Linux processes are preemptible. Preemption occurs when the process is in user mode, and cannot be preempted when the process is in kernel mode.

Scheduler (based on priority queuing scheduling policy)

The schedule() function scans the array of tasks and compares the running time (counter) of the ready task to determine the process with less running time. The task switching macro function is used to switch the process to run.

If the time slice of all processes in the ready state has been used up, the system will recalculate the time slice value counter of each task for all processes in the system (including sleeping processes) according to the priority value of each process.

$counter = \frac{counter}{2} + priority$

For processes that are sleeping, they have a very high counter when they wake up. The schedule() function then rescan all ready processes in the task array and repeats the process until one is selected, finally calling switch_to() to perform the actual process switch. If no other process is available, process 0 is selected to run. In the case of Linux0.11, process 0 calls pause() to put itself to an interruptible sleep state and again calls schedule(). Schedule process 0 to run whenever the system is idle.

Process switching

Switch_to () is defined in include/ ASM /system.h. This macro replaces the CPU’s current process state with the new process state. Before switching, switch_to() checks whether the process to be switched to is the current process. If so, it does nothing and exits. Otherwise, it sets the current variable to the pointer of the new task, and then jumps to the address of the TSS of the new task. The CPU performs task switchover. At this point, THE CPU will save the state of all its registers to the TSS structure of the current process task data structure pointed to by the TSS segment selector in the current task register TR, and then restore the register information in the TSS structure of the new task data structure pointed to by the new task state segment selector to CPU. The system officially starts to run the new switch task.

5.7.6 Stopping a Process

When a process finishes running or terminates in mid-execution, the kernel needs to release system resources that the process occupies. This includes files opened while the process is running, memory requested, and so on.

When a user program calls the exit() system call, the kernel function do_exit() is executed. This function first frees the memory pages occupied by the code and data segments of the process, closes all open files of the process, and synchronizes the current working directory used by the process, the root directory, and the I node where the program is running. If the process has children, make the init process the parent of all of its children. If the process is a process and a control terminal session, then release the control terminal, and to send off the signal belongs to all the processes of the session, it will usually end the session for all processes, and then the process status for zombie state, and send its parent ISGCHLD signals, notify its a child process has been terminated. Finally, do_exit() calls the scheduling function to execute the other process, and the process is terminated with its task data structure still intact.

The parent process usually waits for a child process to terminate using wait() or waitpid() while the child process is executing. When the waiting child process is terminated and in a zombie state, the parent process will add the time spent by the child process to its own process, and finally release the memory pages occupied by the task data structure of the terminated child process and empty the pointer items occupied by the child process in the task array.

5.8 How to Use the Stack in Linux

There are four types of stack used in Linux 0.11:

A temporary stack used during system boot initialization;
After entering protected mode, the kernel program initialization stack is provided: at the fixed location of the kernel code address space, this stack is also the user stack used by task 0 later;
Kernel-state stack for tasks: the stack used by each task to execute kernel programs through system calls;
The user-mode stack of the task is located near the end of the logical address space of the task.

Why are there so many stack types?

Moving from solid mode to protected mode changes the way the CPU addresses memory, so the stack area needs to be reconfigured.
Solve different CPU privilege level Shared protection of the problems of using stack (when run a task into kernel mode, will use the TSS segment given privilege level 0 stack pointer, namely the kernel mode stack, the original user stack pointer will be stored in the kernel stack, when returning to user mode from kernel mode, will return to use the user mode stack.)

5.8.1 Initialization Phase

Boot initialization (bootsect.s, setup.s)

When the bootsect.s code is loaded by ROM BIOS boot into physical memory at 0x7c00, there is no setup stack. The stack segment register SS is set to 0x9000 and the stack pointer ESP register is set to 0xFF00 until bootsect.s is moved to 0x9000:0.

When entering protected mode (head.s)

The following address locations can be found in the system.map file generated when the kernel is compiled.

From the head. S program, the system officially begins to run in protected mode. At this point, the stack segment is set to the kernel data segment (0x10), and the stack pointer ESP is set to point to the top of the USER_Stack array, reservating 1 page of memory (4K) for use as the stack. The user_Stack array is defined in sched.c. Initialization time (main.c)

In init/main.c, the stack is used until the move_to_user_mode() code is executed to transfer control to task 0. After move_to_user_mode(), the main.c code is switched to task 0. By executing the fork() system call, init() in main.c will execute in task 1 and use task 1’s stack, while main() itself will switch to task 0 and still use the kernel program’s own stack as task 0’s user-mode stack.

5.8.2 Task stack

Each task has two stacks, one for user – and kernel-mode program execution. Apart from being in different CPU privilege levels, the main difference between the two stacks is that the kernel stack of the task is small, holding no more than (4096- task data structure block) bytes of data, whereas the user stack of the task can be extended over the user’s 64MB space.

When running in user mode

Each task (except task 0 and task 1) has its own 64MB address space, and when a task is first created, ** its user-mode stack pointer is set near the end of its address space. ** The end actually contains the parameters and environment variables of the execution program, followed by the user stack space, as shown in the figure.This stack is always used when the application is running in user mode. The actual physical memory used by the stack is determined by the CPU paging mechanism. ** Since Linux implements copy-on-write, after a process is created, if neither the process nor its parent uses the stack, they share the physical memory page corresponding to the same stack. ** The kernel memory manager allocates a new memory page to the writing process only when one of the processes performs a stack write. The user stacks for processes 0 and 1 are special.

When running in kernel mode

Each task has its own kernel-state stack, which is used during the execution of the task in kernel code. The position in the linear address is specified by the SS0 and ESP0 fields in the TSS segment of the task. Ss0 is the segment selector of the task kernel stack, and ESP0 is the stack bottom pointer. Therefore, whenever a task is moved from user code into kernel code for execution, the kernel stack of the task is empty. The task kernel stack is positioned at the end of the page where its task data structure is located, that is, on the same page as the task’s task data structure. This is set by fork() in the kernel-level stack field of the TSS section of the task (kernel/fork.c) when a new task is created.

p->tss.esp0 = PAGE_SIZE+(long)p;
p->tss.ss0 = 0x10;
Copy the code

P is the task data structure pointer of the new task, and TSS is the task state segment structure. The kernel allocates memory for the new task to hold its task_struct data, and the TSS segment is one of the fields in the task_struct. The kernel stack segment value tsS.ss0 for this task is also set to 0x10 (kernel data segment selector), while TSS.esp0 points to the end of the task_struct structure page. In fact tsS.esp0 is set to point to one byte on the page. When the Intel CPU executes a stack operation, it decrement the stack pointer ESP and then stores the contents.

Why can a page of memory requested from the main memory area to hold task data structures also be used to set data in the kernel data segment?

This is because the user kernel state still belongs to the kernel data space. At the end of the head.s program, the descriptors of the kernel code segment and the data segment are set respectively. The segment length is set to 16MB, which is the maximum physical memory length supported by Linux 0.11 kernel. Therefore, kernel code can be addressed anywhere in the entire physical memory range, including the main memory area. Whenever a task needs to use its kernel stack to execute a kernel program, the CPU uses the TSS structure to set its kernel stack to be made up of tsS.ss0 and TSS.esp0. The kernel stack pointer ESP0 of the old task is not saved when the task is switched. To the CPU, these two values are read-only. So every time a task goes into kernel mode, its kernel stack is always empty.

Stack for task 0 and task 1

The stacks for task 0 (idle process) and task 1 (init process) are special.

Task 0 and Task 1 have the same code and data segments, and both have a limit of 640KB, but they are mapped to different linear address ranges. Task 0’s segment base address starts at linear address 0, and Task 1’s segment base address starts at 64MB, but they are all mapped to physical addresses in the 0-640KB range. This address range is where the kernel code and basic data are stored.

After executing move_to_user_mode(), the kernel-mode stacks for task 0 and task 1 are at the end of the page where their respective task data structures are located, while the user-mode stack for task 0 is the stack previously used after entering protected mode. That is the position of the USER_stack [] array of sched.c.

Since task 1 copied the user stack of task 0 when it was created, task 0 and task 1 initially shared the same user stack space. But when task 1 was created, it copied the user stack of task 0, so task 0 and task 1 initially shared the same user stack space. However, when task 1 starts to run, the page table entry mapped to user_stack[] is set as read-only, so that task 1 will cause a page write exception when performing stack operations. Therefore, the kernel will use the copy-on-write mechanism to allocate another main memory area page for task 1 to use as heap space. Only then does task 1 start using its own separate user stack memory page. Therefore, the stack for task 0 needs to be “clean” until task 1 actually starts using it, that is, task 0 cannot use the stack at this time to ensure that the copied stack page contains no data for task 0.

The kern-state stack for task 0 is specified in its manually set initialization task data structure, while its user stack is set in the stack before the simulated IRET returns when move_to_user_mode() is executed. We know that when there is a transfer of control where the privilege level changes, the target code uses the stack of the new privilege level, and the old privilege code stack pointer remains in the new stack. Therefore, the user stack pointer for task 0 is pushed onto the stack currently at privilege 0, and the code pointer is pushed onto the stack, and the IRET instruction is executed to transfer control from the code at privilege 0 to the code for task 0 at privilege 3.

In this manual stack, the original ESP value is set to the same position value in the USER_Stack, and the original SS segment selector is set to 0x17, which is the segment selector in the user-mode local table LDT. The task 0 code segment selector 0x0F is then pushed onto the stack as the original CS segment selector, and the pointer to the next instruction is pushed onto the stack as the original EIP. Thus, executing the IRET instruction returns to the code for task 0 to continue execution.

5.8.3 Switching between task kernel mode stack and user mode stack

In Linux0.11 system, all interrupt service programs belong to kernel code. If an interrupt occurs while the task is executing in user code, then the interrupt causes the CPU privilege level to change from level 3 to level 0 (the interrupt occurs in user mode), at which point the CPU switches from user-mode stack to kernel-mode stack. The CPU retrieves the segment selector and offset of the new stack from the task status segment TSS of the current task. Because the interrupt service routine is in the kernel and is level 0 privilege code, the kernel-state stack pointer is obtained from the SS0 and ESP0 fields of the TSS. After locating the new stack, the CPU pushes the old user-mode stack Pointers SS and ESP onto the kernel-mode stack, and then pushes the contents of the flag registers EFLAGS and the return locations CS and EIP onto the kernel-mode stack. (The user stack address is stored in the kernel-stack for proper return)

The kernel’s system call is a software interrupt, so when a task invokes the system call it goes into the kernel and executes the interrupt service code in the kernel. The kernel code then operates using the kernel-state stack for that task. Similarly, when entering a kernel program, the privilege level has changed,The segment of the user-mode stack and the stack pointer along with eFlags are stored in the task’s kernel-mode stack.When iRET exits the kernel program and returns to the user program, the user stack and eFlags are restored.If a task is running in kernel mode, there is no need to switch the stack if the CPU responds to the interrupt. Therefore, the kernel code in which the task is running is already using the kernel mode stack, and there is no priority change involved. ** So the CPU simply pushes eFLAGS and interrupt return Pointers CS and EIP onto the current kernel stack, and then performs the interrupt service process.

5.9 File Systems used in Linux 0.11

A device that stores a file system is a file system device. Files stored according to certain rules on a hard disk constitute a file system. The file system supported by the Linux0.11 kernel is MINIX 1.0 file system. At present, ext2 or EXT3 file system is the most widely used in Linux system.

A Linux 0.11 system running on a floppy disk, which consists of a simple two floppy disks, a Bootimage disk and a rootimage disk. Bootimage is the boot bootimage file, which mainly contains the disk boot sector code, operating system loader, and kernel execution code. Rootimage is the root file system used to provide basic support to the kernel. Together, these two disks are equivalent to a bootable DOS operating system disk.

When the Linux boot disk loads the root file system, the root file system is loaded from the specified device based on the root file system device number at byte 509 and 510 in the boot sector on the boot disk.

5.10 Directory structure of Linux kernel source code

5.10.1 Kernel Home Directory Linux

Includes 14 subdirectories, makefiles — parameter configuration files used to compile the auxiliary tool software make. The main purpose of the make tool is to automatically determine which files need to be recompiled in a system with multiple source files by identifying which files have been modified.

The Makefile also makes nested calls to makefiles contained in all subdirectories.

5.10.2 Booting the Boot program Directory Boot

Contains three assembly language files, is the kernel source file is the first program to be compiled. The main functions of these three programs are to boot the kernel when the computer is running, load the kernel code into memory, and do some system initialization before going into 32-bit protected mode.

Bootsect. s: AS86 software compilation. Disk boot block sector.
Setup. s: AS86 software compilation. It is mainly used to read the hardware configuration parameters of the machine.
Head. s: GNU AS compiled. The compiled link is at the front of the system module for hardware detection and initialization of the memory management page.

5.10.3 fs

Version 1.0 of the MINIX file system was used. Linux was developed on a MINIX file system for cross-compilation. MINIX uses single-threaded file system processing, while Linux uses multi-threaded processing. Linux programs have to deal with contention and deadlocks caused by multiple threads, and Linux file system code is much more complex than MINIX.

5.10.4 include

5.10.5 Kernel initializer directory init

Contains a file main.c for all initialization of the kernel, then moves to user mode to create a new process and run the shell program on the console device. The program first allocates buffer memory based on how much memory the machine has, and if it is also set to use a virtual disk, it also leaves space behind the buffer. After all hardware initialization, including artificial created the first task, and set the interruption allows logo, after executing in kernel mode to user mode, the system first calls the process of creating function fork (), create a used to run the init () process, in the process of the child, the system will be the console environment Settings, And a child process is generated to run the shell program.

5.10.6 Home Directory of the Kernel Program Kernel

It contains 12 code files, a Makefile, and three subdirectories. All tasks are stored in the kernel/ directory, including things like fork, exit, schedulers, and some system calls.

5.10.7 Kernel Function Directory Lib

Unlike normal programs, kernel code cannot use the standard C library and other libraries. The main reason is that the full C library is large. Therefore, there is a special lib/ directory in the kernel source code that provides functions that the kernel needs. The kernel function library is used to provide call support for kernel initializer init/main.c processes running in user mode. It is implemented in the same way as a normal static library.

There are 12.c files.

5.10.8 Memory Manager Directory MM

It is mainly used to manage the use of the main memory area by the program, realizing the mapping between the logical address of the process to the linear address and the linear address to the physical memory address in the main memory, and establishing the corresponding relationship between the virtual memory page of the process and the physical memory page of the main memory area through the memory paging management mechanism.

The Linux kernel handles memory in two ways: paging and segmentation. The first is to divide the 386’s 4G virtual address space into 64 segments of 64MB each. All kernel programs occupy the first segment and have the same physical address as the linear segment. Then each task is assigned a segment to use. The paging mechanism is used to map the specified physical memory pages into segments, detect any duplicate copies created by fork, and perform the write-time replication mechanism.

5.10.9 Compiling kernel Tools Directory Tools

The build.c program is used to merge the object code links generated by the respective compilation of Linux directories into a running kernel image file image.

5.11 Relation between kernel system and application program

System call interface int 0x80
Development environment library functions or kernel library functions (for task 0 and task 1 only, and ultimately to call the system call)

The kernel really only provides a unified interface to all user program processes, system calls. System calls are mainly provided for system software programming or for the implementation of library functions, while ordinary user-developed programs access kernel resources by calling functions in libraries such as LIBC, which are often referred to as application programming interfaces (APIS).

System calls are the highest level of the kernel’s interface to the outside world. In the kernel, each system call has a sequence number and is implemented in the form of an initiative macro. Applications should not use system calls directly, otherwise they would be less portable.

Library functions generally include user-level functions that perform advanced functions not provided by C, such as input/output and string manipulation functions. Some library functions are simply enhanced versions of system calls. The standard I/O library functions fopen and fclose provide similar functionality to the system calls open and close, but at a higher level. System calls generally provide slightly better performance than library functions, but library functions provide more functionality and more fault detection.

P192

#
# if you want the ram-disk device, define this to be the
# size in blocks.
#
RAMDISK = #-DRAMDISK=512

AS86	=as86 0 -a
LD86	=ld86 0

AS	=gas
LD	=gld
LDFLAGS	=-s -x -M
CC	=gcc $(RAMDISK)
CFLAGS	=-Wall -O -fstrength-reduce -fomit-frame-pointer \
-fcombine-regs -mstring-insns
CPP	=cpp -nostdinc -Iinclude

#
# ROOT_DEV specifies the default root-device when making the image.
# This can be either FLOPPY, /dev/xxxx or empty, in which case the
# default of /dev/hd6 is used by 'build'.
#
ROOT_DEV=/dev/hd6

ARCHIVES=kernel/kernel.o mm/mm.o fs/fs.o
DRIVERS =kernel/blk_drv/blk_drv.a kernel/chr_drv/chr_drv.a
MATH	=kernel/math/math.a
LIBS	=lib/lib.a

.c.s:
	$(CC) $(CFLAGS) \
	-nostdinc -Iinclude -S -o $*.s $<
.s.o:
	$(AS) -c -o $*.o $<
.c.o:
	$(CC) $(CFLAGS) \
	-nostdinc -Iinclude -c -o $*.o $<

all:	Image

Image: boot/bootsect boot/setup tools/system tools/build
	tools/build boot/bootsect boot/setup tools/system $(ROOT_DEV) > Image
	sync

disk: Image
	dd bs=8192 if=Image of=/dev/PS0

tools/build: tools/build.c
	$(CC) $(CFLAGS) \
	-o tools/build tools/build.c

boot/head.o: boot/head.s

tools/system:	boot/head.o init/main.o \
		$(ARCHIVES) $(DRIVERS) $(MATH) $(LIBS)
	$(LD) $(LDFLAGS) boot/head.o init/main.o \
	$(ARCHIVES) \
	$(DRIVERS) \
	$(MATH) \
	$(LIBS) \
	-o tools/system > System.map

kernel/math/math.a:
	(cd kernel/math; make)

kernel/blk_drv/blk_drv.a:
	(cd kernel/blk_drv; make)

kernel/chr_drv/chr_drv.a:
	(cd kernel/chr_drv; make)

kernel/kernel.o:
	(cd kernel; make)

mm/mm.o:
	(cd mm; make)

fs/fs.o:
	(cd fs; make)

lib/lib.a:
	(cd lib; make)

boot/setup: boot/setup.s
	$(AS86) -o boot/setup.o boot/setup.s
	$(LD86) -s -o boot/setup boot/setup.o

boot/bootsect:	boot/bootsect.s
	$(AS86) -o boot/bootsect.o boot/bootsect.s
	$(LD86) -s -o boot/bootsect boot/bootsect.o

tmp.s:	boot/bootsect.s tools/system
	(echo -n "SYSSIZE = ("; ls -l tools/system | grep system \ | cut -c25- 31 | tr '\ 012' ' '; echo "+ 15) / 16") > tmp.s
	cat boot/bootsect.s >> tmp.s

clean:
	rm -f Image System.map tmp_make core boot/bootsect boot/setup
	rm -f init/*.o tools/system tools/build boot/*.o (cd mm; make clean) (cd fs; make clean) (cd kernel; make clean) (cd lib; make clean) backup: clean (cd .. ; tar cf - linux | compress - > backup.Z) sync dep: sed '/\#\#\# Dependencies/q' < Makefile > tmp_make (for i in init/*.c; do echo -n "init/"; $(CPP) -M $$i; done) >> tmp_make cp tmp_make Makefile (cd fs; make dep) (cd kernel; make dep) (cd mm; make dep) ### Dependencies: init/main.o : init/main.c include/unistd.h include/sys/stat.h \ include/sys/types.h include/sys/times.h include/sys/utsname.h \ include/utime.h include/time.h include/linux/tty.h include/termios.h \ include/linux/sched.h include/linux/head.h include/linux/fs.h \ include/linux/mm.h include/signal.h include/asm/system.h include/asm/io.h \ include/stddef.h include/stdarg.h include/fcntl.hCopy the code

5.1 Kernel Mode

5.2 Linux Kernel Architecture

5.3 Memory Management and Usage by the Linux Kernel

5.3.1 Physical Memory

5.3.2 Memory Address space

5.3.3 Memory fragmentation Mechanism

5.3.4 Memory paging Management

5.3.5 CPU Multitasking and Protection Modes

5.3.6 Relationships between Virtual Addresses, Linear Addresses, and Physical Addresses

5.3.7 Users Applied for Dynamic Memory Allocation

5.4 Interruption Mechanism in Linux

5.4.1 Interrupt Operation Principle

5.4.2 Interrupt subsystem of 80X86 microcomputer

5.4.3 Interrupt Direction Table (Interrupt Descriptor Table)

5.4.4 Linux kernel interrupt Processing

5.4.5 Interrupt flag of the flag register

5.5 Linux system calls

5.5.1 System call Interface

5.5.2 System Call Processing Procedure

5.5.3 System Call Parameter Transfer Mode

5.6 System time and Timing

5.6.2 System Timing

5.7 Linux Process Control

5.7.1 Task Data Structure

5.7.2 Process Running Status

5.7.3 Initializing processes

5.7.4 Creating a Process

5.7.5 Process Scheduling

5.7.6 Stopping a Process

5.8 How to Use the Stack in Linux

5.8.1 Initialization Phase

5.8.2 Task stack

5.8.3 Switching between task kernel mode stack and user mode stack

5.9 File Systems used in Linux 0.11

5.10 Directory structure of Linux kernel source code

5.10.1 Kernel Home Directory Linux

5.10.2 Booting the Boot program Directory Boot

5.10.3 fs

5.10.4 include

5.10.5 Kernel initializer directory init

5.10.6 Home Directory of the Kernel Program Kernel

5.10.7 Kernel Function Directory Lib

5.10.8 Memory Manager Directory MM

5.10.9 Compiling kernel Tools Directory Tools

5.11 Relation between kernel system and application program

Related Posts

LeetCode – recursive backtracking algorithm

IOS development CoreData notes

With all the Java interview notes summarized by Ali Daniu, the first battle was a success and the ant offer was successfully obtained