Linux0.11 kernel source code analysis 3-main function to run the initialization interrupt descriptor table

Basic knowledge of

If you know DPL, RPL, CPL, skip this paragraph and skip to the next section 😂

To understand these concepts, you need to know what segment descriptors, selectors and other data structures are.

DPLThat’s in the segment descriptorDPL

RPLThat’s what you get from the selectorsRPL

CPL is the RPL of the selector in the segment register CS, which is used to indicate which state the CPU is currently running in.

The Linux operating system has only two privilege levels, one user mode (privilege level 3) and one kernel mode (privilege level 0). User mode and kernel mode are for the CPU, which refers to whether the CPU is running in user mode (privilege level 3) or kernel mode (privilege level 0). When a user process is in kernel mode, the execution of a piece of kernel code begins after the current process is temporarily terminated due to an internal or external interrupt and its context is saved by the kernel interrupt routine.

The protected midsection register is not a segment base address, but a selector, which is actually the index of the array, according to which the segment selector can be found. To avoid illegal references to memory segments, the processor checks in the following ways:

The segment descriptor is first validated against the value of the selector to see if it is out of bounds
The segment descriptor also has a type field, which is used to indicate the type of the segment, meaning that different segments have different functions. After the selector check, it is necessary to check the segment type, here mainly check the purpose of the segment register and segment type match. The broad principles are as follows.
- Only segments with executable attributes (code segments) can be loaded into the CS segment register.
- Segments that only have execution attributes (code segments) are not allowed to be loaded into segment registers other than CS.
- Only segments with writable attributes (data segments) can be loaded into the SS stack segment register.
- Segments with at least readable attributes can be loaded into DS, ES, FS, GS segment registers
After checking for type, the segment is also checked to see if it exists. CPU through the P A segment descriptor to confirm the existence of memory segment and if P 1, said, this time can be selector for the segment registers, at the same time segment descriptor buffer register will be updated to choose the content of the corresponding segment descriptor, then the processor will be in A position of descriptor is 1, said it had visited. If the P bit is 0, it indicates that the memory segment does not exist. The possible cause is that the operating system (OS) has removed the memory segment and dumped it to the hard disk because of insufficient memory. The handler throws an exception and automatically executes the corresponding exception handler, which loads the segment from hard disk into memory, sets P to 1, and returns. The CPU continues to perform the preceding operation and determines the P bit.

There are user-mode stacks and kernel-mode stacks. When switching between two privilege levels, how does the CPU select the corresponding stack? Below we look at the process structure of an attribute TSS (Task State Segment), the inside of the ss0 esp0 is privilege level 0 stack, ss1 esp1 is privilege level 1 stack, ss2 esp2’s privilege level 2 stack, no ss3 because privilege level 3 has the minimum permissions.

Privilege level conversion is divided into low to high conversion and high to low conversion

Privilege level conversion from low to high

The switch from low privilege level to high privilege level can be realized by means of interrupt gate and call gate. Because we don’t know where the stack address of the target privilege level is, we record the address of the target stack somewhere in advance, and then load it into SS and ESP to update the stack when the processor moves to a higher privilege level. This place is TSS. The processor automatically finds the corresponding high-privilege stack address from the TSS.

Privilege level conversion from high to low

First of all, the higher the privilege level, the more things you can do. The CPU wants to run in kernel mode all the time. There is no need to devalue to switch to a lower privilege level. But the user program is user mode, user mode can switch to the kernel state, how to go from the kernel state to user state again? There is only one way to go from high privilege to low privilege by calling the return instruction, which is the only way to make the processor lower privilege.

The processor does not need to look for the low-privileged target stack in TSS. One of the reasons I think you might have guessed is that TSS only records stacks of privilege levels 2, 1, and 0. If you return from privilege level 2 to privilege level 3, where do you find the stack of privilege level 3? Another reason is that the address of the lower-privileged stack already exists, as determined by the mechanism by which the processor implements higher-privileged instructions (such as int, call, etc.). In other words, the processor knows where to find the lower-privileged target stack. When the processor moves from a lower privilege to a higher privilege, it automatically pushes the stack addresses of the lower privilege (SS and ESP) into the stack of the higher privilege (copy). Therefore, when returning from a higher privilege to a lower privilege with a return instruction such as RETF or IRET, The processor can obtain the low-privilege segment selectors and offsets from the currently used high-privilege stack. The process of returning from a higher privilege level to a lower privilege level is called “out-migration”

104 bytes is the minimum size of the TSS. If the process needs to respond to the IO port signal, you can add the IO bitmap here, and the size is not 104.

The CPU executes code. We define the executing code segment as a visitor, and the resource a visitor accesses is called an interviewee. An interviewee may be a code segment or a data segment.

For a data segment (the type field in the segment descriptor does not have an X executable attribute) :

A visitor must have access to the interviewee’s DPL at least equal to or greater than the interviewee’s minimum level of access, otherwise the threshold will not be crossed. CPL <= DPL.

For code segments (the type field in the segment descriptor contains the X executable attribute) :

Access can proceed only if the visitor’s permission is equal to the minimum permission indicated by the DPL, i.e. horizontal access only. Any visitor with more or less permissions will be rejected by the CPU. There is an exception, however, and this is the only case where the processor will run from high privilege to low privilege: when the processor returns to user mode from the interrupt handler.

One way to execute instructions on a high-privilege code segment without raising the privilege level is to use consistent code segments. In a segment descriptor, if the segment is a non-system segment (the segment descriptor S field is 0), the C bit in the type field can be used to indicate whether the segment is a consistent code segment. If C is 1, it indicates that the code segment is consistent; if C is 0, it indicates that the code segment is inconsistent. The code snippet mentioned above is non-consistent and can only be moved horizontally.

Conforming code segments, also known as Conforming code segments, are used to transfer less-privileged code to higher-privileged code. A uniform code segment means that if it is the target segment after the transfer, its privilege level (DPL) must be greater than or equal to the CPL before the transfer, that is, CPL is greater than or equal to DPL, that is, the DPL of the consistent code segment is the upper limit of permission, and any privilege level under this permission can be transferred to this code segment for execution. When the processor encounters a consistent target segment, it does not replace the CPL with the DPL for that target segment. Since CPL remains unchanged after the transfer to a consistent code segment with a higher privilege level, this indicates that the transfer itself does not increase the privilege level, but can run to the code segment with a higher privilege level to execute instructions, and there is no potential danger for the computer due to the increase in privilege level

Door descriptor

The privilege level from low to high can only pass through the gate structure, including interrupt gate, trap gate, call gate, and a task gate, but it is rarely used, so I will not talk about it here. Call gates can be located in GDT, LDT, interrupt gates and trap gates only in IDT.

Call gates can be called directly with the Call and JMP directives because both gate descriptors are in the descriptor table, either GDT or LDT, and they are accessed in the same way as normal segment descriptors and must also be accessed through selectors. So they can be invoked simply by following a task gate or call gate selector after a Call or JMP instruction. Trap gates and interrupt gates exist only in IDT, so they cannot be called actively, and can only be triggered by interrupt signals.

trap_init()

The figure above shows the memory distribution before the execution of main, and we see where the IDT is.

With the above basic knowledge, the following code to understand. Kernel /traps.c is initialized by the macro _set_gate

void trap_init(void) { int i; set_trap_gate(0,&divide_error); set_trap_gate(1,&debug); set_trap_gate(2,&nmi); set_system_gate(3,&int3); /* int3-5 can be called from all */ set_system_gate(4,&overflow); set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); set_trap_gate(7,&device_not_available); set_trap_gate(8,&double_fault); set_trap_gate(9,&coprocessor_segment_overrun); set_trap_gate(10,&invalid_TSS); set_trap_gate(11,&segment_not_present); set_trap_gate(12,&stack_segment); set_trap_gate(13,&general_protection); set_trap_gate(14,&page_fault); set_trap_gate(15,&reserved); set_trap_gate(16,&coprocessor_error); // Int17-47 trap doors are set as reserved. Later, each hardware will reset its own trap doors when initialized. for (i=17; i<48; i++) set_trap_gate(i,&reserved); // Sets the coprocessor interrupt 0x2D (45) trap gate descriptor and allows it to generate interrupt requests. Sets the parallel port interrupt descriptor. set_trap_gate(45,&irq13); outb_p(inb_p(0x21)&0xfb,0x21); // Allow IRQ2 interrupt request for 8259A main chip. outb(inb_p(0xA1)&0xdf,0xA1); // Allow 8259A interrupt request from chip IRQ3. set_trap_gate(39,&parallel_interrupt); // Set the interrupt 0x27 trap gate descriptor for parallel port 1. }Copy the code

include/asm/system.h

#define _set_gate(gate_addr,type,dpl,addr) \ __asm__ ("movw %%dx,%%ax\n\t" \ "movw %0,%%dx\n\t" \ "movl %%eax,%1\n\t" \ "movl %%edx,%2" \ : \ : "i" ((short) (0x8000+(dpl<<13)+(type<<8))), \ "o" (*((char *) (gate_addr))), \ "o" (*(4+(char *) (gate_addr))), \ "d" ((char *) (addr)),"a" (0x00080000)) edx = addr, eax = 0x00080000 #define set_trap_gate(n,addr) \ _set_gate(&idt[n],15,0,addr) #define set_system_gate(n,addr) \ _set_gate(&idt[n],15,3,addr) #define set_intr_gate(n,addr) \ _set_gate(&idt[n],14,0,addr)Copy the code

Take trap gates for example. Type is 111, similar to interrupt gates, with the only difference being that control enters the handler through trap gates with the IF flag bit unchanged, i.e., no interrupt.

“D” ((char *) (addr)) indicates the offset address of addr to the EDX register
“A” (0x00080000) puts 0x00080000 values into eAX, 32 bits in total (high 16 bits is 0x0008 is the segment selector, low 16 bits are replaced by process offset low 16 bits in EDX, currently 0)
“I” ((short) (0x8000+(DPL <<13)+(type<<8)))0x8000(0b1000_0000_0000_0000) P equals 1 indicates that the descriptor is in memory, with DPL and type set.
“Movw %%dx,%%ax indicates that the lower 16 bit dx value is assigned to the lower 16 bit AX, where the value of eAX is 0x0008+ ADDR offset. 0x0008 is the segment selector, binary is 0b1000, RPL is 0, look in GDT, 1 is 1, index 0 is null by default.
Movw % 0, % % dx said the ((short) (0 x8000 + (DPL < < 13) + (type < < 8))) for dx
Movl %%eax,%1 add 32 bit register eax to (*((char *) (gate_addr))).
Movl % % edx, % 2 to the value of the 32-bit registers edx (* (4 + (char *) (gate_addr)))

divide_error

We assume that the top value of the stack is A, and first push the address of do_divide_error into the stack. _do_divide_error is the compiled name of the C function do_divide_error, and then push ebx, ecx, edx, etc. Lea 44(% ESP),%edx LEA means to give the value of 44(% ESP) to EDX, 44(% ESP) is esp + 44, because the stack is growing towards the lower address. Stack bottom at high address, stack top at low address. Call *%eax calls _do_divide_error. This function is used to exit the process, release process resources, send the child process to init, and signal the parent process to collect its corpse.

kernel/asm.s

Divide_error: pushl $do_divide_error # first push the address of the function to be called no_error_code: XCHGL % EAX,(%esp) # _DO_DIvide_error address → EAX, eAX is switched into the stack pushl % EBx PUShl % ECx Pushl % EDX Pushl %edi Pushl % ESI pushl %ebp The push % DS # 16-bit segment register also takes up 4 bytes after being pushed. Lea 44(% ESP),%edx # take the stack pointer position at the return address of the call to the stack and push it onto the stack. Mov %dx,%ds mov %dx,%es mov %dx,%fs mov %dx,%es mov %dx This is called an indirect invocation. Call the C handler that caused the # exception, such as do_divide_error. call *%eax addl $8,%esp pop %fs pop %es pop %ds popl %ebp popl %esi popl %edi popl %edx popl %ecx popl %ebx popl %eax # Pop up iret from the original eAXCopy the code

kernel/traps.c

The parameters are on the stack, and ESP is the top of the stack before the do_divide_ERROR call, and is used to return the previous state.

void do_divide_error(long esp, long error_code) { die("divide error",esp,error_code); } // This subroutine is used to print the name of the error interrupt, the error number, the calling program's EIP, EFLAGS, ESP, fs segment register value, // segment base address, segment length, process number PID, task number, and 10-byte instruction code. If the stack is in the user data segment, also // prints 16 bytes of stack content. static void die(char * str,long esp_ptr,long nr) { long * esp = (long *) esp_ptr; int i; . Print some data... if (esp[4] == 0x17) { printk("Stack: "); for (i=0; i<4; i++) printk("%p ",get_seg_long(0x17,i+(long *)esp[3])); printk("\n"); } str(i); Printk ("Pid: %d, process nr: %d\n\r",current-> Pid, 0xFFFF & I); for(i=0; i<10; i++) printk("%02x ",0xff & get_seg_byte(esp[1],(i+(char *)esp[0]))); printk("\n\r"); do_exit(11); /* play segment exception */ }Copy the code

Kernel /exit.c Process program exit handler function.

If the process has a child, change the child’s father to 1 (init). If the child is already in a ZOMBIE state, it sends SIGCHLD to process 1 to kill it. Close the open file of the process. Change your status to TASK_ZOMBIE and notify the parent process to collect your corpse. Then reschedule schedule().

int do_exit(long code) { int i; free_page_tables(get_base(current->ldt[1]),get_limit(0x0f)); Free_page_tables (get_base(current-> LDT [2]),get_limit(0x17)); // If the current process has a child, set father to 1(init). // If the child process is already in a ZOMBIE state, the child abort signal SIGCHLD is sent to process 1. for (i=0 ; i<NR_TASKS ; i++) if (task[i] && task[i]->father == current->pid) { task[i]->father = 1; if (task[i]->state == TASK_ZOMBIE) /* assumption task[1] is always init */ (void) send_sig(SIGCHLD, task[1], 1); } // Close all open files for the current process. for (i=0 ; i<NR_OPEN ; i++) if (current->filp[i]) sys_close(i); // Synchronize the current process working directory PWD, root directory root, and the I node that executes the program file, put back // each I node and empty (release) respectively. iput(current->pwd); current->pwd=NULL; iput(current->root); current->root=NULL; iput(current->executable); current->executable=NULL; // If the current process is the leader process and it has a control terminal, the terminal is released. if (current->leader && current->tty >= 0) tty_table[current->tty].pgrp = 0; // Void last_task_used_math if the current process last used the coprocessor. if (last_task_used_math == current) last_task_used_math = NULL; // If the current process is the leader process, all related processes of the session are terminated. if (current->leader) kill_session(); // Set the current process to dead to indicate that the current process has released resources. And saves the exit code that will be read by the parent process. current->state = TASK_ZOMBIE; current->exit_code = code; // Tell the parent, i.e. signal SIGCHLD to the parent that the child will stop or terminate. tell_father(current->father); schedule(); // reschedule the process to allow the parent process to handle any other cleanup. // The following return statement is only used to remove warnings. Because this function does not return, the use of the // volatile keyword before the function name tells the GCC compiler that the function does not return. This allows GCC to produce better code, // and no more return statements and false warnings. return (-1); /* just to suppress warnings */ }Copy the code

debug

Interrupt raised when the TF flag in EFLAGS is set

Debug: pushl $DO_INT3 # _DO_debug C function pointer is pushed JMP no_error_codeCopy the code

kernel/traps.c

void do_int3(long * esp, long error_code, long fs,long es,long ds, long ebp,long esi,long edi, long edx,long ecx,long ebx,long eax) { int tr; __asm__("str %%ax":"=a" (tr):"0" (0)); The values in the / / the ax to the tr printk (" eax \ tebx \ t \ \ t tecx \ tedx \ n \ r \ t t % 8 x 8 x % \ \ t % % 8 x 8 x \ t \ n \ r ", eax, ebx, ecx, edx); printk("esi\t\tedi\t\tebp\t\tesp\n\r%8x\t%8x\t%8x\t%8x\n\r", esi,edi,ebp,(long) esp); printk("\n\rds\tes\tfs\ttr\n\r%4x\t%4x\t%4x\t%4x\n\r", ds,es,fs,tr); printk("EIP: %8x CS: %4x EFLAGS: %8x\n\r",esp[0],esp[1],esp[2]); }Copy the code

nmi

By classifying interrupts by event source, interrupts from outside the CPU are called external interrupts, and interrupts from inside the CPU are called internal interrupts. External interrupts are divided into interrupts and non-maskable interrupts based on whether they cause outages, while internal interrupts are classified into soft interrupts and exceptions based on whether they are normal.

nmi:
	pushl $do_nmi
	jmp no_error_code
Copy the code

void do_nmi(long esp, long error_code)
{
	die("nmi",esp,error_code);
}
Copy the code

int3

An interrupt raised by an int 3 instruction, independent of a hardware interrupt

int3:
	pushl $do_int3
	jmp no_error_code
Copy the code

overflow

This interrupt occurs when the CPU executes the INT0 instruction in EFLAGS when the OF flag is set. Usually used by compilers to track arithmetic overflows.

overflow:
	pushl $do_overflow
	jmp no_error_code
Copy the code

void do_overflow(long esp, long error_code)
{
	die("overflow",esp,error_code);
}
Copy the code

bounds

The interrupt thrown when the operand is outside the valid range. This interrupt occurs when a BOUND instruction test fails

bounds:
	pushl $do_bounds
	jmp no_error_code
Copy the code

void do_bounds(long esp, long error_code)
{
	die("bounds",esp,error_code);
}
Copy the code

invalid_op

An interrupt caused by an invalid opcode detected by the CPU actuator

invalid_op:
	pushl $do_invalid_op
	jmp no_error_code
Copy the code

void do_invalid_op(long esp, long error_code)
{
	die("invalid operand",esp,error_code);
}
Copy the code

device_not_available

If the EM(analog) flag in the control register CRO is set, the interrupt is raised when the CPU executes a coprocessor instruction, giving the CPU a chance to have the interrupt handler emulate the coprocessor instruction. The CRO switch flag, TS, is set when the CPU performs a task transition. TS can be used to determine when the content in the coprocessor does not match the task the CPU is performing. This interrupt is raised when the CPU finds TS set while running a coprocessor escape instruction. At this point, you can save the coprocessor contents of the previous task and restore the coprocessor execution state of the new task. The interrupt is finally transferred to the label REt_FROM_SYS_CALL for execution (detecting and processing signals).

kernel/system_call.s

device_not_available: Push % DS push % ES push %fs pushl %edx pushl % ECx Pushl %ebx PusHL % eAX movL $0x10,% eAX # ds,es is set to point to the kernel data segment. Mov %ax,%ds mov %ax,%es movl $0x17,%eax mov %ax,%fs # If the coprocessor emulation flag EM is not set, it indicates that the interrupt was not caused by EM #, then the task coprocessor state is restored, the C function math_state_restore() is executed, and the code at # ret_FROM_sys_call is executed when it returns. Pushl $REt_FROM_sys_call # push the return address of the following jump or call. clts # clear TS so that we can use math movl %cr0,%eax testl $0x4,%eax # EM (math emulation bit) je math_state_restore # If the EM flag is set, Execute math_emulate(). Pushl %ebp pushl %esi pushl %edi call math_emulate popl %edi popl %esi popl %ebp retCopy the code

Task 0 doesn’t do anything, it doesn’t process signals, it just reschedule, and when task 0 starts to execute, it does task scheduling.

Ret_from_sys_call; signal = 12 #; each bit represents a signal; the signal value = 1 blocked = (33*16) #; Task0 = task0; task0 = task0; task0 = task0 Movl current,%eax # task[0] cannot have signals CMPL task,%eax je 3f # Determine whether the caller is a user task by checking the original caller code selector. If not, exit the interrupt directly. # This is because tasks cannot be preempted when executed in kernel mode. Otherwise, the task is identified by semaphore. Here the selector is compared to see if # is the selector 0x000F (RPL=3, local table, first segment (code segment)) of the user's code segment. If it is not #, it indicates that an interrupt service routine jumped to it, and the jump quits the interrupt routine. If the original stack selector is not # 0x17(that is, the original stack is not in the user segment), it also indicates that the caller of this system call is not a user task and exits. cmpw $0x0f,CS(%esp) # was old code segment supervisor ? jne 3f cmpw $0x17,OLDSS(%esp) # was stack segment = 0x17 ? Jne 3F # The following code is used to process signals in the current task. First, take the signal bitmap in the current task structure (32 bits, each representing 1 # signal), then use the signal blocking (shielding) code in the task structure to block the signal bits that are not allowed to obtain the signal value with the smallest value, and then reset the corresponding bit of the signal in the original signal bitmap (set 0). Finally, call do_signal() with this value as one of the arguments. # do_signal() In kernel/signal.c, the argument contains 13 pushing information. Movl signal(% eAX),%ebx # select signal bitmap →ebx A total of 32 signals were movL blocked(% eAX),% ECX # blocked signal bitmap → ECX NotL % ECX # reverse ANDL % ebX,% ECX # Licensed signal bitmap BSFL % ECX,% ECX # Scan the bitmap from the low position (bit 0) to see if there are bits of 1. If so, Movl %ebx,signal(% eAX) #, movl %ebx,signal(% eAX) # Incl ->signal. Incl %ecx # Resize the signal to a number starting from 1 (1-32) pushl %ecx # The signal value is pushed as one of the parameters to call do_signal # C popl %eax # popl %eax # Popl %ebx POPl %ecx popl %edx POP %fs pop %es pop %ds iretCopy the code

double_fault

Typically, when the CPU detects a new exception while calling the handler for the previous exception, the two exceptions will be handled serially, but in rare cases, the CPU will not be able to do this serially and the interrupt will be raised.

Pushl $do_double_fault # C XCHGL %eax,4(%esp) # &function <-> %eax the original address is saved on the stack XCHGL %ebx,(%esp) # &function <-> %ebx the original address is saved on the stack pushl %ecx pushl % EDX pushl % EDI Pushl % ESI Pushl % eBP push % DS push % ES push % FS pushl %eax # Error code # Offset # program to return the address of the stack pointer position value pushed pushl % eAX MOVL $0x10,% eAX # set the kernel data segment selector. Mov %ax,%ds mov %ax,%es mov %ax,%fs call *%ebx # Addl $8,%esp # discards the two arguments pushed as C functions. pop %fs pop %es pop %ds popl %ebp popl %esi popl %edi popl %edx popl %ecx popl %ebx popl %eax iretCopy the code

void do_double_fault(long esp, long error_code)
{
	die("double fault",esp,error_code);
}
Copy the code

coprocessor_segment_overrun

Exceptions are basically equivalent to coprocessor error protection. Because when the operands of floating-point instructions are too large, we have an opportunity to load or store floating-point values beyond the data segment

coprocessor_segment_overrun:
	pushl $do_coprocessor_segment_overrun
	jmp no_error_code
Copy the code

void do_coprocessor_segment_overrun(long esp, long error_code)
{
	die("coprocessor segment overrun",esp,error_code);
}
Copy the code

invalid_TSS

The CPU attempted to switch to a process whose TSS was invalid. Depending on which part of the TSS caused the exception, the switch is terminated when the exception is raised in the current task because the TSS length exceeds 104 bytes. Other problems may cause this exception to occur in new tasks after switching.

invalid_TSS:
	pushl $do_invalid_TSS
	jmp error_code
Copy the code

void do_invalid_TSS(long esp,long error_code)
{
	die("invalid TSS",esp,error_code);
}
Copy the code

segment_not_present

The referenced segment is no longer in memory. The segment descriptor indicates that the segment is no longer in memory

segment_not_present:
	pushl $do_segment_not_present
	jmp error_code
Copy the code

void do_segment_not_present(long esp,long error_code)
{
	die("segment not present",esp,error_code);
}
Copy the code

stack_segment

An instruction operation attempts to go beyond the range of the stack segment, or the stack segment is no longer in memory

stack_segment:
	pushl $do_stack_segment
	jmp error_code
Copy the code

void do_stack_segment(long esp,long error_code)
{
	die("stack segment",esp,error_code);
}
Copy the code

general_protection

Indicates an error that does not belong to any other class. If an exception occurs without a corresponding processing vector (0 — 16),

general_protection:
	pushl $do_general_protection
	jmp error_code
Copy the code

void do_general_protection(long esp, long error_code)
{
	die("general protection",esp,error_code);
}
Copy the code

page_fault

Page exception interrupt handler (INT14), mainly in two cases. One is abnormal page interruption caused by missing pages (when THE CPU finds the existence bit (P) flag of the corresponding page directory entry or page table entry is 0), do_NO_page (error_code, address) is called to deal with it. Second, page exceptions are caused by page write protection (the current process does not have the permission to access the specified page). In this case, the page write protection processing function do_wP_page (error_code, address) is called to handle them. The error code (error_code) is automatically generated by the CPU and pushed onto the stack, and the linear address accessed when an exception occurs is retrieved from the control register CR2. CR2 is designed to store linear addresses in the event of a page error.

For page exception handling interrupts, the CPU provides two pieces of information to diagnose the page exception and recover from it

Error code placed on the stack.

The error code indicates whether the exception is caused by a page that does not exist or by a violation of access rights: bit 2(U/S) -0 indicates execution in superuser mode, 1 indicates execution in user mode; – Bit 1(W/R) -0 indicates the read operation, and bit 1 indicates the write operation. – Bit 0(P) -0 indicates that the page does not exist, and 1 indicates page-level protection

CR2(control register 2). The CPU stores the linear address for access that caused the exception in CR2. Exception handlers can use this address to locate the corresponding page directory and page table entries. If another page exception is allowed to occur during the execution of the page exception handler, the handler should push CR2 onto the stack.

mm/page.s

page_fault: XCHGL % eAX,(%esp) # insert error code into eax pushl %ecx pushl %edx push % DS push %es push %fs movl $0x10,%edx # insert kernel data segment selector RPL=0,TI=0,index = 2, RPL=0,TI=0,index = 2, Mov %dx,%ds mov %dx,%es mov %dx,%fs movl %cr2,%edx mov %dx,%es mov %dx,%fs movl %cr2,%edx Jne 1f call do_no_page # call do_no_page JMP 2f 1 as an argument to the function to be called: Call do_wp_page # call the write protection handler 2: addL $8,%esp # discard the two arguments pushed onto the stack, eject the register and exit the interrupt. pop %fs pop %es pop %ds popl %edx popl %ecx popl %eax iretCopy the code

mm/memory.c

Do_no_page is the access page handler function. The function called during page exception interrupt handling. Function parameters error_code and address are automatically generated by the CPU when a process accesses a page due to a page failure. This function first tries to share pages with the same files that have been loaded, or simply map a page of physical memory because the process dynamically applies for an in-memory page. If the share operation is unsuccessful, the missing data page can only be read from the corresponding file to the specified linear address.

void do_no_page(unsigned long error_code,unsigned long address) { int nr[4]; unsigned long tmp; unsigned long page; int block,i; // select the page address at the specified address in the linear space. Thus, we can calculate the offset length TMP of the specified linear address in the process // space relative to the process base address, that is, the corresponding logical address. address &= 0xfffff000; tmp = address - current->start_code; // If the process's executable pointer is null or the specified address exceeds the (code + data) length, request // a page of physical memory and map it to the specified linear address. Executable is an I-node structure of the executable // piece that the process is running. Since the code for task 0 and task 1 is in the kernel, task 0, task 1, and all tasks derived from Task 1 // that do not call execute() have executables that are 0. If this value is 0, or if the parameter // specifies a linear address that exceeds the code plus data length, it indicates that the process is requesting a new memory page to hold heap or stack // data. So call get_empty_page() to get a page of physical memory for the process // and map it to the specified linear address. The process task structure field start_code is the process generation // code segment address in the linear address space, and the field end_data is the code plus data length. For the Linux0.11 kernel, the code segment and // data segment actually have the same base address. if (! current->executable || tmp >= current->end_data) { get_empty_page(address); return; } if (share_page(tmp)) return; if (! (page = get_free_page())) oom(); /* remember that 1 block is used for header */ * remember that 1 block is used for header */ * Remember that 1 block is used for header */ Therefore, the data block number of the missing page needs to be calculated first. Because the length of each block of data is // BLOCK_SIZE=1KB, one page of memory contains four data blocks. The process logical address TMP is divided by the data block size // size plus 1 to get the missing page's start block number in the execution image file block. Based on this block number // and the I node of the execution file, we can find the corresponding device logical block number // in the corresponding block device from the map bitmap (stored in the NR [] array). Using bread_page(), you can read the four logical blocks into the physical page. block = 1 + tmp/BLOCK_SIZE; for (i=0 ; i<4 ; block++,i++) nr[i] = bmap(current->executable,block); bread_page(page,current->executable->i_dev,nr); When operating on read device logical blocks, it is possible to have a situation where the read page bit // in the execution file is less than 1 page long from the end of the file. Therefore, it is possible to read some useless information, and the following operation is to clear this part of the execution file after end_data. i = tmp + 4096 - current->end_data; tmp = page + 4096; while (i-- > 0) { tmp--; *(char *)tmp = 0; } // Finally map the physical page that caused the missing page exception to the specified linear address address. // Returns if the operation succeeds. Otherwise, the memory page is freed, showing that there is not enough memory. if (put_page(page,address)) return; free_page(page); oom(); }Copy the code

Not finished… Continue tomorrow ~

Reference: Intel ® 64-bit and IA-32 Architecture Developer’s Manual: Volume 3A