page

Real mode using segment base address + offset to addressing, application because real mode can not block access to any memory address, so have a protected mode, protection mode, also extended to the 32-bit address line, you can access to 4 gb of memory address, I in order to manage memory, the memory into pages for the unit, 1 m below the address of the page without the use of management, Generally, the size of a page is 4KB, so the data of each process can be scattered in the memory page discreetly. The data of each process can be loaded into the memory without loading the whole process into the memory. If the CPU needs to access the memory of a process (in a page) and finds that the page where the memory resides does not exist, it issues a page interrupt. The process does not know when a page break occurs, and when the system handles the break, the process continues as if nothing had happened.

page_fault

This exception can be handled in two cases:

  • One is abnormal page interruption caused by missing pages (when THE CPU finds the existence bit (P) flag of the corresponding page directory entry or page table entry is 0), do_NO_page (error_code, address) is called to deal with it.

  • The page exception is caused by the page write protection (the current process does not have the permission to access the specified page). In this case, the page write protection handler function do_wP_page (error_code, address) is called to handle the page exception. The error code (error_code) is automatically generated by the CPU and pushed onto the stack, and the linear address accessed when an exception occurs is retrieved from the control register CR2. CR2 is designed to store linear addresses in the event of a page error.

For page exception handling interrupts, the CPU provides two pieces of information to diagnose the page exception and recover from it

  • Error code placed on the stack.

The error code indicates whether the exception is caused by a page that does not exist or by a violation of access rights: bit 2(U/S) -0 indicates execution in superuser mode, 1 indicates execution in user mode; – Bit 1(W/R) -0 indicates the read operation, and bit 1 indicates the write operation. – Bit 0(P) -0 indicates that the page does not exist, and 1 indicates page-level protection

  • CR2(control register 2). The CPU stores the linear address for access that caused the exception in CR2. Exception handlers can use this address to locate the corresponding page directory and page table entries. If another page exception is allowed to occur during the execution of the page exception handler, the handler should push CR2 onto the stack.

mm/page.s

page_fault: XCHGL % eAX,(%esp) # insert error code into eax pushl %ecx pushl %edx push % DS push %es push %fs movl $0x10,%edx # insert kernel data segment selector RPL=0,TI=0,index = 2, RPL=0,TI=0,index = 2, Mov %dx,%ds mov %dx,%es mov %dx,%fs movl %cr2,%edx mov %dx,%es mov %dx,%fs movl %cr2,%edx Jne 1f call do_no_page # call do_no_page JMP 2f 1 as an argument to the function to be called: Call do_wp_page # call the write protection handler 2: addL $8,%esp # discard the two arguments pushed onto the stack, eject the register and exit the interrupt. pop %fs pop %es pop %ds popl %edx popl %ecx popl %eax iretCopy the code

do_no_page

mm/memory.c

The contents of the process have not been loaded into memory, which causes a page-missing exception when accessing the process.

void do_no_page(unsigned long error_code,unsigned long address) { int nr[4]; unsigned long tmp; unsigned long page; int block,i; address &= 0xfffff000; TMP = address-current ->start_code; Executable is the i-node structure of the process. // TMP is greater than or equal to end_data. This indicates that a page is missing when accessing the heap or stack space. current->executable || tmp >= current->end_data) { get_empty_page(address); return; If (share_page(TMP)) return; // Share succeeded, return if (! (page = get_free_page())) // Get a memory page oom(); /* remember that 1 block is used for header */ * remember that 1 block is used for header */ * Remember that 1 block is used for header */ Therefore, the data block number of the missing page needs to be calculated first. Because the length of each block of data is // BLOCK_SIZE=1KB, one page of memory contains four data blocks. The process logical address TMP is divided by the data block size // size plus 1 to get the missing page's start block number in the execution image file block. Based on this block number // and the I node of the execution file, we can find the corresponding device logical block number // in the corresponding block device from the map bitmap (stored in the NR [] array). Using bread_page(), you can read the four logical blocks into the physical page. block = 1 + tmp/BLOCK_SIZE; for (i=0 ; i<4 ; block++,i++) nr[i] = bmap(current->executable,block); bread_page(page,current->executable->i_dev,nr); When operating on read device logical blocks, it is possible to have a situation where the read page bit // in the execution file is less than 1 page long from the end of the file. Therefore, it is possible to read some useless information, and the following operation is to clear this part of the execution file after end_data. i = tmp + 4096 - current->end_data; tmp = page + 4096; while (i-- > 0) { tmp--; *(char *)tmp = 0; } // Finally map the physical page that caused the missing page exception to the specified linear address address. // Returns if the operation succeeds. Otherwise, the memory page is freed, showing that there is not enough memory. if (put_page(page,address)) return; free_page(page); oom(); }Copy the code

Do_no_page is the access page handler function. The function called during page exception interrupt handling. Function parameters error_code and address are automatically generated by the CPU when a process accesses a page due to a page failure. This function first tries to share pages with the same files that have been loaded, or simply map a page of physical memory because the process dynamically applies for an in-memory page. If the share operation is unsuccessful, the missing data page can only be read from the corresponding file to the specified linear address.

get_empty_page

void get_empty_page(unsigned long address) { unsigned long tmp; // If you can't get a free page, or can't put the page to the specified address, the memory is out of the message. if (! (tmp=get_free_page()) || ! put_page(tmp,address)) { free_page(tmp); /* 0 is ok - ignored */ oom(); }}Copy the code

get_free_page

get_free_pageIs in themem_mapIn the array to find the position of a value of 0, because an array of each item on behalf of each page in main memory, so you can according to the location of the array multiplied by 4 KB can get a relatively memory location, coupled with low memory address (LOW_MEM) can obtain the physical address, then just apply for a page of memory all initialized to zero. Returns the requested address to the caller. Return 0 if no request is received.

unsigned long get_free_page(void) { register unsigned long __res asm("ax"); __asm__("std ; repne ; Scasb \n\t" // set the orientation bit, al(0) compares with the (di) content of each page "jne 1f\n\t" // If there is no byte equal to 0, Jump over (0). "movb $1, 1 (edi) % % \ n \ t" / / 1 = > [1 + edi], the corresponding page memory image bit position 1 "sall $12, % % ecx \ n \ t" / / page number * 4 k = relative page starting address "addl %2,%%ecx\n\t" // add the low memory address to get the actual physical starting address of the page "movl %%ecx,%%edx\n\t" // add the actual physical address of the page ->edx register. "Movl $1024,%%ecx\n\t" leal 4092(%%edx),%%edi\n\t" leal 4092(%%edx),%%edi\n\t" Stosl \n\t" // Clear edi's memory (reverse direction, To reset the page) "movl % % edx, % % eax \ n" / / the page starting address - > eax (return) "1:" : "= a" (__res) : "0" (zero), "I" (LOW_MEM), "c" (PAGING_PAGES), "D" (mem_map+PAGING_PAGES-1) ); return __res; // Return the address of the free physical page (0 if there are no free pages).}Copy the code

For a detailed code analysis, take a look at the line by line analysis below, optionally at 😵

Line by line analysis:

Register state before assembly code execution:

  • register unsigned long __res;Define a register variable, if you want to specify which registerregister unsigned long __res asm("ax");.
  • "=a" (__res)Represents the result output to__res, i.e.,eaxRegister.
  • "0" (zero)The first 0 means to use the previous constraint, namelyeaxRegister, the second 0 is to give the value 0eax."c" (PAGING_PAGES)PAGING_PAGES is given toecx

“D” (mem_map+PAGING_PAGES-1) indicates em_map+PAGING_PAGES-1 to EDI

Start executing the code in the register:

  • The STD instruction sets the DF flag to 1, with DF=0 indicating forward operation and DF=1 indicating reverse operation. The CLD instruction will clear the DF flag. Forward operation means that the direction of the transfer operation is from the low address end of the memory region to the high address end, and reverse operation is just the opposite.

  • The SCASB, SCASW, and SCASD instructions compare values in AL/AX/EAX with a byte/word/double word addressed by EDI, respectively. These instructions can be used to find a numeric value in a string or array. In combination with the REPE (or REPZ) prefix, continuously scan the string or array when ECX > 0 and the value AL/AX/EAX is equal to each successive value in memory. REPNE prefixes can also scan until AL/AX/EAX is equal to some memory value or ECX = 0.

repne ; Scasb: The scASB instruction is repeated to compare the value of al (now 0) with the value of ES: EDI. Edi initializes the value to be the address of the last mem_map item. Edi automatically subtracts 1 after each SCASB because STD is used. B for byte), and modify some flag bits in the flag register.

Repne: To stop, either the number of cycles in ecX is equal to 0, or the ZF bit is equal to 1.

  • jne 1fJump forward to label 1, which does not execute code and returns directly.
  • Movb $1, 1 (% % edi)Represents the position of EDI + 1 to 1,
  • sall $12,%%ecx: EcX used to store the total number of main memory pages (Linux0.11 supports a maximum of 16MB, memory planning is kernel, buffer, virtual disk, main memory, high-end memory), now is the number of pages * 4KB represents the number of bytes of main memory.
  • movl %%ecx,%%edx: Number of bytes in main memory -> edx
  • movl $1024,%%ecx:1024 -> ecx
  • leal 4092(%%edx),%%edi4096 = 4096-4 = 4092;
  • rep ; stosl: STOSL is equivalent to saving the value in EAX to the address pointed to by ES:EDI. If the direction position in EFLAGS is set (using STD before STOSL), EDI decreases by 4, otherwise (using CLD)EDI increases by 4, starting at ECx +4092, reverse direction, step by 4, Repeat 1024 times to fill all 1024 entries of the physical page with the value (0) of the EAX register.
  • movl %%edx,%%eax: puts the physical page start address into the EAX register for return

put_page

Map linear address address to physical address page

unsigned long put_page(unsigned long page,unsigned long address) { unsigned long tmp, *page_table; // First check the validity of the given physical memory page. If the page location is lower than LOW_MEM (1MB) // or exceeds the system's actual memory high HIGH_MEMORY, a warning is issued. LOW_MEM is main memory area may be the if (page < LOW_MEM | | page > = HIGH_MEMORY) printk (" Trying to put the page % p at % p \ n ", page, address). if (mem_map[(page-LOW_MEM)>>12] ! = 1) printk("mem_map disagrees with %p at %p\n",page,address); page_table = (unsigned long *) ((address>>20) & 0xffc); Page_table = (unsigned long *) (0xffffF000&* page_table); page_table = (unsigned long *) (0xffffF000&* page_table); Page table address else {if (! (tmp=get_free_page())) return 0; *page_table = tmp|7; page_table = (unsigned long *) tmp; } page_table[(address>>12) & 0x3ff] = page | 7; Return page; }Copy the code

For secondary page tables,hereThere are

page_table = (unsigned long *) ((address>>20) & 0xffc); : Addres >>22 represents the index in the page directory, each PDE or PTE occupies 4 bytes. Note that the page directory is located at address 0, so as long as we know the index, we can calculate the address of the PDE. So (addres>>22) * 4, so it becomes addres>>20, and that address is pointing to a PDE. 2^10=1024 directories, each entry occupies 4 bytes, then the address is from 0x000-0xffc, that is, 0-4092, and the upper 0xFFC means that the address is limited to 0-4092.

*page_tableRepresents the data obtained by taking out PDE,(*page_table)&1Check if the P bit in PDE is 1.

page_table = (unsigned long *) (0xfffff000 & *page_table); : This point is PTE. page_table[(address>>12) & 0x3ff] = page | 7; : Sets the PTE value.

free_page

Free 1 page memory from physical address ADDR.

Void free_page(unsigned long addr) {void free_page(unsigned long addr) { If the physical address addr is less than the low memory (1MB) // it is in the kernel program or cache and is not processed. If the physical address addr>= is at the highest end of the system memory, an error message is displayed and the kernel stops working. if (addr < LOW_MEM) return; if (addr >= HIGH_MEMORY) panic("trying to free nonexistent page"); // If the parameter addr is validated, then the physical address is converted to the page number remembered from the lower end of memory. Page number = (addr-low_mem)/4096. The page number is stored in // addr. If the page number corresponds to a page mapping byte that is not equal to 0, it is returned by subtracting 1. The mapping byte value should be 0 to indicate that the page has been freed. If the corresponding page byte is 0, it indicates that the // physical page is free, indicating a kernel code problem. An error message is displayed and the machine is stopped. addr -= LOW_MEM; addr >>= 12; If (mem_map[addr]--) return; Mem_map [addr]=0; mem_map[addr]=0; panic("trying to free free page"); }Copy the code

share_page

Code analysis for share_page is shown here

do_wp_page

void do_wp_page(unsigned long error_code,unsigned long address)
{
	un_wp_page((unsigned long *)
		(((address>>10) & 0xffc) + (0xfffff000 &
		*((unsigned long *) ((address>>20) &0xffc)))));

}
Copy the code

((address>>10) & 0xffC: represents the address in the page table

(address>>20) &0xffC: Calculates the offset address of the specified linear address in the directory table

un_wp_page

Cancel the shared page. If the page is referenced only once, simply set the R/W flag of the page entry to bit 1 and flush the TLB cache. If the page is shared, apply for an empty page, replace the value of the page entry with the address of the new page, and copy the original page data to the new page.

void un_wp_page(unsigned long * table_entry) { unsigned long old_page,new_page; old_page = 0xfffff000 & *table_entry; / / page table entries address the if (old_page > = LOW_MEM && mem_map [MAP_NR (old_page)] = = 1) {/ / in the main memory page has not been Shared * table_entry | = 2; // R/W = 1 invalidate(); // Refresh the cache return; } // Otherwise, you need to allocate a free page in the main memory area for the write process to use alone, cancel the page // share. If the original page is larger than the lower end of memory (meaning mem_map[]>1 and the page is shared), then the page map byte array of the original page // face is decremented by 1. Then the specified page entry content is updated to the new page address, with readable // write flags (U/S, R/W, P). After refreshing the page transform cache, finally copy // the original page content to the new page. if (! (new_page=get_free_page())) oom(); if (old_page >= LOW_MEM) mem_map[MAP_NR(old_page)]--; / / already Shared, then the original share the page reference number to minus 1 * table_entry = new_page | 7; // new page US=1 RW=1 P=1 invalidate(); // Refresh the cache copy_page(old_page,new_page); // copy a page of data}Copy the code

copy_page

Copy 1 page of memory from FROM to (4K bytes).

#define copy_page(from,to) \ __asm__("cld ; rep ; Movsl ": :" S "(from)," D "(to)," c "(1024)) / / cache refresh the page transformation # define invalidate () \ __asm__ (" movl % % eax and % % cr3," : : "a" (0))Copy the code

Reference:

Intel ® 64-bit and IA-32 Architecture Developer’s Manual: Volume 3A

x86 Instruction Set Reference SCAS/SCASB/SCASW/SCASD

x86 Instruction Set Reference REP/REPE/REPZ/REPNE/REPNZ