@[toc]

The core of Linux kernel management is the use of paging management, the use of page directory and page table structure to deal with other parts of the kernel code memory application and release. Memory is managed on a per-page basis (address contiguously 4K bytes of physical memory). Page catalog entries and page table entries address and manage usage of specified pages.

The file name location function
head.s linux/boot/head.s Initialize the memory page directory table and page table.
main.c linux/init/main.c Physical memory was initialized. Procedure
memory.c linux/mm/memory.c Memory page management core, memory initialization, page directory table and page table management.
page.s linux/mm/page.s Memory page abnormal interrupt processing process (INT14), the missing page and page write protection processing.
Swap. C (linux0.12) linux/mm/swap.c Virtual memory swapping.
graph LR
o((start)) -- 0xffff0 --> A
A[ROM BIOS] -- 0x7c00 --> B[bootsect.s]
B -- 1. 0x90000 --> B
B -- 2. 0x90200 --> C[setup.s]
B -- 3. 0x10000 --> D[system-head.s]
C -- 0x00000 --> D

13.1 Data structures and global variables related to memory management

13.1.1 Global variable mem_map[]

linux/mm/memory.c

Mem_map is a byte array whose array length is the number of pages in the main memory area. Its array content represents the number of pages occupied. When applying for a page of physical memory, the number of pages occupied is plus 1,0, indicating that the page is free. The array contents of the main memory area are initialized to 0.

static unsigned char mem_map [ PAGING_PAGES ] = {0}; # PAGING_PAGES =15M /4K
Copy the code

13.1.2 Switching mapping bitmap SWAP_bitmap

Linux/mm/swap. C (Linux0.12)

Each bit indicates a switching page. If a bit is 0, the switching page is occupied. If the bit is 1, the switching page is available.

static char * swap_bitmap=NULL;
Copy the code

13.2 Initializing Memory Management

13.2.1 Initializing the paging mechanism

linux/boot/head.s

Setup of a page table in head.s, i.e. initialization of the paging mechanism.

Head. S Main tasks:

  1. Initialize GDT, IDT.
  2. Check if the A20 address line is open (in real mode, the kernel can only use less than 1MB of physical memory space, in order to be able to address 16MB of memory, the A20 address line needs to be turned on).
  3. Initialize the memory page directory table (in preparation for memory paging management) as well as the page table. This is the specific content explained in this section.

The head.s code snippet is located at the beginning of the kernel module, and the code is located at the beginning of physical memory. The paging mechanism places the page catalog table at the beginning of the physical address, and then places four page tables in a row. Each page table (page) is 4KB in size, and each page table consists of 1024 page entries (1024 x 4B). Using 5 pages ((1+4)*4K) of memory space, the process of index 4 page tables through the page directory table, each page table index 4MB of memory space, and finally realize the management of physical memory 16MB space, as shown in the figure below.

Note: since some of the code in head.s is also placed at the beginning of the physical address, the page table covers the beginning of the code up to.org 0x1000(116 lines), and the rest of the code and data in head.s are placed at the beginning of 0x5000. At the beginning of head.s, pg_DIR: identifier (line 16) indicates that it is the start of the page directory table, Org 0x0000 is where the page directory starts (shared by all processes). Org 0x1000(line 116) starts by defining four page tables (kernel specific) — pg0, pg1, pG2, pG3.

/* * linux/boot/head.s * * (C) 1991 Linus Torvalds */

/* * head.s contains the 32-bit startup code. * * NOTE!!! Startup happens at absolute address 0x00000000, which is also where * the page directory will exist. The startup code will be overwritten by * the page directory. */Globl idt, GDT,pg_dir,tmp_floppy_area pg_DIR: # page directory location. Globl startuP_32 startuP_32: movl $0x10,%eax
	mov %ax,%ds
	mov %ax,%es
	mov %ax,%fs
	mov %ax,%gs
	lss stack_start,%esp
	call setup_idt
	call setup_gdt
	movl $0x10,%eax		# reload all the segment registers
	mov %ax,%ds		# after changing gdt. CS was already
	mov %ax,%es		# reloaded in 'setup_gdt'
	mov %ax,%fs
	mov %ax,%gs
	lss stack_start,%esp
	xorl %eax,%eax
1:	incl %eax		# check that A20 really IS enabled
	movl %eax,0x000000	# loop forever if it isn't cmpl %eax,0x100000 je 1b /* * NOTE! 486 should set bit 16, to check for write-protect in supervisor * mode. Then it would be unnecessary with the "verify_area()"-calls. * 486 users probably want to set the NE (#5) bit also, so as to use * int 16 for math errors. */ movl %cr0,%eax # check math chip andl $0x80000011,%eax # Save PG,PE,ET /* "orl  $0x10020,%eax" here for 486 might be good */ orl $2,%eax # set MP movl %eax,%cr0 call check_x87 jmp after_page_tables /* * We depend on ET to be correct. This checks for 287/387. */ check_x87: fninit fstsw %ax cmpb $0,%al je 1f /* no coprocessor: have to set bits */ movl %cr0,%eax xorl $6,%eax /* reset MP, set EM */ movl %eax,%cr0 ret .align 2 1: .byte 0xDB,0xE4 /* fsetpm for 287, ignored by 387 */ ret /* * setup_idt * * sets up a idt with 256 entries pointing to * ignore_int, interrupt gates. It then loads * idt. Everything that wants to install itself * in the idt-table may do so themselves. Interrupts * are enabled elsewhere, when we can be relatively * sure everything is ok. This routine will be over- * written by the page tables. */ setup_idt: lea ignore_int,%edx movl $0x00080000,%eax movw %dx,%ax /* selector = 0x0008 = cs */ movw $0x8E00,%dx /* interrupt gate -  dpl=0, present */ lea idt,%edi mov $256,%ecx rp_sidt: movl %eax,(%edi) movl %edx,4(%edi) addl $8,%edi dec %ecx jne rp_sidt lidt idt_descr ret /* * setup_gdt * * This routines  sets up a new gdt and loads it. * Only two entries are currently built, the same * ones that were built in init.s. The routine * is VERY complicated at two whole lines, so this * rather long comment is certainly needed :-). * This routine will beoverwritten by the page tables. */ setup_gdt: lgdt gdt_descr ret /* * I put the kernel page tables right after the page directory, * using 4 of them to span 16 Mb of physical memory. People with * more than 16MB will have to expand this. */ .org 0x2000 pg1:.org 0x3000 pg2:.org 0x4000 pg3: /* * tmp_floppy_area is used by the floppy-driver when DMA cannot * reach to a buffer-block. It needs to be aligned, so that it isn't
 * on a 64kB border.
 */
tmp_floppy_area:
	.fill 1024.1.0

after_page_tables:
	pushl $0		# These are the parameters to main :-)
	pushl $0
	pushl $0
	pushl $L6		# return address for main, if it decides to.
	pushl $main
	jmp setup_paging
L6:
	jmp L6			# main should never return here, but
				# just in case, we know what happens.

/* This is the default interrupt "handler" :-) */
int_msg:
	.asciz "Unknown interrupt\n\r"
.align 2
ignore_int:
	pushl %eax
	pushl %ecx
	pushl %edx
	push %ds
	push %es
	push %fs
	movl $0x10,%eax
	mov %ax,%ds
	mov %ax,%es
	mov %ax,%fs
	pushl $int_msg
	call printk
	popl %eax
	pop %fs
	pop %es
	pop %ds
	popl %edx
	popl %ecx
	popl %eax
	iret


/* * Setup_paging * * This routine sets up paging by setting the page bit * in cr0. The page tables are set up, identity-mapping * the first 16MB. The pager assumes that no illegal * addresses are produced (ie >4Mb on a 4Mb machine). * * NOTE! Although all physical memory should be identity * mapped by this routine, only the kernel page functions * use the >1Mb addresses directly. All "normal" functions * use just the lower 1Mb, or the local data space, which * will be mapped to some other place - mm keeps track of * that. * * For those with more memory than 16 Mb - tough  luck. I've * not got it, why should you :-) The source is here. Change * it. (Seriously - it shouldn't be too difficult. Mostly * change some constants etc. I left it at 16Mb, as my machine * even cannot be extended past that (ok, but it was cheap :-) * I've tried to show which constants to change by having * some kind of marker at them (search for "16Mb"), but I * won't guarantee that's all :-( ) */
.align 2
setup_paging:
	movl $1024*5,%ecx		/* 5 pages - pg_dir+4 page tables */
	xorl %eax,%eax
	xorl %edi,%edi			/* pg_dir is at 0x000 */cld; rep; stosl// eAX is stored in es:edi; Edi to add 4
	movl $pg0+7,pg_dir		/* set present bit/user r/w */
	movl $pg1+7,pg_dir+4		/* --------- "" --------- */
	movl $pg2+7,pg_dir+8		/* --------- "" --------- */
	movl $pg3+7,pg_dir+12		/* --------- "" --------- */
	movl $pg3+4092,%edi 
	movl $0xfff007,%eax		/* 16Mb - 4096 + 7 (r/w user,p) */ // Each item: the physical memory address currently mapped + the page flag
	std				// Edi value minus 4B
1:	stosl			/* fill pages backwards - more efficient :-) */
	subl $0x1000,%eax
	jge 1b
	cld
	xorl %eax,%eax		/* pg_dir is at 0x0000 */
	movl %eax,%cr3		/* cr3 - page directory start */
	movl %cr0,%eax
	orl $0x80000000,%eax
	movl %eax,%cr0		/* set paging (PG) bit */
	ret			/* this also flushes prefetch-queue */

.align 2
.word 0
idt_descr:
	.word 256*8- 1		# idt contains 256 entries
	.long idt
.align 2
.word 0
gdt_descr:
	.word 256*8- 1		# so does gdt (not that thatAny. Long GDT # magic number, but it works for me :^). Align 8 idt:.fill 256,8,0 # idt is uninitialized GDT: .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x00c09a0000000fff /* 16Mb */ .quad 0x00c0920000000fff /* 16Mb */ .quad 0x0000000000000000 /* TEMPORARY - don't use */
	.fill 252.8.0			/* space for LDT's and TSS's etc */

Copy the code

The core function segment in head.s related to page directory and page table initialization is setup_paging (line 200).

Page directory and page table initialization steps:

  1. Apply space in the kernel (defined at the beginning of line 116) for the page catalog table and four page tables (to be able to index 16MB of memory space).

  2. Clear the contents of the page table of contents and 4 page tables. As shown in lines 201 through 204.

  3. Sets items in the page catalog table. We have four page tables in the kernel, so we need to set four page table entries in the page directory table to index them, as shown in lines 205 to 208 of code. The linear address of the first page table is 0x1000, and the attribute of the first page table is 0x7, indicating that the page exists and is readable and writable. Because each page entry is 4B in size, pg_DIR +4 jumps to the next page entry.

  4. Sets the page entry in each page table. Each page table has a size of 4 x 1024B (identifying the physical page number range from 0 to 0xFFF), as shown in lines 209 to 214. Pg3 +4092 means that the last page entry is filled in from the last page entry, and the content is the physical memory page number mapped by the page entry and the page attribute 0x7. Subtract 4K from the loop judgment variable eAX and continue to set the next page entry until zero, indicating that 4096 page entries have been filled in. That is, 16 MB memory pages are complete.

The figure shows how to locate a linear address to the beginning of the last page table in memory. The 32-bit linear address is divided into three parts: the page directory entry (10 bits), the page entry (10 bits), and the in-page offset (12 bits). By dividing linear addresses into blocks, the mapping from linear addresses to finite physical addresses is realized, as shown in the figure.

  1. Sets the start address of the page directory table. Assign the start address of the page directory table to register CR3, as shown in lines 216 through 217.
  2. Enable paging. As shown in lines 218 through 221, set the highest bit of the CR0 register to 1 to turn on paging.

Page table takes up a page of memory (4KB), each entry takes up 4B, can address 1024 page tables, each page table also takes up a page of memory, so a page table can address 1024 pages, a page table can address 4G memory, the Linux kernel, all processes use a page table, The kernel code and data segments are both 16MB in length, using four page tables that map 16MB of physical memory. For the kernel segment, its linear address is the physical address.

13.2.2 Planning physical Memory Space

linux/init/main.c

Physical memory space planning in main().

// main.c part of the code
#define EXT_MEM_K (*(unsigned short *)0x90002) // Expand memory size after 1MB (KB)
static long memory_end = 0; // The amount of physical memory the machine has (B)
static long buffer_memory_end = 0; // Cache end address
static long main_memory_start = 0; // Where the main memory starts
void main(void)		/* This really IS void, no error here. */
{			/* The startup routine assumes (well, ...) this */
/* * Interrupts are still disabled. Do necessary setups, then * enable them */
 	ROOT_DEV = ORIG_ROOT_DEV;	 
 	drive_info = DRIVE_INFO;
	memory_end = (1<<20) + (EXT_MEM_K<<10);	Memory size =1MB+ extended memory (K) x 1024 bytes
	memory_end &= 0xfffff000;	// Ignore memory that is less than 4KB
	if (memory_end > 16*1024*1024)	// If the memory exceeds 16MB, use 16MB
		memory_end = 16*1024*1024;
	if (memory_end > 12*1024*1024) 
		buffer_memory_end = 4*1024*1024;	// Set the buffer end address
	else if (memory_end > 6*1024*1024)
		buffer_memory_end = 2*1024*1024;
	else
		buffer_memory_end = 1*1024*1024;
	main_memory_start = buffer_memory_end;	// Start of main memory = end of buffer
#ifdef RAMDISK_SIZE
	main_memory_start += rd_init(main_memory_start, RAMDISK_SIZE*1024);	// Take up the main memory space, define the memory virtual disk
#endif
	mem_init(main_memory_start,memory_end);
}
Copy the code

C file mainly does the kernel initialization work, including block device, character device, etc., as well as the manual setting of the first task. The code above shows the memory planning part of main(). The main contents are to standardize the size of memory, determine the starting location of main memory, set the virtual disk space and call the main memory initialization function in memory.c.

Physical memory space planning steps:

  1. Regulate memory size. Line 14 of the code, find the memory size, by 1MB kernel region + extended memory region solution. The extended memory (EXT_MEM_K, defined on line 2) is of size 0x90002, and the memory size (memory_end, defined on line 3) is calculated to be 0x241007FF. In line 15, the number of memory less than 4K is ignored and 0x24100000 is solved. Line 16~17 of the code, through the branch statement, it is determined that the memory capacity exceeds 16MB, and the size of the memory is recorded as 16MB.

  2. Determine the start location of main storage. Line 18-23 sets the buffer end address (buffer_memory_end, defined in line 4) by the size of the memory. See the image below for details of the branch determination process. Line 24 assigns the end of the cache address to the start address of the main memory (main_memory_start, defined on line 5), and completes the marking of the start address of the main memory area.

  3. Set the virtual drive space. Lines 25 to 27 of the code set the space occupied by virtual disks through variables in the kernel/blk_drv/ramdisk.c file.

  4. Call the main memory initialization function in memory.c. At line 28, call mem_init(main_memory_start,memory_end) from the mm/memory.c program to further initialize the main memory area.

13.2.3 Initialization of global mem_map[]

Initialization of the global mem_map[] variable in mem_init().

The mem_init() function mainly manages allocation for the main memory region. The mem_map[] data structure represents the status of physical memory pages. Each byte describes the occupied status of a physical memory page, where the value represents the occupied times, and 0 represents the free physical memory. When a page of physical memory is applied, the corresponding byte value is changed to 1.

// Memory.c
#define PAGING_MEMORY (15*1024*1024)
#define PAGING_PAGES (PAGING_MEMORY>>12)
#define MAP_NR(addr) (((addr)-LOW_MEM)>>12)
#define USED 100
static unsigned char mem_map [ PAGING_PAGES ] = {0};void mem_init(long start_mem, long end_mem)
{
	int i;
	HIGH_MEMORY = end_mem;	// Set the memory to 16MB
	for (i=0 ; i<PAGING_PAGES ; i++)
		mem_map[i] = USED;
	i = MAP_NR(start_mem);
	end_mem -= start_mem;
	end_mem >>= 12;
	while (end_mem-->0)
		mem_map[i++]=0;
}
Copy the code

The initialization of this variable is as follows:

  1. Count the number of pages required for non-kernel space memory (PAGING_PAGES, line 2).
  2. Initialize both the cache area and the virtual disk area (if any) to 100 (lines 11-12).
  3. Clear the entries in the main memory area (lines 16 to 17).

Take the 16MB memory size as an example. If the kernel space is 1MB, mem_map needs to manage the remaining 15MB of pages. There are (16MB-1MB) /4KB=3840 items, i.e. PAGING_PAGES is 3840. The main memory area has (16MB-4.5MB) /4KB=2944 items (this 4.5MB space also includes the cache area and virtual disk area), so the first 896 items are set to 100 in the array mem_map, and the remaining 2944 items are set to 0, waiting for the allocation of paging manager. The following figure shows the mem_map initialization result.

13.3 Physical Memory

13.3.1 application

linux/mm/swap.c

Get_free_page () (swap.c- Linux 0.12 memory.c Linux0.11)

The get_free_page() function uses C/C++ inline assembly syntax to find a free physical page in the main memory area. If there is a free physical page, return the starting address of the free physical page, otherwise return 0. For some code that needs to be called frequently, writing it in assembly can improve performance.

unsigned long get_free_page(void)
{
register unsigned long __res asm("ax");
repeat:
	__asm__("std ; repne ; scasb\n\t"
		"jne 1f\n\t"
		"Movb $1, 1 (edi) % % \ n \ t"
		"sall $12,%%ecx\n\t"
		"addl %2,%%ecx\n\t"
		"movl %%ecx,%%edx\n\t"
		"movl $1024,%%ecx\n\t"
		"leal 4092(%%edx),%%edi\n\t"
		"rep ; stosl\n\t"
		"movl %%edx,%%eax\n"
		"1:"
		:"=a" (__res)
		:"0" (0),"i" (LOW_MEM),"c" (PAGING_PAGES),
		"D" (mem_map+PAGING_PAGES- 1)
		:"di"."cx"."dx");
	if (__res >= HIGH_MEMORY)
		goto repeat;
	if(! __res &&swap_out())	// If no free page is found, perform swap processing and search again
		goto repeat;
	return __res;	// Returns the address of the free physical page
}
Copy the code

Input to the function (lines 17 to 18)

Format: “Operation constraint (register constraint, memory constraint, immediate constraint, general constraint 1) “(input expression)

  • “0” (0) : uses the same register/memory as the 0th operation expression. The 0th register, ax, also assigns ax to be 0.
  • “I” (LOW_MEM) : indicates an immediate number that uses an integer type and declares the low memory address without any registers.
  • “C” (PAGING_PAGES) : % ECx /% Cx /% CL is used, that is, Cx =PAGING_PAGES (0xf00).
  • “D” (mem_map+PAGING_PAGES-1) : indicates that %edi/%di is used. Edi =mem_map+PAGING_PAGES-1, pointing to the last byte of the memory byte bitmap.

B:

R :I/O, indicating the use of a general purpose register, GCC in % eAX /% AX /%al, %ebx/%bx/% BL, % ECx /% Cx /% Cl, % EDx /%dx/%dl to select a suitable GCC; Q :I/O, indicating the use of a general register, the same meaning as r; G :I/O: registers or memory addresses are used. M :I/O: memory address is used. A :I/O: indicates that % eAX /%ax/%al is used. B :I/O: indicates that %ebx/%bx/%bl is used. C :I/O: indicates that %ecx/%cx/% CL is used. D :I/O: indicates that %edx/%dx/%dl is used. D:I/O: indicates that %edi/%di is used. S:I/O: indicates that %esi/%si is used.

The output of the function (line 16)

  • __res = ax. Physical page start address, the return value of the function is stored in eAX (line 16).

Function execution flow (excluding 20~23 lines of disassembly)

1. Function disassembly results are shown in the figure below. From the beginning to the line marker 51, complete the function input parameter assignment and determine the page is 0 (so as to determine the case), and then until the end of the function, complete the main function of the function and return. 2. Scas instructions scan bytes (or words) in the destination string (ES :di or EDI) with the value in AL (or AX), often used with REPNZ (unequal continue) or REPZ (equal Continue) (line marker 52), ecx (physical page number) minus 1, Edi (the last byte from back to front in mem_map), a matching byte. 3. If a free page is found, set the mem_map entry of the corresponding page to 1 (line marker 56) 4. The page number \*4K gets the page start address, plus the base address to get the physical start address (line flags 5A, 5d) 5. 6. Save the physical page start address to eAX (line 72)Copy the code

13.3.2 release

linux/mm/memory.c

Functions involved: free_page()

The free_page(addr) function frees the memory of the page where the physical address addr ** starts.

/* * Free a page of memory at physical address 'addr'. Used by * 'free_page_tables()' */
void free_page(unsigned long addr)
{
	if (addr < LOW_MEM) return;	// For kernel and buffer (1MB)
	if (addr >= HIGH_MEMORY)
		panic("trying to free nonexistent page");
	addr -= LOW_MEM;
	addr >>= 12;	/ / page number
	if (mem_map[addr]--) return;
	mem_map[addr]=0;
	panic("trying to free free page");
}
Copy the code

Function input: physical address.

Function execution flow

  1. Check the validity of parameters. The kernel, buffer, and the highest end of the physical memory contained in the system are all invalid requests. If addr is less than 1MB, it indicates that the page requesting the release of the kernel program is ignored (as shown in line 7 of the code). For requests requesting the release of more than the physical memory of the system, error messages are displayed and downtime (as shown in line 8~9 of the code), there may be a problem with the code.
  2. Convert the physical address ADDR to a page number in 4K per unit size (shown in lines 10-11 of code).
  3. If the mem_map entry indexed by the page number is not equal to 0, the reference count is reduced by one and return (as shown in line 12).
  4. If mem_map[addr] is already 0, mem_map[addr]– changes the value to -1, and the code continues, reassigning mem_map[addr] to 0, and then reporting an error outage (as shown in lines 13 to 14), indicating that there may be a problem with the code.

13.3.3 statistics

linux/mm/memory.c

Show_mem () (Linux0.12)

The show_mem() function displays memory information.

void show_mem(void)
{
	int i,j,k,free=0,total=0;
	int shared=0;
	unsigned long * pg_tbl;
	printk("Mem-info:\n\r");
	for(i=0 ; i<PAGING_PAGES ; i++) {
		if (mem_map[i] == USED)	// Skip pages of memory that cannot be used for allocation
			continue;
		total++;	// Count the total number of pages in the main memory area, including the number of free pages and the number of shared pages.
		if(! mem_map[i])free+ +;// Main memory area free page statistics
		else
			shared += mem_map[i]- 1;	// The number of pages shared
	}
	printk("%d free pages of %d\n\r".free,total);
	printk("%d pages shared\n\r",shared);
	k = 0;	// A process occupies page statistics
	for(i=4 ; i<1024 ;) {	// The first four entries of the page directory table are used by kernel code
		if (1&pg_dir[i]) {
			if (pg_dir[i]>HIGH_MEMORY) {
				printk("page directory[%d]: %08X\n\r",
					i,pg_dir[i]);
				continue;
			}
			if (pg_dir[i]>LOW_MEM)
				free++,k++;
			pg_tbl=(unsigned long(*)0xfffff000 & pg_dir[i]);
			for(j=0 ; j<1024 ; j++)
				if ((pg_tbl[j]&1) && pg_tbl[j]>LOW_MEM)
					if (pg_tbl[j]>HIGH_MEMORY)
						printk("page_dir[%d][%d]: %08X\n\r",
							i,j, pg_tbl[j]);
					else
						k++,free+ +; }// The linear space of each task is 64MB, so a task occupies 16 directories. Each 16 directory entry counts the page table occupied by the task structure of the process.
		i++;
		if(! (i&15) && k) {	// k ! If = 0, the corresponding process exists
			k++,free+ +;/* one page/process for task_struct */
			printk("Process %d: %d pages\n\r",(i>>4)- 1,k);	// After the process ID and its occupied memory page value k are displayed, clear k to count the number of memory pages occupied by the next process.
			k = 0;
		}
	}
	printk("Memory found: %d (%d)\n\r".free-shared,total);
}
Copy the code

13.4 a page table

13.4.1 release

linux/mm/memory.c

Free_page_tables () Linux0.12

In order to continuously free the page space, this function is written, which requires that each page table maps 4MB of free pages. Enter the linear address starting address of the page mapped by the page table and the size of the space to be freed (bytes) to release the page space, and set the table entry free.

/* * This function frees a continuos block of page tables, as needed * by 'exit()'. As does copy_page_tables(), this handles only 4Mb blocks. */
int free_page_tables(unsigned long from,unsigned long size)	// Start linear base address and the length of bytes released
{
	unsigned long *pg_table;
	unsigned long * dir, nr;

	if (from & 0x3fffff)	// Check whether the linear base address of the argument from is 4MB, 8MB, 12MB.
		panic("free_page_tables called with wrong alignment");
	if(! from)panic("Trying to free up swapper memory space");
	size = (size + 0x3fffff) > >22;	// Number of page tables freed, number of page directory entries
	dir = (unsigned long *) ((from>>20) & 0xffc); /* _pg_dir = 0 */ // Page entry index content
	for(; size-->0 ; dir++) {
		if(! (1 & *dir))
			continue;
		pg_table = (unsigned long(*)0xfffff000 & *dir);	// get the page table address
		for (nr=0 ; nr<1024 ; nr++) {
			if (*pg_table){	// If the page entry is not 0
				if (1 & *pg_table) // If this item is valid, the corresponding page is released
					free_page(0xfffff000 & *pg_table);
				else	// Otherwise release the corresponding page in the switch device
					swap_free(*pg_table >> 1);
               	*pg_table = 0; // The page entry content is cleared
            }
			pg_table++;	// point to the next item in the page table
		}
		free_page(0xfffff000 & *dir);	// Release the memory page occupied by the table
		*dir = 0;
	}
	invalidate(a);return 0;
}
Copy the code

Function execution flow:

  1. First check the validity of the arguments (lines 10-13 of code). Check whether the value of the FROM parameter is 4MB, 8MB, 12MB, 16MB… “, and also checks if it is 0. If 0, an error occurs, indicating an attempt to free the space where the kernel and buffer are located.
  2. Then convert (lines 14-15 of code). Convert size (step size) to the number of page entries needed (a 4MB carry integer multiple, carry one method) and convert from to the start directory entry pointer. Since the linear address has 32 bits, the top 10 bits are the page table entry number, the middle 10 bits are the page table entry number, and the bottom 12 bits are the page offset. If we change size+0x3fffff(4MB-1), we can get the number of 4MB contained in size. That is, the number of page catalog entries. In the page directory table, the size of each page directory item is 4B, and there are 1024 page directory items in total, that is, the size of the directory page is 4KB, and the size of the dir variable is 4B, that is, each increment of 1 will increase the displacement of 4 bytes. In order to traverse 1024 page directory items, the displacement length of the traversal is 4*1024. A 12-bit loop judgment variable (DIR) is required. So as shown in line 15 of code, move the linear address 20 bits to the right, masking the lower two bits with 0xffc, and finally get the start directory entry pointer.
  3. Iterate through the page catalog table, releasing the page entries in each page table in turn. Lines 17 to 18 skip invalid page directory entries. If the page catalog entry is valid, the empty page table is released starting at line 19. At line 19 of the code, retrieve the page table address, while shielding the attributes of the lower three bits of the page table, enter line 20 of the code, and cycle clear 1024 page table entries, each page table entry corresponds to 4KB memory space. If the content of a page table entry is not zero, judge whether the page table entry is valid. If valid, call free_page(); otherwise, free the corresponding page on the switch, and then clear the page entry content (line 26) to the next item in the page table (line 28).
  4. Free the page occupied by the page table, line 30.
  5. Zero out the directory entry for the corresponding page table, line 31.

Dir =((address>>22)<<2) & 0xffc: directory entry corresponding to linear address (address>>22: page directory entry index)

Pg_table = *dir & 0xfffff000: specifies the start address of the page table stored in this directory entry

*pg_table: the start address of the 0th page frame of the page table

*(pg_table+1) : specifies the start address of the first frame of the page table

13.4.2 copy

linux/mm/memory.c

Functions involved: copy_page_tables()

Copy the contents of a range of linear addresses by copying pages in memory. The input parameters are the starting address of the linear address mapped to the source page table, the destination linear address, and the number of bytes to copy. Fill in (copy to) the page entry in the destination page table according to the page table table corresponding to the source and destination linear addresses.

/* * Well, here is one of the most complicated functions in mm. It * copies a range of linerar addresses by copying only the pages.  * Let's hope this is bug-free, 'cause this one I don't want to debug :-) * * Note! We don't copy just any chunks of memory - addresses have to * be divisible by 4Mb (one page-directory entry), as this makes the * function easier. It's used only by fork anyway. * * NOTE 2!! When from==0 we are copying kernel space for the first * fork(). Then we DONT want to copy a full page-directory entry, as * that would lead to some serious memory waste - we just copy the * first 160 pages - 640kB. Even that is more than we need, but it * doesn't take any more memory - we don't copy-on-write in the low * 1 Mb-range, so the pages can be shared with the kernel. Thus the * special case for nr=xxxx. */
int copy_page_tables(unsigned long from,unsigned long to,long size)
{
	unsigned long * from_page_table;
	unsigned long * to_page_table;
	unsigned long this_page;
	unsigned long * from_dir, * to_dir;
	unsigned long nr;

	if ((from&0x3fffff) || (to&0x3fffff))
		panic("copy_page_tables called with wrong alignment");
	from_dir = (unsigned long *) ((from>>20) & 0xffc); /* _pg_dir = 0 */
	to_dir = (unsigned long *) ((to>>20) & 0xffc);
	size = ((unsigned) (size+0x3fffff)) > >22;
	for(; size-->0 ; from_dir++,to_dir++) {
		if (1 & *to_dir)
			panic("copy_page_tables: already exist");
		if(! (1 & *from_dir))
			continue;
		from_page_table = (unsigned long(*)0xfffff000 & *from_dir);
		if(! (to_page_table = (unsigned long *) get_free_page()))
			return - 1;	/* Out of memory, see freeing */
		*to_dir = ((unsigned long) to_page_table) | 7;
		nr = (from==0)?0xA0:1024;
		for(; nr-- >0 ; from_page_table++,to_page_table++) {
			this_page = *from_page_table;
			if(! (1 & this_page))
				continue;
			this_page &= ~2;	// Set it to read-only
			*to_page_table = this_page;	// Copy the source page entry to the destination page table
			if (this_page > LOW_MEM) {
				*from_page_table = this_page;
				this_page -= LOW_MEM;
				this_page >>= 12;
				mem_map[this_page]++;
			}
		}
	}
	invalidate();
	return 0;
}
Copy the code

Function execution flow:

  1. First check the parameters, lines 26 to 27. The source IP address and destination IP address must be 4MB, 8MB, 12MB… Otherwise, the machine will break down. This ensures that page table entries are copied from item 1 of the first page table.
  2. Then obtain the source address and destination address start directory entry pointer (from_dir and to_dir), in the need to copy the number of page table size, such as code 28~30 lines.
  3. Checks whether the page to which the destination start directory entry pointer points exists and whether the source directory entry is valid. If the destination directory entry pointer indicates that the page already exists, an error is reported. As shown in lines 32 to 35.
  4. If get_free_page() returns 0, the memory is insufficient. If get_free_page() returns 0, the memory is insufficient.
  5. Change the attributes of the page table corresponding to the sub-directory entry to user level, as shown in line 39 of code.
  6. For the page table corresponding to the current directory entry, set the number of pages to be copied. If it is in kernel space, 160 pages (640KB of memory) are copied. The logical partition of physical memory is shown in the figure, otherwise, all 1024 entries in a page table are copied and mapped to 4MB space, as shown in line 40.

  1. Starts loop assignment of the specified NR memory page entries. If the source page entry is not in use, the entry is not copied; otherwise, the memory page corresponding to the page entry is read-only, as shown in lines 41 to 46.
  2. If the memory pages corresponding to the source page table are larger than 1MB, you need to set the mem_map array to increase the number of references to the mem_map entries corresponding to the indexed pages (as shown in lines 47 to 51).

13.4.3 Mapping Linear Pages to physical Pages

linux/mm/memory.c

Functions involved: put_page(), and the function that calls it.

Map an actual physical page to a specified linear address. Return the physical address of the page by setting the page entry in the page table.

/* * This function puts a page in memory at the wanted address. * It returns the physical address of the page gotten, 0 if * out of memory (either when trying to access page-table or * page.) */
unsigned long put_page(unsigned long page,unsigned long address)
{
	unsigned long tmp, *page_table;

/* NOTE !!! This uses the fact that _pg_dir=0 */

	if (page < LOW_MEM || page >= HIGH_MEMORY)
		printk("Trying to put page %p at %p\n",page,address);
	if (mem_map[(page-LOW_MEM)>>12] != 1)	// page Is a new application page
		printk("mem_map disagrees with %p at %p\n",page,address);
	page_table = (unsigned long *) ((address>>20) & 0xffc);
	if ((*page_table)&1)
		page_table = (unsigned long(*)0xfffff000 & *page_table);
	else {
		if(! (tmp=get_free_page()))return 0;
		*page_table = tmp|7;
		page_table = (unsigned long *) tmp;
	}
	page_table[(address>>12) & 0x3ff] = page | 7;
/* no need for invalidate */
	return page;
}
Copy the code

Function execution flow:

  1. First check the validity of the passed parameter. Check whether the physical memory page pointer page is less than 1MB or higher than the highest physical memory address. If yes, a warning is displayed. Lines 13~14.
  2. Check whether the page is a newly applied page. Otherwise warning. Lines 15~16.
  3. From the linear address of address, find the corresponding directory entry pointer (line 17 of code).
  4. Check the validity of directory entries. If valid, the 12th-bit in-page offset address is masked and the result is saved in the page_table variable. If not (the specified page table is not in memory), call get_free_page() to request a page to be called to hold the page table, set flags in the corresponding directory entry, and place the page table address in the page_table. The default page directory base address is 0. Lines 18 to 25.
  5. Set the contents of the page entry page_table on line 26. Mask the linear address address 12 bits lower in the page offset, rotate the remaining 20 bits lower 10 (page table address), fill in the physical memory page address (XXX000 page), at the same time write property 7.
  6. Line 27 shows: No need to refresh the page transform cache. The reason: this function is called by do_no_page(), and there is no need to refresh the CPU’s page-conversion buffer or call invalidate() for missing page exceptions.

13.4.4 Physical Memory Sharing

linux/mm/memory.c

Functions involved: try_to_share(), share_page()

Try_to_share () shares the page at address of the P process with the current process. The share_page() function attempts to find a process that can share a page with the current process. The parameter address is the address of a page expected to be shared in the current process space.

/* * try_to_share() checks the page at address "address" in the task "p", * to see if it exists, and if it is clean. If so, share it with the current * task. * * NOTE! This assumes we have checked that p ! = current, and that they * share the same executable. */
static int try_to_share(unsigned long address, struct task_struct * p)
{
	unsigned long from;
	unsigned long to;
	unsigned long from_page;
	unsigned long to_page;
	unsigned long phys_addr;

	from_page = to_page = ((address>>20) & 0xffc);
	from_page += ((p->start_code>>20) & 0xffc);	// p process page directory entry
	to_page += ((current->start_code>>20) & 0xffc);	// Current process directory entry
/* is there a page-directory at from? * /
	from = *(unsigned long *) from_page;	// p Process directory entry
	if(! (from &1))
		return 0;
	from &= 0xfffff000;	// Page table address
	from_page = from + ((address>>10) & 0xffc);	// Page entry pointer
	phys_addr = *(unsigned long *) from_page;	// Page entry content
/* is the page clean and present? * /
	if ((phys_addr & 0x41) != 0x01)
		return 0;
	phys_addr &= 0xfffff000;
	if (phys_addr >= HIGH_MEMORY || phys_addr < LOW_MEM)
		return 0;
	to = *(unsigned long *) to_page;	// current Process page directory entry content
	if(! (to &1)) {	// Check whether the page table exists
		if ((to = get_free_page()))	// Request a free page to hold the page table* (unsigned long *) to_page = to | 7;
		else
			oom();
	}
	to &= 0xfffff000;
	to_page = to + ((address>>10) & 0xffc);
	if (1& * (unsigned long *) to_page)
		panic("try_to_share: to_page already exists");
/* share them: write-protect */* (unsigned long *) from_page &= ~2;	/ / read-only* (unsigned long *) to_page = *(unsigned long *) from_page;	// assign the corresponding page entry of p to the corresponding page entry of current
	invalidate();
	phys_addr -= LOW_MEM;
	phys_addr >>= 12;
	mem_map[phys_addr]++;	// the number of pages referenced by the process is increased by 1
	return 1;
}

/* * share_page() tries to find a process that could share a page with * the current one. Address is the address of the wanted page relative * to the current data space. * * We first check if it is at all feasible by checking executable->i_count. * It should be >1 if there are other tasks sharing this inode. */
static int share_page(unsigned long address)
{
	struct task_struct六四事件p;

	if(! current->executable)return 0;	// This process has no execution file
	if (current->executable->i_count < 2)	// The number of references is 1
		return 0;	// There is no sharing condition
	for (p = &LAST_TASK ; p > &FIRST_TASK ; --p) {
		if(! *p)continue;
		if (current == *p)
			continue;
		if((*p)->executable ! = current->executable)continue;
		if (try_to_share(address,*p))
			return 1;
	}
	return 0;
}
Copy the code

try_to_share(address, p)

Input parameter: address- Logical address in the process. P – Shared process.

Function execution flow:

  1. Find from_page and to_page corresponding to the logical address address in current process and process P. First, find the page directory entry that specifies the logical address address in process space (64MB), as shown in line 17. The logical page directory entry number + the page directory entry corresponding to the start address of the process in 4G linear space = from_page and to_page in 4G linear space, as shown in lines 18 to 19.
  2. The contents of the P process page entry are displayed. First, to determine whether the p process directory entry is valid, we only need to determine whether the first digit is 1. If it is 1, it indicates that the page table exists. If the page table exists, fetch the corresponding page entry content (corresponding physical page address), as shown in lines 24 to 26.
  3. Check whether the physical page exists and has been modified. Use 0x41 to check the D bit and P bit, and then extract the validity of the corresponding physical page address, should not be less than 1MB, should not be higher than the highest end of memory, code 28~32 lines. At this point, the physical page at the corresponding logical address address in process P is found.
  4. Determine the address of the page entry corresponding to the logical address address in the current process, as shown in lines 33 to 38. Check whether to_page of the page directory item corresponding to the current process is valid. If it is invalid, that is, the page table corresponding to the current process does not exist, apply for an idle page to store the page table; otherwise, an error is reported.
  5. Then the page table address in the page directory entry (shown in line 40 of code), plus the offset of the page table entry in the page table (shown in line 41 of code), to get the page table address, check whether it is valid, if it is valid, error down (indicating that the page has been mapped without mapping operation).
  6. Then the current process copies the page table entries of the P process to realize that the page at the logical address of the current process is mapped to the physical page at the logical address of the P process. Lines 45 to 50 and 45 to 46 of the code realize write protection and page table entry copy. Line 47 refresh the page transform cache, and then increase the number of references corresponding to the page number by 1.
  7. If the program returns 1, the share is successful.

share_page(address)

Function execution flow:

  1. Check whether the current process meets the sharing conditions. The Executable field in the process task data structure determines whether a corresponding executable file exists for the process. If yes, query the reference value of the corresponding file node. If the value is 1, only one process is running in the execution file and the file sharing condition is not met, as shown in lines 66 to 69.
  2. If there are sharing conditions. The task array is then traversed to find another process running the same executable file for page sharing. This is shown in lines 70 to 78. The for loop at line 70 iterates through an array of tasks (from back to front), line 71 indicates if the task item is free, and continues if the task item is free, and line 73 indicates if the current task is free. Line 75 shows that if a process is running a different file than the current process is executing, the loop continues, and line 77 calls try_to_share to try to share.

13.5 Process Address Space

Kernel code and data segments in GDT in head.s.

gdt_descr:
	.word 256*8-1	# so does gdt (not that that's any
	.long gdt		# magic number, but it works for me :^)
gdt:.quad 0x0000000000000000	/* NULL descriptor */
	.quad 0x00c09a0000000fff	/* 16Mb */
	.quad 0x00c0920000000fff	/* 16Mb */
	.quad 0x0000000000000000	/* TEMPORARY - don't use */. Fill 255,8,0 /* space for LDT's and TSS's etc */
Copy the code
  • Line 1: Load the operand required by the global descriptor register GDTR.

  • Line 2: Sets the first 2 bytes, representing the GDT table length.

  • Line 3: The linear base address of the GDT table, consisting of 256 descriptor entries per 8 bytes.

  • Line 4: Global table, this item is empty.

  • Line 5: Code snippet descriptor.

    • 0x08
  • Line 6: Data segment descriptor.

    • 0x10
  • Line 7: System call segment descriptor.

  • Line 8: Reserve space.

In 0.11, the maximum available virtual address space of each process is 64MB, and the global descriptor table has 256 entries. Two entries are idle and two are used by the system, and each process uses two entries. In other words, the system can hold a maximum of (256-4)/2=126 tasks, and the virtual address range is 126x64MB=8GB. However, the maximum number of manually defined tasks in 0.11 is 64, so the total linear address space is 64x64MB=4GB.

Heap is an area of heap space that is used to allocate memory that a process dynamically requests during execution. BSS is the uninitialized data area of a process, used to store static uninitialized data. Each task has two stacks, cent is used in the user mode and kernel mode application execution, respectively called user mode stack and the kernel mode stack (the position of the linear addresses by the task of TSS segment ss0 and esp0 two fields, its position in the task of data structure in the end of the page), the kernel mode stack is small (about 3 k bytes), The user stack of tasks can be extended on the user’s 64MB space.

13.6 pages of error handling

Page_fault () in write_verify(), un_wP_page () page.s.

Write_verify () — Write page verification. This function handles if the page exists and is not writable, passing in the linear address of the page. Un_wp_page () — Cancels the write protection function, which handles the write protection exception in the process of page exception interruption. The passed argument is the page entry pointer.

Graph LR A[physical page shared] -- Yes --> B[re-apply for new page and copy page content] A -- No --> C[set page writable]
void un_wp_page(unsigned long * table_entry)
{
	unsigned long old_page,new_page;

	old_page = 0xfffff000 & *table_entry;	// Retrieves the physical page address in the specified page entry
	if (old_page >= LOW_MEM && mem_map[MAP_NR(old_page)]==1) {
		*table_entry |= 2; // 1: writable, rewrite the attribute, do not need to apply again
		invalidate(a);return;
	}
	if(! (new_page=get_free_page()))
		oom(a);if (old_page >= LOW_MEM)
		mem_map[MAP_NR(old_page)]--;	// Cancel page sharing
	*table_entry = new_page | 7;
	invalidate(a);copy_page(old_page,new_page); ,// Request a new page for the process to use alone
}	
void write_verify(unsigned long address)
{
	unsigned long page;

	if(! ( (page = *((unsigned long *) ((address>>20) & 0xffc)) )&1))
		return;
	page &= 0xfffff000;
	page += ((address>>10) & 0xffc);
	if ((3& * (unsigned long *) page) == 1)  /* non-writeable, present */
		un_wp_page((unsigned long *) page);
	return;
}
Copy the code

write_verify()

Write page validation functions. Only if the page is unwritable and exists, call the cancel write protection function, (R/W=0) the page is unwritable.

Function execution flow:

  1. Determine if the page table exists based on the linear address, and return if it does not, as shown in lines 23 to 24. Because there is no sharing and copy-on-write for pages that do not exist.
  2. As shown in line 25, you get the base address of the page entry, plus the offset of the address shown in line 26 to get the page entry address.
  3. As shown in line 27, determine the contents of the page entry. If the page is unwritable and exists, call un_wp_page(), otherwise return.

Un_wp_page () When the kernel creates a process, the new process and its parent are set to share code and data memory pages, and all of these pages are set to read-only pages. When the new process or the original process needs to write data to the memory page, a page write protection exception will be generated, and then determine whether the page is shared. If so, apply for a new page and copy the page content. If not, you only need to set the page read/write mark.

Function execution flow:

  1. Retrieves the physical page address in the specified page table entry, as shown in line 5 of the code.
  2. To determine whether a page in the main memory area and determining if a corresponding reference page number 1 (if 1 shows the page without sharing), if required, you only need to change the page page table entries in the R/W logo to be written to (line 7) code, refresh the page transformation cache code (line 8).
  3. Otherwise, apply for a free page in the main memory area for the write process to use alone (as shown in lines 11 to 12).
  4. If mem_map>1, the page map byte array value of the original page is reduced by 1, as shown in lines 13 to 14 of code.
  5. Updates the content of the specified page entry to the page address and sets read-write flags, as shown in line 15.
  6. Refresh the page transform cache (shown on line 16), and then call the copy_page() function to copy the contents of the original page to the new page, as shown on line 17.

Page error interrupt handler page_fault() in page.s

Graph LR A[interrupt int14] --> B B --> C --> D --> do_wp_page --> E --> do_no_page

The page.s program contains the low-level page exception handling code, and the actual work is done in memory.c. The interrupt handler page_fault (interrupt 14) in page.s file is divided into two cases: one is the abnormal interruption of the page caused by the missing page, which is handled by calling do_NO_page (error_code, address); The page exception is caused by the page write protection (the current process does not have the permission to access the specified page). In this case, the page write protection handler function do_wP_page (error_code, address) is called to handle it. Error_code is automatically generated by the CPU and pushed onto the stack.

Control register: CR0~CR3.

  • CR0: system control flag containing controller operation mode and state.
  • CR1: Reserved.
  • CR2: contains linear addresses that cause page errors.
  • CR3: contains the page directory table physical memory base address (page directory base address register PDBR).
/*
 *  linux/mm/page.s
 *
 *  (C) 1991  Linus Torvalds
 */

/*
 * page.s contains the low-level page-exception code.
 * the real work is done in mm.c
 */

.globl page_fault

page_fault:
	xchgl %eax,(%esp)	Fetch error code to eAX
	pushl %ecx
	pushl %edx
	push %ds
	push %es
	push %fs
	movl $0x10,%edx	Set the kernel data segment selector
	mov %dx,%ds
	mov %dx,%es
	mov %dx,%fs
	movl %cr2,%edx	Select the linear address that caused the page exception
	pushl %edx	# push the linear address and error code onto the stack to call the parameters of the function
	pushl %eax
	testl The $1,%eax	# Test page has flag P, if not the exception caused by the missing page will jump
	jne 1f
	call do_no_page	Call the missing page handler
	jmp 2f
1:	call do_wp_page	Call the write protection handler
2:	addl $8,%esp
	pop %fs
	pop %es
	pop %ds
	popl %edx
	popl %ecx
	popl %eax
	iret

Copy the code

The page exception descriptor will be set in traps.c.

Function execution flow:

  1. Line 15: Retrieve the error and save it in eAX.
  2. Line 21: Sets the kernel data segment selector.
  3. Line 25: Push the linear address and error code as arguments to the function to be called.
  4. Line 28: The test page has flag P and jumps if it is not an exception caused by a missing page.
  5. Line 30: Call the missing page handler do_no_page()
  6. Line 32: Call the write protection handler do_wp_page()

Two pages failed to call the function:

1. do_wp_page()

Perform write-protected page processing.

/* * This routine handles present pages, when users try to write * to a shared page. It is done by copying the page to a new address * and decrementing the shared-page counter for the old page. * * If it's in code space we exit with a segment error. */
void do_wp_page(unsigned long error_code,unsigned long address)
{
#if 0
/* we cannot do this yet: the estdio library writes to code space */
/* stupid, stupid. I really want the libc.a from GNU */
	if (CODE_SPACE(address))
		do_exit(SIGSEGV);
#endif
	un_wp_page((unsigned long *)
		(((address>>10) & 0xffc) + (0xfffff000& * ((unsigned long *) ((address>>20) &0xffc)))));

}
Copy the code

When a user writes to a shared page, it triggers a page exception and calls this function to handle it, copying the page to a new address and decrement the shared count implementation of the original page(un_wp_page()). I think this function is just an interface to call do_wp_page(). Only error codes are determined and Pointers to page table entries are constructed. The function takes error_code — generated automatically by the CPU, and address is the linear address of the page.

Lines 10 to 15 determine if address is in code space, or terminate the program.

To call the do_wp_page() function, you need to construct the incoming arguments. ((address>>10) & 0xFFC) : Calculates the offset address of the page entry in the page table from the specified linear address. (0xffffF000&* ((address>>20) &0xFFc)) : Fetch the address value of the page table in the directory entry. The offset address of the page entry in the page table plus the physical address of the corresponding page table in the contents of the directory entry can get the pointer to the page entry.

2. do_no_page()

Graph LR A[missing page processing] --> B[1. Process dynamically request A physical page mapping] A --> C[2. Attempt to share the page with the same file that was loaded] A --> D[3.
void do_no_page(unsigned long error_code,unsigned long address)
{
	int nr[4];
	unsigned long tmp;
	unsigned long page;
	int block,i;

	address &= 0xfffff000;	// address missing page address
	tmp = address - current->start_code;	// Process linear address space corresponds to offset address
	if(! current->executable || tmp >= current->end_data) {get_empty_page(address);	// Apply a physical memory page and map it to the logical address of the process page
		return;
	}
	if (share_page(tmp))	// Try page sharing at logical address TMP
		return;
	if(! (page =get_free_page()))
		oom(a);/* remember that 1 block is used for header */
	block = 1 + tmp/BLOCK_SIZE;	// Start block number in the execution file
	for (i=0 ; i<4 ; block++,i++)
		nr[i] = bmap(current->executable,block);	// The corresponding logical block number on the device
	bread_page(page,current->executable->i_dev,nr);	// Read four logical blocks on the device
	i = tmp + 4096 - current->end_data;	// Clear more than the end_data portion of the execution file
	tmp = page + 4096;
	while (i-- > 0) { tmp--; * (char *)tmp = 0;
	}
	if (put_page(page,address))	// The physical page that caused the missing page exception is mapped to the linear address address
		return;
	free_page(page);
	oom(a); }Copy the code

The do_no_page() function is called by the page.s program, and there are two cases. One is that the process requests a clean page to store the data in the heap or stack, so it can directly apply a physical memory for the process and map it to the specified linear address. Second, the missing page is within the scope of the process execution image file, so we try to share the page. If the page fails, we apply for a page of physical memory, and then read the corresponding page in the execution file from the device and map it to the logical address TMP of the process logical page.

Function execution flow:

  1. Take the offset length TMP (logical address) of the linear address space address in the process space relative to the process base address, lines 8 to 9, where TMP refers to the logical address corresponding to the missing page.
  2. If the current process’s EXECUTABLE pointer is null or the specified address is out of length (code + data), a page of physical memory is required. And map to the specified linear address (indicating that the process is requesting a new memory page to store data in the heap or stack), as shown in lines 10 to 13.
  3. Otherwise, the missing page is not a clean page, so try to share the page, if successful, exit, otherwise apply for a new physical page, then read the corresponding page from the execution file from the device and place it at the process page logical address TMP, as shown in lines 14 to 17.
  4. Codes 18 through 28 show how to read the corresponding pages in the execution file from the device.
  5. As shown in lines 29 through 33, map the physical page that caused the missing page exception to the specified linear address address (line 29), return if the operation succeeded, otherwise free the page (line 31) and show that there is not enough memory (line 32).

13.7 Switching Mechanism

Implemented in 0.12 kernel

Since version 0.12, Linux has added virtual memory swapping to the kernel – temporarily saving unused memory pages to disk, and then putting them back into memory if needed. Swap management uses the same bitmap mapping technique as main memory area management. Bitmaps are used to determine the specific save and map locations of swapped memory pages. If we define the switch number SWAP_DEV when compiling the kernel, then the compiled kernel will swap memory. For Linux0.12, the switch uses a separate swap partition on the hard disk (that is, a virtual memory partition whose size is related to the actual physical memory) with no file system on it.

A page contains 32768 bits SWAP_BITS=4096 x 8. The number of pages that can be managed by a swap partition cannot exceed 32768 pages.

The system block device partition initialization bit swAP_size switch page, the 0 page as the switch management page (save switch bitmap mapping information), the actual number is less than the number of SWAP_BITS.

Data structure – Swap map bitmap SWAP_bitmap

Bits in the bitmap: 0,1,1,1,1... 1, 0... ,0 corresponds to the switch page number: 0,1,2,3,4, SWAP_size-1, SWAP_size,... ,SWAP_BITS-1Copy the code

13.7.1 Initializing Switch Processing

If the system defines the swap number SWAP_DEV, the kernel will perform the swap initialization function init_Considerations ()

Memory page initialization function. 1. Check whether a switch device exists and whether the switch device is valid. 2. Fetch a memory page to hold the page bitmap array. 3. Set the flag of the management page in the page bit mapping array, 1 can be used, 0 is occupied. 4. Check the bit values in the page bit map array.

int SWAP_DEV = 0;	// Switch device number set during kernel initialization
static char * swap_bitmap = NULL;	// Page bit mapping array
void init_swapping(void)
{// Check whether there are switches
	extern int *blk_size[];	// An array of blocks pointing to the specified primary device number. Each item corresponds to the total number of data blocks on a sub-device, and each sub-device corresponds to a partition of the device, blK_DRV/ll_rw_BLk.c
	int swap_size,i,j;

	if(! SWAP_DEV)return;
	if(! blk_size[MAJOR(SWAP_DEV)]) {//#define MAJOR(a) ((unsigned)(a))>>8)
		printk("Unable to get size of swap device\n\r");
		return;
	}
	swap_size = blk_size[MAJOR(SWAP_DEV)][MINOR(SWAP_DEV)];	#define MINOR(a) ((a)&0xff)
	if(! swap_size)return;
	if (swap_size < 100) {
		printk("Swap device too small (%d blocks)\n\r",swap_size);
		return;
	}
	swap_size >>= 2;	// Convert the total number of exchanged data blocks to the corresponding total number of exchangeable pages
	if (swap_size > SWAP_BITS)
		swap_size = SWAP_BITS;
	swap_bitmap = (char *) get_free_page();	// Request a physical memory to store the swap page bitmap array
	if(! swap_bitmap) { printk("Unable to start swapping: out of memory :-)\n\r");
		return;
	}
	read_swap_page(0,swap_bitmap);	// Read page 0 (switch area management page) from switch partition to swAP_bitmap
	if (strncmp("SWAP-SPACE",swap_bitmap+4086.10)) {	// Determine whether the device is a valid switch based on the device characteristic string starting with 4086 bytes
		printk("Unable to find swap-space signature\n\r");
		free_page((long) swap_bitmap); // Convert a physical address to a linear address
		swap_bitmap = NULL;
		return;
	}
	memset(swap_bitmap+4086.0.10);	// The characteristic string is clear 0
	for (i = 0 ; i < SWAP_BITS ; i++) {	// Check bitmap bits
		if (i == 1)
			i = swap_size;
		if (bit(swap_bitmap,i)) {
			printk("Bad swap-space bit-map\n\r");
			free_page((long) swap_bitmap);
			swap_bitmap = NULL;
			return;
		}
	}
	j = 0;
	for (i = 1 ; i < swap_size ; i++)
		if (bit(swap_bitmap,i))
			j++;
	if(! j) { free_page((long) swap_bitmap);
		swap_bitmap = NULL;
		return;
	}
	printk("Swap device ok: %d pages (%d bytes) swap-space\n\r",j,j*4096);
}
Copy the code

SWAP_DEV: During kernel initialization, if the system defines the swap number SWAP_DEV, the kernel will perform the swap initialization function init_Considerations ().

Function execution flow:

  1. Check whether the system has switches based on the device partition array (block number array), as shown in lines 5 to 13. Line 8~9 of the code, according to the switching device number to determine whether a switch device is defined, if not, return; Lines 10 to 13 check if the switch has set the block number array, if not, display a warning message and return. Blk_size [MAJOR][MINOR], containing the total number of blocks for all block devices, if! Blk_size [MAJOR] does not need to check the total number of blocks on the child device. Blk_size [MAJOR] indicates the total number of blocks contained in the primary device number. Blk_size [MAJOR][MINOR] takes the total number of data blocks in the swap partition on the device.
  2. Check whether the switch device is valid based on the total number of data blocks. As shown in lines 14 to 20, if the total number of blocks is less than 100, the message “The switch device is too small” is displayed and exit.
  3. Apply for a physical page to hold the page bit map array. As shown in lines 21 to 28, line 21 converts the total number of exchanged data blocks to the total number of corresponding exchangeable pages. The value cannot be greater than 32768 (SWAP_BITS, 4K*8, represents the number of bits in a page). If so, set the maximum value to 32768, and then apply for a physical page to store the array. See line 24 of code. On line 24, swap_bitmap = (char *) get_free_page();
  4. Then read page 0 on the device swap partition into the SWAP_bitmap page. This macro is defined in Linux /mm. H, where the character string “swap-space” is displayed at the beginning of byte 4086. If not, the switch device is invalid, an error message is displayed, and all Settings are reset, as shown in lines 30 to 35. Otherwise, the characteristic string bytes are cleared to zero, as shown in line 36.
  5. Check the page swap page mapping array SWAP_bitmap.Check only, not assign. If the bitbit in the bitmap is 0, the switching page is in use. Otherwise, the switching page is idle. The default bit of the first switching page is the management page, storing the page mapping array, so the 0th bit is 0. All [1 — SWAP_size-1] of the switching page are available (the actual number of switching pages available on the device is SWap_size-1), and their corresponding bit in the bitmap is 1. In the bitmap, the bit inside [SWap_size — SWAP_BITS] is set to 0 (unavailable) because there is no corresponding switching page. The check procedure is shown in Code 3750 lines. Code 51Line 55 indicates that if the statistics show that there are no idle switching pages, it indicates that there is a problem with the switching function, and the pages occupied by the bitmap are released and exit. Line 57 shows the number of switching pages and the total number of bytes of swap space that the switching device has, if it works properly.

13.7.2 Bit manipulation macros and embedded functions

Defines three operations to test, set, or clear a specified bit through different op operations.

  • The parameter addr specifies a linear address; Nr is the bit offset from the specified address.
  • According to line 5 of the code, different instructions are formed depending on the op character:
    • When op=””, form instruction bt — tests and sets the carry bit with the original value.
    • When op=”s”, the instruction BTS is formed — sets the bit value and sets the carry bit with the original value.
    • When op=”r”, the instruction BTR is formed — resets the bit value and sets the carry bit with the original value.
  • Input: %0 – (return value) %1 – bit offset NR %2 – base address ADDR %3 – Operation register initial value 0
  • The embedded assembly code saves the bit values specified by the base address (%2) and bit offset (%1) into carry flag CF.
  • Adcl instruction is carry plus.
  • If CF=1, return register value 1, otherwise return register value 0.
#define bitop(name,op) \
static inline int name(char * addr,unsigned int nr) \
{ \
int __res; \
__asm__ __volatile__("bt" op "% 1 and % 2; adcl$0, % 0" \
:"=g" (__res) \
:"r" (nr),"m" (*(addr)),"0" (0)); \
return__res; // Define 3 embedded functions according to different op characters bitop(bit,"") // Define the function bit(char* addr, unsigned int nr) to test and set the carry bit with the original value. bitop(setbit,"s"Setbit (char* addr, unsigned int nr) setbit(clrbit, unsigned int nr)"r"Clrbit (char* addr, unsigned int nr)Copy the code

13.7.3 Applying for and Releasing A Switch

linux/mm/swap.c

Apply for a switching page based on the switching bitmap and release the specified page in the switching device

static int get_swap_page(void)
{
	int nr;

	if(! swap_bitmap)return 0;
	for (nr = 1; nr < 32768 ; nr++)	// Scan the entire swap map bitmap
		if (clrbit(swap_bitmap,nr))	// Reset the first 1 bit
			return nr;	// Returns the number of swap pages that are currently idle
	return 0;
}

void swap_free(int swap_nr)	// Releases the switching page specified in the switching device
{
	if(! swap_nr)return;
	if (swap_bitmap && swap_nr < SWAP_BITS)
		if(! setbit(swap_bitmap,swap_nr))return;
	printk("Swap-space bad (swap_free())\n\r");
	return;
}
Copy the code

Get_swap_free () — request a swap page number to scan the entire swap map bitmap, reset the first bit of the swap map to 1, and return the value position value (the idle swap page number), otherwise return 0, as shown in line 11 of code 1. In line 7, 32768 represents the total number of pages that can be mapped. At line 8, the function clrbit() is called to reset the first bit with a value of 1. Swap_free () — Release the specified switching page in the switching bitmap set the corresponding bit of the specified page number (set to free), if the original bit is equal to 1, it indicates that an error, error message is displayed. The function passes in an argument bit to specify the swap page number. Code 17, line 19, determine whether the page mapping array and the specified swap page number are valid, if so, call the function setbit() to set the position to 1, indicating that the page is released.

13.7.4 Switching page in and out

linux/mm/swap.c

Swaps the specified page into memory from the switching device and outputs the memory page information to the switching device

void swap_in(unsigned long *table_ptr)	// Swap the specified page into memory, page table pointer
{
	int swap_nr;
	unsigned long page;

	if(! swap_bitmap) { printk("Trying to swap in without swap bit-map");
		return;
	}
	if (1 & *table_ptr) {
		printk("trying to swap in present page\n\r");
		return;
	}
	swap_nr = *table_ptr >> 1;	// Page table contents /2 to get the page table number, page number *2
	if(! swap_nr) { printk("No swap page in swap_in\n\r");
		return;
	}
	if(! (page = get_free_page()))// Apply for a new physical page in physical memory
		oom();
	read_swap_page(swap_nr, (char *) page);	// Read the swAP_NR page from the switch device
	if (setbit(swap_bitmap,swap_nr))	// The page bit mapping array is set to a bit, indicating that the corresponding switching page on the switching device is free
		printk("swapping in multiply from same page\n\r");
	*table_ptr = page | (PAGE_DIRTY | 7);	// The page entry points to the physical page
}
Copy the code

Swap_in () — Swaps the specified page into memory reads the memory page corresponding to the specified page entry from the switch device into the newly allocated memory page. At the same time, modify the corresponding bit in the swap bitmap, modify the page table content, let it point to the memory page, and set the corresponding flag. The passed parameter table_ptr is the page table entry pointer.

Function execution flow:

  1. Lines 6 to 10 check the validity of swapped bitmaps and parameters. Checks whether the swap bitmap exists and whether the page corresponding to the specified page entry exists in memory. Displays a warning message if an error occurs.
  2. Line 14~18 of the code, judge whether there is a suitable switching page in the switching area, only there is a page with bit 0 and page entry content is not 0 will be in the switching device.
  3. Lines 19 to 24 apply for a page of physical memory and read the page with the page number swAP_NR from the switching device. After swapping the page with read_swap_page(), swap the corresponding bitmap bits. If it is already set, the same page is being read from the switch again. Finally, point the page table entry to the physical page (as shown in line 24 of code) and set the page table properties — change bits, page privileges, user read-write, and presence flags.

Why is it not necessary to call invalidate() to refresh the fast table when switching in?

Because it will be called when swapping out, resulting in swapping out page table will not affect the fast table, while swapping in, because it is a new swapping in page table, so it is not needed in memory.

int try_to_swap_out(unsigned long * table_ptr)
{
	unsigned long page;
	unsigned long swap_nr;

	page = *table_ptr;
	if(! (PAGE_PRESENT & page))// PAGE_PRESENT:0x01 0000 0001 linux/mm.h
		return 0;
	if (page - LOW_MEM > PAGING_MEMORY) // Check whether the physical page address specified by the page entry is larger than the memory high end (15MB)
		return 0;
	if (PAGE_DIRTY & page) {	// PAGE_DIRTY:0x40 0100 0000 Linux /mm.h This page has been modified
		page &= 0xfffff000;	// Get the physical page address
		if(mem_map[MAP_NR(page)] ! =1)	// Shared pages do not need to be swapped out
			return 0;
		if(! (swap_nr = get_swap_page()))// Apply for the switching page number, leaving the page entry bits empty. Only pages with bit 0 and page entry content not 0 will be in the switching device
			return 0;
		*table_ptr = swap_nr<<1;
		invalidate();	// Refresh the CPU page transform cache
		write_swap_page(swap_nr, (char *) page);	// Call the write swap page function to write the pages in memory to the swap device.
		free_page(page);
		return 1;
	}
	*table_ptr = 0;
	invalidate();
	free_page(page);
	return 1;
}

/* * Ok, this has a rather intricate logic - the idea is to make good * and fast machine code. If we didn't worry about that, things would * be easier. */
int swap_out(void)	// Search the entire 4GB linear space, attempting to swap the corresponding physical memory pages to the switching device, on success returns 1.
{
	static int dir_entry = FIRST_VM_PAGE>>10;	#define FIRST_VM_PAGE (TASK_SIZE>>12) 64MB/4KB
	static int page_entry = - 1;
	int counter = VM_PAGES;	#define LAST_VM_PAGE (1024*1024) 4GB/4KB #define VM_PAGES (LAST_VM_PAGE - FIRST_VM_PAGE) 1032192
	int pg_table;

	while (counter>0) {	// Loop through the page directory table to find a valid page directory entry
		pg_table = pg_dir[dir_entry];	// The contents of the page directory entry
		if (pg_table & 1)
			break;
		counter -= 1024;	// One page table corresponds to 1024 page table frames
		dir_entry++;		// Next directory entry
		if (dir_entry >= 1024)
			dir_entry = FIRST_VM_PAGE>>10;	// iterate again
	}
	pg_table &= 0xfffff000;	// Page table pointer
	while (counter-- > 0) {	// For all 1024 pages in the page table, call swap function one by one to try to swap out. If all 1024 pages in the page table are traversed, proceed to the next page table
		page_entry++;	// Page entry index
		if (page_entry >= 1024) {
			page_entry = 0;
		repeat:
			dir_entry++;	// Process the next page table
			if (dir_entry >= 1024)
				dir_entry = FIRST_VM_PAGE>>10; // This judgment can be deleted (see Linux1.2 for details).
			pg_table = pg_dir[dir_entry];	// The contents of the page directory entry
			if(! (pg_table&1))
				if ((counter -= 1024) > 0)
					goto repeat;
				else
					break;
			pg_table &= 0xfffff000;	// Page table pointer
		}
		if (try_to_swap_out(page_entry + (unsigned long *) pg_table))
			return 1;
	}
	printk("Out of swap-memory\n\r");
	return 0;
}
Copy the code

Try_to_swap_out () — Try to write the page to the switch. If the page has not been modified, there is no need to write the page to the switch. If the page has been modified, apply for a swap page number and swap out the page. The swap page number is stored in the corresponding page entry, and the page entry bit P=0 is kept. The passed argument is the page table entry pointer. Function execution flow:

  1. Check the validity of parameters. As shown in lines 6 through 10 of the code. Then determine whether the physical page address specified by the page entry is larger than the high PAGING_MEMORY (15MB) for paging management.
  2. If the page has been modified. As shown in lines 11 through 22. Then check whether the page is shared. If the page is shared, it will not be released. If the page is not shared, apply for a swap page id and save it in the page entry. In line 13 to 14 of the code, according to the reference times of the corresponding physical page saved in the MEM_map memory management array, determine whether the page is shared. If it is not equal to 1, it indicates that the page is shared. Therefore, directly return 0 to exit. Otherwise, apply for a swap page number (SWAP_NR) and prepare to swap out the page. The corresponding page entry stores the result of SWAP_N r multiplied by 2, as shown in line 17 of code. The reason for multiplying by 2 is to empty the existence bit (P) of the original page entry. Only pages with bit 0 and non-0 page entry content exist on the switching device. Write the corresponding memory page to the switch, as shown in line 19, and then free the memory page.
  3. If the page has not been modified, simply release it, as shown in codes 23 to 26.

The LOW_MEM and PAGING_MEMORY variables on lines 9 and 36 are defined as follows:

Note: Task 0 is not exchanged (task[0] kernel page). The first page to swap is the virtual memory page starting at 64MB at the end of task 0.

#define FIRST_VM_PAGE (TASK_SIZE>>12) // TASK_SIZE:0x04000000 64MB 64MB/4KB=16384
#define LAST_VM_PAGE (1024*1024) // 4GB/4KB=1048576
#define VM_PAGES (LAST_VM_PAGE - FIRST_VM_PAGE) / / 1032192
Copy the code

Swap_out () — puts the memory page on the switch

  1. Search the entire 4GB linear space, starting with the page directory entry (FIRST_VM_PAGE >> 10) corresponding to the linear address 64MB. During an attempt to swap the corresponding physical memory page to the switching device, return 1 on a successful swap, otherwise return 0. Two static variables in the function are used to temporarily store the current search point for the starting position of the next search.
  2. Find the pg_table page directory entry that contains valid page table contents, exit if found, otherwise reduce counter variable set, and continue to detect the next page directory entry. As shown in lines 41 through 49, loop through the contents of the page table entry and determine if the page table contents are valid.
  3. After obtaining the page table pointer in the current directory entry, check the 1024 pages corresponding to the page table and call the switching function try_to_SWAP_out () one by one. If the page table is successfully switched to the switching device, 1 is returned. If all the page tables corresponding to all directory entries have failed, an error message is displayed and 0 is returned. Such as code 50Line 71 shows this. The page_table variable is the pointer to the page table on line 50, and page_entry is the index of the page table entry on line 52, as shown in line 56dir_entry++Represents processing the next page table, trying to call the swap function each time a valid page is found, as in code 67As shown on line 68, exit and return 1 if the call succeeds, otherwise continue looking for the next page.

13.7.5 Two low-level block read/write functions

#define read_swap_page(nr,buffer) ll_rw_page(READ,SWAP_DEV,(nr),(buffer)); #define write_swap_page(nr,buffer) ll_rw_page(WRITE,SWAP_DEV,(nr),(buffer)); The above two macros are defined in Linux /mm.h. The ll_rw_page() function is implemented in kernel/blk_drv/ll_rw_blk.c

Ll_rw_page () function (a page access function with a specified device number) a low-level read/write function on a block device that accesses the block device data in the unit of 4KB, that is, eight sectors (512 bytes) are read/written at a time.

Ll_rw_blk.c is the program that interfaces all block devices (hard disks, floppy disks, and virtual Ram disks) with the rest of the system. Other programs in the system can asynchronously read and write data from block devices by calling the program’s low-level block-write function ll_rw_block(). The actual read and write operations are performed by the device’s request handler request_fn() (do_hd_request() for hard disks, do_fd_request() for floppy disks, do_rd_request() for virtual disks).

void ll_rw_page(int rw, int dev, int page, char * buffer)
{
	struct request * req;
	unsigned int major = MAJOR(dev);
	// Check the parameters
    // Check whether the main device number and the request operation function of the device exist
	if(major >= NR_BLK_DEV || ! (blk_dev[major].request_fn)) { printk("Trying to read nonexistent block-device\n\r");
		return;
	}
    // Check whether the parameter command is READ or WRITE
	if(rw! =READ && rw! =WRITE) panic("Bad block dev command, must be R/W");
	// Create a request item
repeat:
	req = request+NR_REQUEST;
	while (--req >= request)
		if (req->dev<0)
			break;
	if (req < request) {
		sleep_on(&wait_for_request);
		goto repeat;
	}
    // You want to fill in the request information in the free request item and queue it
/* fill up the request-info, and add it to the queue */
	req->dev = dev;	/ / device number
	req->cmd = rw;	// Command (READ/WRITE)
	req->errors = 0;	// Count of read/write errors
	req->sector = page<<3;	// Start read sector
	req->nr_sectors = 8;	// Number of read-write sectors 1 page
	req->buffer = buffer;	// Data buffer
	req->waiting = current;	// The current process enters the request waiting queue
	req->bh = NULL;	// Unbuffered bulk pointer (no caching)
	req->next = NULL;	// Next request item pointer
	current->state = TASK_UNINTERRUPTIBLE;	// Set it to the uninterruptible state
	add_request(major+blk_dev,req);	// Queue the request item
	schedule();
}
Copy the code

13.7.6 Applying for an Idle Physical Page

Get_free_page () : get_free_page() : get_free_page() : get_free_page() : get_free_page()

  • Input: %1 (ax=0) -0; %2 (LOW_MEM) the starting memory location for byte bitmap management; %3 (cx=PAGING_AGES); % 4 (edi = mem_map + PAGING_PAGES – 1)
  • Output: %0 (ax= physical page start address)
/*
 * Get physical address of first (actually last :-) free page, and mark it
 * used. If no free pages left, return 0.
 */
unsigned long get_free_page(void)
{
register unsigned long __res asm("ax");

repeat:
	__asm__("std ; repne ; scasb\n\t"
		"jne 1f\n\t"
		"movb The $1,1(%%edi)\n\t"
		"sall $12,%%ecx\n\t"
		"addl %2,%%ecx\n\t"
		"movl %%ecx,%%edx\n\t"
		"movl The $1024,%%ecx\n\t"
		"leal 4092(%%edx),%%edi\n\t"
		"rep ; stosl\n\t"
		"movl %%edx,%%eax\n"
		"1:"
		:"=a" (__res)
		:"0" (0),"i" (LOW_MEM),"c" (PAGING_PAGES),
		"D" (mem_map+PAGING_PAGES-1)
		:"di"."cx"."dx");
	if(__res >= HIGH_MEMORY) // The page address is greater than the actual memory capacity to address goto repeat;if(! __res && swap_out()) // If there is no free page, perform swap processing and re-find goto repeat;return__res; // Return free physical page address}Copy the code

The code actually points to the last byte of mem_map[], which scans all page flags forward from the end of the bitmap. The code will look for the byte item with the value of 0 in the memory-mapped byte bitmap and then clear the corresponding physical page to zero. If the page address obtained is larger than the actual physical memory capacity, it will look again. If no free page is found, it will call the execution swap processing and search again, as shown in lines 25 to 28 of the code.

The appendix


  1. www.cnblogs.com/taek/archiv…↩