Author: Luo Jia/Post editor: Zhang Handong

The evolution from segments to pages is a major advance in operating system memory abstraction. Page storage management is implemented by software and hardware. Software provides mapping and hardware speeds up address translation. In order to better design hardware, software pages are often required to meet certain data requirements.

This note attempts to tease out the page implementation process at the software level. It will cover allocation algorithms for large pages, alignment of physical addresses, compatible designs for multiple paging modes, and how to implement them using generics, modularity, and other modern language techniques.

It may not be clear, please bear with me.

Large page allocation algorithm

Page storage system has the concept of “big page”. In some architectures, large pages require address alignment. That is, the physical address of a large page must be aligned to a specific level, and subsequent low-level page numbers and indexes are collectively used as longer indexes. This puts forward certain requirement to allocation algorithm.

Summarize the problems that need to be solved. Our problem is: given a page table system M, the maximum rank is N; For virtual page number v = (vn-1,… V1, v0) and the number of consecutive pages to be mapped n, find a set of page mapping U = {(v_ start, p_ start, length)}, so that the page number v can be mapped to the physical page number P. Consider using a large page to optimize the process, make the | | U as little as possible, but need to pay attention to alignment requirements, that is, large page to rank for k, the starting address for.vn – 1,… vk, 0, … If the alignment requirement of class K is A [k], v = 0 (mod a[k]).

In M = Sv39 systems, the maximum grade n = 3, alignment requirements A [0] = 1, a[1] = 512, a[2] = 26’2144[note 1]. That is, each level contains 512 page table entries.

To simplify the problem understanding, we define a page table system with M = simple, and the maximum rank is also n = 3, but a[0] = 1, A [1] = 4, a[2] = 16. Each level contains three page table entries. Given the virtual address v = (2, 2, 2) to v + n = (5, 2, 1), map to the physical address p = (2, 2, 2).

Now we are at the highest level 3 and will assign r(v) = (2, 2, 2).. =(5, 2, 1) to p =(2, 2, 2). If you use large pages directly, you must ensure that both levels 2 and 1 are numbered 0. That is, only r(v) = (3, 0, 0).. 5, 0, 0, p is equal to 3, 0, 0.

Why not assign (2, 0, 0).. How about 3, 0, 0 to 2, 0, 0? Because this would allocate more than (2, 2, 2).. Phi is equal to the range of 5, 2, 1, which is different from the problem we’re trying to solve. Similarly, it cannot assign (5, 0, 0).. (6, 0, 0) to (5, 0, 0). The scope of this section needs to be accomplished with the help of lower level large pages.

The level 3 allocation is complete, then we use the level 2 page table to allocate nearby fragmentary (2, 3, 0).. (3, 0, 0) and (5, 0, 0).. (5, 2, 0). Assign (2, 3, 0).. (3, 0, 0) to (2, 3, 0); (5, 0, 0).. 5, 2, 0, to 5, 0, 0.

Finally, with the aid of a level 1 page table, assign (2, 2, 2).. (2, 3, 0) to (2, 2, 2); (5, 2, 0).. 5, 2, 2, to 5, 2, 0. At this point, all allocations are complete.

Because this method requires the allocation of physical large pages on the virtual large page, the base address of the two large pages must have the same alignment, therefore, the difference between the virtual page number and physical page number must be aligned to the corresponding page, that is, v = p (mod a[k]), then this allocation method can be used. Otherwise a lower level of allocation should be used. (Think about why.) Otherwise, an exception for alignment requirements will be generated during allocation.

For example, v = (2, 2, 2) to P = (2, 2, 2), we can use level 3; If v = (2, 2, 2) to P = (2, 1, 2), you can only use level 2.

Generalizing this result, we can get a rule. First, using the greedy method of rounding in the upper and lower limits of the address, the largest large page is allocated to the highest address range. Then, for the fragmentary ranges on both sides, use the page table one level lower and continue the allocation. Continue lowering the level until all pages are assigned.

The starting level of the algorithm is progressively reduced based on the alignment rules and the number of pages required. For a page table system of rank n = 3, we write pseudo-code for order rules in the form of branches.

// input: v: VirtPageNum, p: PhysPageNum, n: usize, a: PageMode; if (v - p) % (a[2].frame_align()) == 0 && n >= a[2].frame_align() { let l2n = (vs2 - ve2) / a[2].frame_align(); map(2, ve2, vs2, ve2-v+p); let l1n = (ve2 - ve1 + vs1 - vs2) / a[1].frame_align(); map(1, ve1, ve2, ve1-v+p); map(1, vs2, vs1, vs2-v+p); let l0n = (n + ve1 - vs1) / a[0].frame_align(); map(0, v, ve1, p); map(0, vs1, v+n, vs1-v+p); } else if (v - p) % (a[1].frame_align()) == 0 && n >= a[1].frame_align() { let l1n = (vs1 - ve1) / a[1].frame_align(); map(1, ve1, vs1, ve1-v+p); let l0n = (n + ve1 - vs1) / a[0].frame_align(); map(0, v, ve1, p); map(0, vs1, v+n, vs1-v+p); } else if (v - p) % (a[0].frame_align()) == 0 && n >= a[0].frame_align() { let l0n = n / a[0].frame_align(); map(0, v, v+n, p); } else { panic! ("Can't map v to p under this page mode") }Copy the code

What we find is that the intermediate variable of the low-ranking algorithm also appears in the high-ranking algorithm. So this algorithm can be changed to a loop.

// input: v: VirtPageNum, p: PhysPageNum, n: usize, M: PageMode; For I in M::visit_levels_until(PageLevel::leaf_level()) {let align = M::a(I); If (v - p) % align! = 0 | | n < align {/ / alignment requirements are not up to grade, or quantity is not enough, use low-level algorithms continue; } let (mut ve_prev, mut vs_prev) = (None, None); For j in M: : visit_levels_from (I) {/ / traversal sequence: [j, j - 1, 0]... let a = M: : a (j); Let ve_cur = a * roundup(v/a) let vs_cur = a * rounddown((v + n)/a) if let (Some(ve_prev), Some(vs_prev)) = (ve_prev, vs_prev) { map(j, ve_cur.. ve_prev); // Execute the mapping function map(j, vs_prev.. vs_cur); } else { map(j, ve_cur.. vs_cur); } (ve_prev, vs_prev) = (Some(ve_cur), Some(vs_cur)); } break; }Copy the code

This algorithm can be applied to any level of page table system, so the problem is solved. A better algorithm implementation can be achieved by wrapping the algorithm with generators or iterators in Rust.

The traditional allocation algorithm maps all the addresses in the address segment to the smallest page frame. At this point, you allocate as many pages as you need to manage page frames. The large page allocation algorithm accomplishes the same task by allocating fewer pages that meet alignment requirements. How do we compare the number of pages assigned by the large page allocation algorithm with the traditional one?

Let’s take the M = Sv39 system. Where, the maximum grade n = 3, alignment requirements A [0] = 1, A [1] = 512, A [2] = 26’2144.

Also assign 505’5550 page frames, assuming page number alignment meets the maximum 26’2144, using a different virtual page number.

Virtual page number The number of Number of page tables required save Level 0 Level 1 Level 2
0 505 ‘5550 227 0.00% 62 146 19
10 505 ‘5550 1249 0.02% 574 657 18
20 505 ‘5550 1249 0.02% 574 657 18
512 505 ‘5550 738 0.01% 62 658 18
1024 505 ‘5550 738 0.01% 62 658 18
1025 505 ‘5550 1249 0.02% 574 657 18
26 ‘2144 505 ‘5550 227 0.00% 62 146 19
100 ‘0000 505 ‘5550 738 0.01% 574 145 19
Virtual page number The number of Number of page tables required save Level 0 Level 1 Level 2
30 ‘0000 1 1 100.00% 1 N/A N/A
30 ‘0000 10 10 100.00% 10 N/A N/A
30 ‘0000 100 100 100.00% 100 N/A N/A
30 ‘0000 1000 489 48.90% 488 1 N/A
30 ‘0000 1 ‘0000 291 2.91% 272 19 N/A
30 ‘0000 10 ‘0000 355 0.36% 160 195 N/A
30 ‘0000 100 ‘0000 995 0.10% 64 929 2
30 ‘0000 1000 ‘0000 752 0.01% 128 587 37

It can be found that when Sv39 requires high alignment and a large number of page frames, a large page only needs less than one thousand page tables to manage one million page frame Spaces, which greatly reduces the number of page tables. When the number of page frames is small, the amount saved is not obvious because the alignment requirement is not high; When alignment requirements are low, the savings are not significant.

In practice, try to give the maximum alignment requirements, so that when allocating a large number of page frames, save more page frame space. The results also have implications for the layout of peripherals. If the advanced embedded chip has many peripherals, try to place the physical address of the peripherals on higher alignment requirements, so that the operating system management can free up more memory space for use.

Academician Cheng Huiming: Inheriting the wisdom of our ancestors, he advocated the “4-digit separation method” for writing Arabic numerals in Chinese. Proceedings of the Chinese Academy of Sciences

Abstract software design

Taking Rust language as an example, the abstract method of common structure of page system is given.

Page number

Start by defining physical and virtual page numbers.

#[derive(Copy, Clone, PartialEq, Eq, Debug)]
pub struct PhysPageNum(usize);

#[derive(Copy, Clone, PartialEq, Eq, Debug)]
pub struct VirtPageNum(usize);
Copy the code

The “physical page length + offset length” may be greater than the “schema address width”, which allows us to access addresses larger than the schema width. For example, in RISC-V RV32, Sv32 systems can access 34 bits of physical addresses, even if the architecture is only 32 bits.

The physical page number and virtual page number can be translated by the corresponding address.

impl PhysPageNum {
    pub fn addr_begin<M: PageMode>(&self) -> PhysAddr {
        PhysAddr(self.0 << M::FRAME_SIZE_BITS)
    }
}
Copy the code

This transformation relationship requires the schema of the page table to be entered. The offset of the address may be different for different architectures.

Page frame splitter

Then we need a page frame allocator. Imitating the design of the Rust language alloc package, the structure can be given as follows.

pub trait FrameAllocator {
    fn allocate_frame(&self) -> Result<PhysPageNum, FrameAllocError>;
    fn deallocate_frame(&self, ppn: PhysPageNum);
}
Copy the code

When constructing a page frame allocator, a range of physical pages should be given.

Then, each time an allocation is requested, the algorithm in it returns either the result of the allocation or an error if no page exists.

impl StackFrameAllocator {
    pub fn new(start: PhysPageNum, end: PhysPageNum) -> Self {
        StackFrameAllocator { current: start, end, recycled: Vec::new() }
    }
}
Copy the code

The page frame allocator only allocates numbers and does not store or read data into allocated memory, so its design is simple with the Alloc library.

This is designed to facilitate testing of the correctness and performance of the page frame splitter.

Page frames for boxing

Or FrameBox, borrowing the title Box name from Rust, denotes a page frame with ownership.

#[derive(Debug)] pub struct FrameBox<A: FrameAllocator = DefaultFrameAllocator> { ppn: PhysPageNum, // equivalent to * muT type pointer frame_alloc: A,}Copy the code

Each time a new page frame is created, a new page frame is retrieved from the page frame allocator frame_alloc and then wrapped properly using ownership semantics. When its life cycle ends, the page frame allocator is called to release the occupied page frames.

Impl <A: FrameAllocator> FrameBox<A> {// Allocate page frames and create FrameBox pub fn try_new_in(frame_alloc: A) -> Result<FrameBox<A>, FrameAllocError> { let ppn = frame_alloc.allocate_frame()? ; Ok(FrameBox { ppn, frame_alloc }) } } impl<A: FrameAllocator> Drop for FrameBox<A> {fn Drop (&mut self) {// Drop self.frame_alloc.deallocate_frame(self.ppn); }}Copy the code

The boxed page frame actually holds ownership of the page frame’s memory, allowing data to be written in and read from it.

Page address space

A structure representing a paging system implementation that holds all contained boxes of page frames and releases all of them on release.

This structure has a type parameter for paging mode, which is used to compute the page frame insertion algorithm.

// The address space that represents a paging system implementation // // should not be used if it is a direct mapping or linear offset mapping. #[derive(Debug)] pub struct PagedAddrSpace<M: PageMode, A: FrameAllocator = DefaultFrameAllocator> { root_frame: FrameBox<A>, frames: Vec<FrameBox<A>>, frame_alloc: A, page_mode: M, }Copy the code

When a page address space is created, a root page table is immediately allocated.

Impl <M: PageMode, A: FrameAllocator + Clone> PagedAddrSpace<M, A> {// Create an empty paging address space. Pub fn try_new_in(page_mode: M, frame_alloc: A) -> Result<Self, FrameAllocError> {let mut root_frame = FrameBox::try_new_in(frame_alloc.clone())? ; Unsafe {fill_frame_with_initialized_page_table::<A, M>(&mut root_frame)}; Ok(Self { root_frame, frames: Vec::new(), frame_alloc, page_mode }) } }Copy the code

After creating the structure, when inserting a new mapping, use the insert algorithm provided in the previous section to get the range to be inserted, and then read and write the page frame box to complete the insert.

Impl <M: PageMode, A: FrameAllocator + Clone> PagedAddrSpace<M, A> {// Set A page entry. Unsafe fn alloc_get_table(&mut self, entry_level: PageLevel, vpn_start:); unsafe fn alloc_get_table(&mut self, entry_level: PageLevel, vpn_start: VirtPageNum) -> Result<&mut M::PageTable, FrameAllocError> { let mut ppn = self.root_frame.phys_page_num(); for &level in M::visit_levels_before(entry_level) { let page_table = unref_ppn_mut::<M>(ppn); let vidx = M::vpn_index(vpn_start, level); match M::slot_try_get_entry(&mut page_table[vidx]) { Ok(entry) => ppn = M::entry_get_ppn(entry), Err(mut slot) => {let frame_box = FrameBox::try_new_in(self.frame_alloc. M::slot_set_child(&mut slot, frame_box.phys_page_num()); ppn = frame_box.phys_page_num(); self.frames.push(frame_box); } } } // println! ("[kernel-alloc-map-test] in alloc_get_table PPN: {:x? }", ppn); let page_table = unref_ppn_mut::<M>(ppn); // PPN is the current page table that needs to be modified // an unconstrained life cycle is created. Ok(&mut *(page_table as *mut _))} pub fn allocate_map(&mut Self, VPN: VirtPageNum, ppn: PhysPageNum, n: usize, flags: M::Flags) -> Result<(), FrameAllocError> { for (page_level, vpn_range) in MapPairs::solve(vpn, ppn, n, self.page_mode) { // println! ("[kernel-alloc-map-test] PAGE LEVEL: {:? }, VPN RANGE: {:x? }", page_level, vpn_range); let table = unsafe { self.alloc_get_table(page_level, vpn_range.start) }? ; let idx_range = M::vpn_index_range(vpn_range.clone(), page_level); // println! ("[kernel-alloc-map-test] IDX RANGE: {:? }", idx_range); for vidx in idx_range { let this_ppn = PhysPageNum(ppn.0 - vpn.0 + M::vpn_level_index(vpn_range.start, page_level, vidx).0); // println! ("[kernel-alloc-map-test] Table: {:p} Vidx {} -> Ppn {:x? }", table, vidx, this_ppn); match M::slot_try_get_entry(&mut table[vidx]) { Ok(_entry) => panic! ("already allocated"), Err(slot) => M::slot_set_mapping(slot, this_ppn, flags.clone()) } } } Ok(()) } }Copy the code

Wrapper page frame allocation algorithm

The definition and implementation are as follows.

#[derive(Debug)] pub struct MapPairs<M> { ans_iter: alloc::vec::IntoIter<(PageLevel, Range<VirtPageNum>)>, mode: M, } impl<M: PageMode> MapPairs<M> { pub fn solve(vpn: VirtPageNum, ppn: PhysPageNum, n: usize, mode: M) -> Self { let mut ans = Vec::new(); /* omit the solution */ Self {ans_iter: ans.into_iter(), mode } } } impl<M> Iterator for MapPairs<M> { type Item = (PageLevel, Range<VirtPageNum>); fn next(&mut self) -> Option<Self::Item> { self.ans_iter.next() } }Copy the code

Each iteration of it returns a page frame that should be allocated. Based on this result, the mapping should be set up.

The activation function

The implementation of this function depends on the architecture. Here risC-V Sv39 is used as an example.

// Switch the address space and provide 1. Detailed Settings of the address space 2. Pub unsafe FN Activate_paged_RIScv_SV39 (root_ppn: PhysPageNum, asID: AddressSpaceId) { use riscv::register::satp::{self, Mode}; satp::set(Mode::Sv39, asid.0 as usize, root_ppn.0); asm! ("sfence.vma {}", in(reg) asid.0 as usize); }Copy the code

After executing, you are in the new address space. Note that the current PC address is still unchanged. If the code segment corresponding to the instruction has disappeared after entering the new space, an exception will be generated. Therefore, it is common to use “springboard pages” that are mapped in each virtual space to accomplish all this process.

Advantages and disadvantages of this design

The advantage of this design is that you will find that you can automatically fill in the rest of the algorithm simply by passing in the generic parameter M, which represents the page table schema.

For example, the RISC-V Sv39 schema can be implemented as a page table schema, passing in a generic parameter M; It is defined as follows.

// Sv39 paging mode; #[derive(Copy, Clone, PartialEq, Eq, Debug)] pub struct Sv39; impl PageMode for Sv39 { const FRAME_SIZE_BITS: usize = 12; const PPN_BITS: usize = 44; type PageTable = Sv39PageTable; type Entry = Sv39PageEntry; type Slot = Sv39PageSlot; type Flags = Sv39Flags; /* omit a large number of utility functions */}Copy the code

Simply by implementing these parameters in pattern M, the page table space system, including the solution algorithm, can be used seamlessly.

The complete code implementation is here.

The downside of this approach is that it requires a programming language that supports generics. For example, if the operating system kernel is written in Rust, you can use this programming approach.

Now common hardware page table system

Risc-v provides Sv39 and Sv48; These are level 3 and level 4 page table systems, and the higher the level, the more virtual space you can manage.

We describe the basic parameters of these page table systems using the descriptive methods of the previous section.

Page table system M A virtual address Physical address Level of n Alignment requirement A
RISC-V Sv39 39 55 3 1, 512, 26, 2144
RISC-V Sv48 48 55 4 1, 512, 26 ‘2144,’ 3421 ‘, 7728
RISC-V Sv32 32 34 * 2 1, 1024,
The godson LA64 48 60 4 1, 4096/2048*, …
The godson LA32 32 36 2 1, 1024,
arm64 48 39 4 1, 512, 26 ‘2144,’ 3421 ‘, 7728
arm32 32 32 2 1, 256,
X86-64 (old) 48 47 4 1, 512, 26 ‘2144,’ 3421 ‘, 7728
X86-64 (new) 57 52 5 1, 512, 26’2144, 1’3421’7728, 687’1947’6736
x86-32 32 32 2 1, 1024,

* The physical address of Sv32 does exceed 32 bits

* In Loongson LA64 architecture, the alignment requirements of two-page storage structures are different

Sv39 can be activated by the kernel when the system boots, and the page table has fewer levels, lower overhead, and faster startup. Later, on demand, you can move to a page table system with more space to accommodate more applications.

Page memory management notes

At the end of the article, we spent some time trying to organize notes on the concept of pager memory management.

Page memory management

To manage mounted peripherals and chunks of memory, we define physical addresses, which are numbers for resource units in real hardware. All physical addresses form an address space; An address space is a collection of specific resources and hardware that can be indexed by an address.

Our application can run directly on a physical address. However, to make it easy for programs to monopolize the address space and to plan memory continuously, we construct virtual Spaces for them. The program can then run on a virtual address, which is the number of memory cells in the virtual space.

We used to manage memory in segments. To reduce internal fragmentation and memory waste, we introduced a paging management system.

Paging systems divide the address space into contiguous blocks of equally large memory called page frames. The size of the page is determined by the hardware implementation, and the software must specify the size of the page according to the hardware.

The data structures that manage pages are called page tables. In memory, page tables are typically the size of a page frame for management by the paging system on hardware. The location in memory where the page table is stored is called the page number.

Pages larger than one page frame are also called large pages. In a page table that manages large pages, each item represents a subpage. This item is called a page table entry for a page table, and is typically made up of permission bits, control bits, and a physical page number.

Consider the address space as the largest page, and the root page table is the page table that manages the largest page. The page number of the root page table is saved in a dedicated location to be used as the starting address for hardware queries.

The translation process is aligned with the address

In modern page table systems, regardless of level, the data structure of the management page is called the page table. The page table translation process is roughly equivalent to this step:

  1. Retrieves the page number of the root page table
  2. Read a specific area of the virtual address as the index of the page table at this level
  3. Extract the page table entry that manages smaller pages, based on the index
  4. If the page entry points to a sub-page table, read the page number and return to Step 2

A very large number of exceptions can occur in this simplified step. Note that if the current page entry points to a physical page number, the physical page number has alignment requirements.

If we use (v0, v1, v2):offset represents a virtual address.

If rt[v0][v1] points to the subpage table, the search continues for RT [v0][v1][v2]. Get the physical page numbers (P0, P1, P2). Combined with off, the physical address (P0, P1, p2) is offset.

If rt[v0][v1] points to a physical page, this is a page entry for a large page. In the page entry, the physical page number (P0, P1, 0) is obtained, and the physical address corresponding to the large page is (P0, P1):v2,offset. Equivalent to lengthening the number of bits of offset.

In the latter form, the hardware usually requires that the physical page number be valid only for levels higher than the large page and invalid for levels lower than it. That is, if the p2 of the page entry (P0, P1, p2) is not equal to zero, a page exception is returned. This is the alignment requirement for a large page table system.

About the author:

Los better

Undergraduate student, School of Cyberspace Security, Huazhong University of Science and Technology, love operating system, embedded development. 3 years experience in Rust language development, active community contributor. At present, we are committed to promoting Rust language to industry, teaching and other fields.