preface

As a component of computer system, Memory is closely related to the daily development activities of developers. Such as Segment Fault, OutOfMemory, Memory Leak and GC are all related to it. The Memory mentioned in this article refers to the Main Memory in the computer system. It is located between the CPU cache and the disk in the storage pyramid, and it is an indispensable part of the program operation.

In computer systems, main memory is usually managed by the operating system (OS), and the details of memory management are irrelevant to the developer. For the average developer, he only needs to know how to call the programming language’s interface for memory allocation and release to write a usable application. If you’re using a language with garbage collection, such as Java and Go, you don’t even have to actively free memory. But if you want to write efficient applications, familiarity with OS memory management becomes essential.

Below, we will start from the most simple memory management principles, take you to peek at the OS memory management mechanism, thus familiar with the underlying memory management mechanism, write efficient applications.

Exclusive memory management

In the early single-tasking systems, only one application could have all the memory (except the memory occupied by the OS) at a time, so memory management could be as simple as dividing the memory into two regions:

In the multi-task system, the computer system can run multiple tasks concurrently. If the management mode is still exclusive, each task switch will involve multiple data copies between memory and disk, which is extremely inefficient:

The most intuitive solution is to keep all application data resident in memory (assuming memory is large enough) to avoid data copying:

But this brings up another problem. The memory address Spaces between programs are not isolated, which is that program A can modify program B’s memory data. Such a major security problem is unacceptable to users, to solve the problem, it is necessary to use the power of virtualization.

Virtual address space

To achieve memory isolation between programs, the OS virtualizes physical memory. The OS virtualizes a segment of memory for each program, which is mapped to a segment of physical memory. But for the program itself, it can only see its own virtual address space, which gives the illusion of monopolizing the entire memory.

In the figure above, the virtual memory space is divided into three areas. The Code area stores the machine instructions for the program Code. The Heap area stores memory that is dynamically requested during program execution. The Stack area stores function inputs, local variables, return values, and so on. The Heap and Stack grow as the program runs, placed above and below the virtual memory space, and growing in opposite directions.

The mapping from the virtual address space to the physical address space requires a conversion operation. The memory management Unit (MMU) is located in the CPU chip to complete the conversion operation.

To convert from a virtual address to a physical address, MMU requires two registers, base and bound. The base register is used to store the program’s base address in physical memory, such as in

Figure 5

, the base address of program A is 192KB; The bound register (sometimes called the limit register) holds the Size of the virtual address space, which is used to avoid out-of-bounds access, for example

Figure 5

The size value of program A is 64K. So, the address translation formula based on this approach looks like this:

Physical address = virtual address + base address

In order to

Figure 5

When program A attempts to access an address beyond its bound range, the physical address translation fails:

Now, let’s take A closer look at the physical memory distribution of program A again, as follows

Figure 7.

As shown, a large chunk of free memory in the middle is in the “applied, unused” idle state. This also means that even if this part is free, it can’t be reallocated to other programs, which is a huge waste of space! To solve this problem of low memory utilization, the familiar segmental memory management emerged.

Segment memory management

In the previous section, we found that when you do memory management on a programmatic basis, you have a memory underutilization problem. To solve this problem, segment memory management was introduced. A Segment is a logical chunk of memory that is contiguous and has a certain size. Code, Heap, and Stack are three segments.

Segment memory management is managed on a segment basis, which allows the OS to flexibly place each segment in a free location of physical memory, thus avoiding the presence of “requested, unused” memory areas:

Address translation

From the top

Figure 8.

In order to convert virtual addresses to physical addresses, MMU needs to provide a pair of Base-bound registers for each segment. For example:

Given a virtual address, how does the MMU know which segment the address belongs to and convert it to the corresponding physical address based on the corresponding base-bound register?

Assume that the virtual address has 16 bits, because there are only three segments, so we can use the upper two bits of the virtual address as the segment identifier, such as 00 for Code segment, 01 for Heap segment, and 11 for Stack segment. Then the MMU can find the base-bound register for the segment based on the virtual address:

However, this does not translate the virtual address to the physical address smoothly, and we ignore an important point: the Heap and Stack grow in opposite directions, which means that the address translation is not the same. Therefore, we must also use one more bit in the virtual address to indicate the growth direction of the segment. For example, 0 indicates growth up (low address direction) and 1 indicates growth down (high address direction) :

Let’s look at a set of examples of segment memory management address translation:

Then, the address translation algorithm of segment memory management is summarized as follows:

Segment = (VirtualAddress & SEG_MASK) >> SEG_SHIFT // Get Offset = VirtualAddress & OFFSET_MASK // GrowsDirection = VirtualAddress & GROWS_DIRECTION_MASK if (Offset >= Bounds[Segment]) RaiseException(PROTECTION_FAULT) else if (GrowsDirection == 0) { PhysAddr = Base[Segment] + Offset } else { PhysAddr = Base[Segment] - Offset }Copy the code

Memory sharing and protection

Segment memory management can also save memory by conveniently supporting memory sharing. For example, multiple programs can share Code segments if they all run in the same executable. To achieve this function, we can set the protection bit on the virtual address. When the protection bit is read-only, it means that the segment can be shared. In addition, if the program modifies a read-only segment, the address conversion fails, so memory protection can also be achieved.

Memory fragments

The most obvious disadvantage of segment memory management is fragmentation, because the segments of programs running on the system are often not fixed in size and the distribution of segments is not continuous. When the system has a lot of memory fragmentation, the utilization of memory also drops dramatically, showing that the system can’t run a program even though it seems to have a lot of memory.

One way to deal with memory fragmentation is to defragment periodically:

However, defragmentation costs a lot. On the one hand, it requires multiple memory copies. On the other hand, in the copy process, the running program must be stopped, which will greatly affect the user experience for some human-computer interaction tasks based applications.

Another solution is page memory management, which I’ll cover next.

Page memory management

The idea of Page memory management is to divide both virtual and physical memory into fixed-size areas called pages. Pages are the minimum unit of memory allocated, and an application’s virtual pages can be stored in any free physical page.

Pages in physical memory are often referred to as Page frames.

Because the page size is fixed and is the smallest unit of allocation, the fragmentation problem in segment memory management can be solved.

However, there is still the possibility of memory fragmentation within the page.

Address translation

Page Table is used to translate virtual address to physical address. The key steps of address translation are as follows:

1) Find the corresponding physical page frame according to the virtual page

Each Virtual Page has a Virtual Page Number (VPN). Correspondingly, each Physical page Frame has a Number called PFN (Physical Frame Number). The page table stores the mapping between VPN and PFN.

2) Find the address Offset within the physical page frame

The offset of the address within the physical page frame is the same as the offset within the virtual page.

We can divide the virtual address into two parts, respectively store VPN and Offset, so that we can find the PFN through VPN and get the actual physical address of PFN+Offset.

For example, if the virtual memory space size is 64 bytes (6-bit address) and the page size is 16 bytes, then the total virtual memory space has 4 pages. So we can use the upper 2 bits to store VPN and the lower 4 bits to store Offset:

VPN (01) uses the page table to find the corresponding PFN (111). The in-page offset of both the virtual address and the physical address is 0100, so the virtual address 010100 corresponds to the physical address 1110100.

Page table and page table entries

The OS assigns a Page Table to each program and stores it in memory. The Page Table consists of multiple Page Table entries (PTE). We can think of a page table as an array of elements called PTE:

Take the composition of PTE in x86 system as an example. PTE occupies 32 bits in total. In addition to PFN, there are some important information, such as P (Present) to identify whether the current page is in memory (or on disk). R/W (Read/Write) indicates whether the current page is read-write (or read-only). User/Supervisor (U/S) specifies whether the current page allows User access. A (Access) identifies whether the current page has been accessed, which is especially useful for determining whether the current page is hotspot data and page swapping. D (Dirty) Indicates whether the current page has been modified.

Disadvantages of page memory management

Address translation is inefficient

According to the above introduction, we can summarize the algorithm of address translation under the page memory management mechanism as follows:

VPN = (VirtualAddress & VPN_MASK) >> SHIFT PTEAddr = PTBR + (VPN * sizeof(PTE) To access main memory, PTE = AccessMemory(PTEAddr) // Validation if (PTE.Valid == False) RaiseException(INVALID_ACCESS) else // Obtain the page offset offset Calculated = VirtualAddress & OFFSET_MASK / / physical address PhysAddr = (PTE) PFN < < PFN_SHIFT) | offsetCopy the code

We found that each address translation accesses main memory once to get the page table, which is much less efficient than segment memory management (no main memory access).

Take up large space

If the address space is 32-bit and the page size is fixed at 4KB, then the entire address space has 2^{32}/4KB=2^{20} page tables, that is, the page table has 2^{20} ptes. Assuming that each PTE is 4-byte in size, each page table takes up 4MB of memory. If there are 100 programs running on the entire system, the page table alone takes up 400MB of memory, which is also unacceptable to users.

Next, we’ll look at how to optimize these two significant disadvantages of page memory management.

Faster address translation for page management

TLB: Translation Lookaside Buffer

As mentioned earlier, ppage memory management address translation is inefficient because of one more main memory access. If you can avoid or reduce access to main memory, you can make address translation faster.

Many people can think of ways to improve efficiency by increasing the cache. For example, to avoid frequent disk queries, we usually add a layer of cache in memory to improve the efficiency of data access. In order to increase the efficiency of accessing data in main memory, it is natural to add a layer of caching closer to the CPU. This place, closer to the CPU, is the MMU inside the CPU chip mentioned earlier. And this cache, TLB (Translation-Lookaside Buffer), caches VPN to PFN to mapping, something like this:

After TLB is added, the address translation algorithm is as follows:

 VPN = (VirtualAddress & VPN_MASK) >> SHIFT (Success, TlbEntry) = TLB_Lookup(VPN) if (Success == True)   // TLB Hit     if (CanAccess(TlbEntry.ProtectBits) == True)         Offset   = VirtualAddress & OFFSET_MASK         PhysAddr = (TlbEntry.PFN << SHIFT) | Offset     else         RaiseException(PROTECTION_FAULT) else                  // TLB Miss     PTEAddr = PTBR + (VPN * sizeof(PTE))     PTE = AccessMemory(PTEAddr)     if (PTE.Valid == False)         RaiseException(SEGMENTATION_FAULT)     else if (CanAccess(PTE.ProtectBits) == False)         RaiseException(PROTECTION_FAULT)     else         TLB_Insert(VPN, PTE.PFN, PTE.ProtectBits)         RetryInstruction()
Copy the code

From the above algorithm, it can be found that the direct access to main memory can be avoided during TLB Hit, thus improving the efficiency of address translation. However, in the event of a TLB cache Miss (TLB Miss), main memory still needs to be accessed once, and the PFN queried from main memory needs to be inserted into the TLB, which becomes even less efficient. Therefore, we must avoid TLB Miss as much as possible.

Make better use of TLB

Below, let’s use an array traversal example to show you how to make better use of TLB.

Suppose we want to do the following array traversal:

 int sum = 0; for (i = 0; i < 10; i++) {     sum += a[i];  }
Copy the code

The memory distribution of the array is as follows:

A [0]~ A [2] is located on Page 5, A [3]~a[6] is located on Page 6, and a[7]~a[8] is located on Page 7. When we first access a[0], a TLB Miss will occur because Page 5 is not in the TLB cache, and then a[1] and a[2] will be TLB hits. Similarly, access to A [3] and A [7] are TLB misses. A [4]~ A [6] and A [8]~ A [9] are TLB hits. Therefore, the TLB cache Hit ratio is as follows: Miss, Hit, Hit, Miss, Hit, Hit, Hit, Miss, Hit, Hit. The TLB cache Hit ratio is 70%. We found that TLBS accessing data on the same page are easier to hit, which is how spatial locality works.

Next, we iterate through the array again. Since the conversion information from Page 5 to Page 7 has been stored in the TLB cache, the TLB hit condition of the second iteration is as follows: Hit, Hit, Hit, Hit, Hit, Hit, Hit, Hit, Hit, Hit, Hit TLB cache Hit 100%! This is how time locality works. Accessing the same memory data for a short period of time can also improve TLB cache hit ratio.

Context switch of TLB

Because the TLB caches the context information of the currently running program, the context information in the TLB must be updated when the program is switched; otherwise, the address translation will be abnormal. There are two main solutions:

  • Method 1: Flush the TLB cache with each program switch, allowing the program to rebuild the cache as it runs.

  • Method 2: Allow TLB to cache the context information of multiple programs and distinguish them by the Address Space Identifier (ASID).

Method 1 is simple to implement, but each program switch needs to reheat the cache, which is low in efficiency. The mainstream approach is method 2.

Note that TLB is embedded in the CPU chip, for multi-core systems, if the program switching between cpus, also need to re-establish TLB cache! Therefore, tying an application to a fixed core can help improve performance.

Make the page list smaller

Large page memory

The easiest way to reduce the page table size is to make the page bigger. In the previous example, if the address space is 32-bit, the page size is 4KB, and the PTE size is 4-byte, each page table requires 4MB of memory space. Now, if we increase the page size to 16KB and keep everything else the same, each page table only needs 2^{32}/16KB=2^{18} PTes, or 1MB of memory, reducing the memory footprint by a factor of 4.

Large page memory also optimizes the use of TLB, because the number of contexts that TLB can cache is fixed. When the number of pages is smaller, the frequency of context swapping decreases, and TLB cache hit ratio increases, making address translation more efficient.

Segment-page memory management

As mentioned above, the space between stack and heap in a program’s address space is often unused. A large number of Ptes in the table are invalid. However, because the page table covers the entire address space, this part of the INVALID PTE can only be left in the page table, resulting in a large space waste.

In the previous article, we solved the problem of wasting memory space between heap and stack by using segment memory management. Corresponding to the page table, we can also introduce segment-based memory management for page memory management, that is, segment-based memory management, to solve the problem of page table space waste.

The specific method is to divide the address space of the program into multiple segments, such as Code, Heap, Stack, etc. Then, page management is done separately within each segment, that is, introducing a page table for each segment:

As you can see from the figure above, after the page table is segmented, the page table no longer needs to record the PTE of pages that are idle, thus saving memory space.

Multistage page table

Another common method of reducing Page Table size is multi-level Page tables, which also reduce the number of Ptes in idle pages, but in a different way than segmented Page memory management. If segmental page memory management can be thought of as segmenting a page table, multilevel page tables can be thought of as paginating a page table.

Specifically, the Page table is divided into several parts (Page Directory, Page Directory, PD) according to a certain size. If all the pages in a Page Directory are idle, there is no need to apply for PTE for the pages in the Page Directory.

Taking two-level page tables as an example, the following figure compares the composition differences between ordinary and multi-level page tables:

Next, let’s compare the space consumption of regular and multistage page tables. Assuming that the address space is 32-bit, the page size is 4KB, and the PTE size is 4-byte, and there are 2^{20} pages, the regular page table requires 4MB of memory space. Now, we split 2^{20} pages into 2^{10} portions, that is, 2^{10} Page directories. Each Page Directory manages 2^{10} pages, that is, 2^{10} Page Directory entries. Assuming that PDE also accounts for 4-byte memory and that 80% of the pages are free according to the 20/80 rule, the secondary page table only needs 0.804MB! (2 ^ {4} {10} * {+ 2 ^ {20} * 4 * (1-80 \ %)})

Therefore, multi-level page tables can effectively reduce the memory consumption of page tables. Multilevel page tables are common in practice. For example, The Linux system adopts the structure of 4-level page tables.

Swap Sapce: disk Swap area

So far, we have assumed that physical memory is large enough to hold the virtual memory space of all programs. However, this is often impractical. Using 32-bit address space as an example, one program has 4 gigabytes of virtual memory. If you have 100 programs, you need 400 GIGABytes of physical memory (ignoring the shared portion)! In addition, not all pages are needed all the time, and many times only a fraction of them are needed.

Therefore, if we can save a lot of physical memory by saving those temporarily unused pages on disk and loading them into memory when they are needed. The area of the disk where these pages are stored is called Swap Sapce, or Swap area.

Under this mechanism, a Page Fault interrupt is triggered when a program accesses an address on a Page that is not in memory. Just as TLB cache misses incur additional overhead, page misses also cause inefficient memory access. The OS must load data from the disk swap area into memory to handle page – missing interrupts. And when free memory is insufficient, the OS must also swap out certain pages from memory into the swap area. This in-and-out disk IO access also directly causes memory access to be much less efficient when page misses occur.

Therefore, when free memory is insufficient, the page out strategy becomes extremely important. If a page that is about to be accessed is swapped out to the swap area, it incurs unnecessary consumption that could have been avoided. There are many page swap strategies, including FIFO (first-in, first-out), Random (Random), LRU (least recently used), LFU (least frequently used recently) and so on. Under common workloads, FIFO and Random algorithms have poor performance and are seldom used in practice. Both LRU and LFU algorithms are based on the historical memory access statistics, so they perform better and have more practical applications. At present, many mainstream operating systems’ page out algorithms are optimized and improved based on LRU or LFU.

The last

This article mainly introduced some basic principles of OS memory management, from exclusive memory management, to page memory management, this process has undergone many optimizations. Each of these optimizations leads to three goals:

1. Transparency. The details of memory management are invisible to the program; in other words, the program can think of itself as monopolizing the entire memory space.

2. Efficiency. Address translation and memory access efficiency should be high, can not let the program run too slow; Space utilization efficiency is also high, can not take up too much free memory.

3, protection. Ensure that the OS itself is not affected by applications; Applications cannot affect each other either.

Of course, the memory management mechanism of current mainstream operating systems (such as Linux, MacOS, etc.) is much more complex than the principle introduced in this paper, but the essential principle is still inseparable from the basic memory management principles described in this paper.

reference

Operating Systems: Three Easy Pieces by Remzi H Arpaci-Dusseau/Andrea C Arpaci-Dusseau

2. Why can HugePages improve database performance and faith-oriented programming

3, explore the principle of CPU scheduling, yuan leap son invitation

For more articles, please pay attention to wechat public number: Yuan Runzi’s invitation