Understanding Memory management

What is memory management

Computer programs are stored in the hard disk after they are written. When the program is executed, information describing how the program is executed is loaded into memory, and then the CPU performs this part of the content to do calculations. This involves the question of why programs should be loaded into memory for execution, rather than directly using the hard disk to do calculations. It can be said that the overall performance of a computer depends on how well the CPU and RAM work together. Hard disks are much slower to read and write than RAM, which causes the CPU to waste more waiting time. And more RAM means more expensive to build.

In general, the sum of the sizes of various programs is always larger than RAM capacity and much larger than RAM capacity. When the program is running, how and what program information should be loaded into memory becomes a problem that must be considered. Therefore, memory management refers to the allocation and use of computer memory resources when software is running.

Memory abstraction

In an operating system, the part responsible for managing the storage system is called memory management. In the beginning, there was no memory abstraction; each program directly accessed physical memory, that is, some location in the memory set from 0 to some upper limit.

In a single process, this is fine. But for multiple processes, this is not feasible. Programs can access physical memory directly, meaning that read and write operations are not constrained, and a memory address used by one program is likely to be changed by another program, which can be fatal.

Exposing physical addresses to programs has the following serious problems:

Programs can easily (intentionally or unintentionally) damage the operating system, even if only one program is running
It is difficult to run multiple programs simultaneously, and programs cannot be trusted to read or write

Therefore, it is necessary to shield the physical address of the program and carry out memory abstraction to protect and relocate the contents of the running program.

Address space

An address space is a set of addresses that a process can use in addressing memory, independent of the address Spaces of other processes. If you have an address, you need to know the access range of the address, so you can use dynamic relocation to simply map the address space of each process to a different part of physical memory.

The classic approach is to configure the CPU with two special hardware registers,Base address registerandLimit register.

The boundary register stores the length of the program, and the base address register stores the starting address of the program. For example, when the program wants to jump to address 28, execute the instruction

JMP 28

Will be translated into

JMP 16408 (16408 = 16380 + 28)

This method is easy to implement the address space, but the disadvantages are: each access to the memory requires addition and comparison operation, comparison operation can be fast, but the addition operation will be too slow due to carry transfer without special circuit support.

While this approach is time efficient, it implies a condition that physical memory is large enough to accommodate all processes, which is impractical.

It is more common for multiple applications to run with a total capacity larger than the size of physical memory. It is impractical to keep all processes in memory all the time, and there needs to be a way to determine what programs should be in memory at any given time. Two common methods are swapping and virtual memory.

Switching technology

Swap loads a process completely into memory, runs it for a while, and then saves it back to disk.

Over time, processes load the memory alternately, and multiple free areas (gray) are generated in the memory. Excessive small free areas may lead to no suitable space for a certain allocation. Therefore, the internal space occupied by each process is moved down through memory tightening to merge these small free areas into a large free area. However, this operation consumes a lot of CPU time and becomes more time-consuming as the memory space increases.

If the data segment of a process can grow (many languages allow memory to be dynamically allocated from the heap), problems arise when the process space tries to grow. If the adjacent area is a free area, allocate the free area. If it is adjacent to another process, several processes need to be swapped out to create a larger free area. Even worse, if the process cannot grow in memory and the swap area on disk is full, it can only be suspended until space is free or the process is terminated.

The problem becomes acute when most programs need to allocate memory dynamically. It is easy to allocate extra memory for a process in advance, but when a process is to be swapped to disk, you should know what the swap is actually using. To allocate memory dynamically, you need to manage memory, keeping track of which memory is occupied and which memory is free. The two available methods are bitmap storage management and linked list storage management

Memory management for bitmaps

Using bitmaps to track memory usage splits memory into storage units, which can be anywhere from a few words to a few thousand bytes. Each cell corresponds to a bit in the bitmap, where 0 represents the free area and 1 represents the occupied area. The size of allocation unit is very important, the smaller the bitmap is larger; Smaller is more waste (the last piece of the unit allocated for the process is often not used up). Another disadvantage is that when the process wants to load into memory, it needs to search k consecutive 0 cells through the bitmap to find the location of storage, but this operation is time-consuming.

Storage management of linked lists

Use a linked list to record the allocated memory. Each node needs to record the indicator, start position, occupied cell size, and next node. Where P is occupied and H is idle. Using linked lists makes it straightforward to update when processes alternate, but it takes time to find out if neighboring nodes can be merged or not, and if you don’t merge, it’s very likely that you won’t find the right node to store the process over time.

To find an allocatable area of memory for an in-memory process, there are algorithms that search for linked list nodes.

The simplest algorithm, the first adaptation algorithm, searches the linked list node, finds the first node that can be allocated, and then splits this node into two parts, a new P node and a new H node. This algorithm is the fastest because it searches as little as possible.

The difference between the next adaptation algorithm and the first adaptation algorithm is that the next adaptation algorithm records the current free area. In this way, when the memory needs to be allocated, the search starts from the recorded location rather than from the beginning.

The best matching algorithm searches the whole linked list to find the most suitable free area, so as to avoid triggering merger when large H nodes are split first and no suitable space area can be found later. Unexpectedly, the actual execution results in a lot of fragmentary H nodes, wasting a lot of memory.

The fast fit algorithm maintains separate linked lists for free areas of regular size, such as 4KB, 8KB, 12KB, etc., so finding a free area of a specified size is very fast.

No matter which algorithm, the common disadvantage is that the merging process is time-consuming. Therefore, when considering the matching degree of an algorithm, it becomes an important examination point to trigger the merging process less.

Virtual memory

Base address register and boundary register can be used as abstractions of address space, but can not solve the problem that the total process size is larger than memory. Switching technology can solve the problem that the total process size is larger than the memory, but as the program size expands, the time it takes to load and unload the program completely increases, which eventually becomes unacceptable.

In fact, only part of the program is executed at any given time, so loading the program into memory is unnecessary. So, just divide the program into parts and load the parts involved into memory at run time. This is where virtual memory comes in.

Virtual memory means that each program has its own address space, which is divided into chunks, each called a page or page. Each page has a contiguous address range. As the program executes, simply map the page to physical memory and continue executing the program. Virtual memory implementation, the general use of paging technology or segmentation technology.

paging

Each program thinks it has a real address space, but these addresses are virtual and do not actually exist in memory, so they are also called virtual address space. When you want to execute code for a virtual address space, you need to map it into memory. The MMU (Memory Management unit) takes care of this, mapping virtual addresses to physical memory.

Virtual addresses are fixed in size or in units called pages, while in physical addresses the corresponding units are called page frames. First, the CPU sends the virtual address of the page to the MMU, and the MMU converts the available physical address to the memory to get the page frame, that is, the physical address that is actually loaded.

As in the figure above, instructions

MOV REG, 8192 will be converted to MOV REG, 16384.

With proper MMU management, you can map many more pages than page frames. Of course, since there are more pages, this means that pages can be swapped in and out, which requires a Bit on the hardware to indicate whether the page is in memory. When the page to be accessed is no longer in memory, the CPU gets stuck in the kernel, which is called a page-missing interrupt or a page-missing error. Then, the MMU picks out a suitable page frame, replaces the old page, replaces the new page, modifs the mapping relationship, and finally re-executes the instruction that caused the page-missing interrupt.

Once you have pages and frames, you also need to know how to find them by looking them up in the page table. A page table is a special data structure that may vary from machine to machine, but the information recorded by a page table entry is roughly the same, with each page entry recording one page:

Cache disabled bit: Whether to disable caching, which sometimes requires reading data directly from the I/O device rather than in the cache
Access Bit: Allowed access type. Three bits represent read, write, and execute
Modified bit: indicates whether the page has been modified. The modified bit must be written back to disk. Also called dirty bit
In/out of place: Indicates whether the page is in memory
Frame number: indicates the location of the page frame

To speed up page

Now that you know about paging, two questions arise:

The mapping of virtual addresses to physical addresses must be very fast
If the virtual address space is very large, the page table is also very large

For the first problem, each memory access requires a mapping, and most instructions access operands in memory, so one or two or more page table accesses per instruction are required. Then, the page table query time must be less than 1/5 of the execution time, otherwise the mapping becomes a major bottleneck.

For the second problem, each process needs to have its own page table, with a 32-bit address space representing a million pages and a 64-bit address space representing an astronomical number of page tables. If you keep all the page tables in memory, the page tables will be expensive to load when the process context switches when the program is large enough.

Solving or mitigating these two problems is imperative, otherwise paging will not be practical. TLB and multi-level page tables were born.

TLB (Conversion detection buffer)

High-value solutions are always the same — it is efficient and worthwhile to provide longer resident periods for units that are more likely to be accessed, which is part of the point of buffering. Interestingly, based on observations, most programs always make more visits to individual pages in a predictable period of time than vice versa, and correspondingly, a few page tables will be accessed multiple times.

Instead of accessing the page table, a small hardware device is set up in the computer to do the mapping. This device is known as the TLB (Transformation check buffer), also known as the fast table. The TLB is usually placed in the MMU and contains only a few entries (usually no more than 256).

The important information of each TLB entry is as follows:

Valid bit: Whether the page can still be accessed through this entry
Virtual page number: virtual address source
Modify bit: dirty bit. When it is modified, it must be written back to disk
Protected bit: Access mode, read, write, and execute
Frame number: position of the page frame

Now, when a page is accessed and a page is interrupted, the TLB will eliminate an appropriate entry, the MMU will map the new page, and the TLB will update the new entry. When the page to be accessed is recorded in the TLB, MMU is not required. The TLB only performs simple statistical updates.

TLB software is also provided, of course. The difference is that a software TLB failure is called a soft failure. The page to be accessed is not recorded in the SOFTWARE TLB. You can update the software TLB, and disk I/O does not occur. A hardware TLB failure is called a hard failure and requires disk I/O. Hard failures take millions of times longer to process than soft failures because of page table traversal.

Multistage page table

TLB solves the problem of exchanging frequently accessed pages multiple times, another problem to solve when expressing large virtual address Spaces. Multi-level page tables are one of the solutions.

Multilevel page tables avoid keeping all page tables in memory and only the currently appropriate page tables.

Take a 32-bit virtual address as an example. After receiving the virtual address, MMU first uses the value of PT1 as the index to find the top-level page table, then uses the value of PT2 to find the second-level page table through the top-level page table, and finally finds the page table entry through the value of Offset to process the corresponding mapping logic.

Compared to single-page tables, multi-level page tables have the following benefits:

A single page table requires the page table to be stored in memory at one time. The size of the memory occupied by the page table depends entirely on the program size. Multi-level page table, the single page table is divided into multiple page tables of the same size, when necessary, only need to use the page table into memory
Although the unmapped virtual address space does not occupy memory, the actual storage content represented by the virtual address space needs continuous disk address space for storage. In the case of single-page tables, the size of the contiguous space depends on the size of the program. Multilevel page table can store program contents separately, because in fact, the segmented two-level page table also represents that the continuous address space of actual storage can be divided into smaller continuous address space
Of course, multistage page tables require more addressing times than single-page tables, lengthening access times, and sacrificing some access time to reduce increasing memory stress is worth it in most scenarios
Multilevel page tables can also be decomposed into more layers

Page replacement algorithm

Now that we know how paging works for virtual memory, the next question is how to determine the right place in a page frame when a page miss interrupts and you have to pick the right place in a page frame, the page replacement algorithm.

NUR (Page replacement algorithm not used recently)

On most computers with virtual memory, there are two status bits for each page, the R bit being set to 1 when read or written, and the M bit being set to 1 when modified. So for a page in a page box, there are four states:

It’s not being accessed, it’s not being modified
It’s not being accessed. It’s being modified
Accessed, not modified
It has been accessed and modified

The NRU algorithm randomly selects one of the pages of states 1 and 2 for elimination. This algorithm is not the best performance, but enough to use, easy to implement.

FIFO (First in first out page algorithm)

As the name suggests, maintaining a linked list of pages, this algorithm eliminates the oldest page in memory, which is the list header. The disadvantage is also obvious, the pages that are eliminated are likely to be very commonly used pages.

Second chance page replacement algorithm

An upgraded version of FIFO, FIFO is likely to eliminate commonly used pages, so in this algorithm, the R bits of the old page are checked. If the R bit is 0, it indicates that it is not in use and can be replaced immediately. If it is 1, the old page is placed at the end of the list; If all pages are visited, this algorithm is pure FIFO.

As each clock ticks, all R bits are reset.

Clock page replacement algorithm

The second chance page replacement algorithm requires constantly moving pages to the end of the list, which is both unnecessary and inefficient. The clock page replacement algorithm maintains a circular linked list with a table pointer pointing to a location. When a page missing interrupt occurs, if the R bit of the current position page is 0, the current page is replaced, and then the table pointer moves forward one bit; If the current page bit R is 1, then the R position is 0, and the table pointer moves forward one bit to continue looking. The table pointer will set the R position of visited pages to 0, so a new page will always be found to replace it.

Least recent use of page replacement algorithms

Eliminate the least visited pages, theoretically can achieve the algorithm, but the cost is very high. Pages maintained using linked lists need to be moved to the end of the list when the page is accessed. Such frequent delete add operations are unacceptable.

Of course, on hardware, you can provide a 64-bit counter, which the page table entry needs to hold, incremented by one each time it is accessed. When the free page is interrupted, check the value of the counter in all the page table entries to find a page with the smallest value.

Working set page replacement algorithm

The working set model is based on the assumption that pages will be accessed frequently over a period of time. But in the actual program running, also showed such locality.

In this case, the page visited by the program in K memory accesses, at any time t, has W pages to represent the set of pages visited. Therefore, the working set page algorithm is the algorithm that predicts which page in the W set is least likely to be accessed in the next period of time, and replaces that page when a page outage occurs. At the time of implementation

The value of K can be arbitrary
Memory access statistics are the statistics of the individual memory access of each program

The working set page replacement algorithm works as follows: set a periodic clock reset R bit, when the page is accessed, record the access time, set R bit to 1. In addition, a time T is set as the comparison criterion for whether the split page survival time is reached and whether it will continue to be stored in the working set. Scan pages in the working set when a missing page interrupt occurs:

If R is 1, set the last time used to the actual time
If R is 0 and the lifetime is greater than T, remove the page
If R is 0 and the lifetime is less than T, remember the page with the maximum lifetime

So, if I have a 2, I replace it with a 2; If there are 2 and no 3, replace the page with the maximum lifetime; If neither 2 nor 3 are available, then a random page is eliminated.

The purpose of the working algorithm is to predict which pages are more likely to be used in the future and keep them in the working set, thereby reducing the chance of missing pages being interrupted.

Working set clock replacement algorithm

There is a problem with the working set page replacement algorithm. When the page missing interruption occurs, all pages need to be scanned to find the pages that should be eliminated, which is time-consuming. Working set clock page algorithm improves working set page replacement.

The working set clock page replacement algorithm has a similar working mechanism to the clock algorithm:

When the page missing interruption occurs, if the current pointer points to the page R=0 and the survival time is greater than T, the current page is eliminated and the pointer moves forward
If R=1, set R to 0 and move the pointer forward
If M=1, the page has changed, and in any case, the page should be written back to disk, with the pointer moving forward, right

Under this working mechanism, the maximum number of pages written back to the disk must be N to reduce disk congestion. If the pointer has gone round, it will happen:

At least one write operation has been performed. In this case, the pointer just keeps moving forward and will eventually find a clean page, because the write operation will eventually complete
No write operation occurred. In this case, you can replace a clean page at will and remember the clean page in advance during the scan

Problems with paging systems

Paging system can solve the problem that the program occupies more memory than the physical memory, but paging also faces the problem of its design purpose.

The page size

Page size is an optional parameter. Regardless of the page size, it is inevitable that the last page of the loading program will waste most of its memory. This waste is also called internal fragmentation. Bigger pages mean more waste; Smaller pages mean more pages and more page tables. Memory and disk transfers are page-by-page, with more time spent in seek and rotation delays, so there is little difference in transfer time between large and small pages. Small pages make better use of TLB. These are all factors to consider when deciding the size of your page.

If you look at it mathematically, suppose the average process size is S bytes, the page size is P bytes, and each page entry needs e bytes. The number of pages required by the process is approximately S/P, occupying SE/P page table Spaces. The internal fragment is P /2. Therefore:

Cost = se/p + p/2

When the page is large, the internal fragment is large; Page tables are large when pages are small. So the optimal value is somewhere in the middle. So if I take the derivative, I get

se / p^2 + 1/2 = 0

Se (2) p =)

For s = 1MB and e=8B, the optimal page size is 4KB. Of course, as a reference, this formula only considers fragmentation waste and memory required for page tables.

A Shared page

In large multiprogramming systems, where it is common for several different users to run a program, it is undoubtedly more efficient for these programs to share pages. For read-only, this doesn’t cause too many problems. For write operations, a more complex mechanism is used to satisfy shared pages.

For example, the space occupied by the program is separated into instruction space and data space, also known as I space and D space, which have their own spatial page tables. Each process of the program accesses its own D space in the same I space.

Sharing data is more cumbersome, using copy-on-write, which triggers read-only protection only when a process updates the data, triggering an operating system trap. A copy of the page is then generated, with each process having its own dedicated copy. After that, any write to the copy no longer triggers the trap. The advantage is that for programs that rarely trigger write operations, there is often no need to generate a copy.

The Shared library

For the implementation process of the program, various libraries will be referenced to assist in the completion of the function. For some reusable libraries, if all the programs that use it reserve space on disk for them and load the memory together when the programs use it, you not only waste a lot of disk space, but also waste memory space. If programs know that these libraries can be shared, this waste will be greatly avoided.

The point of a shared library is that if some library is already loaded by another program, there is no need for the current program to load it when it uses it. Of course, shared libraries do not load the entire library into memory at once, but on a page-by-page basis. Functions that are used are loaded into memory, while functions that are defined but not yet used or other libraries that are introduced are not loaded into memory.

The benefits of shared libraries are:

Executable files are smaller, saving memory space
When the shared library BUG is updated, there is no need to recompile the program that called the library, and the program can use the fixed AD shared library the next time it starts

Shared library in paging system, we need to pay attention to the problem is that shared library does not belong to any program, no matter where the shared library is in the program virtual space, through MMU mapping, can not be directly mapped to the absolute physical address space (the virtual address space of the shared library is different for each program). Because multiple programs using shared libraries will access different physical address Spaces. Therefore, when mutating a shared library, the compiler cannot produce instructions that use absolute addresses, but rather relative addresses. In this way, when accessing the shared library, you can access the correct relative memory address of the shared library.

Memory-mapped file

A shared library is actually a subset of memory-mapped files. The idea behind memory-mapped files is that a process can map a file to part of a virtual address space by making a system call. In most implementations, the mapped page does not read the actual content, only one page at a time when it is accessed. When the process exits or the file is unmapped, all changed pages are written back to disk.

The advantage of memory-mapped files is that if two programs map the same file, they can communicate through shared memory. When one process completes a write operation in the shared memory, another process can read the file in the virtual address space mapped by the file.

Clear strategy

Consider a case where the page frame is full and all pages have been modified, and a page miss interrupt occurs. If you want to load new pages, you can’t find a suitable page to eliminate, you need to wait for these modified pages to write back to the disk, and find a clean page when you query again. If more page breaks occur at this point, the waiting time becomes objective.

To solve this problem, a background process called the paging daemon periodically checks memory, and if too few free page boxes are left, some pages are flushed out of memory through a predetermined page replacement algorithm.

One possible cleanup strategy is a two-pointer clock in which the front pointer is written back to disk by the paging daemon when it is empty, and the back pointer works like the standard clock algorithm.

Instruction backup

A page-missing interrupt may emit something at any point in an instruction execution, either while accessing an instruction opcode or while accessing an operand. It is often impossible for an operating system to determine exactly where an instruction interrupts and begins execution. Therefore, in order to ensure that the instructions can be executed correctly again after the page-missing interruption is resolved, the instructions should be backed up. For example, by using a hidden internal register, the contents of the program counter are copied into this internal register before each instruction is executed. With this information, the effect of the instruction that caused the page – missing interruption can be eliminated.

piecewise

In paging, virtual addresses are one-dimensional, ranging from 0 to some maximum address, and for some problems it is better to have multiple independent address Spaces than just one.

Imagine the compiler for the compilation of programs, during the compilation process will create many tables, such as symbol table, parsing tree, constant table, source program questions, symbol table, etc. A way for programmers to manage the expansion and contraction of tables without having to manage the expansion and contraction of tables as the process grows and the expected allocation of virtual address space becomes insufficient as a result of compilation or execution.

Segment: refers to the machine provided mutually independent address space, each segment is 0 to the maximum linear address sequence. Because the address of each segment is independent, growth and compression on the segment does not affect the content of the other segments with which it is common to the program. Also, a segment is a logical entity and is usually large enough. Sharing is easy for segments, such as shared libraries, where the procedures involved in the library can be placed in a separate segment, allowing access by other processes, rather than having to keep a copy in each address space.

The implementation of segmentation and paging is fundamentally different. Pages are fixed length and segments are not. The problem with segments is that as segments are updated and swapped, some spatial regions, also known as external fragments, are created between segments. External fragmentation wastes memory, which can also be addressed by memory tightening.

conclusion

Memory management is really the problem of how to keep an ongoing process going.

At first, the program was small compared to memory, and could simply expose the physical address to the programmer and use memory directly.
It was then discovered that simply revealing the physical address was unwise and could cause security problems. The range of program access is defined by base address register and boundary register.
With the increase of the total size of programs, the base address register and the boundary register cannot effectively place all programs in memory, resulting in the switching technology, by loading the program to be run into memory, temporarily unused programs out to disk to alleviate this problem. Memory usage can be recorded by bitmap or linked list.
When the size of a single program is large enough, switching technology integration into and out of the switching mechanism is too time-consuming, requiring a new way to solve the time-consuming problem. Most programs exhibit a locality of frequent access to certain procedures at a given time, generating virtual spatial addresses to represent individual programs. Through paging, the required partial pages are loaded into memory at run time through the MMU.
Subsequently, in order to better express the local performance of the program, a different page replacement algorithm, with TLB, to make the page with higher access frequency or more likely to be visited in the future, can be stored in memory for a longer time, in order to reduce the performance problems caused by the page loading and loading caused by too much page missing interruption.

Most of the content of this article comes from the fourth edition of modern operating systems, please point out the mistakes.

reference

Memory management

Don’t say that you don’t know Linux memory management, 10 graphs have it all figured out

Why multistage page tables?

Modern Operating Systems – Fourth Edition