Von Neumann told us that the CPU, which consists of the arithmetical logic unit and the controller unit, is responsible for the computation and control of the program flow. The instructions and data needed for the operations are provided by memory.

So, if you were the top designer of an operating system, what mechanism would you provide for the CPU to retrieve instructions and data from memory?

Write a Hello World in C and view the assembly code through objdump. Let me just take a random line.

mov    0x200aed(%rip),%rax        # 200fe8 <__gmon_start__>
Copy the code

This line of assembly code contains a memory address. Is this memory address a real address in physical memory?

We assume that it is a real physical address, but programmers cannot know the memory information of the device to be run when programming, so for different operating systems, the address in the program must be translated into a real physical address at compile time. This works for single-channel programming, but what about multi-channel programming? How do different programs determine their locations in memory?

From single – channel programming to multi – channel programming is a big step forward in computer development. The CPU polls time slices to make multiple programs appear to be running simultaneously. Obviously, using a real physical address in a program breaks this illusion. Different programs don’t know which piece of physical memory the other is using, and their memory is completely unprotected.

Therefore, the address in the program cannot be a real physical address, but it must have a certain mapping relationship with the real physical address. We call the address in the program a virtual address, which should have at least the following characteristics:

  • It can be mapped to real physical addresses through certain mechanisms
  • Ensure that the real physical addresses mapped by different programs (processes) are independent of each other
  • It should work automatically and be transparent to the application developer

With these three features, let’s take a look at how virtual memory works.

One radish, one pit, in sections

The most intuitive solution is to assign each program a separate block of memory, as shown in the figure below.

For each program, its virtual memory space starts at 0, and the base address register stores its starting address in the physical memory space. So, there is a relationship between physical and virtual addresses like this:

Physical address = virtual address + base addressCopy the code

Such address translation is done by hardware called a Memory Management Unit (MMU).

The boundary register can store the size of memory occupied by the program, as well as the physical address of the boundary, and it provides basic memory access protection. If the number of physical addresses converted by the MMU exceeds the threshold, an exception is triggered. Each CPU has a pair of base-address registers and bound registers that are updated when a process switch occurs, thus achieving inter-process memory independence.

At first glance, virtual memory seems to have all three features, but virtually no operating system does. Because it needs to allocate a contiguous memory space in virtual memory and physical memory respectively, and then carry out memory mapping. The disadvantages are obvious.

First, it is easy to cause memory fragmentation. Suppose memory has been used for a period of time, and there are two small blocks of 128 MB left, but the user needs to run a program that consumes 129 MB of memory, and memory cannot be allocated successfully under this mechanism. It was possible to swap memory, swap it back and forth with disk, and splice the spare memory together, but the speed of disk reading and writing was too slow for such a large chunk of data to be acceptable.

Second, it wastes a lot of memory space. If you move the 80-20 rule to a computer, less than 20 percent of the code a program runs most often may be used. The above solution allocates the memory space needed by the entire program at the beginning, leaving a large chunk of memory free between the heap and stack.

For the moment, the above scheme can be regarded as a special kind of “segmentation”. We can try to break it into smaller pieces.

Typical Linux process user-space memory contains stacks, shared libraries, heaps, data, code, and so on. We can divide these basic types into three segments, stack, heap, and code, for demonstration purposes.

The program is logically divided into sections and put into the corresponding section area of memory, so as to avoid the waste of space between heap and stack in the previous scheme, and only apply for memory when it is really needed. It also implements sharing along the way. For some basic system libraries that can be shared, they still need to be copied to the independent space of each process in the previous scheme. In a segmented scheme, only one copy is required, and virtual addresses between different processes are mapped to this physical copy.

However, since the size of each segment is not consistent, the memory fragmentation problem may not be much better than the previous scheme.

In addition, none of the schemes mentioned above take into account the problem of program size. If the program size is larger than physical memory, you can’t segment it to solve the problem.

Break up the paragraphs a little bit more and page them

In order to solve the memory fragmentation problem caused by segmentation, we split the segmentation into a fixed size page, virtual memory and fixed memory. This fixed size is generally 4 KB on current mainstream operating systems, and some systems also support 8 KB, 16 KB, and 64 KB.

It is not difficult to convert virtual address to physical address by matching virtual page and physical page one by one.

In both virtual and physical memory, after paging, each page is assigned a page number and the data is retrieved based on the offset within the page. Since the virtual page and physical page have the same page size, there is no need to convert the in-page offset, just the virtual page number to the physical page number.

The virtual address consists of the virtual page number and the in-page offset.

The operating system stores the mapping between virtual page numbers and physical page numbers in a page table, which is an array of page table entries (Ptes) that contain significant bits, physical addresses, and other data. The page table directly uses the virtual page number as the index to find the corresponding page entry.

The third virtual page in the figure above is mapped to the second physical page. In fact, virtual pages can be mapped to any physical page, and contiguous virtual pages do not need to correspond to contiguous physical pages, which gives the operating system great freedom. It not only reduces memory fragmentation, but also facilitates data sharing between processes by mapping virtual pages of different processes to the same physical page.

In order to be able to retrieve page entries directly using the virtual page number as an index, all page entries in the page table must be contiguous and pre-created. So the question is, how big is the page table?

For a 32-bit operating system, the maximum addressing space is 2 ^ 32 = 4 GB and the page size is 4 KB, so 1 M page entries are required. Each page entry is 4 bytes, so the size of a page table is 1 M * 4 B = 4 MB. To achieve process isolation, each process in turn must have its own separate page table. If you look at your operating system, there are at least hundreds of processes running at the same time, and the page table alone takes up hundreds of megabytes of memory, which is obviously not appropriate.

In fact, for most programs, you don’t need to take up all 4 GB of virtual memory, so there’s no need to allocate the entire page table at the beginning. Using multi-level page tables solves this problem.

Time for space, multi-level page table

Using a 32-bit operating system as an example, let’s look at a simple two-level page table.

The first level is called a page directory entry and has 1 K entries. Each page directory entry corresponds to 1 K page entries, a total of 1 K * 1 K = 1 M page entries, which corresponds to 4 GB of addressing space.

For a 32-bit virtual address, exactly 10 bits correspond to 1 K page entry indexes, 10 bits correspond to 1 K page entry indexes under the specified page entry, and the remaining 12 bits correspond to exactly the 4 KB in-page offset of the page size.

Calculate the size of the secondary page table. 1 K level 1 page directory entries total 4 KB, 1 M level 2 page table entries total 4 MB, adding up to 4.004 MB.

So, secondary page tables take up more memory than regular page tables? It’s not.

First, make it clear that no matter how many levels of page tables, they must cover the entire virtual space. For a normal page table with only one level, all page table entries need to be initialized to cover the entire virtual space address. For a secondary page table, 1 K first-level page catalog entries are sufficient to cover, and second-level page table entries are created only when needed. This saves a considerable amount of memory.

In addition, secondary page tables can be stored not in memory but on disk. This is not specifically designed for multistage page tables; it is a feature of virtual memory paging, and because of this, programs can be larger than actual physical memory.

Page hits and missing pages

Recall the addressing process described earlier. The MEMORY management unit (MMU) processes the virtual address, finds the corresponding page entry (PTE), converts it to a physical address, and locates the corresponding data in the physical memory. This ideal situation is called a page hit, where the data can be retrieved directly in memory based on the virtual address.

However, it is not always possible to retrieve data directly from the PTE in memory. In the most typical case, the program size is larger than physical memory, and inevitably some data is not in memory. In addition, since multi-level page tables are not created initially, the data corresponding to the PTE may not be in memory either.

At any given moment, a virtual memory page can be divided into three states:

  • Unallocated: Pages that have not been allocated (or created). There is no data associated with it and no disk space is occupied
  • Cached: Allocated pages currently cached in physical memory
  • Uncached: Allocated pages not cached in physical memory

Only cached virtual pages can have a page hit, and the PTE actually has a significant bit to indicate whether the page table is valid, assuming 0 means valid and 1 means invalid.

The valid bit is 0, indicating that PTE is available. You can read data directly. The significant bit is 1. Without considering the illegal memory address, it can be considered as unallocated or uncached, and data cannot be read directly from memory. This situation is called page miss.

In the event of a missing page, the system’s missing page exception handler takes over. It looks for a sacrificed page from memory based on a specific algorithm. If the sacrificed page data has been modified, it writes back to disk, then changes the required page to the location of the sacrificed page, and updates the PTE. When the exception handler returns, it reexecutes the command that caused the page miss, that is, the addressing operation, this time a direct page hit.

As you can see, missing pages is a very expensive operation, and the operating system must minimize the occurrence of missing pages, so finding the right sacrificial pages is a big problem. If you replace a page you are about to visit, then you have to change it back later. This frequent switching is unacceptable. For detailed replacement algorithms, see Introduction to Operating Systems, chapter 22 Beyond Physical Memory: Strategies.

Add a layer of cache to the page table, TLB

Back to the page table, converting virtual addresses to physical addresses requires only one memory access if you use an untiered regular page table, but it takes up a lot of memory. Most operating systems use multi-level page tables, such as the current 64-bit Linux operating system, which uses four-level page tables, with a much smaller footprint but at the cost of four memory accesses. It’s a time-for-space strategy.

In addition, the virtual addresses of a sequence of instructions during the execution of the program are contiguous. Several virtual addresses are usually connected in a virtual page, and naturally they all correspond to the same physical page. But no matter how the page table is designed, you still need to visit the page table every time you visit adjacent virtual addresses. This is a point that can be optimized.

Any problem in computer science can be solved by introducing an intermediate layer.

To preserve the low-memory nature of multilevel page tables while avoiding redundant memory access, add another layer of caching.

TLB (Translation Lookaside Buffer), some data translated into Translation backup Buffer, some translated into address change cache, and do not tangle with the name. TLB is a cache chip encapsulated in the CPU. It is the cache of the page table and stores the mapping between virtual addresses and page entries.

During address translation, the first step is to check whether the corresponding page entry PTE exists in TLB according to the virtual address. If the TLB matches, the page table is not accessed. Instead, data is accessed from the CPU Cache or internal according to the physical address cached in the TLB. If the TLB fails to hit, the process is similar to that of a missing page. An exception is thrown and the TLB exception handler takes over. The exception handler accesses the page table, finds the corresponding page entry, and updates the TLB. When the exception handler completes execution, the original instruction is executed again and the TLB is hit. It is conceivable that the HIT ratio of TLB directly affects the operating system efficiency.

conclusion

Let me start with why I wrote this article.

While reading the Binder chapter recently, I found that without an in-depth understanding of MMAP, it is difficult to understand the underlying logic that Binder copies only once. Without a good understanding of the virtual memory mechanism, it is difficult to understand how MMAP works. It’s a ring, a ring, forced to learn a wave.

In fact, many problems in the field of programming, in the final analysis are for the characteristics of the computer operating system, to make solutions and compromises. A solid foundation of the operating system is of great benefit to learning any programming related knowledge. But on the other hand, the knowledge of operating systems is also numerous and mixed, and I can’t guarantee that this article is without any mistakes. If you have a different opinion about the above content, feel free to discuss it with me in the comments section.

Finally, a diagram summarizes how virtual memory works.

Pay attention to my public account Luyao TM, reply virtual memory, get the original HD picture.

This article will continue to be updated on my blog.

References:

In understanding computer systems. Chapter 9 in virtual memory

Introduction to Operating Systems

The basic principles of computer composition

Interesting talk about Linux operating system

Illustrate virtual address to physical address translation and data acquisition