Before introducing TLB, let’s review the basic concept of an operating system, virtual memory.

Virtual memory

From the user’s perspective, each process has its own independent address space. The 4GB of process A and the 4GB of process B are completely independent and irrelevant. What they see is the virtual address space of the operating system. However, the virtual address must ultimately be operated on the physical address of the actual memory. The operating system uses the page table mechanism to translate the virtual address to physical address of the process. The size of each page is fixed. I don’t want to go into too much detail here, so if you’re not familiar with this concept, go back to the operating system textbook.

Two key points of page table management are page size and page table level

  • 1. Page size

On Linux, you can run the following command to view the page size of the current operating system

# getconf PAGE_SIZE
4096  
Copy the code

You can see that the page table on my Current Linux machine is 4KB in size.

  • 2. Page table series
    • The smaller the number of page tables is, the mapping between virtual addresses and physical addresses will be faster, but there will be many page table entries to manage, and the address space that can be supported is limited.
    • On the contrary, the more page table levels, the less page table data need to be stored, and can support a larger address space, but the mapping of virtual addresses to physical addresses will be slower.

Virtual memory implementation for 32-bit systems: secondary page tables

To help you remember this, let me give you an example. If you want to support a 4GB process virtual address space on a 32-bit operating system, assuming a page table size of 4K, there are 2 ^ 20 pages. If the fastest level 1 page table is used, 2 ^ 20 page table entries are required. If a page entry is 4 bytes, then a process needs (10485764=) 4M memory to store the page entry. If you use a level 2 page table, as shown in Figure 1, you only need one page directory to create the process, occupying (10244)=4KB of memory. The remaining secondary page entries will only be applied when needed.

Virtual memory implementation for 64-bit systems: four level page tables

Today’s operating systems need to support a 48-bit address space (64-bit in theory, but only 48-bit in practice) and support hundreds of processes. Without hierarchical page tables, 64-bit Linux currently uses only 48 bits of the address, in which the last 12 bits are the in-page address, and only the first 36 bits are used to find the page table. 2^36 * 4bytes =32GB. As with 32-bit systems, the level of the page table must be increased further.

After V2.6.11, Linux finally adopted a 4-level page table, which is:

  • PGD: Page Global Directory (47-39)
  • PUD: Page Upper Directory(38-30)
  • PMD: Page middle Directory (29-21)
  • PTE: Page table entry(20-12)

Thus, a 64-bit virtual space was initially created to maintain only a 2^9 page global directory, and the page table data structure is now extended to 8 bytes. The global page directory only needs (2^9 *8=)4K, and the rest of the middle page directory and page table entries just need to be redistributed at the point of use. This is how Linux supports the (2^48 =)256T process address space.

Problems with page tables

Now that I’ve scratched through the implementation of Linux virtual memory, I can finally get to the point I want to make. Although creating a process that supports 256T of address space initially requires only a 4K page global directory, this introduces an additional problem: the page table is stored in memory. It takes 4 page table searches to convert a virtual address to a physical address, plus the actual memory accesses. In the worst case, it takes 5 IO to fetch a single memory data.

TLB was born

And CPU L1, L2, L3 cache idea, since the address translation requires a lot of MEMORY I/O times, and time-consuming. Then simply cache the page table in the CPU as much as possible, so there is TLB(Translation Lookaside Buffer), specifically for improving the speed of the virtual address to physical address cache. Its access speed is very fast, comparable to register access and faster than L1 access.

I wanted to actually look at TLB information, but I scoured Linux commands and couldn’t find a way to see L1, L2, and L3 sizes as easily as SYSfs. Just provide the picture below for everyone’s reference! (If anyone found the command to check TLB, don’t forget to share it with Fei Brother, thanks!)

With TLB, the CPU accesses a virtual memory address as follows

  • 1. The CPU generates a virtual address
  • 2. The MMU obtains the page table from the TLB and translates it into a physical address
  • 3. The MMU sends the physical address to L1, L2, L3, or MEMORY
  • 4.L1, L2, L3, or memory Returns the address data to the CPU

Since step 2 is register-like access speed, if TLB hits, the time overhead from virtual address to physical address is almost negligible. For more details on how TLB works, see “Understanding Computer Systems in Depth – Chapter 9 virtual Memory.”

tool

Since TLB cache hits are important, what tools can you use to look at hit ratios on your system? Do have

# perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses -p $PID Performance counter stats for process id '21047' : Number of DTLB-load-misses # 1.36 of all DTLB-cache hits 2,001,294 ITlB-loads 3,826 ITlB-load-misses # 0.19% of all iTLB cache hitsCopy the code

extension

Because TLB is not very big, only 4K, and now the logic core will have two processes to share. So there may be cache misses. In addition, if a TLB miss is more serious than a physical address cache miss, a maximum of five MEMORY I/OS may be required. It is recommended that you use the perf tool above to check the TLB miss condition of your program. If the miss ratio is really high, then Linux allows you to use large memory pages. This will greatly reduce the number of page entries, which will naturally reduce the TLB cache miss rate. The price to pay is a certain amount of memory waste. In Linux, large memory pages are disabled by default.



Development of internal skills of the CPU

  • 1. Do you think all your multicore cpus are eucore? Multicore “illusion”
  • 2. I heard you only know memory, not cache? CPU says very sad!
  • 3. What is TLB cache? How to check TLB miss?
  • 4. What is the overhead of process/thread switching?
  • 5. What makes coroutines better than threads?
  • 6. How much CPU do soft interrupts eat you?
  • 7. How much does a system call cost?
  • 8. What is the overhead of a simple PHP request redis?
  • 9. Do too many function calls cause performance problems?

My public account is “developing internal Skills and Practicing”. Here I am not simply introducing technical theories, nor only introducing practical experience. But to combine theory with practice, with practice to deepen the understanding of theory, with theory to improve your technical practice ability. Welcome you to follow my public number, also please share with your friends ~~~