How to write a technical article

  1. Identify your target audience
  2. Start from the problem, take the problem step by step to introduce the content to explain
  3. Derive content by logical deduction based on the problem
    1. To be logical
  4. Fibonacci principle
    1. Progressive transformation

Ps: From Zhihu Q&A


The body of the

The topic of this sharing is memory management. First why do share this time, I watched a lot of things before, but, due to not used in the work, watching after all forget, take the redis, believes that many people have seen the redis design and implementation of this book, remember oneself also see while reading the source code, but don’t remember anything now. So, I just want to have what method can let oneself understand better master the knowledge, even if the content in the work in less than temporarily, a good way is to do the test, namely we often that some questions to ask yourself, test their understanding of the content, so I share this question first, and then to answer this question, Step by step, the content of this share is given, of course, because the topic of memory management is too large. Second, because their knowledge is also limited, it is impossible to cover everything, because they do not know what they do not know. Memory management can only be supplemented and improved continuously.

Memory allocation

To start with memory allocation, let’s look at the following snippet of code:

func main(a) {
	http.HandleFunc("/bar".func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))
	})
    fmt.Printf("pid: %d\n", os.Getpid())
	http.ListenAndServe(": 8080".nil)}Copy the code

The function is very simple, just start an HTTP service on port 8080, we compile and run

go build main.go
./main
> pid: 3240424
Copy the code

Running the ps command to view process details (ps on a MAC)

VSZ, RSS

Why is there so much more virtual memory than physical memory? To answer this question, let’s first introduce virtual memory.


Virtual memory

To talk about virtual memory, we start from the von Neumann system, von Neumann system will be computer is mainly divided into: CPU, memory, IO these three parts, where we first answer a question: how can executable program be executed? Go build main.go in our daily development; ./main Why can the program be executed when we execute main?

A program written in a high-level language is preprocessed, compiled, assembled, linked, and then produced executable files. Only linked files can be executed.

file output/bin/xx.api
output/bin/xx.api: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, not stripped
Copy the code

1) ELF, DYNAMICALLY Linked, LD-linux-x86-64.so.2 First of all, ELF is a type of executable program under Linux. We can look at the instructions of Man ELF.

Reading through the Manual, you can see that the ELF file starts with an ELF header, followed by the Program Header or section Table, which describe the rest of the file.

Elf header readelf-h output/bin/x. API

hexdump -C output/bin/xx.api |head -n 10

More information about ELF can be found on the blog Introduction to the ELF Format

Having introduced the ELF Header, here are two very important concepts:

  • program header
  • Section header

We know that the program should pass pretreatment, compiled, assembled, link four steps to be executable, the assembly is converts assembly file to machine executable instructions, there is also a very important job is to store their files according to the semantic segmentation, some of the common section is the code, data, as well as the debug, So why do we segment by function? The following code is an example:

int printf(const char *format, ...) ; void func1(int i) {printf("%d\n", i);
}

int main() {
    static int static_var = 85;
    static int static_var2;

    int a = 1;
    int b;

    func1(static_var + static_var2 + a + b);

    return a;
}
Copy the code

The above code we declared the printf, but does not define it, we need to relocation printf the symbols in the link stage, in order to be able to go to the lavatory search to other file printf this symbol, natural files need a symbol table, derived that other files for a symbol lookup, so we in order to better differentiate program function, Readelf -s output/bin/xx. API Section headers can be read by readelf -s output/bin/xx. API

Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)
Copy the code

In the figure above we can see that although the program is divided into many segments, many segments have the same permissions, so let’s keep that in mind.

And finally, let’s talk about the Program header. Now that we have the executable file, and the first 64 bytes of the file, we can verify that the file is indeed elf format. Once verified, in order to be able to execute the file, we naturally need to load the file into memory, which must be loaded in order to execute the program. The question then is how to load the program?

Readelf -l output/bin/xx. API

  • Offset: indicates the offset in the file
  • VirtAddr: virtual address space, that is, the address in the address space of the process after the program is loaded
  • FileSize: segment Occupies a large size in a file
  • MemSize: Size of virtual space occupied by loading the segment into memory

So here, let’s summarize program headers and Section headers

  • Segment Program header Execution view, i.e. loaded near memory, address space distribution
  • The Section header is a linked view. Elf is stored according to the Section when it is stored. Then when it is loaded, in order to reduce memory fragmentation, the Section with the same permission is merged into a Segment.

Interfacing with ELF files

Executable and Linkable Format 101

TLB

Now let’s rearrange our thoughts:

To answer the question, we need to know what virtual memory is, so we introduce a static view of an executable program on disk: The executable program is divided into sections according to their functions. When the program is loaded into memory, the different sections form a segment with the same permissions and are allocated to the virtual address space of the program.

Now that we’ve talked about virtual address Spaces, each application has its own virtual address space. Why does each application have a separate address space?

First program execution is actually the CPU in one line executes instructions, CPU needs a address to go to read the memory, and then execute, this address it is the earliest physical memory address, which means that all program links, must specify different physical address, or it will cover each other after loaded into memory, The stringent requirements with application more and more obviously is impossible, so, there will be a virtual address space, i.e. each program has its own virtual address space, and then see the CPU is the virtual address space, but the physical memory is certainly need to physical addresses to access, all is a middleware to virtual addresses into physical addresses.

evernotecid://684F00FC-2900-4AF6-B7AA-D9B72CB9AC48/appyinxiangcom/5364228/ENResource/p14937

Virtual Memory, Paging, and Swapping

Let’s answer a few questions about the process described above:

  1. What is a page table and why is it used?
  2. What is cached in TLB?

Let’s start by answering why do we need page tables? We now the goal is to establish the mapping relationship between virtual address and physical address, and the memory we usually can be as a large array, each element in the array size is 1 byte, it means that we need 1 gb of memory physical space of 4 g mapping relationship, that is, a relationship we need 4 bytes, is simple

var maps [4*1024*1024*1024]int32
// The subscript is the virtual address, and the value is the physical address
Copy the code

That’s 4G memory mapping and we need 16 gigabytes to store the mapping. This is obviously not possible, so we need to the division of big strength of physical memory, usually on a 32-bit machine age, we will be divided into physical memory according to the page, each page size of 4 k, for convenience, we assume that the virtual address is according to the page, the page is divided into 4 g to 1 m, 4 m is needed to store the mapping relationship, 4M memory is 1000 pages. Even with 4 gigabytes of physical memory, we could only run 1000 of the process’s page tables simultaneously. So we use a multi-level page table scheme, as shown in the following figure for a 2-level example:

TLB and Pagewalk Coherence in x86 Processors


Why do we need page tables? What is cached in TLB? First of all, we see a picture: evernotecid: / / 684 af6 f00fc – 2900-4 – B7AA – D9B72CB9AC48 / appyinxiangcom / 5364228 / ENResource/p14939

CPU Cache Flushing Fallacy
cpuid

cpuid -1
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries = 0xff (255)
      instruction associativity = 0x1 (1)
      data # entries = 0xff (255)
      data associativity        = 0x1 (1)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
  instruction # entries = 0xff (255)
  instruction associativity = 0x1 (1)
  data # entries = 0xff (255)
  data associativity        = 0x1 (1)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries = 0x0 (0)
      data associativity        = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
  instruction # entries = 0x200 (512)
  instruction associativity = 4-way (4)
  data # entries = 0x200 (512)
  data associativity        = 4-way (4)
Copy the code

TLB is also divided into L1 and L2 like Cache. L1 Cache has 256 items for large pages 2M/4M and 256 items for small 4k pages. The L2 cache can store only 512 4K pages. In addition, TLB can be divided into Instruction -TLB (ITLB) and data-TLB (DTLB) just like cache.

So now we know that due to the huge speed difference between CPU and memory, if each address translation needs to access memory, performance will definitely decline, so the TLB cache is the virtual address to physical address relationship, as shown in the following diagram:

Now we know that the page table is used to store the virtual address and physical address mapping relation, also know to speed up the conversion process, TLB as a cache to store the mapping relationship, but we want to, before we say 4 g space is divided in 4 k pages, there will be 1 k table item, before we through the cpuid command to see, Even TLB 2 has only 512 entries, which means a mapping of 1024 -> 512 must be made. One more question. When we looked at TLB information we had a concept called Instruction associativity = 4-way (4). What does that mean?

We think that our current 4G virtual address is divided into 1K pieces, each of which is 4M in size. For the 32-bit address, the first 10 bits are page numbers, and the last 22 bits are in-page offsets. But now we only have 256 TLB entries, so the easiest thing is to compare all the entries to see if there is a virtual page number in the 256 entries. This requires comparing 256 rows at the same time, and it gets harder and harder as this number increases, so we have another way.

We use the last 8 bits of the 10-bit virtual page number to select 256 entries, and the next 2 bits to compare if the row is current

evernotecid://684F00FC-2900-4AF6-B7AA-D9B72CB9AC48/appyinxiangcom/5364228/ENResource/p14943

Then, the associativity = 4-way means that there are four sets of the associativity = 4-way, and that the remaining bits are tagged. For more details, see Chapter 6 of Understanding Computer Systems.


summary

To sum up, what we have done so far:

  • First of all, we will share the theme of memory management, according to memory allocation and garbage collection two blocks of content
  • For the first part of memory allocation, we start from the phenomenon that the virtual memory and physical memory are very large after the execution of a simple program
  • To answer the question of why virtual memory is much larger than physical memory, we need to know what virtual memory is
  • To explain virtual memory, we introduce two views of the executable:
    • Static view: on disk the format is ELF and stored in sections according to program functions
    • Dynamic view: When loaded into memory, the same section permissions are merged and allocated to the same virtual address space
  • When we mentioned the process dynamic view above, we tried to answer the question why does every program have its own virtual address space
    • Isolation: Each program can assign its own program execution address during the link phase
    • Security: Each program can only access its own address space content
  • Since each program has its own virtual address space, the instructions and data executed by the real machine need to be kept in physical memory, which requires translation of the virtual address -> physical address
  • In translation, we have page tables to keep the correspondence between virtual and physical addresses, and multi-level page tables to reduce the space occupied by page tables
  • Due to the huge difference between CPU and memory speed, it is impossible for us to read the page table in memory every time, so we have TLB, and TLB is essentially a cache, because the cache capacity is smaller than all the corresponding relationships, we need to solve:
    • Quickly check whether the corresponding relationship is in TLB
    • When TLB is full, phase out
    • Ensure the consistency between TLB data and memory data
    • .
    • The above problems are not discussed in detail this time. Mark will be added later. Update article address

Now that we have this knowledge, let’s describe it as a whole. In daily development we execute go Build main.go; What happens at./main, and why does the program get executed?

  1. Since we are executing this command within bash, all bash will be forked first to create a new process
  2. The new process uses the execve system call to specify the execution of the ELF file, the main file
  3. Execve enters the kernel state by calling sys_execve
  4. Kernel call link: sys_EXECve -> do_EXECve -> search_binary_handle -> load_ELf_binary
  5. The main process first reads the 128 words in the header to determine the file format, and then selects the appropriate loader to load the program into memory
    1. The program header is only read here, the virtual address space is allocated, and the mapping relationship between the virtual address space and the executable file is established
    2. The mapping between the virtual space and the executable is set up here so that the content can be correctly loaded in case of a page-missing exception
    3. When a missing page exception occurs, physical memory is actually consumed when physical pages are allocated and files are loaded in from disk
  6. Finally, we set the CPU instruction register to the entry address of the executable. The program begins to execute…

When each program is loaded, virtual space addresses are allocated, but only when these addresses are actually accessed will the page miss interrupt be triggered to allocate physical memory. ps: cat /proc/21121/maps; Cat /proc/21121/smaps You can view the detailed address space of processes.

Application-layer memory management

We introduced the concept of virtual memory above, know that a program to be executed, must go through layers of steps loaded into memory, can be executed, the above content if we introduce the memory management layer, should belong to the kernel layer and hardware layer (TLB, cache) content, first to a picture: evernotecid://684F00FC-2900-4AF6-B7AA-D9B72CB9AC48/appyinxiangcom/5364228/ENResource/p14945

  1. To reduce system calls (BRK, MMAP)
  2. The operating system doesn’t know how to reuse memory, and once a process claims memory, it’s occupied by the process, and as long as it’s not freed, we can’t allocate it anymore, right
  3. The application layer does its own memory management, which can better reuse memory and cooperate with the garbage collector to better manage memory

Let’s look at memory management implementations for common languages:

  • C: Malloc, free
  • C++ : new, delete
  • Go: Escape analysis and garbage collection

Hands-on project: How to implement a minimal Malloc yourself.

Note: the following content is not guaranteed to be completely correct due to the height of the station. Please don’t hesitate to point out any mistakes.


When designing our own memory allocator, the important metrics to measure are:

  • throughput
  • Memory usage

Throughput is the number of requests (both allocated and reclaimed) that the memory allocator can handle per second, while memory utilization is to minimize internal and external fragmentation. ps:

  • Internal fragmentation is when more memory has been allocated than is required for the request.
  • External fragmentation is a free chunk of memory that has not been allocated but is too small to be allocated to a new process that claims space.

Let’s take a look at the GO language memory allocator design, we will be divided into two parts:

  • tcmalloc
  • Go language allocator implementation

tcmalloc

Tcmalloc famous, tcMALloc analysis of a lot of online content, here is not to say, we Google. Here is a recommended article illustrating TCMalloc, a picture is worth a thousand words. Just to mention why TCmalloc is so good.

  • Decentralized lock contention (multiple locks from one lock to another) via thread-local cache
  • Some is borrowed, some is returned, and when some continuous memory is not used, it is returned to the kernel

Ps: It is very interesting for 64-bit machines to return memory to the operating system. When memory is freed, VA is not freed, and VA is placed on the Heap. It is just a suggestion to the operating system that the MMU mapping between VA and PA can be removed if a certain section of VA is not used for a while. Behavior the first operating system physical memory does not use a lot of change in the operation, is lifted, the second operating system feel physical memory is quite big, just leave it, but know that for the PA space is not available now, didn’t take the VA release in the Heap, remove next time distribution is used when the VA, is likely to lead to two kinds of behavior, The first is a PA mapping not remove directly bring them here to use, the second is the PA was lifted off can cause abnormal operating system page fault, the operating system will make up the period of physical memory, this process is not visible to the user space, so in user space feel this piece of memory, have been released, because see the user space is always the VA, VA on a section of memory may be or may not exist, it don’t care about whether there is logic to users, this place is, in fact, the operating system to manage, the operating system through setting up mapping mmu, which can cause 64 VA address space is very large, as long as the application is not released, the next time, just repeat usage will catch up on. Memory management is now much simpler in 64 bits, which is used only for VA purposes and never released. Windows operating system is not recommended to release, only to release all.

go allocator

Go’s Memory allocation was developed from TCMALloc, and the basic idea was the same. The following is a brief: evernotecid: / / 684 af6 f00fc – 2900-4 – B7AA – D9B72CB9AC48 / appyinxiangcom / 5364228 / ENResource/p15012

Go Garbage collection

Golang’s Realtime GC in Theory and Practice Getting to Go: The Journey of Go’s Garbage Collector

tuning

With that said, let’s take a look at some examples of how to optimize our own programs in the real world. This section is followed by a special introduction

reference

Go Memory Management CppCon 2017: Bob Steagall “How to Write a Custom Allocator Go’s Memory Allocator – Overview Golang’s Realtime GC in Theory and Practice GopherCon 2018 – Allocator Wrestling Golang’s Realtime GC in Theory and Practice Getting to Go: The Journey of Go’s Garbage Collector