preface

Before the internship, after listening to OOM sharing, on Linux kernel memory management is full of interest, but this knowledge is very large, not a certain accumulation, did not dare to write, worried that perpetuated this misunderstanding, so after a period of time accumulated, understands kernel memory, today didn’t write this blog, record and share.

The solution to Memory overflow is as follows: 1. Reduce the image to the same scale. 2. 3. Use image loading framework to process images, such as professional image processing ImageLoader image loading framework, and XUtils BitMapUtils to process.

This paper mainly analyzes the memory layout and allocation of a single process space, and analyzes the kernel’s memory management from a global perspective.

The following introduces Linux memory management from the following aspects:

  • Process memory application and allocation;

  • OOM after memory is used up.

  • Where is the requested memory?

  • The system reclaims memory.

1. Process memory application and allocation

In the previous article, I explained how the Hello World program loads memory and allocates memory. Again, the address space of the process is given first, which I think is a must for any developer to remember, along with the time spent on disk, memory, and CPU cache.

When we start a program at the terminal, the terminal process calls the exec function to load the executable file into the memory. At this time, the code segment, data segment, BBS segment and stack segment are mapped to the memory space through the MMAP function, and the heap determines whether to map according to whether there is memory application on the heap.

After exec executes, it does not actually start executing the process at this point, but hands over CPU control to the dynamically-linked library loader, which loads the dynamically-linked libraries needed by the process into memory. Then the process execution begins, which can be analyzed by tracing system functions called by the process using the strace command.

This is the application in PIPE in my last blog post, and from this output, you can see that it is consistent with what I described above.

When malloc is called to apply for memory for the first time, BRK is embedded into the kernel through system call. First, it will determine whether there is a VMA about the heap. If not, it will map a block of memory to the heap anonymously through Mmap and establish the VMA structure. Hang to the red-black tree and linked list on the MM_struct descriptor.

Then return to the user state, through the memory allocator (PTMALOC, TCMALloc, Jemalloc) algorithm to manage the allocated memory, returned to the user needed memory.

If a large amount of memory is allocated in the user state, mMAP is directly called to allocate the memory. The memory returned to the user state is still virtual memory. The memory is not allocated until the first time the returned memory is accessed.

What is actually returned via BRK is also virtual memory, but after the memory allocator has cut and allocated it (cutting must access memory), it is all allocated to physical memory

When a process frees memory in user mode by calling free, munmap is returned directly to the system if the memory is allocated by mmap.

Otherwise, the memory is returned to the allocator and then returned to the system by the allocator, which is why we may not get an error when we call free to reclaim the memory.

Of course, when the entire process exits, the memory occupied by the process is returned to the system.

2. OOM after the memory runs out

During my internship, a mysql instance on a test machine was often killed by OOM. Oom (out of Memory) is the self-rescue measure of the system when it runs out of memory. She would choose a process and kill it to release the memory. But is this the case?

When I went to work this morning, I happened to meet OOM, and suddenly found that OOM once, the world was quiet, haha, Redis on the test machine was killed.

Oom_kill. c, which describes how to select the process that should be killed when the memory is insufficient. There are many factors to select, including the memory occupied by the process, the running time of the process, the priority of the process, and whether the process is root. The number of child processes is related to memory usage and user control parameter oOM_adj.

After creating oom, the select_bad_process function iterates through all processes, with each process receiving an OOM_score based on the above mentioned factors, with the highest being selected as the process to kill.

We can interfere with the process the system chooses to kill by setting the /proc/ /oom_adj score.

This is the kernel’s definition of the oOM_adj modifier. The maximum value can be 15, and the minimum value can be -16. If it is -17, the process will not be killed by the system as if it had bought a VIP membership. You can set the oOM_adj of your service to -17.

/proc/sys/vmp/overcommit_memory man proc/ sys/vmp/overcommit_memory

If the overcommit_memory is 0, the system uses the heuristic OOM. If the requested virtual memory is not exaggerated larger than the physical memory, the system allows the application. However, if the requested virtual memory is exaggerated larger than the physical memory, the system generates the OOM.

For example, if there is only 8GB of physical memory, then the redis virtual memory occupies 24GB, and the physical memory occupies 3g, then if bgSave is executed, the child process shares the physical memory with the parent process, but the virtual memory is its own, that is, the child process will request 24GB of virtual memory, which is much larger than the physical memory, and will generate an OOM.

When overcommit_memory is 1, overmemory is always allowed, no matter how large your virtual memory is. However, when the system runs out of memory, oom is generated, as in the redis example above. If overcommit_memory=1, oom is not generated because there is enough physical memory.

When overcommit_memory is 2, you should never exceed the swap+RAM* ratio (/proc/sys/vm/overcmmit_ratio, default 50%, can be adjusted). If the overcommit_memory is 2, you should never exceed the swap+RAM* ratio. Any subsequent attempts to allocate memory will return an error, which usually means that no new programs can be run at this point

Above is OOM content, understand the principle, and how to set up OOM according to their own application.

3. Where is the memory applied for by the system?

Once we know the address space of a process, do we wonder where all the physical memory is? Maybe a lot of people think, isn’t it physical memory? Physical memory can be divided into cache and ordinary memory, which can be viewed by running the free command. Physical memory can also be divided into DMA, NORMAL memory, and HIGH. This section mainly analyzes cache and ordinary memory.

From the first part, we learned that the address space of a process is almost always applied by mMAP functions, including file mapping and anonymous mapping.

3.1 Mapping Shared Files

Let’s take a look at the code segment and the dynamic link library mapping segment. Both of these are shared file mappings. That is to say, two processes started by the same executable share these two segments and map to the same physical memory. I wrote a program to test this:

Let’s take a look at the memory usage of the current system:

When I create a new 1GB file locally:

dd if=/dev/zero of=fileblock bs=M count=1024

Then call the above program to map shared files, and the memory usage is as follows:

We can see that the buff/cache increases by about 1G, so we can conclude that the code and dynamic link library segments are mapped to the kernel cache, that is, when a shared file mapping is performed, the files are read into the cache first and then mapped to the user process space.

3.2 Private file mapping segment

For the data segment in the process space, it must be a private file mapping, because if it is a shared file mapping, then the modification of the data segment by either of the two processes started by the same executable file will affect the other process. I rewrite the above test program as an anonymous file mapping:

Before executing the program, you need to release the previous cache. Otherwise, the result will be affected

echo 1 >> /proc/sys/vm/drop_caches

Then execute the program to see the memory usage:

Used and buff/cache have increased by 1 gb, respectively. This indicates that when private files are mapped to the cache, the files are mapped to the cache first, and if a file modifies the file, A block of memory is allocated to copy the file data to the newly allocated memory and then modify it in the newly allocated memory. This is called copy-on-write.

This also makes sense, because if multiple instances of the same executable file are opened, the kernel maps the executable segment to the cache first, and then allocates a chunk of memory for each instance that changes the segment, since the segment is also private to the process.

Based on the above analysis, it can be concluded that in case of file mapping, the file is mapped to the cache, and then different operations are performed depending on whether the file is shared or private.

3.3 Private Anonymous Mapping

BBS segments, heaps, and stacks are anonymous maps because there are no corresponding segments in the executable file and they must be private, otherwise if the current process forks a child process, the parent and child processes will share these segments and each change will affect each other, which is not reasonable.

Ok, now I change the above test program to a private anonymous mapping

Now let’s look at memory usage

As you can see, only used increased by 1 gb, while buff/cache did not increase. The cache is not being used for anonymous private mapping, which makes sense because only the current process is using the cache.

3.4 Sharing Anonymous Mapping

When we need to share memory between parent and child processes, we can use mmap to share anonymous mapping, so where is the memory of shared anonymous mapping? I went on to rewrite the above test procedure for sharing anonymous mappings.

Take a look at the memory usage:

The buff/cache is only increased by 1 gb, which means that when a shared anonymous mapping is performed, memory is allocated from the cache. This makes sense because the shared anonymous mapping exists in the cache, and each process maps to each other’s virtual memory. So you can operate on the same block of memory.

4. The system reclaims the memory

When the system memory is insufficient, there are two ways to release memory. One is manual, and the other is triggered by the system.

4.1 Manually Reclaiming the Memory Manually reclaiming the memory has been demonstrated before, that is

echo 1 >> /proc/sys/vm/drop_caches

We can see an introduction to this under Man Proc

If the drop_caches file is 1, the pagecache portion is released (some caches cannot be released), and if the drop_caches are 2, dentries and inodes caches are released. When drop_caches is 3, this releases both.

If pagecache contains dirty data, the drop_caches cannot be released. To release pagecache, you must use sync to refresh the dirty data to disk.

Pagecache can’t be released with Drop_caches. What else can pagecache be released with drop_caches?

4.2 TMPFS

TMPFS, like Procfs, SYSFS, and RAMFS, is a memory-based file system. The difference between TMPFS and RAMFS is that ramFS files are based on pure memory. Ramfs may run out of memory, while TMPFS can limit memory usage. You can use df -t -h to check some file systems in the system. Some of them are TMPFS, and the most famous directory is /dev/shm

TMPFS file system can be used to create files based on TMPFS, inode, super_block, identry, and other disk-based file systems. File and other structures, the main difference is in the read and write, because the read and write is related to the file carrier is memory or disk.

The shmem_file_read function of the TMPFS file is used to locate the address_space address through inode structure, which is actually the pagecache of the disk file, and then locate the cache page and the in-page offset through read offset.

At this time can be directly from the pagecache through the function __copy_to_user cache page data copy to user space, when we want to read the data is not pagecache, at this time to determine whether in swap, if in the memory page swap in, and then read.

Shmem_file_write function of TMPFS file, the process is mainly to determine whether the page to be written is in memory, if so, then directly copy the user state data to the kernel pagecache by function __copy_from_user to cover the old data, and marked as dirty.

If the data to be written is no longer in the memory, check whether it is in swap. If so, read it first, overwrite the old data with new data and mark it as dirty. If it is not in the memory or disk, a new Pagecache is generated to store user data.

/dev/ SHM; /dev/ SHM; /dev/ SHM; /dev/ SHM;

As you can see, the cache grows by 1 gigabyte, which verifies that TMPFS actually uses the cache.

In fact, mmap anonymous mapping principle also uses TMPFS, in mm/ map.c->do_mmap_pgoff function, there is a judgment if file structure is empty and SHARED mapping, Call shmem_zero_setup(VMA) to create a new file on TMPFS

This explains why shared anonymous map memory is initialized to 0, but we know that mMAP-allocated memory is initialized to 0, which means that mMAP private anonymous map memory is also initialized to 0, so where is that?

This is not shown inside the do_mmAP_pgoff function, but rather in a page-missing exception, which allocates a special page initialized to 0.

Can the pages held by TMPFS be reclaimed?

This means that the pagecache occupied by TMPFS files cannot be reclaimed, which is obvious because files referencing these pages cannot be reclaimed.

4.3 Shared Memory

Posix shared memory is similar to MMAP shared memory. Both processes create a new file on the TMPFS file System, and then map it to user mode. The last two processes operate on the same physical memory.

We can trace the following function

This function creates a new shared memory segment, where the function

shmem_kernel_file_setup

Is to create a file on the TMPFS filesystem, and then through the memory file implementation process communication, that I would not write a test program, and it is also cannot be recovered, because of the Shared memory mechanism of ipc is life cycle as the kernel, that is to say, after you create the Shared memory, if you don’t show deleted after process exits, Shared memory still exists.

Poxic and System V all use the TMPFS file System (message queue, semaphore and shared memory), which means that the final memory is pagecache. But I can see in the source code that two of the shared memory is based on the TMPFS file system, and the other semaphores and message queues are not yet visible (more on that later).

The POSIX message queue implementation is similar to the PIPE implementation in that it has its own mqueue file system and then attaches the message queue attribute mqueue_inode_info to i_private on the inode. On this attribute, kernel 2.6, The message is stored in an array, and in 4.6 the message is stored in a red-black tree (I downloaded both versions, so I don’t know exactly when the red-black tree will be used).

The two processes then operate each operation on the mqueue_inode_info message array or red-black tree to communicate with each other. Similar to this mqueue_inode_info is the TMPFS file system attribute shmem_inode_info and the file system eventloop for epoll, which also has a special attribute struct eventpoll, This is hanging in the file structure private_data and so on.

The code segment, data segment, dynamic link library (shared file mapping), mMAP shared anonymous mapping all exist in cache, but these memory pages are referenced by the process, so they cannot be released. The life cycle of IPC based on TMPFS is dependent on the kernel. You can’t release drop_caches either.

The cache cannot be freed, but it can be swapped out when the memory is running low.

So drop_caches can release cached pages when a file is read from disk and when a process maps a file to memory and exits, the cached pages of the mapped file can also be released if they are not referenced.

4.4 Automatic Memory Release Mode

When the system memory is insufficient, the operating system has a mechanism to clean up the memory and free it as much as possible. If this mechanism cannot free enough memory, you can only use OOM.

Redis was killed because of OOM, as follows:

In the second half of the sentence,

total-vm:186660kB, anon-rss:9388kB, file-rss:4kB

The memory usage of a process is explained by three attributes, namely all virtual memory, resident memory anonymous mapping page and resident memory file mapping page.

In fact, from the above analysis, we can also know that a process is actually file mapping and anonymous mapping:

  • File mapping: code segment, data segment, dynamic link library shared storage segment and user program file mapping segment;

  • Anonymous mapping: BBS segments, heap, and memory allocated when malloc uses MMAP, and MMAP shared memory segments;

In fact, kernel memory reclamation is based on file mapping and anonymous mapping, which is defined as follows in mmzone.h:

LRU_UNEVICTABLE is a list of pages that cannot be swapped out when mlock is called.

The kernel has a KSWAPD that periodically checks memory usage. If free memory is found at PAGes_low, kSWAPD scans the first four LRU queues in the LRU_list to find inactive pages in the active list. And add an inactive linked list.

Then iterate through the inactive list and recycle 32 pages one by one until the number of free pages reaches pages_high. For different pages, the recycle method is different.

Of course, when the memory level is below a certain threshold, a memory reclamation is directly issued, which is the same as KSWAPD, but this time it is more powerful and requires more memory to be reclaimed.

Page file:

If the page is dirty, it is written back to disk and the memory is reclaimed.

If it is not dirty pages directly release the recycling, because if it is IO read cache, direct release, the next time you read, missing page exception, directly to the disk can be read back, if the page is file mapping, direct release, the next time you visit, also produced two missing page exception, a will read the file content into the disk, another virtual memory associated with a process.

Anonymous page: because anonymous page has no write back place, if released, then can not find the data, so the recovery of anonymous page is to take swap out to disk, and make a mark in the page entry, the next page missing exception will be swap in from disk into memory.

If the system memory demand increases rapidly, the CPU will be occupied by I/OS, and the system will be blocked and cannot provide external services. Therefore, the system provides a parameter for reclaiming cache and swap anonymous pages when reclaiming memory. The parameter is:

The maximum value is 100. If the value is set to 0, cache reclamation is used as much as possible to free memory.

5, summary

This article is about Linux memory management:

First, the process address space is reviewed.

Second, when the process consumes a large amount of memory and runs out of memory, there are two ways: first, reclaim the cache manually; The other is the system background thread SWAPD to perform memory reclamation.

Finally, when the requested memory is larger than the remaining memory of the system, only OOM will be generated, killing the process and releasing memory. From this process, it can be seen how hard the system is to free up enough memory.

Author: Luo Daowen private kitchens of the original link: http://luodw.cc/2016/08/13/linux-cache/