From the Linux kernel VS memory fragmentation (above), we can see that grouping by migration type only slows down memory fragmentation rather than solves it, so over time, performance problems can occur when there is too much fragmentation to meet continuous physical memory requirements. This was not enough, so the kernel introduced features such as memory consolidation.
Memory is neat
Before the introduction of memory reclaim, the kernel also used Lumpy Reclaim to carry out anti-fragmentation, but it no longer exists on the kernel version 3.10, which is the most commonly used, so we will not introduce it. If you are interested in memory reclaim, please take it from the list organized at the beginning of the article. Let’s look at memory reclaim.
The idea behind Memory compaction is described in detail in Memory compaction. It simply scans the bottom of a selected zone with assigned MIGRATE_MOVABLE pages, scans free pages from the top of the zone, and moves the movable page from the bottom to the free page from the top. Form contiguous free pages at the bottom.
Selected fragmentation zone:
Scanning movable pages:
Scanning free pages:
To finish neatly:
The kernel also provides an interface for manual scaling: /proc/sys/vm/compact_memory, but as mentioned in the introduction (at least for the 3.10 kernel that we use most often), memory scaling is not very useful either manually or automatically. It is too expensive and becomes a performance bottleneck: The Memory compaction issues. Instead of abandoning this feature, the community updates itself by introducing a new VERSION of V 4.6, Introduce Kcompactd, v4.8 to make Direct compaction more Deterministic.
For the 3.10 kernel, the timing of memory consolidation is as follows:
-
The KSWAPd thread balances the zone after the high order memory allocation fails.
-
Direct memory reclamation to meet high order memory requirements, including THP page missing exception processing path;
-
The Khugepaged kernel thread attempts to collapse a large page;
-
Triggering memory consolidation manually via the /proc interface;
The path related to THP, which I mentioned in the last article why we should disable THP and advised you to turn it off, will not be analyzed here, focusing on the memory allocation path:
Basic process: when the application distribution page, if not from the partner system freelist page, then enter the path of slow memory allocation, pioneered the use of low water level try to allocate, if failed, then memory slightly insufficient, distributor will wake kswapd thread asynchronous recycling pages, and then try to use the lowest water level distribution page. If allocation failure, a serious shortage of residual memory, will be the implementation of the asynchronous memory neat, if still cannot allocate page after asynchronous neat, execute direct memory recovery, the number of pages or recycled still does not meet the demand, the direct memory neat, if direct memory recovery a page has not yet received, call the oom killer recycling memory.
The above process is just a simplified description, the actual workflow will be much more complex, according to different order of application and allocation of flag bits, the above process will be slightly changed, to avoid you into details, we do not expand in this article. To help you quantify the latency caused by direct memory collection and memory consolidation for each participating thread, I submitted two tools in the BCC project: Drsnoop and Compactsnoop are well documented, but it is important to note that in order to reduce the overhead introduced by BPF, there may be a many-to-one relationship compared to the memory request event. For older kernels like 3.10, The number of retries during a slow memory allocation process is uncertain, resulting in oom Killer either appearing too early or too late, leaving most tasks on the server permanently hung up. Kernel in 4.12 mm: fix 100% CPU kSWapd BusyLoop on Unreclaimable Nodes Specifies the maximum number of direct memory reclamation times. At this maximum count of 16, we assume that the average delay for a direct memory collection is 10ms. (for today’s 100 gb servers, shrink active/inactive lru lists are time-consuming, and there is an additional delay if you have to wait for dirty pages to be written back.) So when a thread requisition a page through the page allocator and performs only one direct memory reclamation to reclaim enough memory, the added delay of the allocated memory is increased by 10ms. If it takes 16 retries to reclaim enough pages, the added delay is 160 ms instead of 10ms.
Let’s go back to memory consolidation. The core logic of memory consolidation can be divided into four steps:
-
Determine whether the memory zone is suitable for memory regulation.
-
Set the start page frame number for scanning;
-
Isolate MIGRATE_MOVABLE property page;
-
Migrate MIGRATE_MOVABLE property page to the top of zone;
After a migration, if you still need to clean up, repeat 3 and 4 until the clean is complete. As a result, the SYS CPU consumes a large amount of CPU resources.
Page migration is also a big topic, and there are other scenarios that use memory migration in addition to memory consolidation, so we won’t cover it here. Let’s look at how to determine if a zone is suitable for memory consolidation:
-
In the case of the /proc interface that enforces discipline, no need to say, obey;
-
The order applied is used to judge whether the zone has enough remaining memory for organizing (the details of different kernel algorithms vary from version to version), and the fragmentation index is calculated. When the index approaches 0, it means that the memory allocation will fail due to insufficient memory. Therefore, it is not appropriate to conduct memory organizing at this time, but memory reclamation. When the index approaches 1000, it indicates that memory allocation will fail due to excessive external fragmentation, so it is not suitable for memory reclamation but memory consolidation. The dividing line between consolidation and reclamation is determined by the external fragmentation threshold: /proc/sys/vm-extfrag_threshold;
The kernel developers also provide us with an interface to observe the memory index:
By performing a cat/sys/kernel/debug/extfrag extfrag_index can be observed (there is a decimal point because in addition to 1000).
conclusion
This article briefly describes why external memory fragmentation causes performance problems and the community’s anti-fragmentation efforts over the years, focusing on the principles and quantitative and qualitative observations of the 3.10 kernel anti-fragmentation. I look forward to helping you.
The reason that direct memory reclamation is mentioned in the description of memory cleanliness is that direct memory reclamation not only occurs when there is a serious shortage of memory, but also can be triggered by memory fragmentation in a real scenario, and the two can be mixed over time.
This article also introduces the /proc file system-based monitoring interface and kernel event-based tools. The two supplement each other. The /proc file system-based monitoring interface is simple to use, but has some problems such as unquantifiable analysis and large sampling period, which can be solved by kernel event-based tools. However, some understanding of how kernel-related subsystems work is required, as well as requirements for the customer’s kernel version.
As for how to reduce the frequency of direct memory reclamation and how to alleviate the fragmentation problem, my idea is that for workload scenarios requiring a large number of IO operations, the kernel takes care of slow back-end devices in design, such as the implementation of second chance method and Refault Distance based on LRU algorithm. It does not provide the ability to limit the page cache size (some companies have customized this feature for their kernels and have tried to submit it to the upstream kernel community, but the upstream community has not accepted it, I think it may cause workingSet refault, etc.). So for machines with more than 100 GIGABytes of memory, it’s a good idea to increase the vm.min_free_kbytes variant to limit page cache usage (up to 5% of total memory). While increasing vm.min_free_kbytes does cause some memory waste, for a 256GB server we set it to 4G and it only accounts for 1.5%. The community has apparently taken note of this and has optimized mm: Scale KSWapd Watermarks in PROPORTION to Memory incorporated in the 4.6 kernel. Another method is to perform drop cache at an appropriate time, which may cause serious service jitter.
Looking forward to your exchange and feedback!