Memory Cgroup leakage is a common problem in K8s(Kubernetes) cluster, which can lead to memory resource shortage of nodes at least, and can only restart the server to restore nodes without response at worst. Most developers circumvent this by dropping cache periodically or shutting down the kernel kmem Accounting. Based on the practical case of NetEase Shufan kernel team, this paper analyzes the root cause of the memory Cgroup leakage problem and provides a solution to fix the problem at the kernel level.

background

Operation and maintenance monitoring found that some cloud host computing nodes and K8s(Kubernetes) nodes had abnormal surge of load. Specifically, the system was running very slow, the load continued at 40+, and some Kworker threads had high CPU usage or were in D state, which had affected the business. Specific reasons need to be analyzed.

Problem orientation

The phenomenon of analysis

Perf is an essential tool for dealing with CPU usage anomalies. Perf top shows that the kernel function cache_reap has intermittent spikes in usage.

Looking at the kernel code, the cache_reap function is implemented as follows:

As you can see, this function mainly iterates through a global slab_caches list, which records information about all slab memory objects on the system.

By analyzing the code flow associated with the slab_caches variable, each Memory cgroup corresponds to a memory.kmem.slabinfo file.

This file records the slab information applied by each Memory Cgroup process, and the slab objects of the memory Cgroup are uniformly added to the global Slab_caches list. Is there too much slab_caches data, which makes it take a long time to complete, causing the CPU to rush?

If there is too much slab_caches data, there must be a very large number of memory Cgroups on the system, so it is natural to count how many memory Cgroups exist on the system. /sys/fs/cgroup/memory cgroup = sys/fs/cgroup/memory The memory.kmem.slabinfo file can contain only a few dozen records per memory cgroup, so the number of Slab_caches should not exceed 10,000, so there should be no problem.

Finally, we’ll start with the original cache_reap function, which is CPU intensive, so we can track it directly to see where the code is taking longer to execute.

Confirm the returning

Tracking the Cache_REAP function with a series of tools shows that the number of slab_caches is in the staggering millions, which is far from what we would actually count.

Then run cat /proc/cgroup to view the current cgroup information. The number of cgroups in memory has reached 20w+. There are so many Cgroups on cloud host computing nodes, which is obviously not a normal situation. Even on K8s(Kubernetes) nodes, cgroups of this order of magnitude cannot be normally generated by container services.

/sys/fs/ cgroups /memory /cgroups /cgroups /cgroups /cgroups /cgroups /cgroups /cgroups /cgroups Due to memory cgroup leak!

Please refer to the following for detailed explanation:

Many operations on the system (such as creating and destroying containers/cloud hosts, logging in to hosts, cron scheduled tasks, etc.) trigger the creation of temporary Memory Cgroups. The processes in these Memory Cgroups may generate cache memory (such as accessing files or creating new files) during their running. The cache memory will be associated with the memory Cgroup. After processes in the memory cgroup exit, the corresponding directories in the /sys/fs/cgroup/memory directory of the cgroup will be deleted. However, the cache generated by the memory Cgroup is not actively reclaimed. Since some cache memory still references the memory Cgroup object, the objects in the memory cgroup are not deleted.

In the process of locating, we found that the cumulative number of memory Cgroups was still slowly increasing every day, so we tracked the creation and deletion of memory Cgroup directories of nodes, and found that the following two triggers would cause memory Cgroup leakage:

  1. Specific CRON scheduled tasks are executed

  2. Users frequently log in and out of nodes

Both of these triggers cause memory Cgroups to leak because they are related to the Systemd-Logind login service. The systemd-Logind service creates temporary memory Cgroups when performing crON scheduled tasks or logging on to the host. After the CRon task is executed or the user logs out, the temporary Memory Cgroup is deleted. In the case of file operations, the memory Cgroup leaks.

Retrieval methods

Analyzing the triggering scenarios for memory cgroup leaks makes it much easier to reproduce the problem:

The core reproduction logic is to create a temporary memory cgroup, perform file operations to generate cache memory, and then delete the memory cgroup temporary directory. Through the above method, the test environment can quickly reproduce the 40W memory Cgroup residue field.

The solution

By analyzing the problem of memory Cgroup leakage, the root cause and triggering scenario of the problem are basically understood. Then how to solve the problem of leakage?

Solution 1: Drop cache

Since cgroup leaks are caused by cache memory not being recycled, the most direct solution is to clear the system cache with echo 3 > /proc/sys/vm-drop_caches.

But clearing the cache is only mitigated, and cgroup leaks can still occur later. On the one hand, you need to configure scheduled tasks to drop cache every day. In addition, the drop cache operation consumes a large amount of CPU and affects services. For nodes with a large number of Cgroup leaks, the drop cache operation may be stuck in the cache clearing process. Create new problems.

Plan 2: NokMEm

Kernel provides the cgroup.memory = nokmem parameter, which disables the kMEm accounting function. After this parameter is configured, the memory cgroup does not have a separate slabinfo file. This will not cause the Kworker thread CPU to surge even if there is a memory Cgroup leak.

However, this solution can take effect only after the system is restarted, which has certain impact on services. Moreover, this solution cannot completely solve the fundamental problem of memory Cgroup leakage, but can only alleviate the problem to a certain extent.

Plan 3: Eliminate the trigger source

The above analysis found that the two trigger sources of Cgroup leaks can be eliminated.

In the first case, it is confirmed that the CRON task can be closed by communicating with the corresponding business module.

In the second case, you can run the loginctl enable-linger username command to set the corresponding user as the background resident user.

When the user logs in, the Systemd-Logind service creates a permanent memory Cgroup for the user. This memory Cgroup can be reused every time the user logs in, and will not be deleted after the user logs out, so there is no leakage.

At this point, it seems that the problem of memory Cgroup leakage has been solved perfectly, but in fact, the above treatment scheme can only cover the two known trigger scenarios, and does not solve the problem that cgroup resources cannot be completely cleaned and recycled. A new memory Cgroup leak trigger scenario may occur later.

Solutions in the kernel

Conventional scheme

In the process of locating the problem, Google has found a lot of problems caused by cgroup leakage in container scenarios. There are reported cases on centos7 series and 4.x kernel. The main reason is that the kernel does not fully support cgroup kernel memory accounting. When K8s(Kubernetes)/RunC uses this feature, there will be the problem of memory Cgroup leakage.

The main solutions are as follows:

  1. The drop cache command is executed periodically

  2. Kernel configuration Nokmem Disable the KMEM accounting function

  3. K8s Disable KernelMemoryAccounting

  4. Docker /runc Disable KernelMemoryAccounting

We are wondering if there is a better solution to solve the cgroup leak “once and for all” at the kernel level.

Kernel reclaim thread

Memoy CGroup leak analysis reveals that the core problem is that the cgroup directory created temporarily by Systemd-Logind is automatically destroyed, but the cache memory created by reading and writing files and the associated slab memory are not immediately recovered. Because of these pages, the reference count for the Cgroup management structure cannot be cleared, so while the cgroup mounted directory is deleted, the associated kernel data structure remains in the kernel.

Based on the tracking analysis of solutions to community-related problems and the ideas provided by Ali Cloud Linux, we realized a simple and direct solution:

Run a kernel thread in the kernel to reclaim these residual memory Cgroups separately and release the memory pages held by them into the system, so that these residual memory Cgroups can be reclaimed by the system normally.

This kernel thread has the following features:

  1. Only residual memory Cgroups are recycled

  2. This kernel thread has the lowest priority set

  3. Cond_resched () is used for each round of memory cgroup reclaiming to prevent CPU usage for a long time

The core process of thread recycling is as follows:

Functional verification

Function and performance tests are performed on the kernel that joins the kernel reclaim thread, and the results are as follows:

  • In the test environment, the recycle thread is enabled, and the residual memory cgroup in the system can be cleaned timely.

  • Simulate cleaning 40W leaked memory Cgroups, and the CPU usage of reclaiming threads is no more than 5%, and the resource occupation is acceptable.

  • The residual memory cgroup with a large size was tested. One memory Cgroup holding 20G memory was reclaimed, and the execution time distribution of the core reclamation function was basically less than 64uS. No impact on other services;

After the kernel reclamation thread is enabled, the kernel LTP stability test is successfully passed without increasing the kernel stability risk.

It can be seen that adding a kernel thread to reclaim the residual memory Cgroup can effectively solve the problem of Cgroup leakage with a small resource utilization rate. This scheme has been widely used online in NetEase private cloud, effectively improving the stability of NetEase container business.

conclusion

The above is the analysis and location process of the memory Cgroup leak problem that we share, and provides relevant solutions as well as a way to solve the problem at the kernel level.

In the long-term business practice, I deeply realized that K8s(Kubernetes)/ container scenarios have all-round use and requirements for Linux kernel. On the one hand, the whole container technology is mainly built based on the capabilities provided by kernel, which is essential for bug locating and repairing of related modules to improve kernel stability. On the other hand, kernel optimizations/new features abound for container scenarios. We also continue to pay attention to the development of relevant technologies, such as using EBPF to optimize container network, enhancing kernel monitoring capability, and using CGROUP V2 /PSI to improve resource isolation and monitoring capability of containers. These are also the main direction to promote the implementation of NetEase in the future.

Author introduction: Zhang Yabin, NetEase shufan kernel expert