This article is a best practice for troubleshooting problems with the Security Container Cloud technology team.

The problem background

We found that the Kubelet process CPU usage of all worker nodes in the customer’s Kubernetes cluster environment was too high for a long time, and the CPU usage was up to 100% according to pidstat. This article records the troubleshooting process of this problem.

Cluster environment

The screening process

Use the Strace tool to track the Kubelet process

1, due to Kubelet process CPU usage exception, you can use strace tool to Kubelet process dynamic tracking process call, first use strace -cp command statistics Kubelet process in a certain period of time each system call time, call and error.

As you can see from the above figure, futex throws more than 5,000 errors during the system call, which is not a normal number, and the function takes up 99% of the time, so you need to take a closer look at the kubelet process related calls.

2. The strace -cp command can only view the whole process call, so we can print the timestamp of each system call by strace -tt -p command, as follows:

According to the output result of Strace, when executing futex-related system call, there are a large number of Connect timed out, and -1 and ETIMEDOUT errors are returned, so we can see so many errors in Strace – CP.

Futex is a synchronization mechanism that combines user mode and kernel mode. When the futex variable tells the process that there is a race, it will execute the system call to complete the corresponding processing, such as wait or wake up. According to the official document, Futex has several parameters:

futex(uint32_t *uaddr, int futex_op, uint32_t val,
                 const struct timespec *timeout,   /* or: uint32_t val2 */
                 uint32_t *uaddr2, uint32_t val3);
Copy the code

Official documentation explains ETIMEDOUT:

ETIMEDOUT
       The operation in futex_op employed the timeout specified in
       timeout, and the timeout expired before the operation
       completed.
Copy the code

Futex_op corresponds to FUTEX_WAIT_PRIVATE and FUTEX_WAIT_PRIVATE of the output above. You can see that most of the timeouts occur when FUTEX_WAIT_PRIVATE.

From the current system call level, it can be determined that Futex cannot enter the sleep state smoothly, but it is still unclear what operations Futex has carried out, so it is still impossible to determine the cause of the kubeletCPU surge, so we need to further see where the execution is stuck in the function call of Kubelet.

FUTEX_PRIVATE_FLAG: This parameter tells the kernel that futex is private to the process and not shared with other processes. FUTEX_WAIT_PRIVATE and FUTEX_WAKE_PRIVATE are two of these flags.

Futex 1: man7.org/linux/man-p… Fuex 2: man7.org/linux/man-p…

Kubelet function calls are analyzed using the Go Pprof tool

In earlier versions of Kubernetes, you can directly access the debug/pprof interface to retrieve the debug data. This interface is disabled for security reasons. For more information, see CVE-2019-11248 (github.com/kubernetes/…). . Therefore, we will enable proxy through Kubectl to obtain relevant data indicators:

Start API server proxy with kubectl proxy command

= '0.0.0.0' kubectl proxy - address - accept - hosts = '^ * $'Copy the code

Note that if you are using the kubeconfig file copied on Rancher UI, you need to use the context with the specified master IP. If you are using RKE or other tools, you can ignore it.

2. Build the Golang environment. Go Pprof needs to be used in golang environment. If golang is not installed locally, you can quickly build golang environment with Docker

docker run -itd --name golang-env --net host golang bash
Copy the code

Here, replace 127.0.0.1 with the IP address of the Apiserver node. The default port is 8001. If the Docker Run environment runs on the node where apiserver resides, you can use 127.0.0.1. Also, replace NODENAME with the corresponding NODENAME.

docker exec -it golang-env bash go tool pprof -seconds=60 -raw -output=kubelet.pprof http://127.0.0.1:8001/api/v1/nodes/${NODENAME} / proxy/debug/pprof/profileCopy the code

4. The output pprof file is not easy to view, so it needs to be converted into FlameGraph. It is recommended to use FlameGraph to generate SVG graph

git clone https://github.com/brendangregg/FlameGraph.git
cd FlameGraph/
./stackcollapse-go.pl kubelet.pprof > kubelet.out
./flamegraph.pl kubelet.out > kubelet.svg
Copy the code

When converted into a flame chart, you can visually see the function related calls and the specific call time ratio in the browser.

5. Analyze the flame chart

From kubelet flame figure you can see, is the longest function called * * k8s.io/kubernetes/vendor/github.com/google/cadvisor/manager. (containerData). Housekeeping *, CAdvisor is a built-in indicator collection tool in Kubelet, which is mainly responsible for real-time monitoring and performance data collection of resources and containers on node machines, including CPU usage, memory usage, network throughput and file system usage.

In-depth function calls can be found k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs. Manager (*). The GetStats this function takes k8s. I ContainerData o/kubernetes/vendor/github.com/google/cadvisor/manager. (*). Housekeeping this function is the longest, It takes a long time to obtain the container CGroup-related status.

6. Since this function takes a long time, let’s analyze what this function does.

View the source code: github.com/kubernetes/…

func (s *MemoryGroup) GetStats(path string, stats *cgroups.Stats) error { // Set stats from memory.stat. statsFile, err := os.Open(filepath.Join(path, "memory.stat")) if err ! = nil { if os.IsNotExist(err) { return nil } return err } defer statsFile.Close() sc := bufio.NewScanner(statsFile) for sc.Scan() { t, v, err := fscommon.GetCgroupParamKeyValue(sc.Text()) if err ! = nil { return fmt.Errorf("failed to parse memory.stat (%q) - %v", sc.Text(), err) } stats.MemoryStats.Stats[t] = v } stats.MemoryStats.Cache = stats.MemoryStats.Stats["cache"] memoryUsage, err := getMemoryData(path, "") if err ! = nil { return err } stats.MemoryStats.Usage = memoryUsage swapUsage, err := getMemoryData(path, "memsw") if err ! = nil { return err } stats.MemoryStats.SwapUsage = swapUsage kernelUsage, err := getMemoryData(path, "kmem") if err ! = nil { return err } stats.MemoryStats.KernelUsage = kernelUsage kernelTCPUsage, err := getMemoryData(path, "kmem.tcp") if err ! = nil { return err } stats.MemoryStats.KernelTCPUsage = kernelTCPUsage useHierarchy := strings.Join([]string{"memory", "use_hierarchy"}, ".") value, err := fscommon.GetCgroupParamUint(path, useHierarchy) if err ! = nil { return err } if value == 1 { stats.MemoryStats.UseHierarchy = true } pagesByNUMA, err := getPageUsageByNUMA(path) if err ! = nil { return err } stats.MemoryStats.PageUsageByNUMA = pagesByNUMA return nil }Copy the code

As you can see from the code, the process reads memory.stat, which holds the cgroup memory usage. That is, it takes a lot of time to read the file. At this point, if we manually view the file, what will happen?

# time cat/sys/fs/cgroup/memory/memory. The stat > / dev/null real 0 m9. 065 s user 0 m0. 000 s sys 0 m9. The 064 sCopy the code

Here’s a clue: it took 9s to read this file, which is obviously not normal.

Based on the above results, we found an issue on GitHub of cAdvisor (github.com/google/cadv…). From the issue, it can be seen that this problem is related to slab Memory cache. From the issue, you can see that the memory of the affected machine is gradually used, and by /proc/meminfo you can see that slab memory is used, which is the memory page of the kernel cache, most of which is the dentry cache. From this, we can judge that when the life cycle of a CGroup ends, the process is still kept in slab memory due to caching, which makes it unable to be released just like zombie Cgroups.

This cache contains the current cgroup-related cache (dentry, inode), i.e. the cache of directory and file indexes. This cache is essentially to improve the efficiency of reading. However, when all processes in the CGroup exit, the cache that exists in kernel memory space is not cleaned up.

The kernel allocates memory through the partner algorithm. Every time a process applies for memory space, it will allocate at least one memory page for it, that is, at least 4K memory will be allocated. Every time the memory is released, it will also release the memory according to at least one page. In Linux, to solve this problem, slab memory allocation management is introduced to handle such a small amount of memory requests. This will cause that when all the processes in the CGroup exit, This part of memory is not easily reclaimed, and the cached data in this part of memory is also read into stats, which affects the read performance.

The solution

This is a temporary solution. It can reduce the kubelet CPU usage by temporarily clearing the node memory cache, but the CPU usage will increase again when the cache is used.

echo 2 > /proc/sys/vm/drop_caches
Copy the code

2. Upgrade the kernel version

In fact, this is mainly a kernel problem, on GitHub commit(github.com/torvalds/li…) CGroup STATs-related query performance has been optimized in kernel versions 5.2+ or higher. If you want to better solve this problem, you are advised to upgrade the kernel version based on your own operating system and environment. In addition, Redhat is available in kernel-4.18.0-176 (bugzilla.redhat.com/show_bug.cg…). The kernel version of CentOS 8/RHEL 8 is 4.18 by default. If you are using THE CURRENT OPERATING system RHEL7/CentOS7, you can gradually replace the new operating system with the kernel version 4.18.0-176 or later. After all, the new kernel will be a much better container-specific experience.

Kernel commit: github.com/torvalds/li… Redhat Kernel Bug fix: bugzilla.redhat.com/show_bug.cg…