< Follow the public account, reply “Investigation” to get the download link >
Kubernetes is open for download
The author of this book luo Jianlong (alias Shengdong), ali cloud technology expert, has many years of operating system and graphics graphics card driver debugging and development experience. Currently focusing on cloud native domain, container cluster and service grid. This book is divided into theory and practice, a total of 12 technical articles, in-depth analysis of the cluster control, cluster expansion principle, mirror pull and other theories, take you to achieve accurate understanding from the basic concept to start the precise operation of skilled, simple use Kubernetes!
The book has four major highlights:
- Online massive precipitation of real cases
- Perfect combination of theory and practice
- The theory is simple
- The technical details get to the bottom of it
Help you understand 6 core principles at a time, understand the basic theory, learn 6 typical problems of the gorgeous operation!
How to download it for free?
Pay attention to “Alibaba Cloud native”, reply to ** check **, you can download this book for free.
preface
The following is an excerpt from the book Kubernetes.
Ali Cloud has its own Kubernetes container cluster product. With the rapid increase of Kubernetes cluster shipments and sporadic discovery of online users, the cluster will have a very low probability of NotReady situation.
According to our observation, this problem almost every month one to two customers encounter. After NotReady, the cluster Master has no control over the node, such as delivering new pods or capturing real-time information about running pods on the node.
In the above problem, our troubleshooting path from the K8s cluster to the container runtime, to sdBUS and SystemD is not trivial. This issue has been fixed in Systemd so the chances of seeing this issue are getting lower and lower.
However, cluster node readiness issues still exist, but for different reasons.
Today’s article will focus on sharing with you another example of a cluster node called NotReady. This problem is completely different from the above problem.
Phenomenon of the problem
In this case, cluster nodes become NotReady. The problem can be resolved temporarily by restarting the node, but will recur after about 20 days.
After the problem occurs, if we restart kubelet on the node, the node will become Ready, but only for three minutes. This is a special situation.
The logic
Before we dive into this, let’s take a look at the big logic behind the ready state of cluster nodes. In THE K8s cluster, there are four components related to the node readiness state, which are the core database ETCD of the cluster, the API Server of the cluster entrance, the node controller and kubelet which is stationed on the cluster node to directly manage the node.
On the one hand, Kubelet plays the role of cluster controller. It regularly obtains information about Pod and other related resources from API Server, and controls the execution of Pod running on nodes according to these information. Kubelet, on the other hand, acts as a monitor for node health, retrieving node information and synchronizing that health to the API Server as a cluster client.
In this case, Kubelet plays the second role.
Kubelet uses the NodeStatus mechanism shown above to periodically check the cluster NodeStatus and synchronize the NodeStatus to the API Server. NodeStatus is the main criterion to determine node readiness, which is PLEG.
PLEG stands for Pod Lifecycle Events Generator. Basically, the execution logic of PLEG is to periodically check the Pod running on the node, and if changes of interest are found, PLEG wraps this change as an Event and sends it to Kubelet’s main synchronization mechanism, syncLoop, for processing. However, when PLEG’s Pod checking mechanism fails to perform periodically, the NodeStatus mechanism assumes that the node is in the wrong state and synchronizes this state to the API Server.
It is the node control component that finally implements the node status reported by Kubelet to the node status. Here I deliberately distinguish between the state of the node reported by Kubelet and the final state of the node. The former is actually the Condition we see when describing a node, while the latter is the NotReady state in a real node list.
Ready in three minutes.
After a problem occurs, we restart Kubelet and it takes three minutes for the node to become NotReady. This phenomenon is a key point of entry.
Before I explain it, take a look at the official PLEG schematic. This picture mainly shows two processes.
- On the one hand, Kubelet acts as a cluster controller, fetching pod Spec changes from the API Server, and then creating or terminating pods by creating worker threads.
- PLEG, on the other hand, periodically checks the state of the container and reports the state back to Kubelet in the form of events.
Here, PLEG has two key time parameters: the interval between the execution of the check and the timeout of the check. By default, PLEG checks are one second apart. In other words, PLEG waits one second after each check before the next check. The timeout time for each check is three minutes. If a PLEG check cannot be completed within three minutes, this situation will be synchronized to the API Server as a cluster node NotReady credential by NodeStatus mechanism mentioned in the previous section.
The reason we observed that the node was ready three minutes after kubelet was restarted was that the first PLEG check did not complete smoothly after Kubelet was restarted. The node is in a ready state and is not synchronized to the cluster until three minutes after the timeout.
In the figure below, the top line represents the normal PLEG execution and the bottom line represents the problem. Relist is the main function of the check.
PLEG’s stalled
With that in mind, let’s take a look at PLEG’s log. The log can be basically divided into two parts. One part, railcraft pod synchronization, is output by the Kubelet synchronization function syncLoop, indicating that it has skipped a POD synchronization. PLEG is not healthy: PLEG was last seen active ago; Threshold is 3m0s, which clearly shows the problem of relist timeout of three minutes mentioned in the previous section.
299597 kubelet Railcraft - [PLEG is not healthy: Pleg was last seen active 3m0.000091019s ago; Threshold is 3m0s] 17:08:22.399758 kubelet Railcraft - [PLEG is not healthy: Pleg was last seen active 3m0.100259802s ago; Threshold is 3m0s] 17:08:22.599931 kubelet Skipping Pod synchronization - Pleg was last seen active 3m0.300436887s ago; Threshold is 3m0s] 17:08:23.000087 Kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3m0.700575691s ago; Threshold is 3m0s] 17:08:23.800258 Kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3m1.500754856s ago; 400439 kubelet skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen Active 3m3.100936232s ago; Threshold is 3m0s] 17:08:28.600599 kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3m6.301098811s ago; Threshold is 3m0s] 17:08:33.600812 kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3M11.30128783s ago; Threshold is 3m0s] 17:08:38.600983 kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3M16.301473637s ago; Threshold is 3m0s] 17:08:41.601157 Kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3m21.301651575s ago; Threshold is 3m0s] 17:08:48.601331 Kubelet Skipping Pod synchronization - [PLEG is not healthy: Pleg was last seen active 3m26.301826001s ago; threshold is 3m0s]Copy the code
Kubelet’s call stack provides a direct view of relist function execution. We simply send SIGABRT to the Kubelet process, and the Golang runtime prints out all the call stacks for the Kubelet process. Note that this operation kills the Kubelet process. But because rebooting Kubelet in this case doesn’t break the replay environment, it doesn’t matter much.
The following call stack is for the PLEG relist function. From the bottom up, we can see that Relist etc is getting PodSandboxStatus via GRPC.
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.(*Stream).Header()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.recvResponse()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.invoke()
kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.Invoke()
kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2.(*runtimeServiceClient).PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/remote.(*RemoteRuntimeService).PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/kuberuntime.instrumentedRuntimeService.PodSandboxStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/kuberuntime.(*kubeGenericRuntimeManager).GetPodStatus()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).updateCache()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist()
kubelet: k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc420309260)
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil()
kubelet: k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until()
Copy the code
Using PodSandboxStatus to search the Kubelet call stack, it is easy to find the thread below, which is actually querying the Sandbox state. From the bottom up, we can see that this thread is trying to fetch a Mutex in the Plugin Manager.
kubelet: sync.runtime_SemacquireMutex()kubelet: sync.(*Mutex).Lock() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network.(*PluginManager).GetPodNetworkStatus() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).getIPFromPlugin() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).getIP() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).PodSandboxStatus() kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_PodSandboxStatus_Handler() kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).processUnaryRPC() kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).handleStream() kubelet: K8s.io/kubernetes/vendor/google.golang.org/grpc. Server (*). ServeStreams. Func1.1 kubelet () : created by k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1Copy the code
This Mutex is only useful in the Plugin Manager, so let’s look at all the Plugin Manager-related call stacks. Part of the thread is waiting for Mutex, while the rest is waiting for the Terway CNI Plugin.
kubelet: syscall.Syscall6()kubelet: os.(*Process).blockUntilWaitable() kubelet: os.(*Process).wait()kubelet: os.(*Process).Wait() kubelet: os/exec.(*Cmd).Wait()kubelet: os/exec.(*Cmd).Run() kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.(*RawExec).ExecPlugin() kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.(*PluginExec).WithResult() kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/pkg/invoke.ExecPluginWithResult() kubelet: k8s.io/kubernetes/vendor/github.com/containernetworking/cni/libcni.(*CNIConfig).AddNetworkList() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni.(*cniNetworkPlugin).addToNetwork() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network/cni.(*cniNetworkPlugin).SetUpPod() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim/network.(*PluginManager).SetUpPod() kubelet: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).RunPodSandbox() kubelet: k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2._RuntimeService_RunPodSandbox_Handler() kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).processUnaryRPC() kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).handleStream() kubelet: K8s.io/kubernetes/vendor/google.golang.org/grpc. Server (*). ServeStreams. Func1.1 ()Copy the code
Unresponsive Terwayd
Before we can explain this further, we need to distinguish between Terway and Terwayd. In essence, Terway and Terwayd are client-server relationships, similar to those between Flannel and Flanneld. Terway is a plug-in that implements the CNI interface as defined by Kubelet.
At the end of the last section, we saw the problem that when Kubelet called CNI Terway to configure the POD network, terway did not respond for a long time. Normally this should be done in seconds, very fast. When things went wrong, Terway wasn’t doing its job properly, so we saw a lot of Terway processes piling up on the cluster nodes.
Similarly, we can send SIGABRT to these Terway plug-in processes to print out the call stack of the process. Below is the call stack for one of the Terways. This thread is executing the cmdDel function, which deletes a POD network configuration.
kubelet: net/rpc.(*Client).Call() kubelet: main.rpcCall()kubelet: main.cmdDel() kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAn dCall() kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain() kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.PluginMainWithError() kubelet: github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.PluginMain()Copy the code
The above thread actually removes the POD network by calling Terwayd via RPC. Therefore, we need to further investigate the call stack of Terwayd to further locate the problem. As Terway’s server, Terwayd receives remote calls from Terway and performs its cmdAdd or cmdDel on Terway’s behalf to create or remove POD network configurations.
As you can see in the screenshot above, there are thousands of Terway processes on cluster nodes waiting for Terwayd, so there are actually thousands of threads in Terwayd processing Terway requests.
Using the following command, you can print out the call stack without restarting Terwayd.
curl --unix-socket /var/run/eni/eni.socket 'http:/debug/pprof/goroutine? debug=2'
Copy the code
Because Terwayd’s call stack is very complex and almost all threads are waiting for locks, it is complicated to directly analyze the wait-holding relationship of locks. In this case, we can use the “time method”, that is, assume that the first thread to enter the wait state, most likely the thread holding the lock.
After analyzing the call stack and code, we found that the thread below was the one that waited the longest (1595 minutes) and held the lock. This lock blocks all threads that create or destroy the POD network.
goroutine 67570 [syscall, 1595 minutes, locked to thread]: syscall.Syscall6() github.com/AliyunContainerService/terway/vendor/golang.org/x/sys/unix.recvfrom() github.com/AliyunContainerService/terway/vendor/golang.org/x/sys/unix.Recvfrom() github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive() github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute() github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink.(*Handle).LinkSetNsFd() github.com/AliyunContainerService/terway/vendor/github.com/vishvananda/netlink.LinkSetNsFd() github.com/AliyunContainerService/terway/daemon.SetupVethPair()github.com/AliyunContainerService/terway/daemon.setupCont ainerVeth.func1() github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func1() github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/plugins/pkg/ns.(*netNS).Do.func2()Copy the code
Reasons Digging into the call stack of the previous thread, we can determine three things.
- First, Terwayd uses netLink library to manage virtual network cards, IP addresses and routing resources on nodes, and NetLink implements functions similar to Iproute2.
- Second, NetLink uses sockets to communicate directly with the kernel.
- Third, the above threads wait on the recvFROM system call.
In this case, we need to look at the kernel call stack of the thread to further confirm the reason why the thread is waiting. Since it is not easy to find the system thread ID of this thread from the goroutine thread number, we can find the kernel call stack of the upper thread by fetching the system core dump.
In the kernel call stack, search for recvFROM to locate the thread below. Basically from the bottom of the call stack, we can only determine that this thread is waiting on the recvfrom function.
PID: 19246 TASK: ffff880951f70fd0 CPU: 16 COMMAND: "terwayd"
#0 [ffff880826267a40] __schedule at ffffffff816a8f65
#1 [ffff880826267aa8] schedule at ffffffff816a94e9
#2 [ffff880826267ab8] schedule_timeout at ffffffff816a6ff9
#3 [ffff880826267b68] __skb_wait_for_more_packets at ffffffff81578f80
#4 [ffff880826267bd0] __skb_recv_datagram at ffffffff8157935f
#5 [ffff880826267c38] skb_recv_datagram at ffffffff81579403
#6 [ffff880826267c58] netlink_recvmsg at ffffffff815bb312
#7 [ffff880826267ce8] sock_recvmsg at ffffffff8156a88f
#8 [ffff880826267e58] SYSC_recvfrom at ffffffff8156aa08
#9 [ffff880826267f70] sys_recvfrom at ffffffff8156b2fe
#10 [ffff880826267f80] tracesys at ffffffff816b5212
(via system_call)
Copy the code
This problem is difficult to investigate further, and it is clearly a kernel problem, or kernel-related problem. We went through the entire kernel core, checked all the thread call stacks, and couldn’t see any other threads that might be related to this problem.
repair
The fix for this issue is based on the assumption that NetLink is not 100% reliable. Netlink can respond slowly or not at all. So we can add timeouts to netLink operations to ensure that terwayd does not block even if a netLink call fails.
conclusion
In the node ready state scenario, Kubelet actually implements the node heartbeat mechanism. Kubelet periodically synchronizes various node-related states, including memory, PID, disk, and of course, the ready states that are the focus of this article, to cluster control. Kubelet uses plug-ins to directly manipulate node resources while monitoring or managing cluster nodes. This includes network, disk, and even container runtime plug-ins whose state directly applies to Kubelet and even node state.
< Follow the public account, reply to the investigation to download the book >
“Alibaba Cloud originators pay close attention to technical fields such as microservice, Serverless, container and Service Mesh, focus on cloud native popular technology trends and large-scale implementation of cloud native, and become the technical circle that knows most about cloud native developers.”