Phenomenon of the problem

The monitor sends an alarm notification. The CPU of kubelet of a machine is too high, and the QPS of kube-Apiserver is too high. The Kube-controller-Manager also alarms the QPS of the requested Apiserver to be too high.Finally check down, found a POD has been expelled, and then continue not to create. This is very amazing, I will omit some of the process, focus on the following questions, you can also think about the problem.

  • How does this happen?
  • Why is this happening?
  • How can this be avoided?

No more nonsense, let’s first will be the problem to reproduce, and then a good analysis of the above problem.

Problem of repetition

First I will show you the YAML file of the Deployment modified according to the problem POD.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
  labels:
    app: demo
spec:
  The number of copies of ## pod
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate
  # Here tag select pod template
  selector:
    matchLabels:
      app: demo
  ## pod template
  template:
    metadata:
      labels:
        app: demo
    spec:
      # Select the shared host network namespace
      hostNetwork: true
      # Select host name
      nodeName: "k8s-node-02"
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
Copy the code

Based on this file, we try to create the service

kubectl apply -f demo-test.yaml
Copy the code

And you end up with this

The POD was finally scheduled to k8S-Nod-02. I now want to make an update to the service. I execute the following command:

Kubectl set image deployments/demo nginx=nginx:1.16.1 --record ##Copy the code

A lot of demo pod was generated (the error was NodePorts, probably due to my Kubernetes version), and it worked.

Problem analysis

How to trigger

The pod on node already exists. After the pod is ejected, a new pod is created immediately. There are a few key points here

  1. After the POD is created, it is rescheduled to the same node.
  2. Create a pod, which causes port conflict on node
  3. This problem will arise after the change

Based on the above three points, we can correspond to three configurations:

  1. NodeName: “k8s-nod-02” this configuration indicates the node where the POD is scheduled
  2. HostNetwork: true This configuration indicates the shared host’s network namespace
  3. Type: RollingUpdate This configuration indicates a rolling release, that is, a new POD is created, and after success, the old pod is deleted.

Why is that

Now that we know how to trigger, let’s start with the whole situation (the result of the combing is not completely correct). First of all, after we update the service, due to rolling publishing, we first create a POD, and directly generate pod on the corresponding node, but the POD already exists, and share the host namespace, resulting in port number conflict, this situation triggers the Expulsion mechanism of Kubelet, and the pod is expelled. After expulsion, Kubelet needs to continue to ensure that its POD state is consistent with the information in the ETCD, and continue to create the POD. So it’s infinite.

This analysis, at first glance seems to be fine, but there are still a few small problems

  1. What is the scheduling mechanism of kube-Sechduler? Why is it dispatched to the corresponding node without preselection?
  2. There is no place to explain the high QPS of the kube-Controller-Manager client. Take your time, let’s take a look at kubernetes’ scheduling mechanism and come up with an answer.

Scheduling mechanism of nodeName

Let’s look at the mechanics of the Kube-Scheduler. The scheduling mechanism consists of three steps

  1. Get the unscheduled podList
  2. Through pre-selection, the preferred algorithm is used to select a suitable node for POD
  3. Finally, the node information is submitted to apiserver

In fact, we can assume that if the pod is marked with nodeName, it is no longer in the unscheduled podList, and we can also assume that the message written by the scheduler is the nodeName field. To test this hypothesis let’s look at the source code.

Get code for an unscheduled podList:

// Initialization logic for podInfomer
func NewPodInformer(client clientset.Interface, resyncPeriod time.Duration) coreinformers.PodInformer {
  // Select a pod with non-successful and non-failed states
    selector := fields.ParseSelectorOrDie(
            "status.phase! =" + string(v1.PodSucceeded) +
                    ",status.phase! =" + string(v1.PodFailed))
    lw := cache.NewListWatchFromClient(client.CoreV1().RESTClient(), string(v1.ResourcePods), metav1.NamespaceAll, selector)
    return &podInformer{
            informer: cache.NewSharedIndexInformer(lw, &v1.Pod{}, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}),
    }
}
Copy the code

Let’s continue to look at the logic for adding pod events to the podInformer

	// Cache the scheduled nodes
	args.PodInformer.Informer().AddEventHandler(
            cache.FilteringResourceEventHandler{
                FilterFunc: func(obj interface{}) bool {
                        switch t := obj.(type) {
                        case *v1.Pod:
                            // This is the key to determining whether to schedule
                            return assignedPod(t)
                        case cache.DeletedFinalStateUnknown:
                            if pod, ok := t.Obj.(*v1.Pod); ok {
                                    return assignedPod(pod)
                            }
                            runtime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, c))
                            return false
                        default:
                            runtime.HandleError(fmt.Errorf("unable to handle object in %T: %T", c, obj))
                            return false
                        }
                },
                Handler: cache.ResourceEventHandlerFuncs{
                        AddFunc:    c.addPodToCache,
                        UpdateFunc: c.updatePodInCache,
                        DeleteFunc: c.deletePodFromCache,
                },
        },
	)
	// Put the unscheduled POD into the unscheduled queue
	args.PodInformer.Informer().AddEventHandler(
		cache.FilteringResourceEventHandler{
			FilterFunc: func(obj interface{}) bool {
				switch t := obj.(type) {
				case *v1.Pod:
                                      // This is the key to determining whether to schedule
					return! assignedPod(t) && responsibleForPod(t, args.SchedulerName)case cache.DeletedFinalStateUnknown:
					if pod, ok := t.Obj.(*v1.Pod); ok {
						return! assignedPod(pod) && responsibleForPod(pod, args.SchedulerName) } runtime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, c))
					return false
				default:
					runtime.HandleError(fmt.Errorf("unable to handle object in %T: %T", c, obj))
					return false
				}
			},
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    c.addPodToSchedulingQueue,
				UpdateFunc: c.updatePodInSchedulingQueue,
				DeleteFunc: c.deletePodFromSchedulingQueue,
			},
		},
	)

Copy the code

Let’s take a closer look at the key function that determines whether or not it has been scheduled

// Determine whether a node has been assigned based on nodeName
func assignedPod(pod *v1.Pod) bool {
	return len(pod.Spec.NodeName) ! =0
}
Copy the code

To find here consistent with my guess, feel good! Read on to see if a nodeName message is written to the pod metadata if scheduled normally.

// Here is the code that binds the POD to the host selected by the schedule
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
  // Set the NodeName of the pod information to the name of the corresponding host
	assumed.Spec.NodeName = host

	iferr := sched.config.SchedulerCache.AssumePod(assumed); err ! =nil {
		klog.Errorf("scheduler cache AssumePod failed: %v", err)
		sched.recordSchedulingFailure(assumed, err, SchedulerError,
			fmt.Sprintf("AssumePod failed: %v", err))
		return err
	}
	ifsched.config.SchedulingQueue ! =nil {
		sched.config.SchedulingQueue.DeleteNominatedPodIfExists(assumed)
	}
	return nil
}
Copy the code

Perfect. When the schedule is complete, the scheduled node information is written to the pod information via nodeName.

The process of successful pod creation and expulsion

Similarly, we can guess the process. Kubelet will first verify the POD, find that it does not meet the scheduling requirements, not allow it to create a pod, and synchronize information to apiserver. At this time, kube-controller-Manager will find that the number of active copies of the corresponding service is 0, and it needs to create a new POD, then create a pod again. After listening to this information, Kubelet will create a POD again.

Here’s a little bit of a puzzle

  • How does Kubelet handle the pod in this case
  • Will controller-Manager always create a POD

Let’s first look at the management mechanism for creating pod, which has a key code:

      // During the creation process, check whether the pod is allowed to be created
      ifok, reason, message := kl.canAdmitPod(activePods, pod); ! ok { kl.rejectPod(pod, reason, message)continue
        }
Copy the code
    / / check the function, is found through a lifecycle PodAdmitAttributes to constraints.
    func (kl *Kubelet) canAdmitPod(pods []*v1.Pod, pod *v1.Pod) (bool.string.string) {
      // the kubelet will invoke each pod admit handler in sequence
      // if any handler rejects, the pod is rejected.
      // TODO: move out of disk check into a pod admitter
      // TODO: out of resource eviction should have a pod admitter call-out
      attrs := &lifecycle.PodAdmitAttributes{Pod: pod, OtherPods: pods}
      for _, podAdmitHandler := range kl.admitHandlers {
        ifresult := podAdmitHandler.Admit(attrs); ! result.Admit {return false, result.Reason, result.Message
        }
      }

      return true."".""
    }
Copy the code
    // setup eviction manager.// Add an ejected Handler
    klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
    // Add runtime related Handler
    klet.admitHandlers.AddPodAdmitHandler(runtimeSupport)
    / / add sysctlsWhitelist
    klet.admitHandlers.AddPodAdmitHandler(sysctlsWhitelist)
    / / add NewPredicateAdmitHandlerklet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(...) )/ / add NewAppArmorAdmitHandlerklet.softAdmitHandlers.AddPodAdmitHandler(lifecycle.NewAppArmorAdmitHandler(...) )/ / add NewNoNewPrivsAdmitHandlerklet.softAdmitHandlers.AddPodAdmitHandler(lifecycle.NewNoNewPrivsAdmitHandler(...) )Copy the code

I looked through the handlers in the PodAdmitAttributes one by one, focusing on the NewPredicateAdmitHandler handler. In fact, the handler is actually executed according to the schedule of the scheduler. Let’s continue to look at the handler.

    // Check the Admit method of this Handler
    func (w *predicateAdmitHandler) Admit(attrs *PodAdmitAttributes) PodAdmitResult {
		fit, reasons, err := predicates.GeneralPredicates(podWithoutMissingExtendedResources, nil, nodeInfo)
    
  // There are two judgment methods in the preselection judgment
  func GeneralPredicates(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error) {
    var predicateFails []PredicateFailureReason
    // Non-urgent judgment
    fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo)
    iferr ! =nil {
      return false, predicateFails, err
    }
    if! fit { predicateFails =append(predicateFails, reasons...)
    }
    // Necessary judgment
    fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
    
  // Necessary judgment
  func EssentialPredicates(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error){...// The key judgment is whether the POD matches the host
    fit, reasons, err = PodFitsHostPorts(pod, meta, nodeInfo)
	  ...
  }
Copy the code

After all, the pod is not allowed to be created because of port conflicts.

Let’s look at the controller – processing mechanism of manager, here you can refer to this article: www.bookstack.cn/read/source…

The key is to get the number of Active pods, compare them to the expected number of copies, and update them if they don’t match. Replicaset Controller will send update events to create a new POD, and the pod will not be able to reach the active state on the specified host.

#### Reason Summary

Let me use a picture here, with the caption.

  1. Kubectl sends a request for an update
  2. Kube-controller-manager listens to the information and creates the POD
  3. Kube-scheduler listens to pod and tries to schedule, but no scheduling algorithm logic is performed because of nodeName
  4. Kubelet listens that the node where kubelet resides needs to create a new POD. During the creation process, a port conflict occurs because hostNetwork is set to true, and the creation fails
  5. At this point, kube-controller-Manager listens to the information of POD has not reached the expected state, and continues to enter the process of the second step

How can this be prevented?

Depending on what the trigger is, if we let either of these things not happen, it will solve the problem.

  1. Change the publishing mechanism to an abstract type
  2. Change hostNetwork to false and use nodePort to expose your service
  3. Change the node scheduling mode to nodeSelector or nodeAffitiy

In practice, however, I recommend using either nodeSelector or nodeAffinity for this business scenario.

conclusion

The article is bound to have some not rigorous place, but also hope that we contain, we absorb the essence (if any), to its dregs. If you are interested, you can close my public account: Gungunxi. My wechat id is lCOMedy2021