This is the 7th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

Last review

In the last article, we described in detail the process of creating a pod from a local creation command to the pod being persisted into an ETCD. This part of the process is quite complex, and pod creation is not only through Kubectl run but also through other commands such as create and more advanced commands such as various controllers. But they are also common in creating pods underneath, so understanding the pod creation process is very helpful in understanding the overall k8S usage.

Let’s move on to the second half of pod creation, how the pod is scheduled and actually created on the node.

scheduler

The core job of the scheduler is to assign executable nodes to the POD to be scheduled and to complete binding to a node. The scheduling process is a general link, next we will take a look at the overall scheduling process, and then to a slightly detailed understanding of the overall realization of a downgrader, complete introduction needs another separate article, first reserved a foreshadowing.

Scheduling Process Overview

The scheduler is a stand-alone module in the control plane, but operates exactly like any other controller: it listens for events and then tries to tune the state. Specifically, the scheduler filters out all pods with the NodeName field empty in the PodSpec, and then tries to find a node for those pods to run on.

To find the right nodes, the scheduler uses a unique scheduling algorithm. The scheduling algorithm works in two steps:

  1. When the scheduler is started, a set of default predictors are registered. These predictors are efficient functions that are executed to determine whether a node is suitable to host a POD.
  2. After the selection of suitable nodes, a series of priority functions will be performed for these nodes to score these candidate nodes so as to rank the suitability. For example, in order to spread the workload as far as possible across the cluster, the scheduler would prefer nodes with fewer resources currently allocated. When these functions are run, it assigns a score to each node, and eventually the scheduler selects the node with the highest score.

When the scheduler dispatches a POD to a node, kubelet on that node takes over the specific creation work.

Introduction to scheduler implementation

The core processes

The entire initialization process of the scheduler is omitted here, focusing on the core process of the scheduling execution of the universal scheduler.

// pkg/scheduler/scheduler.go#L311
// Run begins watching and scheduling. It starts scheduling and blocked until the context is done.
func (sched *Scheduler) Run(ctx context.Context) {
	sched.SchedulingQueue.Run()
	wait.UntilWithContext(ctx, sched.scheduleOne, 0)
	sched.SchedulingQueue.Close()
}
Copy the code

When the scheduler starts to run, it first runs the scheduling queue through the go procedure. All the PODS that need to be scheduled must be put into the queue first, which is implemented as the priority queue by default.

The core scheduling entry is: scheduleOne, scheduling a pod, UntilWithContext can be understood as an infinite loop, the whole scheduling will not end until the external tells to exit. As you can see from the function comments, the entire scheduling process is executed sequentially.

// pkg/scheduler/scheduler.go#L429
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host  fitting.
func (sched *Scheduler) scheduleOne(ctx context.Context) {
  // Retrieve a pod to be scheduled
  podInfo := sched.NextPod()
  // Some column checks, if failed, will exit directly
  
  // Schedule according to the algorithm
  scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)
  iferr ! =nil{
    // According to the specific error processing, the difference between unschedulable and scheduling failure
  }
  
  // Set the pod NodeName to notify the cache of the success of dispatching and to continue dispatching without waiting for the binding to succeed
  // Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPodInfo := podInfo.DeepCopy()
	assumedPod := assumedPodInfo.Pod
	// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
  	err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
  
  	// Execute the reservation method for the reserved plug-in
  	// Execute the "allow" plug-in
  runPermitStatus := fwk.RunPermitPlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
  // Perform node binding asynchronously
  go func(a) {
    // Wait for the allowed state to be successful
    waitOnPermitStatus := fwk.WaitOnPermit(bindingCycleCtx, assumedPod)
    
    // Execute the pre-binding plug-in
    // Run "prebind" plugins.
	preBindStatus := fwk.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
    
    // Perform the binding
    err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
    // Perform the post-binding plug-in
    fwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)
  }
}
Copy the code

Scheduling algorithm

All scheduling algorithms must implement this interface, and K8S provides a default general-purpose scheduling algorithm.

// pkg/scheduler/core/generic_scheduler.go#L95
// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
// TODO: Rename this type.
type ScheduleAlgorithm interface {
	Schedule(context.Context, framework.Framework, *framework.CycleState, *v1.Pod) (scheduleResult ScheduleResult, err error)
	// Extenders returns a slice of extender config. This is exposed for
	// testing.
	Extenders() []framework.Extender
}
Copy the code

The implementation process of the general scheduling algorithm is as follows:

// pkg/scheduler/core/generic_scheduler.go#L131
// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
  // Check whether there are currently schedulable nodes
  if g.nodeInfoSnapshot.NumNodes() == 0{}
  
  // Find all the appropriate nodes
  feasibleNodes, filteredNodesStatuses, err := g.findNodesThatFitPod(ctx, fwk, state, pod)
  
  // Rank the candidate nodes
  priorityList, err := g.prioritizeNodes(ctx, fwk, state, pod, feasibleNodes)
  // Select an appropriate node
  host, err := g.selectHost(priorityList)
  // returns the scheduling result
  return ScheduleResult{
		SuggestedHost:  host,
		EvaluatedNodes: len(feasibleNodes) + len(filteredNodesStatuses),
		FeasibleNodes:  len(feasibleNodes),
  }, err
}
Copy the code

In addition, the predictor and divider functions are extensible and can be customized with the –policy-config-fie parameter. This adds a degree of flexibility. Administrators can also run a custom scheduler (essentially a controller for special logic) from a separate Deployment. If schedulerName is set in the PodSpec, Kubernetes will assign the pod to a scheduler that matches its specified name.

Node Binding Process

When a suitable node is found, the scheduler creates a Binding object whose Name and UID match the POD, and whose ObjectReference field holds the Name of the selected node. This object will be sent to apiserver via a POST request.

When Apisever receives the Binding object, it deserializes the object and updates the following fields of the corresponding POD object: sets NodeName to the value of ObjectReference, adds the relevant annotation, and sets the PodScheduled status condition to True.

Kubernetes. IO/docs/concep…

kubelet

Kubelet is an agent running on each node in the K8S cluster. Each node starts the Kubelet process, which is used to handle tasks sent from the Master node to the node, and it is also responsible for managing the pod life cycle, among other things. Kueblet implements the transformation logic between abstract Kubernetes concept Pod and concrete building blocks and containers. It also takes care of all of these things related to mounting volumes, container logs, garbage collection, and more important things.

Refer to the official documentation for more complete descriptions

Kubernetes. IO/docs/refere…

The working principle of

In general, Kubelet starts a SyncLoop. All work is done around this infinite loop. Here we will focus on the process of creating a POD in Kubelet.

Create a process

Overall introduction

When a POD completes node binding, kubelet’s handler is triggered. Kubelet listens for pod changes, and then handles them differently according to the event type, such as ADD, UPDATE, REMOVE, DELETE, etc.

All new pods are sorted by creation time to ensure that the POD created first is processed first. The pods are then added to podManager one by one. The podManager submodule is responsible for managing pod information on the machine, the mapping between pods and MirrorPods, and so on. All managed pods must appear in the podManager; if not, the pod is considered deleted.

If the operation type is mirroring POD, perform mirroring POD processing and skip subsequent operations.

Verify that the POD can run on the node. If it cannot be rejected directly, the POD will be permanently in an unready state and will not recover by itself, requiring manual intervention.

Send the work of creating pod to podWorkers sub-module for asynchronous processing via dispatchWork.

Add POD to probeManager, and if Readiness and LiVENESS health checks are defined in POD, start Goroutine for periodic testing.

The preparatory work

Before Podworker actually creates a container, there is a lot of work to do (syncPod is case-sensitive), so let’s take a look.

In this method, the following things are done:

  • If it is to remove pod, execute immediately and return
  • Synchronize podStatus to kubelet.statusManager
  • Check whether POD can run on the local node, mainly for permission check (whether it can use the host network mode, whether it can run under privileged authority, etc.). If you do not have permission, delete the old pod locally and return an error message
  • Create a containerManagar object and create a Pod Level cgroup to update the Qos level cgroup
  • If it is static Pod, create or update the corresponding mirrorPod
  • Create pod data directory to store volume and plugin information. If PV is defined, wait for all volume mount to complete (volumeManager will do these things in the background). Go to Apiserver to get the corresponding secrets data
  • Then call the Kubelet.volumemanager component and wait for it to prepare all the plug-in volumes required by the POD.
  • Call the SyncPod method of the Container Runtime to implement the actual container creation logic

Everything here is independent of the specific container, and you can see that this method is the preparatory work that needs to be done before the POD entity (i.e., the container) is created.

// pkg/kubelet/kubelet.go#L1455
// syncPod is the transaction script for the sync of a single pod.
//
// This operation writes all events that are dispatched in order to provide
// the most accurate information possible about an error situation to aid debugging.
// Callers should not throw an event if this operation returns an error.
func (kl *Kubelet) syncPod(o syncPodOptions) error {
  // Call the container runtime's SyncPod callback
  result := kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)
}
Copy the code

Create a container

The SyncPod function of the containerRuntime sub-module does the real job of creating containers inside the POD.

SyncPod performs the following operations:

  • 1. Calculate whether the Sandbox and container have changed
  • 2. Create a sandbox
  • Create init container
  • 4. Create a service container

This section of code is complimented for its complete annotation!

// pkg/kubelet/kuberuntime/kuberuntime_manager.go#L675
// SyncPod syncs the running pod into the desired pod by executing following steps:
//
// 1. Compute sandbox and container changes.
// 2. Kill pod sandbox if necessary.
// 3. Kill any containers that should not be running.
// 4. Create sandbox if necessary.
// 5. Create ephemeral containers.
// 6. Create init containers.
// 7. Create normal containers.
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
  
  start := func(typeName string, spec *startSpec) error {
    if msg, err := m.startContainer(podSandboxID, podSandboxConfig, spec, pod, podStatus, pullSecrets, podIP, podIPs);
  }
  // Step 6: start the init container.
  if err := start("init container", containerStartSpec(container));
  // Step 7: start containers in podContainerChanges.ContainersToStart.
  for _, idx := range podContainerChanges.ContainersToStart {
		start("container", containerStartSpec(&pod.Spec.Containers[idx]))
  }
}
Copy the code

Start the container

StartContainer finally starts the container

The main steps are as follows:

  • 1. Pull the mirror
  • 2. Generate the configuration information of the service container
  • Call the runtime service API to create containers. Note that Dockershim was deprecated in V1.20 and will be removed completely in V1.23. Before that, we need to provide supported container runtime
  • 4. Start the container
  • 5. Execute post start hook
// pkg/kubelet/kuberuntime/kuberuntime_container.go#L134
// startContainer starts a container and returns a message indicates why it is failed on error.
// It starts the container through the following steps:
// * pull the image
// * create the container
// * start the container
// * run the post start lifecycle hooks (if applicable)
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error){}Copy the code

summary

Through the creation process of POD, we can see that Kubelet undertakes huge basic management and operation tasks. After entering into Kubelet, we will also find that the overall architecture of Kubelet also reflects its complexity. The creation of the POD is only a small part of the work. Finally attached is a kubelet overall module architecture diagram.

Reference documentation

books

Kubernetes source code analysis zheng Dongxu

article

Kubernetes ditches Docker context

Kubelet creates the POD workflow

Kubectl Run v1.14 creates pod flows