The “Kubelet From Start to Quit” series takes an in-depth look at Kubelet components from basics to source code. In the last article zouyee took you through CPU management, which we will take a closer look at in this article. Topology management was promoted to Beta with Kubernetes 1.18. The TopologyManager feature enables NUMA alignment of cpus, memory, and peripherals (such as SR-IOV VF and GPU) to enable clusters to meet low latency requirements.

Third, source code analysis

For topology manager code analysis, we approach it from two aspects:

1) When Kubelet is initialized, related operations of topology management are involved

2) Kubelet runtime, related to topology management operations, in-depth analysis of topology management structure logic

3.1 Kubelet initialization

For Kubelet initialization, we are using CPU Manager with topology Manager startup diagram (currently CPU Manager, Memory Manager, Device Manager constitute the resource allocation manager, It is a subsystem of the Container Manager module.

For the above figure, Zouyee summarized the process as follows:

1. In the command line startup section, Kubelet call NewContainerManager build ContainerManager topologymanager. 2, NewContainerManager function call NewManager build topology manager, otherwise is not enabled topology manager, Build the fake and NewContainerManager functions to call the NewManager provided by CPU, Memory, and device respectively to build the related manager. 4. The topology manager uses the AddHintPriovider method to add the CPU, memory, and device managers to the topology manager. The above three resource allocators must implement the HintPriovider interface. Kubelet struct NewMainKubelet(); The CPU, the memory manager (no device) with topology manager encapsulation for InternalContainerLifecycle interface, its implementation Pod relevant life cycle resource management operations, PreStartContainer and PostStopContainer methods are related to resource allocation and recycling. For details, see the implementation. 7, build Kubelet structure, call AddPodmitHandler will GetAllocateResourcesPodAdmitHandler method added to the Pod access plug-in, create in Pod, resources assigned to check, Including GetAllocateResourcesPodAdmitHandler according to whether open topology management, decision is to return topology management Admit interface, or use of the CPU, memory and device resource allocator, Admit interface. The Kubelet constructor calls ContainerManager's Start method. It calls the CPU, memeory, and device manager's Start method. It does some processing work and hatches a Goroutine, which executes the reconcileState() note: For an explanation of the code for the above start-up process, return to the Understanding CPU article.Copy the code

3.2 Kubelet runtime

In Kubelet runtime, topology management and resource allocation are involved in Pod processing process. Zouyee summarized as follows:

PodConfig accepts pods from apiserver, File, and HTTP, and calls Updates() to return a channel containing the Pod list and type. 3. Inside the Run method, Kubelet calls syncLoop. Inside the syncLoop, Kubelet calls syncLoop. Call syncLoopIteration 4, in syncLoopIteration, when configCh (namely PodConfig call Updates () returns the pod type for the ADD, execute handler. HandlePodAdditions, In HandlePodAdditions, Kubelet iterates through admitHandlers and calls the Admit method when the pod state is not Termination. Note: In addition to configCh, syncLoopIteration There are other channels (plegCh, syncCh, housekeepingCh and livenessManager), among which HandlePodAddtion and Ha are called in plegCh, syncCh and livenessManager channels NdlePodReconcile, HandlePodSyncs, and HandlePodUpdates all involve dispatch method calls. Remember that in the Kubelet process, Will manager CPU, the memory manager with topology manager encapsulation for InternalContainerLifecycle interface, the implementation of the life cycle of resources management Pod operation, involving the CPU, memory, related is PreStartContainer method, It calls the AddContainer method, which I'll cover later. 5, in the introduction Kubelet starts, call AddPodmitHandler will join the admitHandlers GetAllocateResourcesPodAdmitHandler method, so the call Admit method of operation, Involves the topology management namely GetAllocateResourcesPodAdmitHandler, then would accept this method. 6, the processing logic of Kublet GetAllocateResourcesPodAdmitHandler method are as follows: If the topology feature is enabled, the topology manager takes over resource allocation. If the topology manager is not enabled, the CPU manager, memory manager, and device manager manage resource allocation respectively. This document only describes how to enable the topology manager. After 7, enable the topology manager, Kublet GetAllocateResourcesPodAdmitHandler return to Admit interface type, the topology manager implementation, follow-up unified is introduced. The above process is the general process of Pod. Topology initialization, AddContainer and Admit methods are introduced below. 1) topology structure topology initialization function to initialize as/cm/PKG/kubelet topologymanager/topology_manager go: 119 / / NewManager creates a new topologymanager based on provided policy and scope func NewManager(topology []cadvisorapi.Node, topologyPolicyName string, topologyScopeName string) (Manager, error) { // a. Var numaNodes []int for _, node := range Topology {numaNodes = append(numaNodes, node.Id) } // b. Check whether the number of NUMA nodes exceeds 8 when the policy is not None. If the number of NUMA nodes exceeds 8, an error if topologyPolicyName! Is displayed. = PolicyNone && len(numaNodes) > maxAllowableNUMANodes { return nil, fmt.Errorf("unsupported on machines with more than %v NUMA Nodes", maxAllowableNUMANodes) } // c. Policy var Policy switch topologyPolicyName {case PolicyNone: policy = NewNonePolicy() case PolicyBestEffort: policy = NewBestEffortPolicy(numaNodes) case PolicyRestricted: policy = NewRestrictedPolicy(numaNodes) case PolicySingleNumaNode: policy = NewSingleNumaNodePolicy(numaNodes) default: return nil, fmt.Errorf("unknown policy: \"%s\"", topologyPolicyName) } // d. Scope var scope scope switch topologyScopeName {case containerTopologyScope: scope = NewContainerScope(policy) case podTopologyScope: scope = NewPodScope(policy) default: return nil, fmt.Errorf("unknown scope: \"%s\"", topologyScopeName) } // e. Manager := &manager{scope: scope,} a. Initialize NUMA information based on CAdvisor data b. Check whether the number of NUMA nodes exceeds 8 when the policy is not None. If yes, an error message is displayed c. Initialize the scope structure based on the name of the policy passed in. Policy d. 2) AddContainer AddContainer actually calls the scope method: pkg/kubelet/cm/topologymanager/scope.go:97 func (s *scope) AddContainer(pod *v1.Pod, containerID string) error { s.mutex.Lock() defer s.mutex.Unlock() s.podMap[containerID] = string(pod.UID) return nil } This is a simple dictionary join operation. 3) Admit Admit function call: PKG/kubelet/cm/topologymanager/topology_manager go: 186, according to different types respectively called the scope of implementation: a, the containerCopy the code

pkg/kubelet/cm/topologymanager/scope_container.go:45

func (s *containerScope) Admit(pod *v1.Pod) lifecycle.PodAdmitResult { // Exception - Policy : none // 1. If s.policy.Name() == PolicyNone {return s.admitPolicyNone(pod)} // 2. Container := range append(pod.spec.initcontainers, pod.spec.containers...) BestHint, admit := s.culateAffinity (pod, &container) if! Admit {return topologyAffinityError()} // 2.2 record the allocation result s.settopologyhints (string(pod.uid), container.Name, Hint bestHint) / 2.3 / call the provider allocates resources err: = s.a llocateAlignedResources (pod, & container) if err! = nil {return unexpectedAdmissionError(err)}} return admitPod()} b, podCopy the code

pkg/kubelet/cm/topologymanager/scope_pod.go:45

func (s *podScope) Admit(pod *v1.Pod) lifecycle.PodAdmitResult { // Exception - Policy : none // 1. If the policy is None, skip if s.policy.Name() == PolicyNone {return s.admitpolicyNone (pod)} BestHint, admit := s.culateAffinity (pod) if! admit { return topologyAffinityError() } // 3. Container := range append(pod.spec.initcontainers, pod.spec.containers...) {// 3.1 Record the assignment result. Settopologyhints (string(pod.uid), container.Name, Hint bestHint) / 3.2 / call the provider allocates resources err: = s.a llocateAlignedResources (pod, & container) if err! = nil {return unexpectedAdmissionError(err)}} return admitPod()} For details see the code notes. It should be noted that scope of container and pod differs from each other in the calculation of affinity. The stage of judging whether to allow or not also reflects the granularity of scope and container, and calculateAffinity method will be introduced in the follow-up. Zouyee will summarize the Admit logic for the topology manager. The topology manager defines Hint Providers interfaces for components to send and receive topology information. CPU, Memory, and device implement these interfaces. The topology manager calls AddHintPriovider to add the interface to the topology manager. The topology information represents the available NUMA nodes and the bitmask of the preferred allocation indication. The topology manager policy performs a set of operations on the hints provided and obtains the optimal solution based on the policy; If you store hints that do not match expectations, the recommended preferred field is set to false. The selected recommendations can be used to decide whether a node accepts or rejects a Pod. The Hint results are then stored in the topology manager for use when Hint Providers make resource allocation decisions. The calculateAffinity general process for the above two scopes (Container and POD) is summarized as follows (ignoring the differences in calculating affinity) :Copy the code

For the above figure, Zouyee summarized the process as follows:

  1. Scope = pod; scope = pod; scope = pod;
  2. For each container, get TopologyHints from a set of HintProviders for each topology-aware resource type requested by the container (for example, gpu-vendor.com/gpu, nic-vendor.com/nic, CPU, etc.).
  3. Using the selected policy, merge the collected TopologyHints to find the best hint that aligns resource allocations across all resource types.
  4. The loop returns the hintHintProviders collection, instructing them to use the merged Hint to allocate the resources they manage.
  5. If any of the above steps fails or alignment requirements cannot be met according to the selected policy, Kubelet will not allow the POD.

Below, Zouyee describes the structures involved in the topology manager according to the following figure.

a. TopologyHints

Topology Hint encodes a set of constraints, records that can satisfy a given resource request. Currently, the only constraint we consider is NUMA alignment. The definition is as follows:

type TopologyHint struct {
    NUMANodeAffinity bitmask.BitMask
    Preferred bool
}
Copy the code

The NUMANodeAffinity field is the bitmask of the number of NUMA nodes that can satisfy resource requests. For example, on a system with 2 NUMA nodes, possible masks include:

{00}, {01}, {10}, {11}
Copy the code

Preferred is the Boolean type used to govern whether NUMANodeAffinity is in effect, if true then the current affinity is valid, if false then the current affinity is invalid. When using the best-effort strategy, priority hints take precedence over non-priority hints when best hints are generated. Non-priority hints will be rejected when restricted and single-Numa-node policies are used.

HintProvider generates a TopologyHint for the mask of each NUMA node that can satisfy the resource request. If the mask does not meet the requirements, it is omitted. For example, when asked to allocate two resources, HintProvider might provide the following hints on a system with two NUMA nodes. These Hint codes represent two resources that can either come from a single NUMA node (0 or 1), or they can come from different NUMA nodes.

{01: True}, {10: True}, {11: False}
Copy the code

All hintProviders set the Preferred field to True only when the information represented by NUMANodeAffinity meets the minimum NUMA node set for the resource request.

{0011: True}, {0111: False}, {1011: False}, {1111: False}
Copy the code

If the actual Preferred allocation cannot be satisfied before another container frees the resource, HintProvider returns a list of hints with all Preferred fields set to False. Consider the following scenario:

  1. Currently, all but two cpus have been allocated to the container
  2. The remaining two cpus are on different NUMA nodes
  3. A new container requests 2 cpus

In this case, the only hint generated is {11: False} instead of {11: True}. Because two cpus can be allocated from the same NUMA node on the system (although not immediately, at the current allocation state), it is better to allow pods to fail and retry deployment when minimal alignment can be met than to choose to schedule pods with suboptimal alignment.

b. HintProviders

Currently, the only HintProviders in Kubernetes are CPUManager, MemoryManager, and DeviceManager. The topology manager both collects TopologyHints from HintProviders and invokes resource allocation using the merged best Hint. HintProviders implements the following interfaces:

type HintProvider interface {
    GetTopologyHints(*v1.Pod, *v1.Container) map[string][]TopologyHint
    Allocate(*v1.Pod, *v1.Container) error
}
Copy the code

GetTopologyHints Returns a Map [string] [] TopologyHint. This enables a single HintProvider to provide hints of multiple resource types. For example, DeviceManager can return multiple resource types registered by the plug-in.

When HintProvider generates hints, only the alignment of the resources currently available on the system is considered. Do not consider any resources that have been allocated to other containers.

For example, consider the system in Figure 1, where the following two containers request resources:

# Container0
spec:
    containers:
    - name: numa-aligned-container0
      image: alpine
      resources:
          limits:
              cpu: 2
              memory: 200Mi
              gpu-vendor.com/gpu: 1
              nic-vendor.com/nic: 1

# Container1
spec:
    containers:
    - name: numa-aligned-container1
      image: alpine
      resources:
          limits:
              cpu: 2
              memory: 200Mi
              gpu-vendor.com/gpu: 1
              nic-vendor.com/nic: 1
Copy the code

If Container0 is the first container to be allocated on the system, the current three topology-aware resource types generate the following hint set:

 cpu: {{01: True}, {10: True}, {11: False}}
gpu-vendor.com/gpu: {{01: True}, {10: True}}
nic-vendor.com/nic: {{01: True}, {10: True}}
Copy the code

Already-aligned resource allocation:

{cpu: {0, 1}, gpu: 0, nic: 0}
Copy the code

When Container1 is considered, the above resource is assumed to be unavailable, so the following hint set is generated:

cpu: {{01: True}, {10: True}, {11: False}}
gpu-vendor.com/gpu: {{10: True}}
nic-vendor.com/nic: {{10: True}}
Copy the code

Allocated aligned resources:

{cpu: {4, 5}, gpu: 1, nic: 1}
Copy the code

Note that when HintProviders invoke Allocate, they do not use the combined best hint. Instead, they use the Store interface implemented by The TopologyManager to obtain the generated hint:

type Store interface {
    GetAffinity(podUID string, containerName string) TopologyHint
}
Copy the code

c. Policy.Merge

Each policy implements the merge method of how to combine the collection of TopologyHints generated by all HintProviders into a single TopologyHint used to provide aligned resource allocation information.

// 1. bestEffort
func (p *bestEffortPolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, bool) {
	filteredProvidersHints := filterProvidersHints(providersHints)
	bestHint := mergeFilteredHints(p.numaNodes, filteredProvidersHints)
	admit := p.canAdmitPodResult(&bestHint)
	return bestHint, admit
}
// 2. restrict
func (p *restrictedPolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, bool) {
	filteredHints := filterProvidersHints(providersHints)
	hint := mergeFilteredHints(p.numaNodes, filteredHints)
	admit := p.canAdmitPodResult(&hint)
	return hint, admit
}
// 3. sigle-numa-node
func (p *singleNumaNodePolicy) Merge(providersHints []map[string][]TopologyHint) (TopologyHint, bool) {
   filteredHints := filterProvidersHints(providersHints)
   // Filter to only include don't cares and hints with a single NUMA node.
   singleNumaHints := filterSingleNumaHints(filteredHints)
   bestHint := mergeFilteredHints(p.numaNodes, singleNumaHints)

   defaultAffinity, _ := bitmask.NewBitMask(p.numaNodes...)
   if bestHint.NUMANodeAffinity.IsEqual(defaultAffinity) {
      bestHint = TopologyHint{nil, bestHint.Preferred}
   }

   admit := p.canAdmitPodResult(&bestHint)
   return bestHint, admit
}
Copy the code

From the three allocation strategies above, you can see some similar flows for the Merge method:

1. filterProvidersHints
2. mergeFilteredHints
3. canAdmitPodResult
Copy the code

Including filterProvidersHints in PKG/kubelet/cm/topologymanager/policy go: 62

func filterProvidersHints(providersHints []map[string][]TopologyHint) [][]TopologyHint {
   // Loop through all hint providers and save an accumulated list of the
   // hints returned by each hint provider. If no hints are provided, assume
   // that provider has no preference for topology-aware allocation.
   var allProviderHints [][]TopologyHint
   for _, hints := range providersHints {
      // If hints is nil, insert a single, preferred any-numa hint into allProviderHints.
      if len(hints) == 0 {
         klog.Infof("[topologymanager] Hint Provider has no preference for NUMA affinity with any resource")
         allProviderHints = append(allProviderHints, []TopologyHint{{nil, true}})
         continue
      }

      // Otherwise, accumulate the hints for each resource type into allProviderHints.
      for resource := range hints {
         if hints[resource] == nil {
            klog.Infof("[topologymanager] Hint Provider has no preference for NUMA affinity with resource '%s'", resource)
            allProviderHints = append(allProviderHints, []TopologyHint{{nil, true}})
            continue
         }

         if len(hints[resource]) == 0 {
            klog.Infof("[topologymanager] Hint Provider has no possible NUMA affinities for resource '%s'", resource)
            allProviderHints = append(allProviderHints, []TopologyHint{{nil, false}})
            continue
         }

         allProviderHints = append(allProviderHints, hints[resource])
      }
   }
   return allProviderHints
}
Copy the code

Iterate through all HintProviders, collecting and storing hints. If HintProviders do not provide any hints, then by default the provider is not allocated any resources. Finally return to allProviderHints.

Including mergeFilteredHints in PKG/kubelet/cm/topologymanager/policy go: 95

// Merge a TopologyHints permutation to a single hint by performing a bitwise-AND // of their affinity masks. The hint shall be preferred if all hits in the permutation // are preferred. func mergePermutation(numaNodes []int, permutation []TopologyHint) TopologyHint { // Get the NUMANodeAffinity from each hint in the permutation and see if any // of them encode unpreferred allocations. preferred := true defaultAffinity, _ := bitmask.NewBitMask(numaNodes...) var numaAffinities []bitmask.BitMask for _, hint := range permutation { // Only consider hints that have an actual NUMANodeAffinity set. if hint.NUMANodeAffinity ==  nil { numaAffinities = append(numaAffinities, defaultAffinity) } else { numaAffinities = append(numaAffinities, hint.NUMANodeAffinity) } if ! hint.Preferred { preferred = false } } // Merge the affinities using a bitwise-and operation. mergedAffinity := bitmask.And(defaultAffinity, numaAffinities...) // Build a mergedHint from the merged affinity mask, indicating if an // preferred allocation was used to generate the affinity mask or not. return TopologyHint{mergedAffinity, preferred} } func mergeFilteredHints(numaNodes []int, filteredHints [][]TopologyHint) TopologyHint { // Set the default affinity as an any-numa affinity containing the list // of NUMA Nodes available on this machine. defaultAffinity, _ := bitmask.NewBitMask(numaNodes...) // Set the bestHint to return from this function as {nil false}. // This will only be returned if no better hint can be found when // merging hints from each hint provider. bestHint := TopologyHint{defaultAffinity, false} iterateAllProviderTopologyHints(filteredHints, func(permutation []TopologyHint) { // Get the NUMANodeAffinity from each hint in the permutation and see if any // of them encode unpreferred allocations. mergedHint := mergePermutation(numaNodes, permutation) // Only consider mergedHints that result in a NUMANodeAffinity > 0 to // replace the current bestHint. if mergedHint.NUMANodeAffinity.Count() == 0 { return } // If the current bestHint is non-preferred and the new mergedHint is // preferred, always choose the preferred hint over the non-preferred one. if mergedHint.Preferred && ! bestHint.Preferred { bestHint = mergedHint return } // If the current bestHint is preferred and the new mergedHint is //  non-preferred, never update bestHint, regardless of mergedHint's // narowness. if ! mergedHint.Preferred && bestHint.Preferred { return } // If mergedHint and bestHint has the same preference, only consider // mergedHints that have a narrower NUMANodeAffinity than the // NUMANodeAffinity in the current bestHint.  if ! mergedHint.NUMANodeAffinity.IsNarrowerThan(bestHint.NUMANodeAffinity) { return } // In all other cases, update bestHint to the current mergedHint bestHint = mergedHint }) return bestHint }Copy the code

The mergeFilteredHints function is processed as follows:

  1. The number of NUMA nodes passed through the CAdvisor generates a bitmask
  2. BestHint := TopologyHint{defaultAffinity, false
  3. Take the cross product of TopologyHints generated for each resource type
  4. For each entry in the crossover, the NUMA affinity of each TopologyHint performs a bit calculation. Set this to NUMA affinity in the merge hint.
  5. If all hints in the entry have Preferred set to True, then the Preferred in the merged hint is set to True.
  6. If there is a hint with a Preferred set of False in the entry, the Preferred set in the merged hint is False. If the number of NUMA affinity nodes is all 0, the Preferred in the hint merge is set to False.

Hint for Container0 is:

cpu: {{01: True}, {10: True}, {11: False}}
gpu-vendor.com/gpu: {{01: True}, {10: True}}
nic-vendor.com/nic: {{01: True}, {10: True}}
Copy the code

The above algorithm will produce the cross product and merged hint:

cross-product entry{cpu, gpu-vendor.com/gpu, nic-vendor.com/nic} “merged” hint {{01: True}, {01: True}, {01: True}} {01: True}

{{01: True}, {01: True}, {10: True}} {00: False}

{{01: True}, {10: True}, {01: True}} {00: False}

{{01: True}, {10: True}, {10: True}} {00: False}

{{10: True}, {01: True}, {01: True}} {00: False}

{{10: True}, {01: True}, {10: True}} {00: False}

{{10: True}, {10: True}, {01: True}} {00: False}

{{10: True}, {10: True}, {10: True}} {01: True}

{{11: False}, {01: True}, {01: True}} {01: False}

{{11: False}, {01: True}, {10: True}} {00: False}

{{11: False}, {10: True}, {01: True}} {00: False}

{{11: False}, {10: True}, {10: True}} {10: False}

Once the list of merged Hints is generated, the topology manager allocation policy configured by Kubelet determines which is the best hint.

The general process is as follows:

  1. Sort by the “strictness” of merged hints. The strictness is defined as the number of bits set in the NUMA similarity mask of the Hint. The fewer bits you set, the narrower the hint. For a Hint that has the same number of digits set in the NUMA association mask, the hint set to the lowest is considered narrow.
  2. Sort by the Preferred field of merged Hints. A hint with Preferred True is better than a hint with Preferred True.
  3. Select the narrowest hint with the best Settings for Preferred.

In the example above, all currently supported policies will use Hint {01: True} to access the Pod.


Iv. Follow-up development

4.1 Known Problems

  1. The maximum number of NUMA nodes that topology manager can handle is 8. If the number of NUMA nodes exceeds 8, enumerating possible NUMA affinities to generate hints results in an explosion of data.
  2. The scheduler does not support the resource topology function. When scheduling fails on this node because of the topology manager.

4.2 Functions and Features

A. Numa application of HugePage

As mentioned earlier, the only three hintProviders currently available for TopologyManager are CPUManager, MemoryManager, and DeviceManager. However, efforts are also under way to add support for hugepage, and TopologyManager will eventually be able to allocate memory, large pages, cpus, and PCI devices on the same NUMA node.

B. scheduling

Currently, OlogyManager does not participate in the Pod scheduling decision and only acts as a Pod Admission controller. After the scheduler dispatches a Pod to a node, OlogyManager decides whether to accept or reject the Pod. However, a POD may be rejected because of the NUMA alignment resources available to the node, contrary to the scheduling system’s decision.

So how do we solve this problem? The current Kubernetes scheduling framework provides implementation framework architecture, scheduling algorithm plug-in, can achieve such as NUMA alignment scheduling plug-in.

D. Pod alignment policy

As mentioned earlier, a single policy is applied to all the pods on the node through the Kubelet command line, rather than being customized by Pod.

The biggest problem with implementing this feature currently is that it requires API changes to express the desired alignment policy in the Pod structure or its associated RuntimeClass.

For follow-up information, please check the public account: DCOS


5. Reference materials

1, 1-18 – feature kubernetes – – topoloy – manager – beta

2, the topology manager

3. CPU Manager Policy

4. Design documents