Today, Zouyee takes you through some interesting things about CNCF last week:

  1. Kubernetes community GB representative election ended with The election of Paris Pittman

  2. CNCF incubation project OPA has entered the graduation process

  3. Last week the HELM project released a functional version of V3.5.0

  4. CoreDNS project releases the application of pull restrictions through the Docker image repository

The book follows the article “Kubernetes Scheduling System from Shallow to Deep Series: A Preliminary Study”. Today, Zouyee brings to you “Kuberneter Scheduling from Shallow to Deep: Framework”. The corresponding version of this series is 1.20.


First, the previous review

In “Kubernetes scheduling system from shallow to deep series: A Preliminary Study”, the overall interaction diagram is given to build the intuitive feeling of Pod scheduling, we expand the interaction diagram, as shown below.

Note: This interaction diagram is not professional UML, please understand.

Taking creating a Pod as an example, the following describes the scheduling process:

  1. The user creates pods from the command line (choosing directly to create pods instead of other workloads to omit kube-controller-manager)

  2. Kube-apiserver writes etCD after object verification, admission, quota and other access operations

  3. Kube-apiserver returns the result to the user

  4. Meanwhile, kube-Scheduler keeps listening for nodes, Pod events, etc. (Flow 1)

  5. Kube-scheduler adds pod of spec.nodeName to the scheduling queue, and the scheduling system selects POD to enter the scheduling cycle (described in this article) (Flow 2-3)

  6. Kube-scheduler binds the POD to the node with the highest score (Flow 4)

  7. Kube-apiserver writes binding information to etCD

  8. Kubelet listens for the Pod assigned to it and calls the CRI interface for Pod creation.

  9. After creating a Pod, kubelet updates the Pod status and other information and reports it to Kube-Apiserver

  10. Kube-apiserver writes data

Second, the framework background

With the increase in functionality, the code and logic become increasingly complex. The increase in code size and complexity inevitably increases maintenance costs and makes it more difficult to locate and fix bugs. Older versions of the Kubernetes scheduler (pre-1.16) provide webhooks for extensions. However, there are the following defects:

  • The number of points that can be extended by users is limited and their positions are fixed. Therefore, flexible expansion and deployment cannot be supported. For example, the Filter policy can be invoked only after the default Filter policy is executed.

  • The call extension interface uses HTTP requests, which are affected by the network and have much lower performance than local function calls. At the same time, each call needs to serialize and deserialize Pod and Node information, which further degrades performance.

  • Pod current related information cannot be passed in time (using scheduling Cache).

In order to solve the above problems, simplify the Scheduling system code and improve the scalability, the community has introduced a new Scheduling Framework – Scheduling Framework starting from Kubernetes 1.16.

The Scheduling Framework defines rich extension point interfaces on the basis of the original Scheduling process. Developers can implement plug-ins by implementing the interfaces defined by extension points and register plug-ins with extension points. When the Scheduling Framework executes the Scheduling process and runs to the corresponding extension point, it executes the plug-in registered by the user and generates the result of the current phase. In this way, the user’s Scheduling logic is integrated into the Scheduling Framework. The Scheduling Framework identifies the following goals:

  • Scalability: Scheduling is more scalable
  • Maintainability: Move some of the scheduler’s features to plug-ins
  • functional
    • The framework provides extensions
    • Provide a mechanism to receive plug-in results and continue or terminate based on the received results
    • Provides a mechanism to handle error communication with plug-ins
Three, the framework principle

The Framework’s scheduling process is divided into two phases:

  • The scheduling stage is executed synchronously, and there is only one scheduling cycle in the same cycle, which is thread safe
  • Gouroutine stages are executed asynchronously, and multiple Binding cycles may be running in the same cycle, making it unsafe for threads

Before introducing the Framework’s scheduling process, let’s look at the schedulerOne process logic shown above:

A. Scheduling phase

FindNodesThatFitPod - PreFilterPlugins - FilterPlugins - Filter - FitError PrioritizeNodes - PreScorePlugins - ScorePlugins - extension prioritizeNodes 3. Node selection is the select function (eligible nodes, sorted by score and selected by sampling) 4. Assume (only preallocate, can be recovered) 5. The relevant scheduling data cache is called RunReservePlugins from this node, and errors occur in the subsequent stages, so UnReserve needs to be called and rolled back (similar transaction) 6. Performing access operations is called RunPermitPluginsCopy the code

B. Binding phase

1. Perform WaitOnPermit, RunReservePluginsUnreserve 2 when the failure. Execute a bind that RunPreBindPlugins RunReservePluginsUnreserve 3 when the failure. Perform extended bingding extendersBinding, namely RunReservePluginsUnreserve 4 when the failure. Perform the binding close, which is the RunPostBindPluginsCopy the code
Introduction to Extension Points

Unreserve Plugins (in purple) are Plugins in Unreserve. These Plugins are Plugins in Unreserve. These Plugins are Plugins in Unreserve.

pkg/scheduler/framework/interface.go

The extension point Use instructions
QueueSort Used to support sorting of custom pods. If QueueSort sorting algorithm is specified, scheduling queues will be sorted according to the specified sorting algorithm. Only one sorting algorithm can be enabled
Prefilter Preprocessing of Pod information, such as Pod cache, etc
Filter For the old Predicate, filter nodes that do not meet the requirements
PostFilter Used to handle actions such as preemption when Pod fails in Filter phase
PreScore It is used to generate some information before Score, and can also generate some log or monitoring information here
Score For the old Priority, the optimal node is selected according to the scoring strategy defined by the extension point (scoring and normalization)
Reserver The last plug-in in the scheduling stage prevents the resource competition after the successful scheduling and ensures the accuracy of the resource information of the cluster
Permit The interception function of Pod binding is mainly provided, allowing, reject, or wait Pod according to the conditions.
PreBind Before you actually bind the node, do a few things
Bind A Pod is handled by only one BindPlugin, creating a Bind object
PostBind The logic to execute after a successful bind
Unreserve Data is rolled back to its initial state whenever an error is reported in the Permit to Bind phases, similar to a transaction.
Four, use scenarios

The following are some examples of how scheduling frameworks can be used to solve common scheduling scenarios.

  1. Joint scheduling

    Similar to kube-Batch, this allows scheduling of tasks with a set number of pods as a whole. It can schedule multiple workers of a training task as a whole. Only when resources of all workers of the task are satisfied, the container will be started on the node.

  2. Dynamic binding of cluster resources

    Volume topology-aware scheduling can be implemented using filter and prebind.

  3. Schedule development

    The framework allows custom plug-ins to run as main functions wrapped around scheduler.

On the framework part, this article is introduced here, next will enter the source stage, the subsequent content for scheduling configuration and third-party scheduling integration related content, please pay attention to.

For follow-up information, please check the public account: DCOS

5. Reference materials
1. https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
2. https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
3. https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/624-scheduling-framework/README.md
Copy the code