Ai will be a disruptive force in the organizational revolution of 2019, according to Gartner’s survey of ciOs around the world. For ARTIFICIAL intelligence, computing power is justice and cost is ability. Docker and Kubernetes, on behalf of cloud native technology, provide a new working mode for AI. Put GPU machines into a unified resource pool for scheduling and management, which avoids low UTILIZATION rate of GPU resources and the cost of manual management. Therefore, Kubernetes, the major container cluster service provider in the world, provides Nvidia GPU container cluster scheduling capability, but usually allocates a GPU card to a container. Although this can achieve better isolation, to ensure that the application using GPU will not be affected by other applications; It is also very suitable for the scenarios of deep learning model training, but the scenarios for model development and model prediction are still wasteful. Based on this, we have shared GPU cluster scheduling requirements.

Kubernetes shared GPU cluster scheduling

The cluster scheduling of shared GPU enables more model development and prediction services to share the same GPU card, thus improving the utilization of Nvidia Gpus in the cluster. This requires the partition of GPU resources, and the dimension of GPU resource partition here refers to the partition of GPU video memory and Cuda Kernel threads. Support for shared Gpus is usually talked about at the cluster level in two things: 1. Scheduling 2. Isolation, where we’re talking about scheduling, currently requires users to apply restrictions (such as using Tensorflow’s per_process_gpu_memory_fraction), – NVIdia-based MPS will be available in the future, and Gpus will also be considered. However, for the fine-grained GPU card scheduling, currently Kubernetes community does not have a good scheme, because Kubernetes definition of GPU extended resources only supports the addition and subtraction of integer granularity, and cannot support the allocation of complex resources. For example, users want to use Pod A to occupy half of the GPU card, which is not possible in the current Kubernetes architecture design to record and call resource allocation. The challenge here is that multi-card GPU sharing is a real vector Resource problem, while Extened Resource is a description of scalar resources. To solve this problem, we designed an Out Of Tree shared GPU scheduling scheme, which relies on the existing working mechanism Of Kubernetes:

  • Extended Resource definition
  • The Scheduler Extender mechanism
  • Device Plugin mechanisms
  • Kubectl extension mechanism

The advantage of this GPU sharing scheduling extension is that it is implemented using Kubernetes’ extension and plug-in mechanism and is non-invasive to core components such as API Server, Scheduler, Controller Manager and Kubelet. This makes it easy for users to apply the solution on different versions of Kubernetes without rebase code and rebuilding the Kubernetes binary package.

User stories

  • Cluster administrator: “I want to increase GPU utilization in the cluster; During development, multiple users share the model development environment.”
  • Application developer: “I want to be able to run multiple inference tasks simultaneously on a Volta GPU.”

[] (https://www.atatech.org/articles/132268#2)

  • Allows users to describe the application for a shareable resource through THE API, and can realize the scheduling of the resource

[] (https://www.atatech.org/articles/132268#3), not goals

  • Isolation of this shared resource is not supported
  • Overselling is not supported

[] (https://www.atatech.org/articles/132268#4) design principles

  1. Clarify the problem and simplify the design. The first step is only responsible for scheduling and deployment, and the subsequent implementation of runtime video memory control.

    There are a lot of customers who clearly want to first support multiple AI applications that can be scheduled on the same GPU, and they can accept control of the size of video memory from the application level, using similargpu_options.per_process_gpu_memory_fractionControl the video memory usage of your application. Then we need to solve the problem to first simplify to video memory as a scheduling ruler, and the size of video memory use in the way of parameters to the container.
  2. This design will not change the design of Extended Resource of Kubernetes core, the implementation of Scheduler, the mechanism of Device Plugin, and the related design of Kubelet. Reuse Extended Resource describes the application API for shared resources. The benefit of this is to provide a portable solution that users can use on native Kubernetes.
  3. Video memory and card scheduling can coexist in a cluster, but the same node is mutually exclusive and cannot coexist. Either by number of cards or by video memory.

The detailed design

[] (https://www.atatech.org/articles/132268#6) if:

  1. Kubernetes Extended Resource definition is still used, but the minimum unit of measurement dimension is changed from 1 GPU card to GPU memory MiB. If the GPU used by the node is single-card 16GiB video memory, the corresponding resource is 16276MiB.
  2. Users’ demands for shared Gpus lie in model development and model prediction scenarios. In this scenario, the maximum GPU resources applied for by users cannot exceed one card, that is, the maximum GPU resources applied for by users is a single card.


First of all, we defined two new Extended Resources: the first is GPU-MEM, which corresponds to GPU video memory; The second value is gPU-count, which corresponds to the number of GPU cards. Vector resources are described by two scalar resources, and the working mechanism supporting shared GPU is provided by combining these resources. Here is the basic architecture diagram:

[] (https://www.atatech.org/articles/132268#7), the core function modules:

  • GPU Share Scheduler Extender: Using the scheduler extension mechanism of Kubernetes, responsible for judging whether a single GPU card on the node can provide enough GPU Mem when global scheduler Filter and Bind are used. In addition, at the time of Bind, the allocation result of GPU is recorded to Pod Spec through annotation for subsequent Filter to check the allocation result.
  • GPU Share Device Plugin: By using the Device Plugin mechanism, the node is called by Kubelet to be responsible for GPU card allocation, which depends on the scheduler Extender allocation result.

The specific flow of [] (https://www.atatech.org/articles/132268#8) :

  1. Resources reported

    The GPU Share Device Plugin uses the NVML library to query the number of GPU cards and the graphics memory of each GPU cardListAndWatch()Report the GPU total video memory (quantity video memory) of the node to Kubelet as an additional Extended Resource. Kubelet further reports back to Kubernetes API Server. For example, if a node contains two GPU cards, and each card contains 16276MiB, the GPU resources of the node are 16276 2 = 32552. The number of GPU cards 2 on the node is also reported as another Extended Resource.





    2. Extended scheduling

    The GPU Share Scheduler Extender can allocate GPU-MEM to Pod while reserving allocation information in Pod spec in the form of annotation. According to this information, whether each card contains enough available GPU-MEM allocation is determined at the filtering time.


2.1 Kubernetes default Scheduler will call the FILTER method of GPU Share Scheduler Extender through HTTP after completing all filter behaviors. This is because when the default scheduler calculates Extended Resources, it can only determine whether the total amount of resources has free resources that meet the requirements, but cannot determine whether a single card meets the requirements. Therefore, the GPU Share Scheduler Extender needs to check whether there are available resources on a single card.





For example, in a Kubernetes cluster consisting of three nodes with two GPU cards, when a user appliesgpu-mem=8138, the default scheduler scans all nodes and finds that the remaining resources of N1 (16276 * 2-16276-12207 = 4069) do not meet the resource requirements, and N1 node is filtered out.







The remaining resources of N2 and N3 nodes are 8138MiB, which meet the conditions of the default scheduler from the perspective of overall scheduling. At this point, the default Scheduler will entrust the GPU Share Scheduler Extender to carry out secondary filtering. In the secondary filtering, the GPU Share Scheduler Extender needs to judge whether a single card meets scheduling requirements. Although N2 node has 8138MiB available resources, GPU0 and GPU1 have only 4069MiB available resources on each card, which cannot meet the requirements of 8138MiB for a single card. Although the N3 node also has a total of 8138MiB available resources, these available resources belong to GPU0 and meet the requirements of single card schedulability. Therefore, precise condition selection can be realized through the screening of GPU Share Scheduler Extender.

2.2 When the Scheduler finds a node that meets the conditions, it will entrust the BIND method of the GPU Share Scheduler Extender to bind the node and Pod. Here, the Extender needs to do two things

  • The optimal GPU card ID is found based on binpack rules. For different GPU cards on the same node, the BINpack principle is used as the judgment condition. The GPU card that meets the conditions for free resources but has the least remaining resources is selected as the optimal GPU card IDALIYUN_COM_GPU_MEM_IDXSave it in the ANNOTATION of the Pod; It also saves the GPU Memory applied by the Pod asALIYUN_COM_GPU_MEM_PODandALIYUN_COM_GPU_MEM_ASSUME_TIMESave to the Pod annotation, and bind the Pod to the selected node at this point.

Note: The Pod annotation of ALIYUN_COM_GPU_MEM_ASSIGNED is also saved, which is initialized to “false”. It means that the Pod is assigned to a GPU card at the time of scheduling, but the Pod is not actually created on the node. ALIYUN_COM_GPU_MEM_ASSUME_TIME indicates the specified time.

If it is found that no GPU resources meet the requirements on the allocated node at this time, no binding is performed at this time, and the default scheduler will reschedule after assuming timeout.

  • Call Kubernetes API to perform binding between node and Pod


In the following figure, when the GPU Share Scheduler Extender wants to change the GPU-MEm: Pod of 8138 is bound to N1 node after screening. Firstly, the available resources of different Gpus are compared, which are GPU0(12207),GPU1(8138),GPU2(4069) and GPU3(16276) respectively. The remaining resources of GPU2 do not meet the demand and are abandoned. Among the other three gpus that meet the requirements, GPU1 is exactly the GPU card that meets the requirements of idle resources but has the least remaining resources. Therefore, GPU1 is selected.

When the Pod and node binding event is received by Kubelet, Kubelet will create a real Pod entity on the node. In this process, Kubelet calls the Allocate method of the GPU Share Device Plugin, and the parameter of the Allocate method is the GPU-MEM applied by Pod. In the Allocate method, the corresponding Pod is run according to the scheduling decision of the GPU Share Scheduler Extender

  • Lists all the nodes in the Pending state andALIYUN_COM_GPU_MEM_ASSIGNEDforfalseGPU Share Pod
  • Select the Pod AnnotationALIYUN_COM_GPU_MEM_PODAllocate the same number of pods as the Allocate request. If there are multiple PODS that meet this condition, one of them will be selectedALIYUN_COM_GPU_MEM_ASSUME_TIMEThe original Pod.
  • Annotation of this PodALIYUN_COM_GPU_MEM_ASSIGNEDSet totrueAnd the GPU information in Pod annotation is converted into environment variables and returned to Kubelet to really create Pod.

[] (https://www.atatech.org/articles/132268#9) related projects

At present, the project has been open source to gpushare-Scheduler-Extender gpushare-device-plugin on github.com

The deployment of

Refer to the deployment documentation

[] (https://www.atatech.org/articles/132268#11) to test the sample

  1. First create a usealiyun.com/gpu-memThe application of
apiVersion: apps/v1kind: Deploymentmetadata:  name: binpack-1  labels:    app: binpack-1spec:  replicas: 1  selector: # define how the deployment finds the pods it manages matchLabels: app: binpack-1 template: # define the pods specifications metadata: labels: app: binpack-1 spec: containers: - name: binpack-1 image: cheyang/gpu-player:v2 resources: limits: # MiB aliyun.com/gpu-mem: 1024Copy the code

use

Please refer to the usage documentation

build

See How to build

Video Demo

[] (https://www.atatech.org/articles/132268#15) Demo 1: deploy multiple GPU Share Pod, found that they are placed in the same in the form of binpack GPU card

[] (https://www.atatech.org/articles/132268#16) Demo 2: avoid errors scheduling application resources than the Pod of a single GPU resources available

Roadmap

The original link

  • Optional support for Nvidia MPS in the Device Plugin
  • This solution can be automatically deployed in a Kubernetes cluster initialized by kubeadm;
  • Improve the Scheduler Extener’s high availability;
  • Provides common solutions for GPU, RDMA and elastic network cards.