With the development of artificial intelligence technology, more and more services within iQiyi are driven by deep learning models and technologies to provide our users with a more intelligent and convenient online video watching experience. For online services, a single container instance usually needs an exclusive GPU to complete deep learning model inference requests such as video, picture, voice and text within milliseconds/second delay. In order to ensure the response delay, requests are usually made separately, and batch requests cannot be made to improve computing efficiency. In addition, different request intervals are random, which results in low GPU computing resource utilization of these services (as shown in Figure 1). In addition, there are peaks and troughs in online service requests within a day or a certain period of time, which further reduces GPU utilization. Given the high price of the GPU itself, low GPU utilization wastes a lot of computing resources and increases the cost of AI services.

Figure 1: Online inference service GPU utilization statistics

In this context, the most direct solution is to deploy multiple services on the same GPU card and improve GPU utilization by GPU sharing on the premise of ensuring service quality. At present, nvidia’s official GPU sharing technology mainly has two solutions:

(1) vGPU; (2) MPS.

Let’s briefly compare the two options.

Nvidia vGPU scheme

Nvidia’s vGPU scheme adopts the virtualization technology to manage GPU devices virtualization based on SR-IOV, provides the logic of time fragmentation execution in the driver layer, and makes certain video memory isolation, so that the graphics card can be divided according to the requirements during the initial setting of the graphics card. The logic of time fragmentation scheduling can be divided by instance or customized proportion. The video memory of the graphics card needs to be divided according to the preset proportion. The implementation of Nvdia’s vGPU solution has the following two limitations:

(1) After the vGPU partition is completed, if you want to change the predefined partition, you need to restart the graphics card to take effect. You cannot change the configuration without restarting.

(2) The scheme is based on virtual machines. The GPU physical machine needs to be virtualized before the container is deployed in the VM. Therefore, containerization scheduling cannot be implemented directly based on physical machines.

Nvidia MPS solution

Nvidia MPS solution is a software virtualization solution for power segmentation. Compared with the vGPU scheme, the configuration is very flexible, and is well adapted to docker. MPS is based on C/S architecture. All processes running on the GPU configured in MPS mode will dynamically send their started internal cores to MPS Server. MPS Server realizes the simultaneous startup and execution of multiple cores with the help of CUDA Stream. In addition, MPS can configure the GPU usage ratio of each process.

One problem of this solution is that each service process depends on MPS. Once the MPS process fails, all processes on the GPU are directly affected and can be recovered only by using nvidia-SMI to reset the GPU.

01

Iqiyi vGPU solution

After the investigation of the above schemes, in order to better apply to iQiyi’s internal AI containerization application scenarios, we redeveloped the GPU virtual sharing scheme in container scenarios, realizing the isolation and allocation of video memory and computing power based on CUDA API interception. Based on the open source project Aliyun-Gpushare Scheduler [1], the scheduling and allocation of virtual GPU on K8S is realized, realizing the goal of deploying multiple application containers on one GPU card.

The main feature of our solution is flexible configuration, which can be organically combined with K8S to allocate vGPU instances needed by users in real time as needed, and at the same time, make physical GPU instances fully shared as far as possible to maximize the utilization of resources.

After designing the solution, we evaluated the overall effect and tested the impact of this isolation and sharing on application performance. Is for a single process, the need to make sure that: first, it will not use more than the assigned work force size, second isolation itself for GPU work should not have too much loss, the third is the multiple processes to share at the same time, relative to their separate runtime, should not have too big deviation, the performance of the share can effectively avoid the interference between processes.

Based on the above criteria, we conducted performance tests on the GPU virtual sharing scheme, and the results are shown in Figure 2.

The first test is to evaluate the performance of a single process with power isolation. When a single process runs on a physical GPU, but is configured three times, respectively with 100%, 50% and 10% computing power, its performance is proportional to that when the program runs independently. The vertical axis is the percentage of non-virtualization runtime performance, and the horizontal axis is the unit test cases conducted. The same color in the area indicates that the group of test cases have the same CUDA kernel but different operating parameters. The allocation of green, blue and red dots in the figure is the relative ratio of the performance obtained by more than 500 test cases under the condition of their respective calculation force allocation to the performance of the exclusive GPU operation without calculation force segmentation. The other curve is that the values of these individual points have been smoothed over the whole dimension for better visual comparison.

In the second and third tests, the two GPU processes were interfered with each other with different computational force ratios. For example, the second two processes are configured with 50% computing power respectively, the green dots are the average performance of the two GPU processes, and the red curve is the smooth curve of these green dots. This curve is almost the same as the curve of 50% calculation force in the first test, which indicates that the interference between simultaneous operation when 50% calculation force is configured in our scheme can be almost ignored. The third one is configured with 70% power and the other is configured with 30% power, which can also be compared with the respective curves in the first test when the independent allocation is 70/30%.

The test results show that the scheme can control GPU mutual interference within a reasonable range. Internal statistics show that on average 100+ deep learning container services can be shared on 35 physical Gpus without affecting each other. For a single GPU physical card, the average number of services carried changed from 1 to 3. At the same time, the average GPU utilization has increased by more than 2 times.

Figure 2: Isolation test results

02

Basic principle of IQiyi GPU virtual sharing

First, let’s take a look at the underlying principle of GPU virtual sharing. As a powerful computing peripheral, GPU provides two main resources: video memory and computing power. To realize the sharing of a single GPU, we need to realize the isolation of video memory and computing power resources, and verify the efficiency and performance of the isolation scheme.

2.1 Video memory isolation

For deep learning applications, the need for video memory comes from three aspects.

1) The first is the CUDA kernel context of the model, which is similar to the TEXT segment in CPU programs and provides an environment for the execution of CUDA kernel. This is a rigid requirement. Without sufficient video memory, the kernel will not be able to start. But it is the smallest part of the overall model’s video memory requirements.

2) The second part is derived from some parameters obtained by model training, such as weight and bias in convolution.

3) The third part comes from the temporary storage of the model in the inference process, which is used to store the intermediate calculation results.

For general models, there is basically no need to occupy the entire GPU video memory. There is an exception, however. The Tensorflow framework allocates video memory for all Gpus by default for its own video memory management. Of course the Tensorflow framework has the option to block this behavior, but it is not feasible for the platform for each user to modify the TF configuration to block this behavior.

A clever way to deal with this problem is to have Tensorflow’s deployed application allocate only as much video memory as it needs without the involvement of the application developer. This method is called API dynamic interception. Tensorflow can know the current remaining GPU memory, is through the cuDeviceTotalMem/cuMemGetInfo both CUDA library API. LD_PRELOAD hook SO, Tensorflow, and CUDA/CUDA/CUDA/CUDA/CUDA/CUDA/CUDA/CUDA Limit the video memory quota for a particular Tensorflow application to its requested value.

In the process of system implementation, cuMemAlloc/cuMemFree is also intercepted, in order to be able to manage multiple GPU processes in the same container. When the sum of video memory allocated by multiple GPU processes exceeds their quota, cuMalloc can be used to return an insufficient video memory error. In-container video memory quota management is done through share MEm. Figure 3 shows the entire flow of video memory isolation and allocation.

Figure 3: Isolation and allocation flow in video memory segmentation

2.2 Calculation force isolation

In addition to video memory, another important GPU resource is computing power. For the Nvidia Volta graphics architecture, the computing power comes from three aspects: floating point unit, shaping unit, tensor Core acceleration unit. Floating point unit and shaping unit are the internal structure of stream processor SP, and SM contains multiple stream processor SP. For the V100 it has 80 SMS, 64 SP in each SM, 5120 stream processors, the tensor core is inside the SM, it shares the register /share MEm /L1 cache with SP. Figure 4 shows the block diagram of hardware architecture organization of Nvidia GPU.

Figure 4: Organization diagram of Nvidia GPU hardware architecture

For CUDA, Nvidia’s GPU programming language, the language design logically corresponds to the hardware level shown above. CUDA has three logical layers: grid, block, and Thread. Grid can be considered as the logical abstraction of the entire graphics card, block can be considered as the logical abstraction of SM units, and Thread is the logical abstraction of SP. In order to achieve the highest degree of concurrency, there can be no interaction between SMS. Of course, this is not absolute. Some programs can also design programs that depend on each other for their own special logic, but this cost is a huge waste of performance.

Now that you know the underlying architecture of the GPU and the design principles of CUDA, you can do some preliminary ideas on how to calculate force virtualization. Since some models cannot fully utilize the full computing power of the GPU, why not reduce the number of SM occupied by the GPU so that the idle SM can be used by other GPU programs?

The idea is good, but there are limitations that prevent this optimization from happening. The execution of GPU program is implemented through kernel fragments. After the kernel is launched on the CPU side, the specific kernel and its calling parameters are handed over to the GPU hardware scheduler to run at a certain future point in time. By default, the kernel is sent to all SMS on the GPU and cannot be interrupted during execution. As shown in Figure 5, after the software system sends startup commands, the commands and parameters are transferred to GPU hardware by PCIe and inserted into its queue. The logic solidified in GPU hardware can handle when to start.

Figure 5: Interaction diagram of GPU software and hardware scheduling system

But just because you can’t, by default, doesn’t mean you can’t. Let’s review CUDA design again. CUDA is a language used to control GPU to complete efficient parallel computing, and its code writing logic is based on Thread. All SPS on SM run a copy of kernel code, and even run at exactly the same pace to some extent. CUDA uses blockIdx/threadIdx embedded variables to identify threads and determine which offsets code should process data from. These two variables are read-only on the machine code and are specified when Thread is dispatched by the hardware scheduler. Using the hardware scheduler, you can bind the abstract blockIdx/threadIdx to the concrete SM/SP. Figure 6 Outlines this mapping.

Figure 6: MAPPING between CUDA logical blocks and hardware computing units

To be able to accurately control the computing power, we can no longer rely on the hardware scheduler to control kernel boot. The trick here is to make the kernel “stuck” on a fixed number of SM after boot, and the ratio of this number value to the total number of SM on the GPU is the ratio of the internal accounting force.

In order to illustrate ideas visually, we made an abstract change to GPU here, and the number of SM was defined as 10. Then there is a kernel whose boot parameter is <<<15,1>>>, i.e. CUDA block size is 15 and thread size is 1. When it starts normally, the hardware scheduler assigns a copy of the kernel to each SM. This consumes 10 copies of the block in the first place, then exits after each SM kernel completes execution, and the hardware scheduler allocates the remaining 5 copies of the block, which completes the entire kernel execution.

After force segmentation, we will dynamically modify the startup parameters of the kernel at startup and change its CUDA block size from 15 to 5. The hardware scheduler then allocates copies of the kernel to half the number of SMS on the GPU, and the free half can be used by other kernels, as shown in Figure 7.

Figure 7: Dynamic modification of startup parameters for force segmentation

Although we have avoided the kernel from occupying all SM resources by dynamically modifying the boot parameters, we have not completed the “trapped” action at this time. Therefore, the kernel will exit after completing the predetermined logic, so that the kernel cannot overwrite the data space when the block size is 15. To trap this, we replaced the BRANCH operation in the kernel’s assembly EXIT. After the kernel completes its logic, it jumps to a preset logic. This logic does the increment of the virtual blockIdx/threadIdx and then jumps to the start of the kernel for a new round of calculations based on the updated blockIdx/threadIdx.

This time it should be noted that blockIdx/threadIdx is a read-only register, so there is no way to change its value directly. As an alternative solution, replace blockIdx/threadIdx in the kernel as a whole with a writable register so that we can make changes in the preset jump logic, as shown in Figure 8.

Figure 8: Assembly changes change the kernel run logic

03

Scheduling design of IQiyi GPU virtual Sharing

After completing GPU resource isolation, we also need to allocate and schedule isolated GPU resources based on the K8S platform to facilitate the rapid deployment of deep learning services on shared Gpus.

Generally, the Nvidia Device Plugin (Nvidia official plug-in) is used to use GPU in K8S container. It can allocate one or more cards to Pod. The minimum unit allocated is one card, which cannot support the underlying isolated GPU resource scheduling. After the investigation, we chose aliyun-Gpushare, the open source of Ali Cloud container service, as the scheduling scheme to realize the scheduling of GPU isolated resources.

In the case of video memory, with Aliyun-gpushare, pods are allocated part of the video memory in a card, so that the resources of a single card can logically be further shelled. Assuming you have a V100 32GB card, you can assign 4GB of video memory to Pod1 and 8GB to Pod2 at the same time until 32GB of video memory is allocated. The whole scheduling process is shown in Figure 9

Figure 9: The overall call scheme diagram exposed by Ali

Among them, the Share GPU Device Plugin and Share GPU Schd Extender are the major new components, which are abbreviated as SGDP and SGSE below. Other components are official K8S components.

The main flow in the figure is as follows:

  1. When you create a Share GPU Pod, you must bring a K8S custom resource, aliyun.com/gpu-mem, that indicates how much video memory it needs.

  2. SGSE assigns a Node to this Pod according to the user’s Share GPU memory request and the overall resource situation of the cluster, and specifies the use of a card through patch Pod annotation.

  3. Kubelet calls the Allocate method of SGDP to Allocate a GPU card to Pod. Also set the environment variables ALIYUN_COM_GPU_MEM_CONTAINER (available video memory for the container) and LD_PRELOAD (whose value is the dynamically linked library path that limits video memory).

  4. Since LD_PRELOAD is set to Pod, all AI framework GPU memory requests are hijacked by dynamically linked library hooks. When the total resources used exceed the value of ALIYUN_COM_GPU_MEM_CONTAINER, the request is rejected. So as to limit the user to use video memory effect.

The scheduling strategy of computing power resources is similar to the above video memory scheduling.

In practice, for a physical GPU card, we divide the video memory and computing power between 1/4 and 1/2. Services can choose the corresponding ratio based on actual requirements. A single GPU can deploy a maximum of four different applications, which can effectively isolate each other.

04

Conclusion and outlook

By means of LD_PRELOAD dynamic hijacking, we realize the isolation of lightweight GPU memory and computing power in the container, so that multiple container applications can be deployed on the same GPU. The scheme realizes the partition of a single GPU resource from a dynamic dimension, and improves the use efficiency of GPU hardware for online inference service scenarios.

In the follow-up work, we also plan to develop and implement a cross-host GPU remote call scheme to solve the problem that some virtual GPU resources have no CPU to allocate due to the IMBALANCE of CPU/GPU ratio on a single-machine multi-card machine after GPU virtual sharing.

References:

1. The aliyun – gpushare: github.com/AliyunConta…

2. The Nvidia vGPU:docs.Nvidia.com/grid/latest…

3. The Nvidia MPS:docs.Nvidia.com/deploy/mps/…

Did you see the heart?

I want to join iQiyi right away

To be one of us?

Iqiyi computing cloud recruitment:

· Distributed storage architecture

· Senior R&D engineer of deep learning platform

·Kubernetes R&D Engineer

· Senior Network r&d engineer

Waiting for the post to come!

Follow public account

Background reply “Recruitment 2”

For more job details ~

Maybe you’d like to see more

Hear the voice of users, iQiyi all-channel user feedback analysis exploration and practice

Inference performance is doubled, and TensorFlow Feature Column performance optimization practice