We see Kubernetes moving to the next level as companies around the world start to adopt it. On the one hand, Kubernetes is adopted by marginal workloads and provides value beyond the data center. Kubernetes, on the other hand, is driving the development of machine learning (ML) and high quality, high-speed data analysis performance.

The example of applying Kubernetes to machine learning that we know about now stems from a feature in Kubernetes 1.10, when the Graphics Processing Unit (GPUS) became a schedulable resource — this feature is now in beta. Individually, these are both exciting developments in Kubernetes. Even more exciting, you can use Kubernetes to use GPUs in the data center and on the edge. In the data center, the GPU is a way to build ML libraries. Those trained libraries will be migrated to the edge of the Kubernetes cluster as reasoning tools for machine learning, providing data analysis as close as possible to data collection.

In its early days, Kubernetes provided a pool of CPU and RAM resources for distributed applications. If we have a CPU and RAM pool, why not a GPU pool? This is fine, but not all servers have GPUs. So, how do we make our servers GPU-capable in Kubernetes?

In this article, I’ll explain a simple way to use a GPU in a Kubernetes cluster. In a future article, we’ll also push the GPU to the edge and show you how to do it. To really simplify the steps, I’m going to use the Rancher UI to handle the GPU-enabled process. The Rancher UI is just a client of the Rancher RESTful APIs. You can use clients of other APIs, such as Golang, Python, and Terraform, in GitOps, DevOps, and other automation solutions. However, we won’t go into that in this article.

In essence, the steps are simple:

  • Build the infrastructure for the Kubernetes cluster
  • Install Kubernetes
  • Install the GPU-operator from HELM

Get up and running with Rancher and available GPU resources

Rancher is a multi-cluster management solution and is the glue behind the above steps. You can find a pure NVIDIA solution that simplifies GPU management on the NVIDIA blog, as well as some important information about how GPU-operators differ from building a GPU driver stack without operators.

(https://developer.nvidia.com/blog/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/)

preparation

Here is the list of materials (BOM) required to get a GPU up and running in Rancher:

  1. Rancher
  2. GPU Operator (https://nvidia.github.io/gpu-.)
  3. Infrastructure – We will use GPU nodes on AWS

In the official documentation, we have a section on how to install Rancher with high availability, so we’ll assume you’ve already installed Rancher:

https://docs.rancher.cn/docs/rancher2/installation/k8s-install/_index/

Process steps

Install the Kubernetes cluster using GPUS

After Rancher is installed, we will first build and configure a Kubernetes cluster (you can use any cluster with an NVIDIA GPU).

Using the Global context, we select Add Cluster

And in the “Hosts from cloud service providers” section, select Amazon EC2.

We do this through node drivers — a set of preconfigured infrastructure templates, some of which have GPU resources.

Notice that there are three node pools: one for the master, one for the standard worker node, and one for the worker with the GPU. The template for the GPU is based on the P3.2 xlarge machine type, using the Ubuntu 18.04 Amazon machine image or AMI (AMI-0AC80DF6EFF0E70B5). Of course, these choices vary based on each infrastructure provider and enterprise needs. Also, we set the Kubernetes option on the “Add Cluster” form to the default value.

Set the GPU Operator

Now, we will use the GPU Operator library (https://nvidia.github.io/gpu-operator) set up a catalog in the Rancher. (There are other solutions that can expose the GPU, including using Linux for Tegra [L4T] Linux distributions or device plug-ins.) At the time of this writing, the GPU Operator has been tested and verified with NVIDIA Tesla Driver 440.

Using the Rancher Global context menu, we select the cluster we want to install to:

Then use the Tools menu to view the catalog list.

Click Add Catalog button and give its name, and then Add url: https://nvidia.github.io/gpu-operator

We chose Helm V3 and cluster scope. We click Create to add Catalog to Rancher. When using automation, we can use this step as part of the cluster build. Depending on the enterprise policy, we can add this Catalog to each cluster, even if it has no GPU nodes or node pools yet. This step gives us access to the GPU Operator Chart, which we will install next.

Now we want to use the Rancher context menu in the upper left corner to enter the cluster’s “System” project, where we have added the GPU Operator function.

In the System project, select Apps:

Then click the Launch button in the upper right.

We can search for “NVIDIA” or scroll down to the catalog we just created.

Click GPU-Operator App, and then click Launch at the bottom of the page.

In this case, all the defaults should be fine. Again, we can add this step to the automation with Rancher APIs.

Using the GPU.

Now that the GPU is accessible, we can deploy a GPU-capable workload. In the meantime, we can verify that the installation was successful by viewing the page in cluster-> Nodes in Rancher. We see that the GPU Operator has installed Node Feature Discovery (NFD) and labeled our nodes for use by the GPU.

Total knot

The reason why Kubernetes can run with the GPU in such a simple way is because of these three important parts:

  1. NVIDIA的GPU Operator
  2. Node Feature Discovery (NFD) from Kubernetes SIG of the same name.
  3. Rancher’s cluster deployment and catalog app integration

You are welcome to try it out with this tutorial, and stay tuned, as we try to reference the GPU to the edge in future tutorials.