ZuoYuPeng

I graduated from The School of Computer Science of Beijing University of Technology in 2014, and used to work in database operation and maintenance of a large state-owned enterprise. In March 2018, HE joined the system Management Center team of information Technology Department of CMBC, and is now mainly engaged in the research, operation and maintenance of container platform and ceph distributed storage based on Kubernetes and Docker.

background

In recent years, with the popularity of artificial intelligence, machine learning, deep learning and other technologies, GPU has also achieved rapid development. The application of GPU can accelerate the running speed of deep learning tasks to a large extent, and the emergence and application of frameworks such as TensorFlow are inseparable from the dependence of GPU resources. In a deep learning project, Minsheng Bank deployed a set of Kubernetes cluster supporting GPU, and started the deep learning journey of Kubernetes+GPU+Tensorflow.

Kubernetes (k8S) officially supports scheduling for Nvidia Gpus in version 1.6 and AMD Gpus in version 1.9. This article takes Nvidia GPU as an example to introduce the deployment and use of GPU nodes in K8S cluster for reference.

2. Environment introduction

OS version: SUSE12SP3 K8s version: 1.9 Docker version: 17.06 Nvidia GPU Model: GeForce GTX 1080 Ti K8s The cluster has been deployed in advance, and the GPU node has been added to the cluster.

Iii. Overview of Device Plugin

For K8s from 1.8 to 1.10, to support GPU scheduling, you must set the – feature-gates= “DevicePlugins=true” parameter to enable DevicePlugins. Starting from 1.10, this parameter is no longer required. Device Plugin is actually a gRPC interface. Device manufacturers only need to implement a specific Device plug-in according to the interface of Device Plugin without modifying the core code of K8S. The Nvidia GPU Device Plugin must meet the following requirements for GPU nodes in the K8S cluster: Nvidia-docker 2.0 Docker default Runtime must be configured to nvidia-container-Runtime. Instead of runc Nvidia driver versions above 361.93

Iv. GPU node driver installation

The host has nvidia graphics card # lspci | grep -i nvidia 04:00. 0 VGA compatible controller: NVIDIA Corporation Device 1B06 (REV A1) 04:00.1 Audio Device: NVIDIA Corporation Device 10EF (REV A1)

Zypper install -y GCC make

Zypper install -y kernel – devel= The variants and versions above need to be consistent with the variants and versions of the currently running kernel. Variant =default, version=4.4.92-6.18, version=4.4.92-6.18 Devel package # zypper install -y kernel-default-devel=4.4.92-6.18 Kernel devel # zypper install -y kernel-devel=4.4.92-6.18

Four nvidia’s official website to download driver login nvidia’s official website https://www.nvidia.com/download/index.aspx download the corresponding drivers set retrieval condition.

The driver version downloaded here is Nvidia-Linux-x86_64-390.87. run

# sh nvidia-linux-x86_64-390.87. run -a -s -q

Nvidia-smi The installation is successful if the following information is displayed:

5, NVIDIa-Docker installation configuration

In order to improve the usability of NVIDIA GPU in Docker, NVIDIA implements its own NVIdia-Docker tool through the encapsulation of native Docker. The hierarchical relationship supported by NVIDIa-Docker for Docker containers using GPU resources is shown in the figure below:

Nvidia-docker makes it easier for Docker to use GPU resources. Up to now, nvidia-Docker has officially gone through two major iterations, Nvidia-Docker and Nvidia-Docker2. Nvidia-docker2 has more usability and architecture improvements on top of Nvidia-Docker. Nvidia-docker2 currently supports Ubuntu 14.04/16.04/18.04, Debian Jessie/Stretch, Centos 7, Redhat 7.4/7.5, Amazon Linux 1/2. And installation method may refer to https://nvidia.github.io/nvidia-docker for more information.

Since Nvidia-Docker2 currently does not support SUSE operating system, in order to enable SUSE docker to use GPU resources, we recompile related components of Nvidia-Container to be compatible with the docker version in the current environment. Achieve the same effect as NVIDIa-Docker2.

A install nvidia – container – the runtime

# RPM -ivh libnvidia-container1-1.0.0-0.1.alpha.3.x86_64. RPM # RPM -ivh libnvidia-container1-1.0.0-0.1.alpha.3.x86_64. RPM Libnvidia - container - tools - 1.0.0-0.1 alpha. 3. X86_64. RPM # RPM - the ivh Nvidia - container - the runtime - 1.1.1-99. Docker17.06.2. X86_64. RPMCopy the code

Edit the /etc/dock/daemon. json file and add the following content:

{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": []}}}Copy the code

Vi. K8s enable DevicePlugins

One in every node of the master and the GPU kubelet nodes to join launch parameters/etc/systemd/system/kubelet. Service. D / 10 – kubeadm. Conf file configuration Environment KUBELET_FEATURE_GATES_ARGS = = “- feature – gates = DevicePlugins = true”

Restart kubelet for the parameters to take effect # systemctl daemon-reload # systemctl restart kubelet

Check whether the new parameters take effect

# ps -ef|grep kubelet

2 of each node in the master kube apiserver, kube controller – manager, kube – the scheduler to start parameters Each node in turn modify/etc/kubernetes/manifests three files in the directory. Yaml, kube-controller-manager.yaml, and kube-scheduler.yaml. Add the following content: – feature-gates=DevicePlugins=true

Check whether the new parameters take effect

# ps -ef|grep apiserver
 # ps -ef|grep controller
 # ps -ef|grep schedulerCopy the code

Deploy the Nvidia-device-plugin

Create the nvidia-device-plugin.yaml file and add the following content:

apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: template: metadata: annotations: scheduler.alpha.K8s.io/critical-pod: "" labels: name: nvidia-device-plugin-ds spec: tolerations: - key: CriticalAddonsOnly operator: Exists containers: - image: Nvidia/k8s - device - the plugin: 1.9 name: nvidia - device - the plugin - CTR securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-pluginsCopy the code

Run the following command on the master node to deploy the nvidia-device-plugin

# kubectl apply -f nvidia-device-plugin.yamlCopy the code

Viewing the Deployment

# kubectl get pod -n kube-system|grep nvidia
 nvidia-device-plugin-mlp3com 1/1 Running 29 7dCopy the code

Check whether the K8S can identify GPU resources on the master node

# kubectl describe nodeCopy the code

The configuration is successful, as shown in the preceding figure. If no GPU is displayed or the number after GPU is 0, the configuration fails. At this point, the K8S cluster can schedule GPU resources on the node.

Tensorflow application deployment test

In order to further verify that K8S can successfully schedule Gpus, we deployed a GPU-based TensorFlow application to do a simple test. The following is the application deployment configuration file gPU-test.yaml

apiVersion: v1
 kind: Service
 metadata:
 name: cmbc-serving
 labels:
 app: tensorflow-serving
 spec:
 type: NodePort
 ports:
 - name: http-serving
 port: 5000
 targetPort: 5000
 selector:
 app: tensorflow-serving
 ---
 # Source: tensorflow-serving/templates/deployment.yaml
 apiVersion: extensions/v1beta1
 kind: Deployment
 metadata:
 name: cmbc-serving
 labels:
 app: tensorflow-serving
 spec:
 replicas: 1
 strategy:
 type: RollingUpdate
 template:
 metadata:
 labels:
 app: tensorflow-serving
 spec:
 hostNetwork: true
 containers:
 - name: serving
 image: "fast-style-transfer-serving:la_muse"
 imagePullPolicy: "IfNotPresent"
 env:
 command: ["python", "app.py"]
 ports:
 - containerPort: 5000
 name: http-serving
 resources:
 limits:
 nvidia.com/gpu: 1Copy the code

Run the following command to create an application:

# kubectl apply -f gpu-test.yamlCopy the code

In this case, run the nvidia-smI command on the GPU node to check the GPU usage. As shown in the following figure, K8S automatically selects a GPU node to deploy the application, and the application successfully uses GPU resources for computing to process external access requests.

Nine,

So far, the main process of using K8S to schedule GPU to deploy TensorFlow application has been introduced. There are still many areas to be improved, such as model training of TensorFlow in K8S, how K8S supports GPU affinity and so on. All in all, this is just our preliminary exploration, and we will share with you in the future based on our progress.

Source: Livelihood operations