Docker builds deep learning furnace

In what situations?

In the alchemy process of deep learning, TensorFlow and Pytorch commonly used in the industry often need to accelerate model training through NVIDIA Gpus. The most important dependency for its parallel acceleration is the CUDA-Toolkit software package developed by NVIDIA

The versions of TensorFlow and Pytorch that are dependent on in the corresponding code of academic paper and their dependencies are often complicated. Although the virtual environment of Anaconda can solve the problem of different versions of TensorFlow and Pytorch, it cannot easily solve the problem of different versions of CUDA-Toolkit. If multiple papers are reproduced or the versions of CUDA-Toolkit that the implementation depends on conflict, the system often needs to be reinstalled, which is time-consuming and laborious.

In this paper, the method of deep learning furnace built by Docker on Ubuntu and other Linux can solve the above problems well, so that researchers can invest their time in more important algorithm and model optimization.

The principle of

Users do not need to install CUDA-Toolkit as long as they install the graphics card driver in the Linux system. Cuda-toolkit, TensorFlow and Pytorch are all in the Docker container

NVIDIA Container Toolkit

Schematic diagram of docker furnace

System requirements

The GPU version of Docker Furnace supports the following OS, basically only Linux

Docker installation

More detailed process reference

Script Installation Method

curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh --mirror Aliyun
Copy the code

Start the docker

sudo systemctl enable docker
sudo systemctl start docker
Copy the code

Create a Docker user group

By default, docker commands use Unix sockets to communicate with the Docker engine. Only root users and docker group users can access the Unix socket of the Docker engine. For security reasons, the root user is not used directly on Linux systems. Therefore, it is better to add users who need to use Docker to the Docker user group.

Create docker group:

sudo groupadd docker
Copy the code

Add current user to Docker group:

sudo usermod -aG docker ${USER}
Copy the code
sudo systemctl restart docker
su root
su ${USER}
Copy the code

Test if Docker is installed correctly

docker run --rm hello-world
Copy the code

Mirror to accelerate

  • Ali Cloud accelerator (click the management console -> login account (Taobao account) -> right mirror center -> mirror accelerator -> copy address

NVIDIA Container Toolkit

NVIDIA container architecture, not supported by Windows users

Ubuntu installation

Setup the stable repository and the GPG key:

distribution=$(. /etc/os-release; echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.listCopy the code

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

sudo apt-get update
Copy the code
sudo apt-get install -y nvidia-docker2
Copy the code

/etc/docker/daemon.json requires the following

Set the default Runtime to nvidia

{
    "default-runtime": "nvidia"."runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime"."runtimeArgs": []}}}Copy the code

Restart the Docker daemon to complete the installation after setting the default runtime:

sudo systemctl restart docker
Copy the code

At this point, a working setup can be tested by running a base CUDA container:

Sudo Docker run --rm -- GPUS all nvidia/ CUDA :11.0-base nvidia- SMICopy the code

tensorflow-docker

TensorFlow installation

There are various versions of TensorFlow in DockerHub. When regenerating code, just select the corresponding version after docker pull

Other dependent installations

Create a new Dockerfile and write other dependencies, such as OpenCV, to the Dockerfile. After the Docker build image, you can use it

FROM tensorflow/tensorflow:1.4.0-gpu-py3
RUNPIP install Keras==2.1.2 \ &&pip install numpy==1.13.3 \ &&pip install opencv-python==3.3.0.10 \ &&pip install H5py = = 2.7.1

RUNapt-get update \ && apt-get install -y libsm6 \ && apt-get install -y libxrender1 \ && apt-get install -y libxext-dev
Copy the code

If Dockerfile contains apt and other commands to install dependencies from foreign sources, the process will be slow or even stuck. The solution can be to mount agents (dig the following article) or use overseas machines to build aliyun image services (dig the following article).

docker build -t dockerImageName:version .
Copy the code

pytorch-docker

Pytorch is similar to TensorFlow

Hub.docker.com/r/pytorch/p…

Pycharm debug docker and run Docker

Set the Python environment image

Set the Run debug configuration

--entrypoint -v /home/tml/vansin/paper/pix2code:/opt/project --rm
Copy the code

The above configuration is to mount the local folder to the Docker directory, so that the trained data is saved locally instead of in Docker

You can debug after the break point

Reference

Docs.nvidia.com/datacenter/…

www.tensorflow.org/install/doc…