Docker builds deep learning furnace
In what situations?
In the alchemy process of deep learning, TensorFlow and Pytorch commonly used in the industry often need to accelerate model training through NVIDIA Gpus. The most important dependency for its parallel acceleration is the CUDA-Toolkit software package developed by NVIDIA
The versions of TensorFlow and Pytorch that are dependent on in the corresponding code of academic paper and their dependencies are often complicated. Although the virtual environment of Anaconda can solve the problem of different versions of TensorFlow and Pytorch, it cannot easily solve the problem of different versions of CUDA-Toolkit. If multiple papers are reproduced or the versions of CUDA-Toolkit that the implementation depends on conflict, the system often needs to be reinstalled, which is time-consuming and laborious.
In this paper, the method of deep learning furnace built by Docker on Ubuntu and other Linux can solve the above problems well, so that researchers can invest their time in more important algorithm and model optimization.
The principle of
Users do not need to install CUDA-Toolkit as long as they install the graphics card driver in the Linux system. Cuda-toolkit, TensorFlow and Pytorch are all in the Docker container
NVIDIA Container Toolkit
Schematic diagram of docker furnace
System requirements
The GPU version of Docker Furnace supports the following OS, basically only Linux
Docker installation
More detailed process reference
Script Installation Method
curl -fsSL get.docker.com -o get-docker.sh
sudo sh get-docker.sh --mirror Aliyun
Copy the code
Start the docker
sudo systemctl enable docker
sudo systemctl start docker
Copy the code
Create a Docker user group
By default, docker commands use Unix sockets to communicate with the Docker engine. Only root users and docker group users can access the Unix socket of the Docker engine. For security reasons, the root user is not used directly on Linux systems. Therefore, it is better to add users who need to use Docker to the Docker user group.
Create docker group:
sudo groupadd docker
Copy the code
Add current user to Docker group:
sudo usermod -aG docker ${USER}
Copy the code
sudo systemctl restart docker
su root
su ${USER}
Copy the code
Test if Docker is installed correctly
docker run --rm hello-world
Copy the code
Mirror to accelerate
- Ali Cloud accelerator (click the management console -> login account (Taobao account) -> right mirror center -> mirror accelerator -> copy address
NVIDIA Container Toolkit
NVIDIA container architecture, not supported by Windows users
Ubuntu installation
Setup the stable repository and the GPG key:
distribution=$(. /etc/os-release; echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.listCopy the code
Install the nvidia-docker2 package (and dependencies) after updating the package listing:
sudo apt-get update
Copy the code
sudo apt-get install -y nvidia-docker2
Copy the code
/etc/docker/daemon.json requires the following
Set the default Runtime to nvidia
{
"default-runtime": "nvidia"."runtimes": {
"nvidia": {
"path": "nvidia-container-runtime"."runtimeArgs": []}}}Copy the code
Restart the Docker daemon to complete the installation after setting the default runtime:
sudo systemctl restart docker
Copy the code
At this point, a working setup can be tested by running a base CUDA container:
Sudo Docker run --rm -- GPUS all nvidia/ CUDA :11.0-base nvidia- SMICopy the code
tensorflow-docker
TensorFlow installation
There are various versions of TensorFlow in DockerHub. When regenerating code, just select the corresponding version after docker pull
Other dependent installations
Create a new Dockerfile and write other dependencies, such as OpenCV, to the Dockerfile. After the Docker build image, you can use it
FROM tensorflow/tensorflow:1.4.0-gpu-py3
RUNPIP install Keras==2.1.2 \ &&pip install numpy==1.13.3 \ &&pip install opencv-python==3.3.0.10 \ &&pip install H5py = = 2.7.1
RUNapt-get update \ && apt-get install -y libsm6 \ && apt-get install -y libxrender1 \ && apt-get install -y libxext-dev
Copy the code
If Dockerfile contains apt and other commands to install dependencies from foreign sources, the process will be slow or even stuck. The solution can be to mount agents (dig the following article) or use overseas machines to build aliyun image services (dig the following article).
docker build -t dockerImageName:version .
Copy the code
pytorch-docker
Pytorch is similar to TensorFlow
Hub.docker.com/r/pytorch/p…
Pycharm debug docker and run Docker
Set the Python environment image
Set the Run debug configuration
--entrypoint -v /home/tml/vansin/paper/pix2code:/opt/project --rm
Copy the code
The above configuration is to mount the local folder to the Docker directory, so that the trained data is saved locally instead of in Docker
You can debug after the break point
Reference
Docs.nvidia.com/datacenter/…
www.tensorflow.org/install/doc…