As a personal note, I will keep updating ing to record the problems and solutions I have encountered while using PyTorch. If you are lucky enough to have an audience, please leave a comment and make progress together.

installation problems

Conda Tsinghua Mirror

Tsinghua mirror

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
Enter the next line to use conda-Forge's package
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
Type the next line if you need Pytorch
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --set show_channel_urls yes
# see the config
conda config --show
# delete a mirror
conda config --remove channels https://pypi.doubanio.com/simple/
# Restore default (not attempted
conda config --remove-key channels
Copy the code

To install PyTorch, first look for commands from the PyTorch website.

Conda install PyTorch TorchVision CudatoolKit =10.1 -C PyTorchCopy the code

Since -c Pytorch is actually specified to be downloaded from Conda’s own repository, it is good to remove -c Pytorch in order to use the image.

username is not in the sudoers file

This can occur when using commands other than conda and PIP using a non-administrator account. The following describes the most successful and convenient of the resources.

Go to root under admin
su root
# username Replaces your own account name
adduser username sudo
Copy the code

no space left on deivce

The most common case is when the server runs out of space under \ TMP for a long time.

Temporary solution, under command line
export TMPDIR=/home/username/tmp
# "permanent", not recommended, because/TMP will be cleaned up after reboot
# in bashrc
export TMPDIR=/home/username/tmp
source ~/.bashrc
Copy the code

conda install opencv-python failed

When installing Opencv-Python using Conda, there are often environment conflicts or package problems. For example, no local packages or working links found for Opencv-Python.

PIP install can be used directly when this happens after several attempts
Check if conda env is PIP during insurance
which pip
pip install opencv-python
Copy the code

Possible cause? Not sure. We’ll look into it at your leisure.

Version of the problem

CUDNN_STATUS_EXECUTION_FAILED

Description: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED is displayed when the code yolo Github is used.

Disable the GPU and run code using the CPU to check whether the code is faulty. After fixing the CODE on the CPU, the problem can be found in CUDA, CUDNN, VS, Python, and PyTorch.

Nvcc-v checks the CUDA version to determine that the PyTorch version and CUDA are appropriate. You can use images to speed up the installation process.

Code Snippet

Initialization of BatchNorm

def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != - 1:
        torch.nn.init.normal_(m.weight.data, 0.0.0.02)
    elif classname.find("BatchNorm2d") != - 1:
        torch.nn.init.normal_(m.weight.data, 1.0.0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)
Copy the code

References: PyTorch Forum, PyTorch Tutorial, Github Issue

init.uniform_(self.weight)

CPU Load checkpoint trained on GPU

The model uses THE GPU to train and checkpoint. If you check torch. Load directly, you may get an error because you only have CPU.

rase AssertionError("Torch not compiled with CUDA enabled")
Copy the code

Use:

net.load_dict(torch.load('model.pt', map_location=lambda storage, loc: storage))
Copy the code

passing in a function, taking a CPU storage and its serialized location as input, and returning some storage to replace it.

Higher-order (rarely used) features

non_blocking

doc

As an exception, several functions such as to() and copy_() admit an explicit non_blocking argument, which lets the caller bypass synchronization when it is unnecessary.

pytorch discuss

You can use non_blocking to make data read from CPU to GPU and GPU kernel asynchronous, but there are two conditions: 1. Dataloader uses pin_memory; 2. 2. The train loop does not transmit the result to the CPU, i.e. there is no other sync point