The AI training program that can run normally on the local host machine reported an error when Docker was used to run, and the problem was finally located as the program failed because the Share Memory in the container was too small.

The problem found

Start the container and enter

docker run -it  -v /mnt/mfs/traincodes/test-20200908/V0000001/PytorchSSD/:/app -v /mnt/mfs/data/:/dataset 0f3bd9e6a0c3 bash
Copy the code
  • In the training codenum_workersSet to 4

Run the code, the error is as follows:

RuntimeError: DataLoader worker (pid 180) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
Copy the code

Setting num_workers to 0 does solve the multi-process problem, but it also greatly reduces the speed of training. Therefore, it is found that the shared memory (SHM) setting is too small.

The default setting for SHM in docker is 64 MB, as shown in /dev/shm

root@ac4af7598549:/app# df -h
Filesystem              Size  Used Avail Use% Mounted on
overlay                  17T  662G   15T   5% /
tmpfs                    64M     0   64M   0% /dev
tmpfs                    63G     0   63G   0% /sys/fs/cgroup
mfs#192.168.4.221:9421   87T  1.1T   86T   2% /app
/dev/sdb3                17T  662G   15T   5% /etc/hosts
shm                      64M     0   64M   0% /dev/shm
tmpfs                    63G   12K   63G   1% /proc/driver/nvidia
/dev/sda1               271G  162G   96G  63% /usr/bin/nvidia-smi
udev                     63G     0   63G   0% /dev/nvidia0
tmpfs                    63G     0   63G   0% /proc/acpi
tmpfs                    63G     0   63G   0% /proc/scsi
tmpfs                    63G     0   63G   0% /sys/firmware
Copy the code

Fault Location: The shared memory setting is too small (num_workers>0), and the memory shared by multiple processes exceeds 64 MB. As a result, the training program reported an error.

It is relatively easy to locate the problem and solve it

Problem Solving (Docker version)

Relax the size of shM-size, start the container, and enter

docker run -it --shm-size 1024M -v /mnt/mfs/traincodes/test-20200908/V0000001/PytorchSSD/:/app -v /mnt/mfs/data/:/dataset 0f3bd9e6a0c3 bash
Copy the code
  • In the training codenum_workersStill set to 4

Continue running, the program is running successfully!!

View the usage of SHM:

root@b43b495d728f:/# watch -n 1 df -h Filesystem Size Used Avail Use% Mounted on overlay 17T 662G 15T 5% / tmpfs 64M 0 Dev TMPFS 63G 0 63G 0% /sys/fs/cgroup mfs#192.168.4.221:9421 87T 1.1t 86T 2% /app /dev/sdb3 17T 662G 15T 5% /etc/hosts SHM 1.0g 109M 916M 11% /dev/shm TMPFS 63G 12K 63G 1% /proc/driver/nvidia /dev/sda1 271G 162G 96G 63% /usr/bin/nvidia-smi udev 63G 0 63G 0% /dev/nvidia0 tmpfs 63G 0 63G 0% /proc/acpi tmpfs 63G 0 63G 0% /proc/scsi tmpfs 63G  0 63G 0% /sys/firmwareCopy the code
  • shmThe use of is obviously more than 64M, the maximum can reach more than 200 M

Problem solving (Kubernetes edition)

So the question is how do we in pod this is the size of SHM?

Test 1- Do not set SHM size

[root@t34 volume]# vim pod-shm.yaml 

apiVersion: v1
kind: Pod
metadata:
  name: test-pd-shm
spec:
  containers:
  - image: centos
    name: centos
    command: [ "sleep", "1000000" ]
    imagePullPolicy: "IfNotPresent"
    volumeMounts:
      - mountPath: /dev/shm
        name: cache-volume
  volumes:
  - emptyDir:
      medium: Memory
    name: cache-volume
Copy the code
  • Set up the/dev/shmmountemptyDirdirectory
  • Enter the POD and check the SHM size. The size of the SHM is the same as that of the host where the POD resides
[root@t34 volume]# kubectl exec -it test-pd-shm bash 
[root@test-pd-shm /]# df -h | grep shm
tmpfs                    126G     0  126G   0% /dev/shm
[root@test-pd-shm /]# 
Copy the code

Test 2- Setting the SHM size

[root@t34 volume]# vim pod-shm.yaml 

apiVersion: v1
kind: Pod
metadata:
  name: test-pd-shm
spec:
  containers:
  - image: centos
    name: centos
    command: [ "sleep", "1000000" ]
    imagePullPolicy: "IfNotPresent"
    volumeMounts:
      - mountPath: /dev/shm
        name: cache-volume
  volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 128Mi
    name: cache-volume
Copy the code
  • Enter pod and check the SHM size. The SHM size is the same as that of test 1, that is, the SHM size on the host where pod is located.
[root@t34 volume]# kubectl exec -it test-pd-shm bash 
[root@test-pd-shm /]# df -h | grep shm
tmpfs                    126G     0  126G   0% /dev/shm
[root@test-pd-shm /]# 
Copy the code

Don’tsizeLimitSetting does not take effect?

/dev/ SHM; /dev/ SHM; /dev/ SHM

[root@t34 volume]# kubectl exec -it test-pd-shm bash [root@test-pd-shm /]# df -h | grep shm tmpfs 126G 0 126G 0% /dev/shm ## write 100M < 128M [root@test-pd-shm /]# dd if=/dev/zero of=/dev/ SHM /test bs=1M count=100 100+0 records In 100+0 Records out 104857600 bytes (105 MB, 100 MiB) Copied, 0.0859482 s, 1.2 GB/s [root @ test - pd - SHM /] # df -h | grep SHM TMPFS 126 g 100 m 126 g 1% / dev/SHM to write under/dev/SHM # # 200 m > 128 m [root@test-pd-shm /]# dd if=/dev/zero of=/dev/shm/test bs=1M count=200 200+0 records in 200+0 records out 209715200 Bytes (210 MB, 200 MiB) copied, 0.146763 s, 1.4 GB/s [root @ test - pd - SHM /] # df -h | grep SHM TMPFS 126 g 200 m 126 g 1% / dev/SHM/root @ test - pd - SHM / # command terminated with exit code 137Copy the code
  • Write 100M to /dev/shm, and the container works fine
  • Write 200M to /dev/shm, and the container waits a few seconds before exiting with 137 exit codes. (137 indicates that the container was killed, mostly due to insufficient resources)
  • Check the POD status and find that because SHM exceeds the set 128Mi, pod is expelled and rescheduled.
[root@t34 volume]# kubectl describe pod test-pd-shm

...

Events:
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Pulled     10m    kubelet, t32       Container image "centos" already present on machine
  Normal   Created    10m    kubelet, t32       Created container centos
  Normal   Started    10m    kubelet, t32       Started container centos
  Warning  Evicted    9m3s   kubelet, t32       Usage of EmptyDir volume "cache-volume" exceeds the limit "128Mi".
  Normal   Killing    9m3s   kubelet, t32       Stopping container centos
  Normal   Scheduled  6m39s  default-scheduler  Successfully assigned default/test-pd-shm to t32
Copy the code

conclusion

In machine learning training or other application scenarios that require efficient operation, the size of the SHM should be adjusted according to the actual situation. The Settings are too small to meet the requirements of high efficiency. However, if the Settings are too large, the host memory (by default, SHM is half of the host memory) will be occupied too much, and cluster avalanche will occur in serious cases.

Therefore, in the production environment, in the early cluster design process need to think more, good design, in order to fill less pit, less loss.