The AI training program that can run normally on the local host machine reported an error when Docker was used to run, and the problem was finally located as the program failed because the Share Memory in the container was too small.
The problem found
Start the container and enter
docker run -it -v /mnt/mfs/traincodes/test-20200908/V0000001/PytorchSSD/:/app -v /mnt/mfs/data/:/dataset 0f3bd9e6a0c3 bash
Copy the code
- In the training code
num_workers
Set to 4
Run the code, the error is as follows:
RuntimeError: DataLoader worker (pid 180) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
Copy the code
Setting num_workers to 0 does solve the multi-process problem, but it also greatly reduces the speed of training. Therefore, it is found that the shared memory (SHM) setting is too small.
The default setting for SHM in docker is 64 MB, as shown in /dev/shm
root@ac4af7598549:/app# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 17T 662G 15T 5% /
tmpfs 64M 0 64M 0% /dev
tmpfs 63G 0 63G 0% /sys/fs/cgroup
mfs#192.168.4.221:9421 87T 1.1T 86T 2% /app
/dev/sdb3 17T 662G 15T 5% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 63G 12K 63G 1% /proc/driver/nvidia
/dev/sda1 271G 162G 96G 63% /usr/bin/nvidia-smi
udev 63G 0 63G 0% /dev/nvidia0
tmpfs 63G 0 63G 0% /proc/acpi
tmpfs 63G 0 63G 0% /proc/scsi
tmpfs 63G 0 63G 0% /sys/firmware
Copy the code
Fault Location: The shared memory setting is too small (num_workers>0), and the memory shared by multiple processes exceeds 64 MB. As a result, the training program reported an error.
It is relatively easy to locate the problem and solve it
Problem Solving (Docker version)
Relax the size of shM-size, start the container, and enter
docker run -it --shm-size 1024M -v /mnt/mfs/traincodes/test-20200908/V0000001/PytorchSSD/:/app -v /mnt/mfs/data/:/dataset 0f3bd9e6a0c3 bash
Copy the code
- In the training code
num_workers
Still set to 4
Continue running, the program is running successfully!!
View the usage of SHM:
root@b43b495d728f:/# watch -n 1 df -h Filesystem Size Used Avail Use% Mounted on overlay 17T 662G 15T 5% / tmpfs 64M 0 Dev TMPFS 63G 0 63G 0% /sys/fs/cgroup mfs#192.168.4.221:9421 87T 1.1t 86T 2% /app /dev/sdb3 17T 662G 15T 5% /etc/hosts SHM 1.0g 109M 916M 11% /dev/shm TMPFS 63G 12K 63G 1% /proc/driver/nvidia /dev/sda1 271G 162G 96G 63% /usr/bin/nvidia-smi udev 63G 0 63G 0% /dev/nvidia0 tmpfs 63G 0 63G 0% /proc/acpi tmpfs 63G 0 63G 0% /proc/scsi tmpfs 63G 0 63G 0% /sys/firmwareCopy the code
shm
The use of is obviously more than 64M, the maximum can reach more than 200 M
Problem solving (Kubernetes edition)
So the question is how do we in pod this is the size of SHM?
Test 1- Do not set SHM size
[root@t34 volume]# vim pod-shm.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pd-shm
spec:
containers:
- image: centos
name: centos
command: [ "sleep", "1000000" ]
imagePullPolicy: "IfNotPresent"
volumeMounts:
- mountPath: /dev/shm
name: cache-volume
volumes:
- emptyDir:
medium: Memory
name: cache-volume
Copy the code
- Set up the
/dev/shm
mountemptyDir
directory - Enter the POD and check the SHM size. The size of the SHM is the same as that of the host where the POD resides
[root@t34 volume]# kubectl exec -it test-pd-shm bash
[root@test-pd-shm /]# df -h | grep shm
tmpfs 126G 0 126G 0% /dev/shm
[root@test-pd-shm /]#
Copy the code
Test 2- Setting the SHM size
[root@t34 volume]# vim pod-shm.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pd-shm
spec:
containers:
- image: centos
name: centos
command: [ "sleep", "1000000" ]
imagePullPolicy: "IfNotPresent"
volumeMounts:
- mountPath: /dev/shm
name: cache-volume
volumes:
- emptyDir:
medium: Memory
sizeLimit: 128Mi
name: cache-volume
Copy the code
- Enter pod and check the SHM size. The SHM size is the same as that of test 1, that is, the SHM size on the host where pod is located.
[root@t34 volume]# kubectl exec -it test-pd-shm bash
[root@test-pd-shm /]# df -h | grep shm
tmpfs 126G 0 126G 0% /dev/shm
[root@test-pd-shm /]#
Copy the code
Don’tsizeLimit
Setting does not take effect?
/dev/ SHM; /dev/ SHM; /dev/ SHM
[root@t34 volume]# kubectl exec -it test-pd-shm bash [root@test-pd-shm /]# df -h | grep shm tmpfs 126G 0 126G 0% /dev/shm ## write 100M < 128M [root@test-pd-shm /]# dd if=/dev/zero of=/dev/ SHM /test bs=1M count=100 100+0 records In 100+0 Records out 104857600 bytes (105 MB, 100 MiB) Copied, 0.0859482 s, 1.2 GB/s [root @ test - pd - SHM /] # df -h | grep SHM TMPFS 126 g 100 m 126 g 1% / dev/SHM to write under/dev/SHM # # 200 m > 128 m [root@test-pd-shm /]# dd if=/dev/zero of=/dev/shm/test bs=1M count=200 200+0 records in 200+0 records out 209715200 Bytes (210 MB, 200 MiB) copied, 0.146763 s, 1.4 GB/s [root @ test - pd - SHM /] # df -h | grep SHM TMPFS 126 g 200 m 126 g 1% / dev/SHM/root @ test - pd - SHM / # command terminated with exit code 137Copy the code
- Write 100M to /dev/shm, and the container works fine
- Write 200M to /dev/shm, and the container waits a few seconds before exiting with 137 exit codes. (137 indicates that the container was killed, mostly due to insufficient resources)
- Check the POD status and find that because SHM exceeds the set 128Mi, pod is expelled and rescheduled.
[root@t34 volume]# kubectl describe pod test-pd-shm
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 10m kubelet, t32 Container image "centos" already present on machine
Normal Created 10m kubelet, t32 Created container centos
Normal Started 10m kubelet, t32 Started container centos
Warning Evicted 9m3s kubelet, t32 Usage of EmptyDir volume "cache-volume" exceeds the limit "128Mi".
Normal Killing 9m3s kubelet, t32 Stopping container centos
Normal Scheduled 6m39s default-scheduler Successfully assigned default/test-pd-shm to t32
Copy the code
conclusion
In machine learning training or other application scenarios that require efficient operation, the size of the SHM should be adjusted according to the actual situation. The Settings are too small to meet the requirements of high efficiency. However, if the Settings are too large, the host memory (by default, SHM is half of the host memory) will be occupied too much, and cluster avalanche will occur in serious cases.
Therefore, in the production environment, in the early cluster design process need to think more, good design, in order to fill less pit, less loss.