This is mainly to record the problems encountered when using Docker and their solutions.

1.Docker migrates storage directories

By default, Docker containers are stored in /var/lib/docker

Cause of the problem: Today, I found the disk speed of one of the servers in the company through the monitoring system, so I went to check it and found the directory /var/lib/docker was very large. /var/lib/docker contains container-specific storage, so you can’t delete it.

Prepare to migrate the Docker storage directory, or expand the /var device for the same purpose. For more details about the parameters of Dockerd, see the official documentation address.

However, it is important not to use soft chains, because some Docker container choreography systems do not support this, such as the well-known K8S.

ERROR: Cannot create temporary directory! $du -h --max-depth=1Copy the code

Solution 1: Add soft links

$sudo systemctl stop docker # 2 $sudo mv /var/lib/docker /data/ # 3 Add soft link # sudo ln -s /data/docker /var/lib/docker # 4. $sudo systemctl start DockerCopy the code

Solution 2: Modify the Docker configuration file

# 3. Change the docker startup configuration file $sudo vim/lib/systemd/system/docker. Service ExecStart = / usr/bin/dockerd - graph = / data/docker / # 4. $sudo vim /etc/docker/daemon.json {"live-restore": true, "graph": ["/data/docker/"]}Copy the code

Operation precautions: Pay attention to the commands used when migrating docker directories, either use mv command to move directly, or use cp command to copy files, but pay attention to copy file permissions and corresponding attributes at the same time, otherwise there may be permission problems in use. If root is also used in the container, this problem does not exist, but you need to migrate the directory correctly.

$sudo cp -arv /data/docker /data2/dockerCopy the code

In the following figure, the container is used by ordinary users to run the process, and it needs to use the/TMP directory. When we import the container image, we are actually giving the permissions and properties of each directory that the container needs to start. If we simply copy the content of the file with the cp command, there will be inconsistent attributes, and there will be certain security problems.

2.DockerInsufficient device space

Increase Docker container size from default 10GB on rhel7.

Cause 1: When the container is being imported or started, if a message indicating insufficient disk space is displayed, it is probably caused by a physical disk space problem. As shown below, we can see that the/partition is indeed full.

$df -th Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 40G 0G 100% / TMPFS 7.8g 0 7.8g 0% /dev/shm  /dev/vdb1 493G 289G 179G 62% /mntCopy the code

If you find that the physical disk space is really full, you need to look at what is taking up so much space that the container cannot start because there is no space. Among them, the built-in command of Docker is a good tool to help us find problems.

The hardware driver is using Devicemapper and the space pool is docker-252. The available disk capacity is only 16.78MB, which is available for us to use $docker info Containers: 1 Images: 28 Storage Driver: Devicemapper Pool Name: docker-252:1-787932- Pool Pool Blocksize: 65.54 kB Backing Filesystem: Extfs Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 1.225 GB Data Space Total: 107.4 GB Data Space Available: 16.78 MB Metadata Space Used: 2.073 MB Metadata Space Total: 2.147 GBCopy the code

Solution: After checking the information, we know that docker does not have enough disk space to load the boot image. The solution is also very simple, the first is to clean up invalid data files to release disk space (clear logs), the second is to modify the storage path of docker data (large partition).

# according to which the log file directory has the largest container $du - d1 - h/var/lib/docker/containers | sort - h # remove your selected container contents of the log file $cat/dev/null > /var/lib/docker/containers/container_id/container_log_nameCopy the code

Cause 2: Obviously, what I encountered was not the previous situation. When I started the container, it displayed the unhealthy state shortly after the container was started. According to the following log, the disk space was insufficient when the configuration file was copied to start.

The default creation size of the Docker container used by CentOS7 is 10G. However, the container we used exceeded this limit, leading to insufficient space when we failed to start.

Spawned: 'app-demo' with PID 835 2019-08-16 11:11:15, 268 INFO exited: app (exit status 1; Not expected) 2019-08-16 11:11:17,270 INFO gave up: APP entered FATAL state, too many start retries too quickly CP: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on device cp: cannot create regular file '/etc/supervisor/conf.d/grpc-app-demo.conf': No space left on deviceCopy the code

Solution 1: Modify the Docker startup configuration file

# /etc/docker/daemon.json
{
    "live-restore": true,
    "storage-opt": [ "dm.basesize=20G" ]
}

Copy the code

Solution 2: Modify the systemctl docker startup file

# 1.stop the docker service
$ sudo systemctl stop docker
# 2.rm exised container
$ sudo rm -rf /var/lib/docker
# 2.edit your docker service file
$ sudo vim /usr/lib/systemd/system/docker.service
# 3.find the execution line
ExecStart=/usr/bin/dockerd
and change it to:
ExecStart=/usr/bin/dockerd --storage-opt dm.basesize=20G
# 4.start docker service again
$ sudo systemctl start docker
# 5.reload daemon
$ sudo systemctl daemon-reload

Copy the code

Cause 3: In another case, the container cannot be started and a message is displayed indicating insufficient disk space. However, the command is used to check whether the problem is caused by insufficient physical disks. Instead, the number of inodes for the partition is full.

No space left on deviceCopy the code

Ext3 file systems use inode tables to store inode information, while XFS file systems use B+ tree to store inode information. By default, the B+ tree uses only the first 1TB space for performance reasons. When the 1TB space is used up, inode information cannot be written and an error is reported indicating that the disk space is insufficient. We can specify inode64 at mount time to extend the space used by the B+ tree to the entire file system.

# check system inodes nodes use $sudo df - I try a mount # $sudo mount -o remount -o noatime, nodiratime, inode64, nobarrier/dev/vda1Copy the code

A file is stored on a hard disk in a unit called a Sector. Each sector stores 512 bytes (equivalent to 0.5KB). When the operating system reads disks, it does not read disks one by one, which is inefficient. Instead, the operating system reads disks in consecutive sectors, that is, one block at a time. This “block”, composed of multiple sectors, is the smallest unit of file access. The most common size is 4KB, that is, eight sectors in a row constitute a block block. The file data is stored in blocks, so obviously we have to find a place to store meta-information about the file, such as who created it, the date it was created, the size of the file, and so on. This area of file meta-information is called an inode. Each file has its inode, which contains all file information except the name of the file.

Inodes also consume disk space, so when the disk is formatted, the operating system automatically divides the disk into two areas. One is the data area, storing file data; The other is the inode table, which stores the information contained in the inode. The size of each inode node, typically 128 or 256 bytes. The total number of inode nodes, given at formatting time, is typically one inode per 1KB or 2KB.

$stat check_port_live.sh File: check_port_live.sh Size: 225 Blocks: 8 IO Block: 4096 Regular File Device: 822h/2082d Inode: 99621663 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 1006/ escape) Gid: ( 1006/ escape) Access: 2019-07-29 14:59:59.498076903 +0800 Change: 2019-07-29 14:59:59.498076903 +0800 Change: The 2019-07-29 23:20:27. 834866649 + 0800 Birth: $df -i Filesystem Inodes IUsed IFree IUse% Mounted on udev 16478355 801 16477554 1% /dev TMPFS 16487639 2521 16485118 1% /run /dev/sdc2 244162560 4788436 239374124 2% / tmpfs 16487639 5 16487634 1% /dev/shmCopy the code

3.Docker lacks shared link library

The Docker command requires access to the/TMP directory

Cause: After installing compose for the system, a shared link library named libz.so.1 is missing when you check the version of Compose. The first reaction is that the system is not installing that package. Then, I did a search and installed all the dependency packages, but the same problem was still displayed.

$docker-compose --version error while loading shared libraries: libz.so. failed to map segment from shared object: Operation not permittedCopy the code

Solution: Later, it was found that the docker in the system did not have access to/TMP directory, so you need to mount it again to solve the problem.

$sudo mount/TMP -o remount,execCopy the code

4. The Docker container file is damaged

Configuration of Dockerd may affect system stability

Cause: The container file is corrupted, which often makes the container inoperable. Normal Docker commands can no longer control this container, and cannot be closed, restarted, or deleted. Coincidentally, this problem was needed the day before yesterday. The main reason was that the default container of Docker was re-allocated.

B 'Devicemapper: Error running deviceCreate (CreateSnapDeviceRaw) dM_task_run failed'Copy the code

Solution: You can delete/rebuild the container by doing the following.

$sudo systemctl stop docker # 2 Delete the container file $sudo rm - rf/var/lib/docker/containers # 3. Rearranging the container metadata $sudo thin_check/var/lib/docker devicemapper/devicemapper/metadata $sudo thin_check --clear-needs-check-flag /var/lib/docker/devicemapper/devicemapper/metadata # 4. $sudo systemctl start DockerCopy the code

5.Docker container restarts gracefully

How nice it is to restart the Dockerd service without stopping the container running on the server

Cause: By default, when the Docker daemon terminates, it closes the running container. Starting with Docker-CE 1.12, you can add the live-restore parameter to the configuration file to keep the container running when the daemon becomes unavailable. Note that the Windows platform does not support this parameter.

# Keep containers alive during daemon downtime $ sudo vim /etc/docker/daemon.yaml { "live-restore": True} # keep container alive during daemon outage $sudo dockerd --live-restore # Reload only # send SIGHUP semaphores to dockerd daemon $sudo Systemctl Reload docker $sudo systemctl restart dockerCopy the code

Solution: You can delete/rebuild the container by doing the following.

# /etc/docker/daemon.yaml { "registry-mirrors": (" https://vec0xydj.mirror.aliyuncs.com "), # configuration for official mirror warehouse address "experimental" : true, # enable experimental function "default - the runtime" : "Nvidia ", # container default OCI runtime (default runc) "live-restore": true, # easy to restart dockerd service without terminating "runtimes": {# configure container runtime "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-address-pools": {[# configuration containers use subnet address pool "scope" : "local", "base" : "172.17.0.0/12", "size" : 24}]}Copy the code

6.Docker containers cannot be deleted

Not finding the corresponding container process is the scariest part

Cause: The docker container cannot be stopped/terminated/deleted today. I thought the container might be hosted by the Dockerd daemon, but I could not find the corresponding running process through ps -ef

. Alas, when I start checking the Supervisor and the processes in the Dockerfile, there are none. The possible reason for this is that after the container is started, the host restarts for whatever reason and does not gracefully terminate the container. The remaining files now prevent you from regenerating a new container with the old name, because the system thinks the old container still exists.

$sudo docker rm -f f8e8c3.. Error response from daemon: Conflict, cannot remove the default name of the containerCopy the code

Solution: find the/var/lib/corresponding container under the docker/containers/folder, delete them, and then restart the dockerd can. We find that containers that we could not delete before are gone.

# delete container file $sudo rm - rf/var/lib/docker/containers/f8e8c3... $sudo systemctl restart docker.serviceCopy the code

7. The Docker container is abnormal

If there is a problem with the container, remember to check the official website first

Cause of the problem: Today, I logged in the MySQL database deployed before, and found that Chinese fields could not be queried using SQL statements. Even if I directly entered Chinese, it could not be displayed.

root@b18f56aa1e15:# locale-a C C UTF-8 POSIXCopy the code

Docker uses the POSIX character set for MySQL. However, POSIX character sets do not support Chinese, and C.UTF-8 supports Chinese. Simply change the system environment LANG format to “C.UTF-8”. Similarly, K8S can not enter Chinese into POD can also be solved by this method.

Docker run --name some-mysql -e docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:tag --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ciCopy the code

8. The Network connectivity of the Docker container is normal

Understand the four network models of Docker

The Nginx container was deployed on the native server to proxy the Python backend server that was started on the native server, but the code server was configured as follows.

$docker run -d -p 80:80 $PWD:/etc/ Nginx Nginx server {... location /api { proxy_pass http://localhost:8080 } ... }Copy the code

Nginx is running in a container, so localhost is the container’s localhost, not the local localhost.

Localhost in nginx.conf can be changed to the host IP address to resolve error 502.

$IP addr show docker0 docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:d5:4c:f2:1e brd Ff :ff:ff:ff:ff:ff :ff inet 172.17.0.1/16 scope Global Docker0 VALID_lft forever preferred_lft forever inet6 fe80::42:d5ff:fe4c:f21e/64 scope link valid_lft forever preferred_lft forever nginx server { ... Location/API {proxy_pass http://172.17.0.1:8080}... }Copy the code

When the container uses the host network, the container shares the network with the host, so that the host network can be accessed from the container, so the container’s localhost is the host’s localhost.

$docker run -d -p 80/80 --network=host $PWD:/etc/nginx nginxxCopy the code

Docker container bus error

Bus errors are scary to see

Cause: A bus error is displayed when running a program in the Docker container.

$inv app.user_op --name=zhangsan Bus error (core dumped)Copy the code

Solution: When Docker is running, the SHM partition is set too small, resulting in insufficient share memory. If the -shm-size parameter is not set, the default SHM size assigned by Docker to the container is 64 MB, resulting in insufficient program startup.

$docker run -it --rm --shm-size=200m Pytorch/Pytorch :latestCopy the code

If the disk space in the container is insufficient, bus Error will be reported, so clear unnecessary files or directories to solve the problem.

$df -th Filesystem Type Size Used Avail Use% Mounted on overlay overlay 1T 1T 0G 100% / SHM TMPFS 64M 24K 64M 1% /dev/shmCopy the code

10. A Docker NFS mounting error occurs

Bus errors are scary to see

Cause: We deployed the service to the OpenShift cluster. When we started the service invocation resource file, the following error message was reported. Python3 read_file() reads the contents of the file and locks the file. But strangely, the local debugging found that the service is normal, file locking is no problem. It turns out that NFS mounted shared disks are used in the OpenShift cluster.

Traceback (most recent call last):...... File "xxx/utils/storage.py", line 34, in xxx.utils.storage.LocalStorage.read_file OSError: [Errno 9] Bad file descriptorCopy the code
# file lock code... with open(self.mount(path), 'rb') as fileobj: fcntl.flock(fileobj, fcntl.LOCK_EX) data = fileobj.read() return data ...Copy the code

Workaround: From the information below, to use flock() on Linux, you need to upgrade the kernel version to 2.6.11+. It was later revealed that this was actually caused by an error in the RedHat kernel, which was fixed in kernel-3.10.0-693.18.1.el7. For NFSv3 and NFSv4 services, you need to upgrade the Linux kernel version to solve this problem.

# https://t.codebug.vip/questions-930901.htm $In Linux kernels up to 2.6.11, flock() does not lock files over NFS (i.e., the scope of locks was limited to the local system). [...] Since Linux 2.6.12, NFS clients support flock() locks by emulating them as byte-range locks on the entire file.Copy the code

11.Docker uses network segments by default

The network of started containers cannot communicate with each other, which is strange!

Cause of the problem: When we started the service using Docker, we found that sometimes the service could be connected with each other before, but the multiple services that had time to start were inaccessible before. It is found that the use of internal private address network segment is inconsistent. Some services are enabled on the 172.17-172.31 network segment, and some services are enabled on the 192.169.0-192.168.224 network segment. As a result, services cannot be accessed after being started.

Solution: The above problem is handled by manually specifying the startup network segment of the Docker service.

$cat /etc/dock/daemon. json {"registry-mirrors": (" https://vec0xydj.mirror.aliyuncs.com "), "the default address - pools" : [{" base ":" 172.17.0.0/12 ", "size" : 24}], "experimental" : true, "default-runtime": "nvidia", "live-restore": true, "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }Copy the code

12.Docker service is started

Use the docker – compose command their startup services in both groups, found the service to crosstalk!

Causes: under two different names the directory directory, use the docker – compose to start the service, found that after the group A service is started, the restart group B service, found A group of corresponding part of the service to restart it again, it’s very strange! Group A services and Group B services cannot be started at the same time. I thought it was a Bug in the tool before, but later I consulted “Shangfeng” and realized the reason.

A: /data1/app/ docker-comedy.yml B: /data2/app/ docker-comedy.ymlCopy the code

Docker-compose is composed for docker-compose, which is composed for docker-compose. Docker-compose will add A label to the container which is started by docker-compose, and then use the label label to identify and determine the corresponding container service which is started and managed by the container service. Here, we need to pay attention to the label of a variable is com.docker.com pose. The project, and its corresponding value is to use the startup configuration file at the bottom of the directory directory name, or the app is the corresponding value. It can be found that the corresponding value of two groups of services A and B are app, so they are considered to be the same when they are started, which leads to the above problems. Look at the source code if you need more insight.

/data/app1/docker-compose. Yml B: /data/app2/docker-compose. Yml A: /data/app2/docker-compose. /data1/app-old/docker-compose.yml B: /data2/app-new/docker-compose.ymlCopy the code

Or use the -p parameter provided by the docker-compose command to avoid the problem.

$docker-compose -f./docker-compose. Yml -p app1 up -dCopy the code

13. The Docker command is invoked incorrectly

Docker-related commands are often executed when writing scripts, but pay attention to the details!

Cause: THE CI update environment executed a script, but an error occurred during script execution, as shown below. In the output, you can see that the device being executed is not a TTY.

Then, I checked the script and found that the error was the execution of an exec docker command, as shown below. It is strange that when the script is executed manually or called directly, there is no problem, but when the CI is called, there is no problem. Take a closer look at the following command and notice the it parameter.

Docker exec it <container_name> PSQL -upostgres...... We can take a look at the two parameters of the exec command to get the picture. -i/-interactive # Keep STDIN open even if no attachment is attached; -t/ -tty # Allocate a dummy terminal to execute the command. A bridge that connects the user's terminal to containers STdin and STdoutCopy the code

Docker exec parameter -t indicates Allocate a pseudo-tty. When CI is not executed on a TTY terminal, an error is reported.

14. The scheduled Docker task is abnormal

Docker command execution exceptions occur in scheduled Crontab tasks.

Cause: A problem was found today. When backing up Mysql database, docker container was used for backup, and Crontab scheduled task was used to trigger backup. The MySQL database is empty, but it is ok to execute the corresponding command manually.

Docker exec it <container_name> sh -c 'exec mysqldump --all-databases -uroot -ppassword . 'Copy the code

The docker command was executed with multiple -i’s. Since the Crontab command is not interactive when executed, you need to remove this. In summary, you need the -t option if you need echo, and the -I option if you need interactive sessions.

-i/-interactive # Keep STDIN open even if no attachment is attached; -t/ -tty # Allocate a dummy terminal to execute the command. A bridge that connects the user's terminal to containers STdin and STdoutCopy the code

15.Docker variables use quotation marks

Compose environment variable with no quotes in it!

Cause: For those of you who have used compose, you may have encountered the question of whether to add environment variables using single, double, or no quotes when writing a startup configuration file. Over time, maybe we are always three of the same, can use each other. But, eventually, we found more and more pits, more and more obscure.

Anyway, I’ve seen a lot of service startup problems caused by adding quotes, and have concluded that quotes don’t apply. Streaking, experience unprecedented refreshing! It was only after seeing the Github counterpart Issus that the case was finally solved.

# TESTVAR= 'test' : 'Compose', 'Compose', 'test' : 'Compose', 'test' : 'Compose', 'test' : 'Compose', 'test' : 'Compose' # docker run it --rm -e TESTVAR="test" test:latest Docker actually handles the use of quotes correctlyCopy the code

Solution: The bottom line is that because Compose parses the YAML configuration file, it finds that the quotes are also wrapped with explanations. This causes TESTVAR=”test” to be resolved to ‘TESTVAR=”test”, so we can’t get the corresponding value when referencing it. Now the solution is, whether we add environment variables directly to the configuration file or use env_file, we can’t use quotes without them.

16.Docker failed to delete the image

Unable to delete the image, after all, there is a place for it!

Cause: The following information is displayed when a mirror is deleted when the server disk space is being cleared. A message is displayed indicating that forcible deletion is required, but the forcible deletion fails.

$docker rmi 3CCXXXX2e862 Error response from daemon: conflict Unable to delete 3CCXXXX2e862 (cannot be forced) - image has dependent child images # Forced delete $dcoker Rmi-f 3CCXXXX2e862 Error response from daemon: conflict: unable to delete 3ccxxxx2e862 (cannot be forced) - image has dependent child imagesCopy the code

Solution: It turns out that the main reason for this is TAG, that is, there are other mirrors that reference this mirror. You can run the following command to check the dependency of the corresponding image file and delete the image according to the corresponding TAG.

$docker image inspect --format='{{.repotags}} {{.id}} {{.parent}}' $(docker image ls -q $docker rmI $(docker images --filter since=<image_id>) # $docker rmI $(docker images --filter "dangling=true" -q --no-trunc)Copy the code

17.Docker common user switch

Docker switch to start the user, or need to pay attention to the permissions of the issue!

Cause of the problem: We all know that it is not safe to use root user in Docker container, and it is easy to have security problems beyond the authority. Therefore, under normal circumstances, we will use ordinary user to start and manage the service instead of root. Nginx has been unable to start the Nginx service. Because the corresponding configuration file is not configured var related directory, but 🤷♀! ️

# Nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (13: Permission denied) 2020/11/12 15:25:47 [emerg] 23#23: mkdir() "/var/cache/nginx/client_temp" failed (13: Permission denied)Copy the code

If the nginx service is started, you need to configure the nginx service file to a directory without permission.

nginx
user  www-data;
worker_processes  1;
error_log  /data/logs/master_error.log warn;
pid        /dev/shm/nginx.pid;
events {
    worker_connections  1024;
}
http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    gzip               on;
    sendfile           on;
    tcp_nopush         on;
    keepalive_timeout  65;
    client_body_temp_path  /tmp/client_body;
    fastcgi_temp_path      /tmp/fastcgi_temp;
    proxy_temp_path        /tmp/proxy_temp;
    scgi_temp_path         /tmp/scgi_temp;
    uwsgi_temp_path        /tmp/uwsgi_temp;
    include /etc/nginx/conf.d/*.conf;
}

Copy the code

The most complete and detailed Docker learning materials in history are recommended for you to have a look.

Author: the Escape links: escapelife. Making. IO/posts / 43 a2b…