What exactly is a container? How does the container work? How does the container isolate resources? Why does the container start so fast? . If you’re a curious baby, you might have a similar question in your mind when using containers. This article will answer any questions you may have about the container principle by explaining its three core technologies: Linux Namespace, Control Groups (Cgroups), and UnionFS (federated file system).

Linux Namespace

Linux Namespaces is a resource isolation solution provided by the Linux kernel. Resources for Namespaces are independent of each other. Currently, Linux provides seven namespaces.

Reference: man7.org/linux/man-p…

Namespace Flag instructions
Cgroup CLONE_NEWCGROUP Isolate the cgroup
IPC CLONE_NEWIPC Isolate interprocess communication
Network CLONE_NEWNET Isolating Network Resources
Mount CLONE_NEWNS Isolate the mount point
PID CLONE_NEWPID ID of the quarantined process
User CLONE_NEWUSER Isolate user and user group ids
UTS CLONE_NEWUTS Quarantine hostname and domain name information

You can create a namespace for the newly created process by passing the Flag parameters from the above table to the Clone system call. You can also use the SETNS system call to add a process to an existing namespace. The container implements resource isolation through namespace technology.

Namespaces limits what resources containers can see.

Example: Create a container from a shell in Linux

Talk is cheap, show me the code.

Let’s use a direct example to demonstrate the effect of namespace resource isolation. From the command line, you can use the unshare command to start a new process and create a new namespace for it. In this example, we will create all namespaces for our container except cgroup and user through unshare, which is also the default namespace docker Run Something creates for the container. This example relies on the Docker environment to provide some configuration convenience for us. The complete sample script is placed here for scriptreplay to review the process.

git clone https://github.com/DrmagicE/build-container-in-shell
cd ./build-container-in-shell
scriptreplay build_container.time build_container.his
Copy the code

Step1: prepare a rootfs

First, we prepare our own rootfs for our container, which provides a file system for the container process to execute in isolation. Here we export the alpine image directly as our rootfs and select the /root/container directory as the image rootfs:

[root@drmagic container]# pwd 
/root/container
[root@drmagic container]Change the mount type to private to ensure that subsequent mount/umount does not propagate between namespaces
[root@drmagic container]# mount --make-rprivate / 
[root@drmagic container]# CID=$(docker run -d alpine true)
[root@drmagic container]# docker export $CID | tar -xf-
[root@drmagic container]# ls # rootfs set up
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
Copy the code

Step2: Namespace isolation

[root@drmagic container]Create a namespace for the new shell using unshare
[root@drmagic container]# unshare --mount --uts --ipc --net --pid --fork /bin/bash
[root@drmagic container]# echo ? Look at the PID of the new process
1
[root@drmagic container]# hostname unshared-bash
[root@drmagic container]# exec bash # replace bash with hostname
[root@unshare-bash container]# # hostname changed
Copy the code

Through the above procedure, we can see the isolation effect of the two namespaces UTS and PID.

If you use Ps to see all the processes in this step, the results may disappoint you — you’ll still see all the processes in the system as if no quarantine had succeeded. This is normal, however, because ps reads the information in /proc, which is still the /proc of host, so PS still sees all processes.

Step3: Isolate the mount information

[root@unshare-bash container]You can still see the mount on the host
/dev/vda2 on / type xfs (rw,relatime,attr2,inode64,noquota)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=1929332k,nr_inodes=482333,mode=755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
mqueue on /dev/mqueue type mqueue (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
.....
Copy the code

We found that mount is still able to obtain global mount information. No, the mount namespace is already in effect. When a new mount namespace is created, it copies the parent process’s mount point, but subsequent changes to the namespace mount point do not affect other namespaces.

Reference: man7.org/linux/man-p…

The propagation type of mount must be set to MS_PRIVATE. This is why mount –make-rprivate/was executed in the first place

So the mount message we see is a copy of the parent process. Let’s remount /proc to make ps work.

[root@unshare-bash ~]Remount /proc
[root@unshare-bash ~]# mount -t proc none /proc
[root@unshare-bash ~]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 21:29 pts/0    00:00:00 bash
root        77     1  0 21:47 pts/0    00:00:00 ps -ef
[root@unshare-bash ~]# # Ahh, now our Photoshop works!
Copy the code

After mounting /proc, we also need to clean up the old mount points and unmount them, using pivot_root(new_root,put_old) to do this. Pivot_root Switches the root mount points of all processes (threads) under the current mount namespace to new_root and places the old root mount points in the put_old directory. The main purpose of using Pivot_root is to unmount mount points copied from the parent process.

Man7.org/linux/man-p…

To meet some of pivot_root’s parameters, an additional bind mount is required:

[root@unshare-bash container]# mount --bind /root/container/ /root/container/ [root@unshare-bash container]# cd /root/container/ [root@unshare-bash container]# mkdir oldroot/ [root@unshare-bash container]# pivot_root . oldroot/ [root@unshare-bash container]# cd / [root@unshare-bash /]# PATH=$PATH:/bin:/sbin [root@unshare-bash /]# mount -t proc none /proc [root@unshare-bash /]# ps -ef PID USER TIME COMMAND 1 root 0:00 bash 70 root 0:00 ps -ef [root@unshare-bash /]# mount # Rootfs on/type rootfs (rw) /dev/vda2 on /oldroot type XFS (rw,relatime,attr2,inode64,noquota) devtmpfs on /oldroot/dev type devtmpfs (rw,nosuid,size=1929332k,nr_inodes=482333,mode=755) tmpfs on /oldroot/dev/shm type tmpfs (rw,nosuid,nodev) .... [root@unshare-bash /]# umount -a # umount all umount: can't unmount /: Resource busy umount: can't unmount /oldroot: Resource busy umount: can't unmount /: Resource busy [root@unshare-bash /]# mount -t proc none /proc # re-mount /proc [root@unshare-bash /]# mount rootfs on / Type rootfs (rw)/dev/vda2 on/oldroot type XFS (rw, relatime, attr2, inode64, noquota) < -- oldroot also in/dev/vda2 on/type xfs (rw,relatime,attr2,inode64,noquota) none on /proc type proc (rw,relatime)Copy the code

We can see that the old directory oldroot is still mount information, we unmount it:

[root@unshare-bash /]# umount -l oldroot/ # lazy umount
[root@unshare-bash /]# mount
rootfs on / type rootfs (rw)
/dev/vda2 on / type xfs (rw,relatime,attr2,inode64,noquota)
none on /proc type proc (rw,relatime)
Copy the code

At this point, the container can only see its own mount information, and mount isolation is complete

Step4: add a network to our container

Next, we initialize the container’s network. Use Veth pair, with the help of docker docker0 bridge, open the container and host network.

[root@unshare-bash /]# ping 8.8.8.8 #
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network unreachable
[root@unshare-bash /]# ifconfig -alo Link encap:Local Loopback LOOPBACK MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 Errors :0 dropped:0 Overruns :0 Carrier :0 collisions:0 TXQueuelen :1000 RX bytes:0 (0.0b) TX bytes:0 (0.0b)Copy the code

Go back to the host shell and add veth pair:

[root@drmagic ~]# pidof unshare # Container PID 11363 [root@drmagic ~]# CPID=11363 [root@drmagic ~]# # Add veth pair [root@drmagic ~]# IP link add name h$CPID type veth peer name c$CPID [root@drmagic ~]# # insert veth side into container [root@drmagic ~]# IP link set c$CPID netns $CPID [root@drmagic ~]# # link h$CPID master docker0 upCopy the code

After setting the Veth pair, return to the container:

[root@unshare-bash /]# ifconfig -a # Come back after settingc11363 Link encap:Ethernet HWaddr 1A:47:BF:B8:FB:88 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX Bytes :0 (0.0b) TX bytes:0 (0.0b) LO Link encap:Local Loopback Loopback MTU:65536 Metric:1 RX packets:0 Errors :0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX Bytes :0 (0.0b) TX bytes:0 (0.0b) [root@unshare-bash /]# ip link set lo up
[root@unshare-bash /]# ip link set c11363 name eth0 up
[root@unshare-bash /]Set a random docker IP address for eth0
[root@unshare-bash /]# ip addr add 172.17.42.3/16 dev eth0
[root@unshare-bash /]Configure the default route to the docker default gateway
[root@unshare-bash /]# ip route add default via 172.17.0.1
[root@unshare-bash /]# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 TTL =43 time=17.220 ms 64 bytes from 8.8.8.8: Seq =1 TTL =43 time=16.996 ms 64 bytes from 8.8.8.8: seq=2 TTL =43 time=17.099 ms 64 bytes from 8.8.8.8: Seq =3 TTL =43 time= 18.730ms ^C -- 8.8.830ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms ^C -- 30.730ms 20% packet loss round-trip min/ AVG/Max = 16.996/17.108/17.220msCopy the code

The network configuration is complete, and the entire container’s resources are now isolated from the host. Docker actually takes similar steps to create containers.

Cgroups

Cgroups, short for Control groups, is a feature of the Linux kernel that restricts, controls, and counts the resources (such as CPU, memory, disk input and output) of a group of processes.

Cgroups is used to limit how many resources a container can use.

The CGroups API is implemented as a pseudo-file system, which allows users to organize cgroups through file operations. On most systems, cgroups are already automatically mounted to the /sys/fs/cgroup directory.

Cgroups consist of different sub-systems, each of which is actually a controller for a resource. /sys/fs/cgroup

$ll /sys/fs/cgroup/ drwxr-xr-x 7 root root 0 11月 11 22:49 blkio LRWXRWXX 1 root root 11 11月 11 22:49 CPU -> CPU,cpuacct LRWXRWXRWX 1 root root 11 11月 11 22:49 CPUACCt -> CPU,cpuacct drwxr-xr-x 6 root root 0 11月 11 22:49 CPU, cpuAcct drwxr-xr-x 4 root root 0 11月 11 22:49 cpuset drwxr-xr-x 6 root root 0 11月 11 23:40 devices drwxr-xr-x 4 root Wwxr-xr-x 4 root root 0 WWxr-Xr-X 4 Root root 0 WWxr-xr-X 6 root root 0 WWxr-xr-X 6 root root 0 WWxr-xr LRWXRWXRWX 1 root root 16 11月 11 22:49 net_cls -> net_cls,net_prio DRWXR -xr-x 4 root root 0 11月 11 22:49 Net_cls,net_prio LRWXRWXRWX 1 root root 16 November 11 22:49 net_prio -> net_cls,net_prio DRWXR -xr-x 4 root root 0 November 11 22:49 perf_event drwxr-xr-x 6 root root 0 November 11 22:49 pids drwxr-xr-x 6 root root 0 November 11 22:49 systemdCopy the code

With the exception of Systemd, each of these directories represents a subsystem, including CPU correlation (CPU, CPUACCT, CPUSET), Memory correlation (Memory), block device I/O correlation (BLKIO), Network related subsystems (NET_CLS, NET_PRIO).

Cgroups manage subsystems in a tree hierarchy, with each subsystem having its own tree structure. A node in a tree is a set of processes (or threads), and the hierarchy of subsystems is independent of each other. For example, the HIERARCHY of the CPU subsystem and the memory subsystem can be different:

CPU / ├ ─ ─ batch │ ├ ─ ─ bitcoins │ │ └ ─ ─ 52 / / < - process ID │ └ ─ ─ hadoop │ ├ ─ ─ 109 │ └ ─ ─ 88 └ ─ ─ docker ├ ─ ─ container1 │ ├ ─ ─ 1 │ 3 └ ├ ─ ─ 2 │ └ ─ ─ ─ ─ container2 └ ─ ─ 4 memory / ├ ─ ─ 109 ├ ─ ─ 52 ├ ─ ─ 88 └ ─ ─ docker ├ ─ ─ container1 │ ├ ─ ─ 1 │ ├ ─ ─ 2 │ └ ─ ─ 3 └ ─ ─ Container2 └ ─ ─ 4Copy the code

To add a process to a group, simply write Pid to the tasks file in the corresponding group directory :echo “Pid” > Tasks.

If you start a container with Docker, Docker creates a docker/$container_id directory for that container in each subsystem directory. This allows CGroups to manage and limit resources for the container.

memory cgroup

A memory cgroup is a cgroup that manages memory. It has two main functions:

  • Statistics the memory usage of the current group.
  • Limits the memory usage of the current group.

statistical

Memory Cgroup tracks the memory usage of each group in memory pages. In the docker example, use the following command to start an nginx container and read the memory usage of the container:

$ container_id=$(docker run -d nginx)d
$ cat /sys/fs/cgroup/memory/docker/$container_id/memory.usage_in_bytes
2666496
Copy the code

Because the statistics are in pages, the results can only be multiples of the page size (typically 4096).

limit

The memory cgroup can limit the memory usage of the entire group (there is no limit by default). There are two kinds of limiting abilities:

  • Hard limits.
  • Soft Limit.

If memory exceeds the hard limit, OOM-killer of the current group is triggered to kill the process.

If you don’t want the process to be killed, you can disable OOM -killer: echo 1 > memory.oom_control for the current group

In contrast to hard limits, soft limits do not force processes to be killed and only work when the system is running out of memory. When there is insufficient memory, Cgroup will try its best to limit the memory of each group below the soft limit to ensure the overall availability of the system.

Using docker as an example, we use the following command to set the hard limit and soft limit of the nginx container to 100M and 50M respectively, and can see the corresponding cgroup file changes:

$ container_id=$(docker run -d -m 100m --memory-reservation 50m nginx)
$ cat /sys/fs/cgroup/memory/docker/$container_id/memory.limit_in_bytes
104857600  <-- 100m
$ cat /sys/fs/cgroup/memory/docker/$container_id/memory.soft_limit_in_bytes
52428800 <-- 50m
Copy the code

cpucpuacct cgroup

CPU and CpusET are two cgroups, but generally these two cgroups will be mounted in the same directory:

. LRWXRWXRWX 1 root root 11 November 11 22:49 CPU -> CPU,cpuacct LRWXRWXRWX 1 root root 11 November 11 22:49 cpuACCt -> CPU,cpuacct Drwxr-xr-x 6 root root 0 11月 11 22:49 CPU, cpuACCt...Copy the code

The CPU and CPU Accounting (CPUACCT) combine the following functions:

  • Collects statistics on the CPU usage of the current group.
  • Limit the ability of groups to use CPU (by influencing scheduling policies).

statistical

The statistics function is mainly provided by CPUACCT, such as reading the total CPU time of the current group:

$ cat cpuacct.usage
1196687732756025  //单位是ns
Copy the code

limit

By influencing the scheduler’s scheduling behavior, you can limit the CPU usage of the current group, which is how the container limits the number of CPU cores. CPU Cgroups can control the scheduling behavior of the following two types of schedulers:

  • Completely Fair Scheduler (CFS) Scheduler based on the Completely Fair algorithm.
  • Real-time scheduler (RT) A scheduler based on real-time scheduling algorithms.

In most cases, we use the default CFS scheduler, so we will only discuss the control behavior of the CFS scheduler here. In the CPU cgroup directory, we can see the following two files:

$ cat cpu.cfs_period_us
100000
$ cat cpu.cfs_quota_us
-1
Copy the code

cpu.cfs_period_us

Indicates the scheduling period in microseconds (μs). Indicates how often a schedule is executed. Default value: 100ms(100000μs)

The longer the scheduling period, the higher the throughput and latency of CPU tasks. Conversely, the shorter the scheduling period, the lower the latency, but also the lower the CPU throughput (because of the time spent switching “worthless” processes).

cpu.cfs_quota_us

Represents the total amount of time, in microseconds (μs), that all processes in the current group are allowed to run on a single CPU during a scheduling period (the time set by CPU.cfs_period_us). The default value is -1, that is, no limit.

If the current packet needs to make full use of dual-core CPU resources, set the following parameters:

  • cpu.cfs_quota_us = 200000
  • cpu.cfs_period_us= 100000

Similarly, if we want to allow only 0.5 cores for the current grouping, then:

  • cpu.cfs_quota_us = 50000
  • cpu.cfs_period_us= 100000

Cpu. cfs_quota_us/cpu.cfs_period_us = Number of CPU cores allocated to the current group

When we use docker to specify the number of cores, we are actually adjusting the parameters of the cpu.cfs_quota_us file.

cpuset cgroup

Cpuset is rarely used. When pursuing the ultimate performance, it can be used to bind core, bind NUMA memory nodes and other functions:

  • cpuset.cpusUsed to indicate which CPUS are available for the current grouping.
  • cpuset.memsIndicates which NUMA nodes are available for the current grouping.

The Non-Uniform Memory Access (NUMA) architecture divides CPU modules into multiple NUMA nodes. Each CPU module consists of four cpus and has independent local Memory and I/O slots. The CPU accesses the memory of the local NUMA node very quickly, equivalent to a layer of cache on top of memory

NUMA is beyond the scope of this article (
I don’t know), if you are interested, please refer to the relevant information.

For example, check the CPU and NUMA node information on the local computer:

$ lscpu ... CPU(s): 2 online CPU(s) list: 0,1 <- dual-core CPU.... NUMA node: 1 <- Only one NUMA node.... CPU of NUMA 0:0 1 <- CPU of NUMA....Copy the code

Let’s look at the corresponding file in the cpuset directory:

$ cat cpuset.cpus
0-1  
$ cat cpuset.mems
0 
Copy the code

That is, cgroup defaults the current group to use all cpus and NUMA memory nodes. Docker command can be used to bind the core and NUMA node:

Reference: docs.docker.com/engine/refe…

blkio cgroup

Blkio Cgroup Indicates the Cgroup related to block device I/ OS. The two main functions of blkio CGroup are:

  • Collects statistics on the usage of each block device by the current group
  • Limits the use of block devices by the current group

statistical

Blkio Collects statistics on the usage of all block devices by the current group in the following four dimensions: Read,write,sync, and Async. Io_service_bytes and blkio.throttle.io_service_bytes:

$ cat blkio.io_service_bytes Total 0 $ cat blkio.throttle.io_service_bytes 253:0 Read 0 253:0 Write 8192 253:0 Sync 8192  253:0 Async 0 253:0 Total 8192 Total 8192Copy the code

Io_service_bytes Indicates the usage of block devices that use the CFQ scheduler. In most cases, the value is 0. For general statistics, see files starting with blkio.

limit

Blkio can restrict the use of block devices by groups. It provides two restriction policies:

  • Weight scheduling policy: This policy takes effect only when the block device uses the Completely Fair Queuing (CFQ) scheduling policy. You can set a weight to limit the block device usage capabilities of groups.
  • I/O Throttling policy: Sets the UPPER LIMIT of the I/O rate of a block device to limit the group’s ability to use the block device.

Weight scheduling is effective only when the CFQ scheduler is used, while I/O traffic limiting policies work at the Generic block layer, which is not affected by I/O scheduling policies and is more widely used.

Can use the following command to check the block device scheduler (change vda to need to look at the piece of equipment) : the cat/sys/block/vda/queue/scheduler/mq – deadline kyber none if you see CFQ, weighted scheduling policies to take effect.

Taking the more general I/O traffic limiting policy as an example, the following four files limit the number of bytes read and written per second and the number of I/O operations:

  • blkio.throttle.read_bps_deviceBytes read per second
  • blkio.throttle.read_iops_deviceNumber of read operations per second
  • blkio.throttle.write_bps_deviceNumber of words per second
  • blkio.throttle.write_iops_deviceNumber of write operations per second

Write “Major: Minor Bytes/times per second” to the above four files to set the maximum number of read/write bytes/operands of the corresponding device.

Major and minor are the id of a block device. You can run the ls -lt /dev/command to view the id of a block device on the host.

For example, blkio.throttle.write_bps_device limits the write speed of device 253:0 (/dev/vda) to 10MB/s.

$ echo "253:0 10485760" > blkio.throttle.write_bps_device
Copy the code

Note that blKIO limits block I/O operations. Conventional writes are flushed to disk asynchronously after passing through the Page cache. The speed of writing to the Page cache is not restricted by BLKIO. If you want to see the limiting effect, use direct I/O, such as:

$ dd if=/dev/zero of=test bs=10M count=5 oflag=direct
5+0 records in
5+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 4.94449 s, 10.6 MB/s
Copy the code

As you can see, the write rate after limiting is approximately 10.6MB/s, which is not much different.

net_clsnet_prio cgroup

Net_cls and NET_PRIO are two net-related Cgroups:

  • net_clsNetwork packets are tagged with a Classid, allowing a Traffic control program (TC: Traffic Controller) to identify packets generated from a specific Cgroup.
  • net_prioYou can set the priority of each network interface.

We cannot directly use net_CLS to achieve the rate limiting function similar to THAT of BLKIO. If we want to achieve the function of traffic limiting, we also need to use TC to achieve it. Net_cls is responsible for marking packets, and TC identifies the packets to limit the traffic.

devices cgroup

Devices cgroup Controls group permissions on devices, including read,write, and mknod permissions.

Check the devices.list file under cgroup to obtain the device permissions for the current group:

$ cat devices.list
c 1:5 rwm
b *:* m
Copy the code

Each line is in the following format: Type (device type) Major :minor Access.

There are three types of devices:

  • aRepresents all devices, including character devices and block devices.
  • bRepresents a block device.
  • cRepresents a character device.

Major: Minor was introduced in Blkio, where * can be used as a wildcard for all numbers, for example *:* for all device numbers.

An access permission is a string containing one or more letters representing different permissions:

  • rRead permission.
  • wWrite permissions.
  • mPermission to create device files.

By default, Docker forbade container access to any devices on the host, except for some special virtual devices. The –devices parameter can be used to add device permissions for the container, or the –privileged parameter can be used to open the privileged mode, and the container started with the –privileged parameter can obtain all the permissions of all the host devices:

Reference: docs.docker.com/engine/refe…

$ container_id=$(docker run -d --privileged nginx)
$ cat /sys/fs/cgroup/devices/docker/$container_id/devices.list a *:* RWM <-- all permissions for all devicesCopy the code

Open virtual devices by default: github.com/containerd/…

freezer cgroup

Warencgroup can suspend and resume processes within the group. Freezer. State file records the current actual state:

  • THAWEDUnfreezing status (Normal running status).
  • FREEZINGThe freeze.
  • FROZENFrozen (suspended).

Writing to freezer. State changes the state of the current group and only FROZEN or THAWED is allowed to be written.

A docker pause is used to pause and resume a container using freezer:

$ container_id=$(docker run -d nginx)
$ docker pause $container_id
$ cat /sys/fs/cgroup/freezer/docker/$container_idWaren. state FROZENCopy the code

pids cgroup

Pids cgroup Is used to limit the number of tasks (tasks, threads or processes) in a group. To enable a limit on the number of tasks, write the maximum number of tasks allowed to the pds. Max file. Writing the string “Max” indicates no limit (the default). By reading the pds. current file, you can get the number of all tasks in the current group.

Docker can limit the number of processes in a container with the –pids-limit argument:

$ container_id=$(docker run -d --pids-limit 3 nginx)
$ cat /sys/fs/cgroup/pids/docker/$container_id/pids.max
3
Copy the code

UnionFS

UnionFS is a file system that allows multiple directories to be combined into a single logical directory that contains all the contents of those directories and provides a unified view to the outside world. For example, suppose we need to update the contents of a CD-ROM, but the CD-ROM is not writable, we can mount the CD-ROM with another writable directory as UnionFS. When we update a file, the contents are written to a writable directory, just as the contents of the CD-ROM are updated.

A container image provides a “static view” of the container that contains the files on which the container runs. We can modify these files in a running container without affecting the image itself. This is because the container directory and the image directory are UnionFS. From the container’s point of view, the image is just like the CD-ROM (unwritable). Changes made to the directory by the container will only be written to the container’s own directory and will not affect the contents of the image.

An image is made up of a number of readable layers. When you create a container with the image, a writable layer is added to the image’s readable layer. All changes to the container’s files are stored in this writable layer.

Copy-on-write

Containers start up quickly (even when mirroring is large) thanks to copy-on-write (copy-on-write) technology. When we start a container, we do not need to copy the entire image file, the container directly reference the file in the image, any read operation directly from the image, when the write operation, we need to copy the corresponding file in the image to the writable layer of the container, write in the writable layer.

Docker document has a detailed introduction of COW and sample docs.docker.com/storage/sto…

OverlayFS

UnionFS implementation has many kinds, docker can also configure a variety of types of storage driver, its more familiar with: overlay2, aufs, devicemapper.

Reference: docs.docker.com/storage/sto…

With the integration of OverlayFS into the Linux Kernel mainline, Overlay2 is more and more commonly used and has become a storage driver recommended by Docker. This article uses OverlayFS and Overlay2 as examples to show how containers benefit from UnionFS and copy-on-write.

Mount OverlayFS:

$ mount -t overlay overlay -o lowerdir=lower1:lower2:lower3... ,upperdir=upper,workdir=work mergedCopy the code

Reference: man7.org/linux/man-p… Overlay (found)

The command merged to mount the merged directory into OverlayFS, where lowerdir is the read-only (mirror) layer and allows multiple layers, and upperdir is the writable (container) layer. This means that when we write to the merged directory, the file will be written to upperdir. To read a file from the merged directory, if the file does not exist in upperdir, go down one layer to the merged directory and search for it in lowerdir.

Workdir is used by the system to do some preparatory work before mounting. You need an empty directory on the same file system as upperdir.

Here is an example to visually illustrate the read and write behavior of OverlayFS:

$ mkdir lower upper work merged
$ echo "lowerdir" > lower/test
$ echo "upper" > upper/test # upper and lower have the same file test
$ echo "lowerdir" > lower/lower # lower only available file
$ mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work  merged
$ ls merged/ After mount, see unified view of lower and upper
lower  test
$ cat merged/test
upper # upper; # lower
$ cat merged/lower # upper does not have this file, read lower file
lowerdir
$ echo "write something" >> merged/test
$ cat upper/test Write to merged only affects upper layer
upper
write something
$ cat lower/test
lowerdir
Copy the code

After creating a container using docker run, docker will mount an OverlayFS for the container:

$ docker run -itd alpine /bin/sh
$ mount | grep overlay2
overlay on /var/lib/docker/overlay2/a2a37f61c515f641dbaee62cf948817696ae838834fd62cf9395483ef19f2f55/merged type overlay
(rw,relatime,
lowerdir=/var/lib/docker/overlay2/l/RALFTJC6S7NV4INMLE5G2DUYVM:
         /var/lib/docker/overlay2/l/WQJ3RXIAJMUHQWBH7DMCM56PNK,
upperdir=/var/lib/docker/overlay2/a2a37f61c515f641dbaee62cf948817696ae838834fd62cf9395483ef19f2f55/diff,
workdir=/var/lib/docker/overlay2/a2a37f61c515f641dbaee62cf948817696ae838834fd62cf9395483ef19f2f55/work)
Copy the code

Docker adds each layer in the image to lowerdir in order, setting upperdir to the container’s writable layer.

When we use docker pull image, Docker has created the directories of each read-only layer in the image. When executing docker run, we basically need to create the writable layer of the container and mount them as OverlayFS. So even if the image is large, the container starts up very quickly.

When you use docker pull to pull an image, the Already exists flag must appear.

docker pull xxxx
...
68ced04f60ab: Already exists <---
e6edbc456071: Pull complete
...
Copy the code

When docker pull is performed, if the content of the layer already exists locally, there is no need to pull it. Different images share the same layer. In /var/lib/docker-overlay2, only one corresponding file directory is saved, reducing disk overhead.

Docker document for overlay2 working process in detail: docs.docker.com/storage/sto…

Other reference

Document class man7.org/linux/man-p… www.kernel.org/doc/Documen… Access.redhat.com/documentati… Android.googlesource.com/kernel/comm…

Class blog www.sel.zju.edu.cn/?p=573 JVNS. Ca/blog / 2019/1… www.infoq.cn/article/bui…

Cgroups, Namespaces, and beyond: What are containers Made from? : www.youtube.com/watch?v=sK5… (Treasure video, push)