- Namespaces
- CGroups
- UnionFS
- conclusion
- Reference
When it comes to virtualization technology, Docker must be the first thing we think of. After four years of rapid development, Docker has become a large-scale use in the production environment of many companies, and is no longer a toy that can only be used in the development stage. As a product widely used in the production environment, Docker has a very mature community and a large number of users, and the content in the code base has become very large.
Also, the development of the project, the splintering of features, and the strange renaming of PR made it more difficult to understand the overall architecture of Docker again.
Although Docker currently has many components and its implementation is very complex, this paper does not want to introduce the specific implementation details of Docker too much. Instead, we want to talk about what core technologies support the emergence of Docker virtualization technology.
First of all, the emergence of Docker must be due to the fact that the current backend in the development and operation and maintenance stage really needs a virtualization technology to solve the problem of the consistency between the development environment and the production environment. Through Docker, we can also include the running environment of the program into the version control, eliminating the possibility of different running results caused by the environment. But while these requirements drive virtualization technology, we still don’t have a perfect product without the right underlying technology. The rest of this paper will introduce several core technologies used by Docker. If we understand their usage methods and principles, we can understand the implementation principle of Docker.
Namespaces
Namespaces are a Linux method for separating resources such as process trees, network interfaces, mount points, and interprocess communication. On a daily basis with Linux or macOS, we don’t need to run multiple completely separate servers, but if we start multiple services on the server, they actually interact with each other. Each service can see the processes of other services and can access arbitrary files on the host machine. Most of the time we don’t want this, we prefer that different services running on the same machine be completely isolated, as if they were running on multiple different machines.
In this case, once a service on the server was breached, the intruder would be able to access all services and files on the current machine, which we don’t want to see. Docker’s Namespaces isolated containers from each other.
The Linux namespace mechanism provides seven different namespaces, Including CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS, With these seven options we can set which resources the new process should be isolated from the host machine when it is created.
process
Process is a very important concept in Linux and modern operating systems. It represents an executing program and a task unit in modern time-sharing systems. On every * Nix operating system, we can use the ps command to print the current running process on the operating system, such as Ubuntu. Using this command, we can get the following result:
$ ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Apr08 ? 00:00:09 /sbin/init
root 2 0 0 Apr08 ? 00:00:00 [kthreadd]
root 3 2 0 Apr08 ? 00:00:05 [ksoftirqd/0]
root 5 2 0 Apr08 ? 00:00:00 [kworker/0:0H]
root 7 2 0 Apr08 ? 00:07:10 [rcu_sched]
root 39 2 0 Apr08 ? 00:00:00 [migration/0]
root 40 2 0 Apr08 ? 00:01:54 [watchdog/0]
...
Copy the code
There are a number of processes running on the machine, and two of them are very special. One is the /sbin/init process with pid 1, and the other is the kthreadd process with PID 2. Both processes are created by the Idle process. The former performs some of the kernel initialization and system configuration, as well as creating some registration processes like Getty, while the latter manages and schedules other kernel processes.
If we run a new Docker container under the current Linux operating system, enter its internal bash through exec and print all of its processes, we get the following result:
root@iZ255w13cy6Z:~# docker run -it -d ubuntu b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 root@iZ255w13cy6Z:~# docker exec -it b809a2eb3630 /bin/bash root@b809a2eb3630:/# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 15:42 pts/0 00:00:00 /bin/bash root 9 0 0 15:42 pts/1 00:00:00 /bin/bash root 17 9 0 15:43 pts/1 00:00:00 ps -efCopy the code
The ps command inside the new container prints a clean list of processes, with only three including the current ps-ef, and dozens of processes on the host machine gone.
The current Docker container successfully isolates the process in the container from the process in the host machine. If we print all the current processes on the host machine, we will get the following three docker-related results:
UID PID PPID C STIME TTY TIME CMD
root 29407 1 0 Nov16 ? 00:08:38 /usr/bin/dockerd --raw-logs
root 1554 29407 0 Nov19 ? 00:03:28 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc
root 5006 1554 0 08:38 ? 00:00:00 docker-containerd-shim b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 /var/run/docker/libcontainerd/b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 docker-runc
Copy the code
On the current host machine, there may be a process tree consisting of the different processes described above:
This is how CLONE_NEWPID is passed in when clone(2) is used to create a new process, that is, the Linux namespace is used to achieve process isolation, any process inside the Docker container is ignorant of the host machine process.
ContainerRouter. PostContainersStart └ ─ ─ the daemon. ContainerStart └ ─ ─ the daemon. CreateSpec └ ─ ─ setNamespaces └ ─ ─ setNamespaceCopy the code
Each time we run Docker run or Docker start, we create a Spec to set the isolation between processes:
func (daemon *Daemon) createSpec(c *container.Container) (*specs.Spec, error) { s := oci.DefaultSpec() // ... if err := setNamespaces(daemon, &s, c); err ! = nil { return nil, fmt.Errorf("linux spec namespaces: %v", err) } return &s, nil }Copy the code
The setNamespaces method not only assigns namespaces to processes, but also assigns namespaces to users, networks, IPC, and UTS:
func setNamespaces(daemon *Daemon, s *specs.Spec, c *container.Container) error { // user // network // ipc // uts // pid if c.HostConfig.PidMode.IsContainer() { ns := specs.LinuxNamespace{Type: "pid"} pc, err := daemon.getPidContainer(c) if err ! = nil { return err } ns.Path = fmt.Sprintf("/proc/%d/ns/pid", pc.State.GetPID()) setNamespace(s, ns) } else if c.HostConfig.PidMode.IsHost() { oci.RemoveNamespace(s, specs.LinuxNamespaceType("pid")) } else { ns := specs.LinuxNamespace{Type: "pid"} setNamespace(s, ns) } return nil }Copy the code
All namespace-specific specs are eventually set as arguments to Create when a new container is created:
daemon.containerd.Create(context.Background(), container.ID, spec, createOptions)
Copy the code
All namespace-related Settings are done in the above two functions, and Docker successfully completes the isolation from the host process and network through the namespace.
network
If the Docker container completes network isolation from the host process through the Linux namespace, but there is no way to connect to the whole Internet through the host network, there will be a lot of restrictions, so although Docker can create an isolated network environment through the namespace, But services in Docker still need to be connected to the outside world to function.
Each Container started with Docker Run has its own network namespace. Docker provides four different network modes: Host, Container, None, and Bridge.
In this section, we will introduce Docker’s default network setup mode: bridge mode. In this mode, in addition to assigning isolated network namespaces, Docker also sets IP addresses for all containers. A new virtual bridge docker0 is created when the Docker server is on the host, and all subsequent services started on the host are connected to the bridge in someone’s case.
By default, each container is created with a pair of virtual network cards. The two virtual network cards form the data channel. One of the virtual network cards is placed in the container and added to the bridge named Docker0. We can use the following command to view the current bridge interface:
$BRCTL show bridge name bridge ID STP enabled interfaces docker0 8000.0242a6654980 no veth3e84d4f veth9953b75Copy the code
Docker0 assigns a new IP address to each container and sets the DOCker0 IP address to the default gateway. The bridge Docker0 is connected to the network card on the host machine through the configuration in iptables. All qualified requests are forwarded to Docker0 through iptables and distributed to the corresponding machine by the bridge.
$ iptables -t nat -L Chain PREROUTING (policy ACCEPT) target prot opt source destination DOCKER all -- anywhere anywhere ADDRTYPE match dst-type LOCAL Chain DOCKER (2 references) target prot opt source destination RETURN all -- anywhere anywhereCopy the code
Docker run -d -p 6379:6379 redis command is used to start a new redis container on the current machine, after which we can check the NAT configuration of the current iptables and see a new rule in the docker chain:
DNAT TCP -- Anywhere anywhere TCP DPT :6379 to:192.168.0.4:6379Copy the code
The above rule forwards TCP packets sent from any source to port 6379 on the current machine to the address 192.168.0.4:6379.
This address is actually the IP address assigned by Docker to the Redis service. If we ping this IP address on the current machine, we will find that it is accessible:
$ping 192.168.0.4 ping 192.168.0.4 (192.168.0.4) 56(84) bytes of data. 64 bytes from 192.168.0.4: Icmp_seq =1 TTL =64 time=0.069 ms 64 bytes from 192.168.0.4: Icmp_seq =2 TTL =64 time= 0.0441 ms ^C -- 192.168.0.4 ping statistics -- 3 packets transmitted, 2 received 0% packet loss, time 999ms RTT min/avg/ Max /mdev = 0.043/0.056/0.069/0.013msCopy the code
From the above series of phenomena, we can infer how Docker exposes the internal port of the container and forwards the packet; When a Docker container needs to expose services to the host machine, it assigns an IP address to the container and appends a new rule to iptables.
When we use redis-CLI to access 127.0.0.1:6379 from the command line of the host machine, NAT PREROUTING from Iptables directs the IP address to 192.168.0.4, The redirected packets can then be configured using the FILTER in iptables to finally disguise the IP address as 127.0.0.1 during the NAT POSTROUTING phase. At this point, even though it looks like we’re requesting 127.0.0.1:6379 from the outside, But what is actually requested is the port exposed by the Docker container.
$redis-cli -h 127.0.0.1 -p 6379 ping PONGCopy the code
Docker implements network isolation through Linux namespace, and forwards packets through Iptables, enabling Docker containers to gracefully provide services for host machines or other containers.
libnetwork
The function of the whole network part is realized by libnetwork, which is separated from Docker. It provides an implementation to connect different containers, and also provides a container network model that can provide consistent programming interface and network layer abstraction for applications.
The goal of libnetwork is to deliver a robust Container Network Model that provides a consistent programming interface and the required network abstractions for applications.
The most important concept in libnetwork, the container Network model consists of the following major components: Sandbox, Endpoint, and Network:
In the container network model, each container contains a Sandbox that stores the network stack configuration of the current container, including the container’s interface, routing table, and DNS Settings. Linux uses network namespaces to implement this Sandbox. Each Sandbox may have one or more endpoints, which on Linux is a virtual network card VEth, through which the Sandbox joins the corresponding network. The network could be the Linux bridge or VLAN we mentioned above.
For more information on libNetwork or the container network model, read Design · libNetwork for more information, and of course read the source code to see how different OSS implement the container network model.
The mount point
Although we have solved the problem of process and network isolation through the Linux namespace, Docker process has no way to access other processes on the host machine and has limited network access, but the Docker container process can still access or modify other directories on the host machine. This is not what we want to see.
To create an isolated mount point namespace in a new process, you need to pass CLONE_NEWNS in the clone function so that the process can get a copy of the parent process’s mount point. If you do not pass CLONE_NEWNS, all child reads and writes to the file system will be synchronized back to the parent process and the entire host file system.
If a container needs to be started, it must provide a root file system that the container needs to use to create a new process, and all binary execution must be in the root file system.
In order to start a container properly, we need to mount the above specific directories in rootfs. In addition to the above directories, we also need to set up some symbolic links to ensure that the system IO does not have problems.
To ensure that the current container process cannot access other directories on the host machine, we also need to change the process access to the root node of each file directory through the pivor_root or chroot functions provided by libcotainer.
// pivor_root put_old = mkdir(...) ; pivot_root(rootfs, put_old); chdir("/"); unmount(put_old, MS_DETACH); rmdir(put_old); // chroot mount(rootfs, "/", NULL, MS_MOVE, NULL); chroot("."); chdir("/");Copy the code
At this point, we mount the desired directory to the container, while preventing the current container process from accessing other directories on the host machine, ensuring the isolation of different file systems.
Docker does not use chroot to ensure that the current process cannot access the host’s directory. In fact, the author does not have an exact answer. First, the Docker project has too much and huge code, so I do not know where to start. The author tries to find relevant results through Google, but finds questions that no one answers, and answers that conflict with the description in the SPEC. If readers have a clear answer, please leave a comment below the blog, thank you very much.
chroot
In Linux, the default directory starts with /, which is the root directory. The use of chroot can change the structure of the current system root directory. By changing the current system root directory, we can restrict the rights of users. In the new root directory, you cannot access the structure of the old system root directory, thus creating a directory structure that is completely isolated from the original system.
The part about chroot comes from the understanding chroot article, which you can read for more details.
CGroups
We use Linux namespace for the newly created process isolation between the file system, network, and with the host machine process mutual isolation, but the namespace is not able to provide us with the isolation on the physical resources, such as CPU or memory, if on the same machine running multiple know nothing about each other and the host machine of “container”, Collectively, these containers occupy the physical resources of the host machine.
If one of these containers is performing CPU-intensive tasks, the performance and execution efficiency of the other containers will be affected, resulting in multiple containers influencing each other and grabbing resources. The Control Groups (CGroups for short) can isolate physical resources on the host machine, such as CPU, memory, disk I/O and network bandwidth.
Each CGroup is a group of processes constrained by the same criteria and parameters. Different Cgroups are hierarchical, that is, they can inherit some criteria and parameters from their parent classes to restrict the use of resources.
Linux cgroups can allocate resources to a group of processes, such as CPU, memory, and network bandwidth. By allocating resources, cgroups can provide the following functions:
In CGroup, all tasks are a process of a system, while CGroup is a group of processes divided according to certain standards. In CGroup, all resource control is realized by CGroup as a unit. Each process can join a CGroup at any time or exit a CGroup at any time.
– CGroup introduction, application examples, and working principles
Linux uses file systems to implement cgroups. We can directly use the following command to see which subsystems are present in the current CGroup:
$ lssubsys -m
cpuset /sys/fs/cgroup/cpuset
cpu /sys/fs/cgroup/cpu
cpuacct /sys/fs/cgroup/cpuacct
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
blkio /sys/fs/cgroup/blkio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
Copy the code
Most Distributions of Linux have very similar subsystems, and the Cpuset, CPU, and so on are called subsystems because they allocate and limit resource usage to the corresponding control group.
If we want to create a new cgroup, we just create a new folder under the subsystem that wants to allocate or restrict resources, and then a lot of content will automatically appear in that folder. If you have Docker installed on Linux, You’ll notice that all subsystems have a folder named Docker in their directories:
$ ls cpu
cgroup.clone_children
...
cpu.stat
docker
notify_on_release
release_agent
tasks
$ ls cpu/docker/
9c3057f1291b53fd54a3d12023d2644efe6a7db6ddf330436ae73ac92d401cf1
cgroup.clone_children
...
cpu.stat
notify_on_release
release_agent
tasks
Copy the code
9c3057xxx is actually a Docker container that we run. When starting this container, Docker will create a CGroup for this container with the same identifier as the container. On the current host, the CGroup will have the following hierarchy:
Each CGroup has a tasks file, which stores pids of all processes in the current control group. As the subsystem responsible for CPU, the contents of cpu.cfs_quota_us file can limit CPU usage. The CPU usage of all processes in the current control group cannot exceed 50%.
If the system administrator wants to control the resource utilization rate of a Docker container, he can find the corresponding child control group under the parent control group of Docker and change the content of their corresponding files. Of course, we can also directly use parameters when the program is running, so that the Docker process can change the content of the corresponding files.
$ docker run -it -d --cpu-quota=50000 busybox
53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274
$ cd 53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274/
$ ls
cgroup.clone_children cgroup.event_control cgroup.procs cpu.cfs_period_us cpu.cfs_quota_us cpu.shares cpu.stat notify_on_release tasks
$ cat cpu.cfs_quota_us
50000
Copy the code
When we use Docker to close the running container, the folder corresponding to Docker’s sub-control group will also be removed by the Docker process. Docker actually only does some file operations to create folders and change file contents when using CGroup. However, the use of CGroup does solve the problem of limiting the resource occupation of sub-containers. The system administrator can reasonably allocate resources to multiple containers without the problem of multiple containers grabbing resources from each other.
UnionFS
Linux namespaces and control groups solve the problem of resource isolation respectively. The former solves the problem of process, network, and file system isolation, while the latter implements the isolation of CPU, memory, and other resources. But there is another very important problem that needs to be solved in Docker – that is, mirroring.
What exactly is the image, and how it is composed and organized is a problem that has been confusing the author for a period of time since the author used Docker. We can use Docker Run to download the image of Docker remotely and run it locally very easily.
Docker image is essentially a compressed package, we can use the following command to export the files in a Docker image:
$ docker export $(docker create busybox) | tar -C rootfs -xvf -
$ ls
bin dev etc home proc root sys tmp usr var
Copy the code
You can see that the directory structure in the BusyBox image is not very different from the contents in the root directory of the Linux operating system, so the Docker image is just a file.
Storage drive
Docker uses a series of different storage drivers to manage the file system within the image and run containers. These storage drivers are somewhat different from Docker volumes. Storage engines manage storage that can be shared between containers.
To understand the storage driver Docker uses, we first need to understand how Docker builds and stores images, and how Docker images are used by each container. Each image in a Docker is made up of a series of read-only layers, and each command in a Dockerfile creates a new layer on top of an existing read-only layer:
FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py
Copy the code
Each layer in the container makes only very small changes to the current container, and the Dockerfile above builds an image with four layers:
When the image is created by the Docker run command, a writable layer is added to the top layer of the image, namely the container layer. All changes to the runtime container are actually changes to the container’s read-write layer.
The difference between a container and an image is that all images are read-only, and each container is actually an image plus a read-write layer, meaning that the same image can correspond to multiple containers.
AUFS
UnionFS is a file system service designed for Linux operating systems to “union” multiple file systems to the same mount point. And AUFS is Advanced UnionFS is actually the upgraded version of UnionFS, it can provide better performance and efficiency.
AUFS is a federated file system, which can Union layers from different folders into the same folder. These folders are called branches in AUFS. The whole process of “Union” is called Union Mount:
Each image or container layer is a subfolder in the /var/lib/docker-/ directory; In Docker, all mirror the content of the layer and containers are stored in the/var/lib/Docker aufs/diff/directory:
$ ls /var/lib/docker/aufs/diff/00adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c 93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d8
00adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c-init 93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d8-init
019a8283e2ff6fca8d0a07884c78b41662979f848190f0658813bb6a9a464a90 93b06191602b7934fafc984fbacae02911b579769d0debd89cf2a032e7f35cfa
...
Copy the code
And/var/lib/docker aufs/stores the image layer in the layers/metadata, every file holds the image layer of metadata, the final/var/lib/docker aufs/MNT/contains the mount point on the mirror or container, Docker will eventually assemble it in a federated way.
The picture above shows the assembly process very well. Each image layer is built on top of another image layer, and all image layers are read-only. Only the top layer of each container can be read and written by users directly. This assembly of containers, including namespaces, control groups, rootFs, and so on, provides great flexibility, and read-only mirror layers can also be shared to reduce disk usage.
Other Storage Drivers
AUFS is just one of the storage drivers Docker uses. In addition to AUFS, Docker also supports different storage drivers, including AUFS, Devicemapper, Overlay2, ZFS and VFS, etc. In the latest Docker, Overlay2 replaces AUFS as the recommended storage driver, but auFS will still be used as the default driver for Docker on machines without overlay2 drivers.
Different storage drivers also have completely different implementations for storing images and container files. Interested readers can find corresponding contents in the official Docker document Select a Storage Driver.
If you want to check which storage driver is used on the Docker of the current system, you only need to use the following command to get the corresponding information:
$ docker info | grep Storage
Storage Driver: aufs
Copy the code
Since there is no overlay2 storage driver on the Ubuntu of the author, auFS is used as the default storage driver for Docker.
conclusion
Docker has become a very mainstream technology and has been used in the production environment of many mature companies. However, the core technology of Docker has a history of many years. The three technologies of Linux namespace, control group and UnionFS support the current implementation of Docker. It’s the most important reason Docker came into being.
In the process of learning the implementation principle of Docker, the author consulted a lot of materials and learned a lot of knowledge related to Linux operating system. However, because the current Docker code base is too huge, it is very difficult to fully understand the details of Docker implementation from the perspective of source code. But if you are really interested in the implementation details, you can start to understand the principle of Docker from the source code of Docker CE.
Reference
- Chapter 4. Docker Fundamentals · Using Docker by Adrian Mount
- TECHNIQUES BEHIND DOCKER
- Docker overview
- Unifying filesystems with union mounts
- DOCKER basic technology: AUFS
- RESOURCE MANAGEMENT GUIDE
- Kernel Korner – Unionfs: Bringing Filesystems Together
- Union file systems: Implementations, part I
- IMPROVING DOCKER WITH UNIKERNELS: INTRODUCING HYPERKIT, VPNKIT AND DATAKIT
- Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces
- Understand the chroot
- Linux Init Process / PC Boot Procedure
- Docker network details and pipework source code interpretation and practice
- Understand container communication
- Docker Bridge Network Driver Architecture
- Linux Firewall Tutorial: IPTables Tables, Chains, Rules Fundamentals
- Traversing of tables and chains
- Docker network part execution flow analysis
- Libnetwork Design
- Profiling Docker file systems: Aufs versus Devicemapper
- Linux – understanding the mount namespace & clone CLONE_NEWNS flag
- Docker – Namespace resource isolation
- Infrastructure for container projects
- The Spec, libcontainer
- DOCKER basic technology: LINUX NAMESPACE
- DOCKER basic technology: LINUX CGROUP
- Linux UnionFS: Write Docker yourself
- Introduction to Docker
- Understand images, containers, and storage drivers
- Use the AUFS storage driver
About pictures and reprints
Creative Commons Attribution 4.0 International License agreement
About comments and comments
Docker core technology and implementation principle
Docker core technology and implementation principle · Faith-oriented programming
Follow: Draveness dead simple