Original address: huweicai.com/run-linux-c…

The container

The emergence of container technology has completely overturned the traditional way of application delivery and deployment. The delivery boundary is no longer limited to code, but a whole set of infrastructure that can run Everywhere. As the slogan of Docker, the master of container technology, says, this is a new era:

Accelerate how you build, share and run modern applications.

A container is essentially a package that contains a business service and its dependencies. For example, I have a Java service that relies on JDK 14.0.1 and several external JAR packages, and I also rely on some features of the operating system that need to run on the Debian Buster distribution. So we can put all of this into an image, or a static container, and anything that affects the runtime we can put into the image, which makes it very portable. When the container is running, it is completely isolated from the underlying machine and other containers, which again avoids many possible conflicts. Portability and isolation are the core features that make modern services increasingly containerized.

The container world is booming. Docker, Containerd, Kata, and other container runtime products are popping up all over the place, and the core technologies that support them are all the same.

In this article, we will introduce a purely manual command typing method, based on a normal Linux system. Based on centos 7.9, pay attention to replace package management related commands), take Alpine :3.11 this image as an example, step by step to run a “five dirty” container, container underlying related core technology are connected.

Download mirror

An image is a static container, essentially a bunch of binaries packed together, and there are several image formats for how they are organized. Currently, there are three main image formats:

  • Docker V2 Schema 1Mainly for compatibility with older versions
  • Docker V2 Schema 2A new version of Docker is currently in use
  • OCIOpen mirror organization based onDocker V2 Schema 2The release of the industry’s unified image standard format

Here, the most common Docker V2 Schema 2 image format is taken as an example. An image will have a manifest.json image description file to describe information such as file layers, container entry commands, version and so on.

We can directly spell out the description file address corresponding to alpine:3.11 image. It is worth noting that the Header is required to declare support for the new image format, otherwise the image service will only deliver the description file of V1 version by default for compatibility consideration. In addition, netease image is used here. Because pulling directly to Docker Hub requires obtaining JWT Token first, it is slightly troublesome:

$ curl -H "Accept: application/vnd.docker.distribution.manifest.v2+json" https://hub-mirror.c.163.com/v2/library/alpine/manifests/3.11

​```
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
      "mediaType": "application/vnd.docker.container.image.v1+json",
      "size": 1470,
      "digest": "sha256:44dc5a8658dc159bb1c78f9285feead8e0d375d13300f10e647152abf5f3c329"
   },
   "layers": [
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 2815346,
         "digest": "sha256:9b794450f7b6db7c944ba1f4161edb68cb535052fe7db8ac06e613516c4a658d"
      }
   ]
}
​```
Copy the code

The layers in the image description file contains the address of the file object for each layer of the container, corresponding to a tar package. When the container runs, these layers are associated through the federated file system. Here our Alpine :3.11 is a base image with only one layer.

We can simply splice out the URL download layer and unzip it into the Alpine directory:

wget -O alpine.tar \
https://hub-mirror.c.163.com/v2/library/alpine/blobs/\
sha256:9b794450f7b6db7c944ba1f4161edb68cb535052fe7db8ac06e613516c4a658d

mkdir -p alpine && tar xf alpine.tar -C alpine
Copy the code

Note: If the netease image fails, you can download rootfs directly from Alpine’s website or elsewhere:

Official website address:

Dl-cdn.alpinelinux.org/alpine/v3.1…

Tsinghua Mirror:

Mirrors.tuna.tsinghua.edu.cn/alpine/v3.1…

Simple operation

If we look at the decompression above, this is Alpine’s Rootfs:

We can use chroot to change the root address, isolate the file view, and then start a shell with alpine mirror:

chroot . /bin/sh --login
Copy the code

Does it suddenly feel like we’re entering a container through a shell?

There are many other types of resources that we need to deal with further:

  • The file systemdf -hYou need to see only the container’s own mount points
  • Network:ifconfigYou need to see only the container’s own network card
  • Process information:ps auxYou need to see only the container’s own processes
  • CPU, process: allocate the specified quota to this container to avoid affecting other containers

Let’s continue the implementation step by step.

Create a mount point

Although we can perform our business operations directly on the image, to start a container based on the image requires pulling down another image, so container products often use a technique called a Union File System, which combines multiple File layers. The view of the first layer is presented for users to use, but all the operations of users, such as modification, deletion and addition, only affect the top layer. The bottom layer is read-only, which facilitates the sharing of images between multiple containers and reduces the cost of file transfer and storage.

Just like regular file systems, there are many types of federated file systems, such as: Aufs, Devicemapper, ZFS, etc. Here we take OverlayFS recommended by Docker as an example. We create a new directory as the upper layer, and the lower layer is the directory where our Alpine mirror is located, and merge it into a file system:

Modprobe overlay overlay Overlayfs also needs a temporary directory mkdir./alpine workdir./ TMP # to mount mount -t overlay -o./alpine lowerdir=./alpine,upperdir=./alpine-workdir,workdir=./tmp overlay ./alpineCopy the code

Here we prepare the mount point for the container, which we will isolate later.

Note: If the system does not support overlay file systems, you can also use another file system, or simply mount the subdirectory to its current location to create a mount point.

isolation

Process & host name

In order to support containerization technology, the Linux community has implemented a resource view partitioning mechanism: namespace, which can isolate resources managed by the operating system according to process groups, so that processes can only see the resources associated with it. The following six resources are supported:

At the code level, the kernel provides the clone, unshare, and setNS system calls to provide namespace functionality. However, we are not going to write the code to implement this functionality directly. Util-linux provides the unshare command line tool that allows us to use this functionality manually. Almost all distributions come with this package built in.

Unshare can be used as follows:

$ unshare --helpUnshare [options] < program > [< parameters >...] Start a new namespace to run the specified program without sharing the current namespace. -m, --mount Isolate file mount System -u, -- UTS isolate host name -i, -- IPC isolate System V process communication mechanism -n, --net isolate network space -p, --pid isolate process number -u, --user quarantine user -f, --fork fork a child process to run the program instead of directly --mount-proc[=<dir>] After quarantine is complete, before running the program, Mount the proc system (quarantined) to the specified location -r, where --map-root-user maps the current user to rootCopy the code

Then we can isolate the resource by unshare and then isolate the file view by chroot:

unshare -muinp -f chroot . /bin/sh --login
Copy the code

Note that we need to fork our shell through the -f argument to enter the new namespace, because the design of this command is that the process calling unshare will remain in the original namespace and the child will be put into the new space.

The /proc file system is now isolated, but the /proc file system needs to be manually mounted to expose the current process, CPU, memory and other system runtime information to ps, df and other tools.

/bin/mount -t proc proc /proc
Copy the code

We have now isolated the new shell process, but there are two minor issues that need to be addressed further:

  1. Although the file mount table is isolated, the new namespace copies the mount table of the old namespace as is. This means that all the previous mount points are still visible in the container, but the two Spaces do not interact with each other in future changes
  2. Although the network space is isolated, what devices are not available in the new space, and further network configuration is required

File mount

Linux provides the pivot_root system call, which pivots the entire system mount point around a specified directory. The subdirectories are rotated to the root, and the original root directory is rotated to the subdirectory. Unmount the original root directory, and you get a “clean” file mount table. Only mount points in our subdirectories, eliminating the need to use chroot for file view isolation.

cd alpine/
#Start the host machine bash under the new namespace
unshare -umpn -f bash
#Create a directory to mount the old root
mkdir -p old-root
#Modify the root mount point to mount the original root to old-root
pivot_root . old-root/
#Replace the current bash process start with sh under alpine
exec /bin/sh --login
#Mount the Proc system
/bin/mount -t proc proc /proc
#Unmount the old root
umount -l old-root/
Copy the code

We can look again with the df command and see that the only mount points left in the container are those for our Alpine directory, as expected.

The network configuration

In the container network model, there is a host network mode, that is, the container and the host share the network space without isolation, so theoretically we can implement a container without doing the following operations, but if we want to implement a normal container using bridge network mode with its own independent IP, Then we have one last step to take.

With Linux namespace, we can isolate a network space with independent network devices, IP protocol stack, routing table, and firewall rules, but as we mentioned above, this new network space is empty and needs to be configured.

The physical layer

We first need to create a network space, and then give it to create the network card, this time we can theoretically the default network directly to the space under the physical adapter moved to the new network space, so that the container can get to the Internet, but the host network paralysis, but there is more than a physical adapter, there is no this problem.

Linux provides the Virtual network card pair mechanism of VETH (VeTH-Virtual Ethernet Device), which can establish a tunnel between two network Spaces, so that network traffic can span different network Spaces. We can create a pair of VETH, one in the default network space. One in our new cyberspace, connecting the two.

#Creating cyberspace
ip netns add ns0
#Create veht pair
ip link add veth0 type veth peer name veth1
#Put VETH1 into the newly created network space
ip link set veth1 netns ns0
Copy the code

The link layer

At present, the two network Spaces can only be regarded as the “physical layer” connected. The link layer between physical network cards is connected through the switch, and our virtual network card also needs a virtual switch — the bridge. In docker container network, all containers will be attached to the Docker0 bridge. In this way, virtual devices can be connected directly to each other as if they were strung together on a switch without needing to go through a route.

# yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utilsCopy the code

The network layer

Next, we need to continue to get through the network layer. First, we need to configure the IP address for the container, and the bridge also needs an IP address. We can choose a non-conflicting network segment at will, and we also need to configure the routing table of the container, so that the container can forward all the traffic to the bridge.

Configure the bridge IP address. Use 172.18.0.0/16 network segment ifconfig bridge0 172.18.0.1 # configure the container IP address IP netns exec ns0 ifconfig veth1 172.18.0.2 # Configure the container routing table default forwarding address IP Netns exec ns0 IP route add default via 172.18.0.1 #Copy the code

If there are multiple containers, that is, multiple namespaces, the network is connected. You can see that each namesapce has a veth pair between it and the default namespace with a physical nic. By default, the veth pari in the namespace is taken over by the Bridge0 network bridge, thus connecting the network between containers. Bridges and physical network cards in the same namespace are visible to each other at the link layer, while at the network layer, The physical network card and the bridge are connected by direct routing. The routing table is automatically generated when we assign IP addresses to the bridge.

Now we can theoretically ping a host from a container: Ping 172.18.0.1, but it is still not possible to access the external network, because we are only a LAN IP, after the packet is sent out, the external machine can not send the packet back without our route, so we need to use tools such as iptables for NAT. Change the IP packet source address to the host address before sending the packet:

#The IP packet forwarding function of the host is enabled
sysctl -w net.ipv4.ip_forward=1
#MASQUERADE is a kind of dynamic SOURCE IP ADDRESS NATIptables -t NAT -a POSTROUTING -s 172.18.0.0/16! -o bridge0 -j MASQUERADECopy the code

The application layer

At this point we can ping through the external machine, but there is a small problem: DNS is not configured yet, we need to copy the host’s DNS configuration to the container before isolating the file view:

#Copy the host DNS configuration to the container
cp /etc/resolv.conf alpine/
Copy the code

At this point, our network configuration is complete.

Resource constraints

As a container with all the five organs, resource limitation is a problem that can not be circumvent in any case. Otherwise, for example, a container has a bug that fills up the CPU in an endless loop, causing all other containers to be abnormal, and even affecting the safety of the host machine. No one will dare to use such a container without resource limitation mechanism.

Linux provides cgroups (Control Groups) mechanism to limit, Control and separate the resources of a process group. Its implementation principle is not difficult to guess. Each resource is allocated by process as the smallest unit, and the amount of resources can be counted by kernel. When allocating resource allocations such as CPU and memory, the kernel only needs to pay a little attention to whether the process is constrained.

The user interface layer of cgroups is implemented as a file system mounted under the /sys/fs/cgroup path. Each level of subdirectories below this level corresponds to a resource control subsystem:

$ ls /sys/fs/cgroup
blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer hugetlb
memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  systemd

Copy the code

To limit the resources required by a group of processes, create a subdirectory under the resource control subsystem and write the process id to the Tasks file. This task is not easy to automate, so we can use cgroups tools:

#Limiting CPU usage
cd /sys/fs/cgroup/cpu

#To create a directory, Cgroups will automatically generate a lot of information for us
mkdir -p my-container
#Limit resource utilization to 20%, default value 100000
echo 20000 > my-container/cpu.cfs_quota_us 

#Limiting memory usage
cd /sys/fs/cgroup/memory/
mkdir -p my-container
#Limited memory 200M
echo "200M" > memory.limit_in_bytes

#Install the CGroups toolset
yum install -y libcgroup libcgroup-tools

#Work with the Cgroups tool to eliminate the process of manually writing PID
cgexec -g cpu,memory:my-container bash
Copy the code

conclusion

We have implemented a container with independent network, process, file space and physical isolation from the host. The underlying technologies that the container relies on are basically these:

  • The namespace mechanism isolates processes, IPC, host numbers, network devices, and network protocol stacks
  • Pivot_root Insulates the file mount point & file view
  • Veth pair connects to network space & Bridge network configuration & IPtalbes performs NAT for external network access
  • Cgroups implements resource limits
  • Federated file systems make the underlying image unchangeable and shareable

Final effect demonstration:

Summary of all commands:

# # # # # # # # # # # # # # # # # # # # # # # # # image download decompression wget - O alpine. Tar \ https://hub-mirror.c.163.com/v2/library/alpine/blobs/\ Sha256: # 9 b794450f7b6db7c944ba1f4161edb68cb535052fe7db8ac06e613516c4a658d or: wget -O alpine.tar \ # https://mirrors.tuna.tsinghua.edu.cn\ # / alpine/v3.11 / releases/x86_64 / alpine - minirootfs - 3.11.0 - x86_64. Tar. Gz mkdir -p alpine && tar xf alpine. The tar - C alpine ########################## network configuration # create network space IP netns add ns0 # create veHT pair IP link add veth0 type veth peer name veth1 # # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install -y bridge-utils # yum install BRCTL addif bridge0 veth0 Use 172.18.0.0/16 network segment ifconfig bridge0 172.18.0.1 # configure the container IP address IP netns exec ns0 ifconfig veth1 172.18.0.2 # Configure the container routing table default forwarding address IP Sysctl -w netns exec ns0 IP route add default via 172.18.0.1 Net.ipv4. ip_forward=1 # NAT, MASQUERADE is A kind of dynamic source IP NAT iptables -t NAT -a POSTROUTING -s 172.18.0.0/16! -o bridge0 -j MASQUERADE # copy host DNS configuration to container cp /etc/resolv.conf alpine/ ########################## combine file system mount # ensure overlay Modprobe overlay overlay Overlayfs also needs a temporary directory mkdir./alpine workdir./ TMP # to mount mount -t overlay -o./alpine Lowerdir = / alpine, upperdir =. / alpine - workdir, workdir =. / TMP overlay. / alpine # # # # # # # # # # # # # # # # # # # # # # # # # # # # resource constraints limit the CPU usage CD /sys/fs/cgroup/ CPU # create a directory for cgroups CD /sys/fs/cgroup/memory/ mkdir -p my-container # limit memory Limit_in_bytes # yum install -y libcgroup libcgroup-tools # yum install -y libcgroup-tools # PID procedure CGexec -g CPU,memory:my-container bash ######################### file mount point isolation && startup CD alpine/ # Resource Restriction-Network Isolation - Other resource isolation boot Bash shell cgexec -g CPU,memory:my-container IP netns exec ns0 unshare-muip -f /bin/bash # Create a directory to mount old root mkdir -p Old -root # alter root mount point Mount the original root to old-root pivot_root. old-root/ # Replace the current bash process with alpine sh exec /bin/sh --login # Mount the proc system /bin/mount -t proc proc /proc # Unmount the old root umount -l old-root/ ######################### Verify PWD ls df -th netstat -a Ifconfig ping -c 2 172.18.0.1 ping -c 2 baidu.comCopy the code

reference

  1. Bocker Github.com/p8952/bocke…
  2. Linux Namespace Lwn.net/Articles/53…
  3. unshare Man7.org/linux/man-p…
  4. MyDocker Github.com/xianlubird/…
  5. Docker core technology and implementation principle draveness.me/docker/
  6. UnionFS En.wikipedia.org/wiki/UnionF…
  7. containerd Github.com/containerd/…