Emulating Docker to implement a simple container with less than 200 lines of code (including blank lines, comments, exception handling) is no boast B. Container technology is pretty much built into the Linux Kernel, and we can do a lot with a simple API call. Of course you have to consider all kinds of business factors, political factors and that can grow to a Docker level of code volume.
The difference between a small company and a large company is that, for example, a small company finds a pig and cuts it to death directly. Big companies have to make a cage to catch a pig, then a process to sharpen a knife, then invent a knife method (engineers often debate knife methods for a long time) to kill a pig. The cage that catch a pig can catch a pig to also catch jump SAO, the tool that sharpen a knife can grind firewood knife, still can grind nail clipper. The process of killing pigs can kill pigs, but also chickens. When you’re done you just type a kill pig command. You don’t know where the pig is because it’s someone else’s responsibility, the code is in some directory you don’t know; You also don’t know where the knife is because the directory is not visible and the format is not readable. You don’t know what a knife is. The system was so powerful in theory that a group of people took great pains to create it, never did anything but kill pigs with firewood knives, never tested chickens, and the code was incomplete. But everyone in the company thinks that’s the way to kill a pig. So everyone was busy every day, and the pig was happy year after year.
So in this series of articles, I will focus on how to find a pig, how to handle a knife without hurting yourself, and how to be more ferocious. And then do a live act of stabbing a living pig to death.
The technology involved
Writing a container requires only two technologies — Namespace and CGroup, both of which are provided by the Linux kernel — to be called. Shamelessly stealing Brendan Gregg’s art.
This diagram contains a detail that is often overlooked — containers share the kernel, they belong to multiple processes running on the same kernel at the same time, separated by namespaces and limited by cgroups to available resources. VMS share “hardware” and each vm has its own independent operating system. So the virtual machine is a bootable, perfectly safe isolation technology; And containers are very fragile, unsafe isolation techniques.
Namespace is an isolation technology provided by the Linux kernel that provides six types of isolation Spaces:
You look so confused, don’t you? That’s okay. Just explain it in a nutshell.
Those of you who have learned the principle of operating system know (have not learned? How dare you play in this business? In a kernel, all processes share resources defined by the operating system — host names, domain names, ARP tables, routing tables, NAT tables; File system, user and group, and process number. The host name, for example, is defined by the operating system in A block of memory, so process A can see it and process B can see it (and even change it if it has permission). Namespace provides an isolation technique that allows each process to define its own hostname. The kernel provides each process with a copy of the current host name. The process can change this data, of course, but the change only affects itself and is not perceived by other processes — it is no longer “global”.
People often ask if all applications can be containerized. Understanding Namespace is an easy answer to this question. Container technology is essentially a shared kernel, so any application that needs to modify the kernel cannot be containerized. Applications such as LVS and OpenvSwtich that need to load kernel modules cannot be made containers.
Hello world
Calling Namespace is very simple and requires only one API (yes, one, only one) — clone.
It creates a new thread (the kernel doesn’t differentiate between threads and processes), with the first parameter specifying the thread’s code entry, the second parameter specifying the thread stack, the third parameter specifying the flag bit, and the fourth parameter specifying the code entry.
The Namespace parameters listed above are passed through the third parameter, the flag bit.
Let’s first test if UTS (hostname) works, since the child process does not involve recursive calls so defining a stack size of 1024 bytes should suffice. Os.waitpid (pid, 0) in the main method is required, otherwise the child will exit prematurely because the parent terminates.
Child_func is the entry to the child process. In this code, we call sethostname to modify the host and then execute hostname to verify that the change is effective.
Libc is my wrapped system call, very simple.
Give it a try:
It first prints its own process number and child process number in the parent process, and then prints its own process number and parent process number in the child process. In the child process we call sethostName to change the hostname and verify the call result with hostname. However, this change does not extend to the kernel, and we finally verify the result by calling hostname in the shell.
There should be a Shell
The above action is to modify hostname only once. The action is a little small and not enough to satisfy the needs. We want to be able to get a shell in a separate Namespace.
Only two lines of code need to be changed. The parent process uses the NEW_PID and NEW_IPC flags, and the child process uses execle to execute bash. The last parameter specifies the environment variable PS1, which represents the prompt.
Upon re-execution, we see that the shell has changed. Verify that we are “in the container” by hostname. Type exit to exit the container.
Is it impossible to hide their inner excitement. Don’t worry. To get more exciting, let’s move on to step 3 — separate the file system.
Complete separation
If you type some top, ps, ls commands into the previous shell, you will find that they are almost identical to the “Host” environment. This is because we haven’t done the most important part yet, separating the file system.
Docker provides images for Ubuntu and CentOS, but these are not exactly images. They should be called root filesystem.
Containers share a shared kernel, so both Ubuntu and CentOS use the kernel of Host. If you check uname in Docker, you will find that the kernel version of any image is exactly the same as that of Host. Therefore, different “operating system” Docker images are actually different root file systems.
Many use BusyBox rootfs as a demonstration of how it is possible to be a coquettish man. So I used CentOS 7 as a demonstration.
The real reason is to switch the root directory in the container, and subsequent code execution will use the new root file system, which is dependent on the Python runtime environment. So we need a Rootfs with Python, and CentOS 7 does just that. If we use C or Golang we don’t have that limitation.
You can find related by CentOS provide Dockerfile rootfs download, such as: https://github.com/CentOS/sig-… ocker
Unzip the downloaded files to the/TMP directory.
Separating the file system is a three-step process. First, we create the /proc file system inside the container, and many Linux commands read the contents of this file system (for example, the list of processes displayed in top). Second, we need to map the current user to the user in the container, otherwise it will prompt insufficient permissions; Finally, you’ll “switch” the root file system through the pivot_root function.
Don’t forget to modify the main method by adding three parameters to the flag bit to map users.
Execute again.
As with CentOS 7, you can even use yum, which will tell you that you can’t access the network because we don’t have network functionality yet.
Do a few more add/delete files? You can see that whatever you do, the resulting data is firmly fixed in/TMP /rootfs, i.e., there is no way to access the host file from inside the container.
The complete code: https://github.com/fireflyc/mini-docker.
Recommend an exchange learning group: 685167672, which will share some senior architects recorded video: Spring, MyBatis, Netty source analysis, high concurrency, high performance, distributed, microservice architecture principle, JVM performance optimization these become architects necessary knowledge system. You can also receive free learning resources, which have benefited a lot at present: