As we all know, Docker started in dotCloud in 2013, so far just seven years, if you happen to experience the early years of 2013-2015 in the circle, naturally should know that the original Docker = LXC + AUFS, the former is the so-called Linux container, The latter is the mirror image of what I want to talk about today.

Millennium: Amazing Live CD

In addition to being a differentiated interface theme, distro, Linux distro, is all about:

  • How to install more easily;
  • How to upgrade more easily;

Distro, on the other hand, is a distraction from these two things: Live CDS, which come on a disc, or a usb stick, and don’t need to be installed or changed. Tong Ge, the operation and maintenance boss of our company, once said:

The e first time I saw the liveCD I was shocked.

I certainly agreed, and I was one of the shocked students at the time, because Knoppix had been around since the 2000s, and it was based on the famous Debian, and until June 2005, Sarge (3.1) officially shipped with the graphical installer Debian-Installer (D-I for short) in Stable Release, while previous installers used text menus. In this era, such an open disc use, start up is the graphical interface system, to us these players bring shock, of course, is imaginable. The Live CD at that time was the Docker 13 years later, definitely worthy of the word “amazing”.

You know, it’s not easy to fit an entire operating system on a 700MB or so DISC (although it wasn’t too hard at first, and eventually my favorite DSL could do 50MB). Knoppix has a great idea — compress the installed operating system, put it on a CD, and decompress it at will. This way, a 700MB CD can fit about 2GB of the root file system, so that you can run the KDE desktop without any problems. Distrowatch.com shows distro’s influence by offering a wide range of Knoppix variations.

Evolution: Read-write layer vs. UnionFS

Knoppix’s early obsession with never laying a finger on local storage, and the ISO9600 file system used for CDS, cd-roms, and cd-roms is read-only, fits into the current trend of immutable infrastructure, but even today, Not having a writable file system is still very difficult for many Linux applications, since any program will need to write configuration files, status information, locks, logs, and so on. Knoppix wasn’t writable when it was born, so to have a compass, you had to manually dig out the local hard drive to hang, or attach a NAS to /home or some other mount point, which is a bit of a hassle when you don’t want to just be an emergency boot drive.

If we go back in time from today, it would be easy to point out that overlayfs plus TMPFS is a writable layer. However, Overlayfs didn’t submit the patchset for the first time until 2010, and it wasn’t incorporated into the 3.18 kernel until 2014 (during which time the Taobao kernel group also had many contributions and defects). Of course, there are similar unionfs that predate overlay, and one of the first ones Docker used was Aufs, which came out in 2006, where the A of Aufs is Advanced, but it actually meant Another — yeah, “Another UFS”, and its predecessor was UnionFS.

Fifteen years ago, in May 2005, Knoppix creatively introduced UnionFS. A year and a half later, in version 5.1, Knoppix introduced the more stable AUFS that year. The familiar Ubuntu LiveCD and Gentoo LiveCD all use AUFS. It can be said that it was Live CDS that prepared Docker and Docker Image for storage 8 years in advance.

For those who don’t know, union FS refers to a file system consisting of multiple file systems combined (stacked), which is different from the general FHS tree structure. As shown in the following figure, for the standard directory tree structure on the left, any file system, For example, all /usr/local/directories are on one file system, and all /usr/local/directories are on another file system. In UnionFS, the files you write will stay on the top layer. For example, your /etc/passwd changes will be on the top writable layer, and other files will be found on the bottom layer. That is, it allows different files in the same directory to be in different layers. The Live CD operating system runs just like a true native operating system and can read and write all paths.

Block or file: Cloop vs. SquashFS

Let’s look at the read-only layer, which is the foundation of Live CDS. The read-only rootFs existed before Live CDS had union FS for layering. In The case of Knoppix, this layer can’t directly hold a full, uncompressed operating system, because in the early 2000s, when everyone was using 24x to 40x speed drives, one of Knoppix’s biggest problems with Live CDS was the conflict between a 700MB disc and a huge desktop operating system.

As we mentioned at the outset, the idea of Knoppix is to ‘squeeze Distro, serve it on a DISC and decompress it as you go’, so that a carefully selected 2GB of Distro can be squeezed into a disc. But Distro isn’t just a Distro, a file system accessing a block device, To find the offset of the block, after compression, the offset is not so easy to find, all decompressed into memory to find the offset is not so easy.

Back in 2000, back in the 2.2 kernel, Knoppix author Klaus Knopper introduced a compressed loop device called cloop, a special format that contains an index, To make the decompression process transparent to the user, Knoppix’s Cloop device looks like a 2GB block device. When an application reads or writes an offset data, it simply decompresses the corresponding block based on the index and returns the data to the user.

Although Knoppix has brought distro onto the LiveCD ship, many of its successors, arch, Debian, Fedora, Gentoo, Ubuntu and other distro LiveCD, And the familiar OpenWrt on the router, instead of cloop files, chose Squashfs, a file system level solution that is closer to the semantics of the application. Squashfs compresses files, inodes, and directories, and supports compression unit sizes from 4K to 1M. It also decompresses on demand based on the index. Unlike Cloop, when a user accesses a file, it decompresses the corresponding block based on the index instead of going through another layer of local file system to address the compressed block, which is much simpler and more straightforward. In fact, there have been calls to switch to Squashfs in Knoppix, for example, since 2004, and some testing data seems to indicate that SquashFS performs better, especially when it comes to metadata manipulation.

Cloop’s shortcomings are as follows:

The design of the cloop driver requires that compressed blocks be read whole from disk. This makes cloop access inherently slower when there are many scattered reads, which can happen if the system is low on memory or when a large program with many shared libraries is starting. A big issue is the seek time for CD-ROM drives (~80 ms), which exceeds that of hard disks (~10 ms) by a large factor. On the other hand, because files are packed together, reading a compressed block may thus bring in more than one file into the cache. The effects of tail packing are known to improve seek times (cf. reiserfs, btrfs), especially for small files. Some performance tests related to cloop have been conducted.

Let me gild the lily.

The design of cloop requires that data be read from disk in compressed blocks. As a result, cloop can be significantly slower when there are many random reads, which can happen when the system is running out of memory or when a large program starts with many shared libraries. A big problem with cloop is the cd-rom’s seek time (about 80 milliseconds), which largely exceeds the hard disk’s find time (about 10 milliseconds). On the other hand, since files can be packaged together, reading a compressed block may actually bring multiple files into the cache. Thus, file systems that support tail packing (such as Reiserfs, BTRFS) can significantly improve seek operation times, especially for small files. There have been some cloop-related performance tests that prove these points.

Of course, cloop still exists on Knoppix despite these arguments, but the debate was finally settled when Squashfs was incorporated into the 2.6.29 mainline kernel in 2009. The out-of-the-box access to the kernel brings with it overwhelming penetration and better support. The advantage of Squashfs lies not only in the distro audience mentioned above, but also in supporting a variety of compression algorithms, suited to different scenarios.

Docker: Make Unionfs Great Again

As time goes by, Live CDS are no longer so popular that they don’t feel new. However, there is also a cycle in the technology circle. Union FS, who were made famous by Live CD in the past, were once again made famous by Docker, which made them glow with a second spring. Generally speaking, although AuFs support multiple read-only layers, a normal Live CD only requires a read-only image and a writable layer for the user. However, dotCloud’s friends, led by Solomon, did #MUGA (Make Unionfs Great Again) by using their brilliant imagination to turn the entire file system into the basic unit of a “package”.

If you recall, Distro, from Slackware in 1993 to RHEL today, has done nothing more than the two things I mentioned at the beginning – installation and upgrading. From RPM to APT/deb to Snappy, the essence of the work once the system is initialized is how to install and upgrade smoothly, make sure there are no dependencies, and not take up too much extra space. The solution to this problem is based on packages like RPM /deb and their dependencies, but things like “A depends on B and C, but B conflicts with C” have kept people trying for two decades.

But Docker goes beyond software packages, and they see it this way

  • A complete operating system is a package, which is necessarily self-contained, and if the same complete, unchanging operating system environment is maintained throughout development, testing, and deployment, then there are fewer dependency issues;
  • This whole operating system is immutable, just like Live CD, we call it mirror, you can use auFS such as union FS to put a writable layer on top, applications can write things to the writable layer at run time, some dynamically generated configuration can also be put on the writable layer;
  • If some applications mirror, they share the same part of the base system, then put these common parts under Unionfs as read-only layer, so that they can be used by different applications; Of course, if two applications rely on different things, then they use different base layers, and there is no need to accommodate each other, naturally, there is no dependency contradiction above;
  • An image can contain multiple layers, which facilitates applications to share some data and saves storage and transmission costs.

The rough sketch looks like this:

This way, if all three applications (containers) are running on the same machine, none of these shared read-only layers need to be downloaded repeatedly.

Furthermore, another advantage of Docker’s hierarchical structure is that it is very developer-friendly. As you can see, here is a schematic of a Dockerfile. FROM represents the bottom base layer, and then RUN, ADD changes the operation of rootfs. The result is stored in a new middle layer, which eventually forms a mirror image. In this way, the organization of software dependencies can be clearly displayed in the hierarchical relationship of the image. For example, the following Dockerfile is a packaging image, which installs the software dependency package and language environment, initializes the user environment for packaging operation, and then pulls the source code. Finally, put the script to make the package into the image. This organization is from generic to task-specific, and the mirror maker wants to make the layers as generic as possible, so that the bottom layer can be used in other images, and the top layer is the content most directly related to the work of the image. Other developers looking at the Dockerfile already know basically what’s in the image, what to do, and whether they can use it for reference. The design of this image is one of the most clever aspects of Docker’s design, and why everyone is willing to agree that Solomon’s work is DX Developer Experiences.

FROM       debian:jessie
MAINTAINER Hyper Developers <[email protected]>

RUNapt-get update &&\ apt-get install -y autoconf automake pkg-config dh-make cpio git \ libdevmapper-dev libsqlite3-dev libvirt-dev python-pip && \ pip install awscli && \ apt-get clean && rm -fr /var/lib/apt/lists/* /tmp/* /var/tmp/*RUN curl -sL https://storage.googleapis.com/golang/go1.8.linux-amd64.tar.gz | tar - C/usr /local -zxf -

RUN useradd makedeb && mkdir -p ~makedeb/.aws && chown -R makedeb.makedeb ~makedeb && chmod og-rw ~makedeb/.aws
RUNmkdir -p /hypersrc/hyperd/.. /hyperstart &&\cd /hypersrc/hyperd && git init && git remote add origin https://github.com/hyperhq/hyperd.git && \
	cd /hypersrc/hyperstart && git init && git remote add origin https://github.com/hyperhq/hyperstart.git && \
	chown makedeb.makedeb -R /hypersrc

ENV PATH /usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV USER makedeb
ADD entrypoint.sh /

USER makedeb
WORKDIR /hypersrc
ENTRYPOINT ["/entrypoint.sh"]
Copy the code

A standard Docker Image or OCI Image derived from it is actually a set of metadata and some layers of data, each of which is a package of file system content. In a sense, the OS of a typical Live CD, Basically a read-only layer plus a writable layer of Docker Container rootfs. In Docker, Union FS is Great Again.

The future: modern mirror systems

However, although the design of Docker Image (or OCI Image) contains the excellent idea that “a complete operating system is a package” and uses Union FS to realize the exquisite design of “layering”, which not only achieves beautiful developer experience but also saves time and space, as time goes by, Still, there are some problems. At the beginning of last year (2019), the OCI community began to talk about the next generation of image formats. This heated discussion focused on the issues of OCIv1 (actually Docker’s) image formats, and Aleksa Sarai wrote a blog post on this topic. Specifically, Aside from the standardization of the TAR format itself, the main complaints with the current image focus on:

  • Content redundancy: The same information between different layers is redundant in transmission and storage, and the existence of such redundancy cannot be detected without reading the content.
  • Parallelism: a single layer is a whole, and neither parallel transmission nor parallel extraction can be performed for the same layer.
  • The integrity of data of the whole layer can be verified only after the complete layer download is completed.
  • Other issues: for example, cross-layer data deletion is difficult to handle perfectly;

In one sentence to summarize the these questions is “layer is the basic unit of the mirror”, however, the image data of actual utilization rate is very low, such as Cern mentioned in this paper, the general image only * * * * 6% of the content will be the actual use, this creates a substantial upgrade image data structure, no longer to layer as the basic units of power.

As you can see, one of the trends in the next generation of images is to further optimize these read-only files by breaking down the structure of the layer. Yes, those of you who are quick may recall that Squashfs is commonly used in Live CDS. Can we extend this to the point where we can pull back the mirrored content from the remote end as needed — from Lazy decompress to Lazy Load, one step away, water to canal city?

Yes, ants use this architecture in their mirror acceleration practice. In the past, large mirrors not only slowed down the pull process, but also contributed to more than half of Pod startup failures if the process was also risky. Today, when we introduce lazy-loading rootFs, these failures are almost completely eliminated. At the 10th China Open Source Hackathon at the end of last year, we also demonstrated the implementation of connecting this system to Kata Containers using Virtio-FS.

As shown in the figure, similar to Squashfs, in this new Image format, compressed data blocks are the basic unit, and a file can correspond to 0 to multiple data blocks. Besides data blocks, some additional metadata is introduced to make the mapping relationship between directory tree and data blocks. In this way, the complete file system structure can be presented to the application even when the complete image data is not downloaded, and the corresponding data can be obtained according to the index to provide the application when the specific data is read. This mirror system can provide these benefits:

  • Load on demand. You do not need to download the image completely during startup. At the same time, you can verify the integrity of the loaded incomplete image as a part of the full link trust.
  • For runC containers, FUSE provides a user-mode solution that is independent of host kernel changes.
  • For virtualized containers like Kata, the image data is sent directly to the Pod sandbox for internal use, not loaded on the host.
  • The user experience is not significantly different from the previous Docker Image, and the developer experience will not be worse;

Moreover, the system at the beginning of the design, we found that because we can get to the application of the file data access pattern, and a benefit is based on the file, even if the image upgrade, its data access pattern is also tend to be less change, so, we can take advantage of the application of the file data access pattern to do some targeted operations such as file to proofread.

As you can see, there’s been a spiral of evolution in system storage over the last 20 years, and the same evolution that happened with Live CDS has happened again with containers. Currently, we are actively participating in the OCI Image V2 standards drive and are in the process of combining our reference implementation with DragonFly P2P distribution and becoming part of CNCF’s open source project DragonFly. We hope to further interact with the OCI community in the future. Let our needs and strengths become part of the community’s norms, and let us be consistent and smooth with the community, so that the future can be unified under the OCI-Image-V2 Image.

The authors introduce

Xu Wang, senior technical specialist of Ant Financial, is also a founding member of the architecture committee of Kata Containers, an open source project, and has been active in the open source development community and standardization work in China for the past few years. Prior to joining Ant Financial, He was co-founder and CTO of Sonic Prodigy, who opened source their virtualization based container engine runV in 2015. In December 2017, together with Intel, they announced the merger of runV and Clear Containers projects, Kata Containers project, which was approved by the Board in April 2019 to become the OpenStack Foundation’s first new open infrastructure Top project since 2012. Prior to founding Sonic Prodigy, Wang worked in the cloud computing team of SHENGda Cloud Computing and China Mobile Research Institute. Wang xu has hosted the cloud computing theme at QCon in Hangzhou in 2011, and has been an active technical writer, translator and blogger.