The author | | ray source alibaba cloud native public number

Alibaba has recently opened source its cloud native container image acceleration technology. Its OverLaybd image format enables web-based on-demand reading compared to the traditional layered tar file format, enabling containers to be launched quickly.

The technology solution was originally part of Ali Cloud’s internal DADI project, which can also be summarized as Data Accelerator for Disaggregated Infrastructure, aiming to provide information about possible Data access acceleration technologies for computing and storage separation architectures. Image acceleration is a breakthrough attempt of DADI architecture in the field of container and cloud native. Since the technology was put into production in 2019, a large number of machines have been deployed online, and the total number of container startup times has exceeded 1 billion, supporting multiple business lines of Alibaba Group and Aliyun, and greatly improving the efficiency of application release and expansion. In 2020, the team presented a paper “DADI: Block-level Image Service for Agile and Elastic Application Deployment. USENIX ATC’20″[1], and subsequently started an open source project, planning to contribute the technology to the community, By establishing standards and creating an ecosystem, attract more developers to the container and cloud native performance optimization space.

Background introduction

With the explosion of Kubernetes and cloud native, containers are becoming more and more widely used on a large scale within the enterprise. One of the core advantages of the container is fast deployment startup, which means that the local image instantiation time is very short, or “hot start” time. However, for a “cold start”, where there is no local image, you need to download the image from Registry before you can create the container. After long-term maintenance and update, the number of mirror layers and the overall size of the service image will reach a large magnitude, for example, hundreds of MB or several GB. Therefore, in a production environment, the cold start of the container often takes several minutes, and as it grows, Registry cannot quickly download images due to network congestion within the cluster.

For example, during the Singles’ Day event of a certain year, an application in Alibaba triggered emergency capacity expansion due to insufficient capacity, but the overall capacity expansion took a long time due to the large amount of concurrency, which affected the user experience of some users. By 2019, with the deployment of DADI, the total image pull + container start time for the new image format container was five times faster than the normal container, and the P99 long tail time was 17 times faster.

How to deal with the mirror data stored on the remote side is the core to solve the problem of slow cold start of the container. Historical attempts to solve this problem include using block storage or NAS to store container images for on-demand reading; Use network-based distribution techniques such as P2P to download images from multiple sources or preheat them to hosts to avoid network single points of bottleneck. In recent years, discussion of new image formats has also been raised. According to Harter et al., pull images take up 76% of the container startup time, while only 6.4% of the time is spent reading data. As a result, mirroring with on-demand Read support has become the default trend. Google’s proposed Stargz format, Seekable tar.gz, as the name suggests, allows you to selectively search for and extract specific files from an archive without having to scan or unzip the entire image. Stargz is designed to improve image pull performance. Its lazy-pull technology does not pull the entire image file, enabling on-demand reading. To further improve runtime efficiency, Stargz has released a Containerd Snapshotter plug-in that optimizes I/O at the storage level.

In the lifetime of a container, images need to be mounted when they are ready, and the core technique for layered mirror mounting is overlayfs, which merges multiple layer files in a stacked form and exposes a unified read-only file system upwards. Similar to block storage and NAS mentioned above, CRFS can be stacked in the form of snapshots. CRFS bound with Stargz can also be regarded as another implementation of Overlayfs.

New image format

DADI didn’t use Overlayfs directly, or rather borrowed ideas from Overlayfs and earlier Union filesystem, but proposed a new layered stack technology based on block devices called Overlaybd, It provides a series of block-based views of merged data for container images. The implementation of Overlaybd is very simple, so a lot of things can be done that were not possible before; Implementing a fully POSIX-compliant file system interface is challenging and can be buggy, as the history of major file systems shows.

In addition to simplicity, other advantages of OverlayBD over OverlayFS include:

  • Avoid performance degradation caused by multi-layer mirroring. For example, update of large files in Overlayfs mode triggers cross-layer reference replication. The system must copy files to the writable layer first. Or creating hard links is slow and so on.
  • You can easily capture block-level I/O mode for recording and playback to prefetch data and further speed up startup.
  • Users can flexibly choose their file system and host OS, for example, Windows NTFS.
  • Online decompression can be done using a valid codec.
  • It can be sunk to a distributed storage system (such as EBS) in the cloud. Mirror system disks and data disks can use the same storage scheme.
  • Overlaybd has natural writable layer support (RW), and read-only mounts can even become a thing of the past.

Overlaybd principle

To understand how OverLaybd works, you first need to understand the layering mechanism of container images. Container images consist of multiple incremental Layer files that are superimposed when used so that only the Layer files need to be distributed when the image is distributed. Each layer is essentially a compressed package that differs from the previous layer (including the addition, modification, or deletion of files). The container engine can use its storage driver to stack the differences in a convention and mount them in read-only mode to the specified directory, called lower_dir. For writable layers mounted in Read/Write mode, the mount directory is usually called upper_dir.

Note that Overlaybd has no concept of files per se; it simply abstracts the image into a virtual block device on which regular file systems are mounted. When a user applies read data, the read request is first processed by the regular file system, which translates the request into one or more reads from the virtual block device. These read requests are forwarded to the user-mode receiver, the runtime carrier of Overlaybd, and finally converted into random reads of one or more layers.

Like traditional mirroring, OverLaybd retains a layer layer structure internally, but each layer is a series of data blocks corresponding to file system changes. Overlaybd provides an upward view of the merge, and the superposition rule of the layer is very simple, that is, for any data block, the last change is always used, and the blocks that have not changed in the layer are regarded as all zero blocks. It also provides the ability to export a series of data blocks into a Layer file, which is high-density, non-sparse, and indexable. Therefore, read operations on a contiguous LBA range of a block device may contain small data segments that belong to multiple layers. We call these small data segments segments. Finding the layer number from the segment attribute maps to reading the layer file at that level. Traditional container images can store their Layer files in Registry or object store, so can overlaybd images.

For better compatibility, Overlaybd wraps the head and tail of a tar file on top of the layer file, masquerading as a tar file. Since there is only one file inside tar, on-demand reading is not affected. At present, whether docker, Containerd, or BuildKit, there are untar and tar processes for downloading and uploading images by default. It is impossible to overcome without intruding code, so adding tar camouflage is conducive to compatibility and process unity. For example, when mirroring, building, or full download is used, there is no need to modify the code, just provide the plug-in.

The overall architecture

The overall architecture of DADI is shown as follows:

1. containerd snapshotter

Containerd has tentatively supported launching remote images since version 1.4, and K8s has explicitly dropped Docker as runtime support. So DADI has chosen to support containerd ecology first, and Docker later.

The core function of Snapshotter is to implement an abstract service interface for mounting and unmounting container Rootfs. It is designed to replace a module called GraphDriver in earlier versions of Docker, making storage drivers simpler and compatible with block device snapshots and overlayfs.

Overlaybd-snapshotter provided by DADI enables the container engine to support the new Overlaybd image format, that is, virtual block devices are mounted to the corresponding directory, and is also compatible with traditional OCI tar image format. Let the user continue to run the normal container as overlayfs.

2. iSCSI target

ISCSI is a widely supported remote block device protocol. It has stable, mature performance and can be recovered when a fault occurs. The Overlaybd module serves as the back-end storage of the iSCSI protocol. Even if the program crashes unexpectedly, it can be recovered after being pulled up again. File system-based mirror acceleration schemes, such as Stargz, cannot be restored.

ISCSI target is the runtime carrier of OverlayBD. In this project, we implemented two types of Target modules: one is based on an open source TGT project with a backing store that can compile code into a dynamically linked library for runtime loading; The second is the LIO SCSI Target (also known as TCMU) based on the Linux kernel. The whole target runs in the kernel mode, which can output virtual block devices conveniently.

3. ZFile

ZFile is a data compression format that supports online decompression. It divides source files into blocks of fixed size, compresses each data block separately, and maintains a Jump table that records the physical offset positions of each data block in ZFile. To read data from ZFile, just look up the index to find the corresponding location and unzip only the relevant data block.

ZFile supports a variety of effective compression algorithms, including LZ4, ZSTD, etc. It is fast to decompress, low overhead, and can effectively save storage space and data transmission. Experimental data show that the performance of decompressing remote ZFile data on demand is higher than that of loading uncompressed data, because the time saved by transmission is greater than the extra cost of decompressing.

Overlaybd supports exporting Layer files to ZFile format.

4. cache

As mentioned above, layer files are stored on Registry, and container read I/O to block devices is mapped to requests to Registry (taking advantage of Registry’s support for HTTP Partial Content). But thanks to the cache mechanism, this won’t always be the case. The cache automatically starts downloading the Layer file some time after the container starts and persists it to the local file system. If the cache hits, read I/O is not sent to Registry, but read to the local place.

Industry leading

On March 25, 2012, Forrester released FaaS Platforms (function-as-a-Service Platforms) evaluation Report for the first quarter of 2021. Ali Cloud stands out with the world’s largest product capabilities, getting the highest score in eight evaluation dimensions. To become the global FaaS leader comparable to Amazon AWS. It is also the first time a domestic tech company has entered the FaaS leadership quadrant.

As we all know, the container is the foundation of the FaaS platform, and the container startup speed determines the performance and response latency of the entire platform. DADI helps Ali Cloud function computing products, greatly reduces the container startup time by 50%~80%, and brings a new Serverless experience.

Conclusion outlook

Alibaba’s open source DADI Container Acceleration project and its Overlaybd image format help meet the need for fast startup of containers in the new era. In the future, the project team will work with the community to accelerate the connection with the mainstream tool chain and actively participate in the development of new image format standards, with the goal of making OverLayBD one of the OCI remote image format standards.

Welcome to participate in open source projects and contribute together!

The follow-up work

1. Artfacts Manifest

The v1 Manifest format of OCI Image has limited description ability, which cannot meet the requirements of remote mirroring. At present, there is no substantive progress in the discussion of V2, and it is not realistic to overturn V1. However, OCI Artfacts Manifest can be used to describe raw data using Additional descriptors, which ensures compatibility and makes it easier for users to accept. Artfacts is also a project being promoted by OCI/CNCF and DADI future plans to embrace Artfacts and implement PoC.

2. Open support for multiple file systems

DADI itself allows users to select an appropriate file system to build an image. However, the corresponding interface is not yet open and the ext4 file system is used by default. In the future, we will improve the relevant interface and open up this function, so that users can decide which file system to use according to their own needs.

3. Buildkit toolchain

Snapshotter can now be built with BuildKit plugins, but it will be improved in the future to form a full toolchain.

4. Data collection

By recording the I/O mode after the container is started and then replaying the record when the same image is started later, prefetching the data and avoiding a temporary request to Registry, the container’s cold start time continues to be reduced by more than half. Theoretically all stateless or idempotent containers can be recorded and played back.