Aliyun Xu Li: Storage innovation oriented to Container and Serverless Computing

* Author: Xu Li

The source of cloud innovation

Under the trend of cloud native, the proportion of application containerization is increasing rapidly, and Kubernetes has become a new infrastructure in the cloud native era. Forrester predicts that by 2022, global organizations/companies will be running containerized applications in production environments. From less than 30% today to more than 75%, the trend of enterprise application containerization is unstoppable. We can see two general phenomena. First, hosting Kubernetes in the cloud has become the preferred cloud and runtime container for the enterprise. In addition, the way users use containers is changing, from stateless applications to core enterprise applications to data intelligence applications. More and more enterprises are using containers to deploy production-level, complex, and high-performance stateful applications. Such as Web services, content repositories, databases, and even DevOps, AI/ big data applications.

With the gradual evolution of infrastructure from physical machines to virtual machines, to container environments represented by Kubernetes, and even to Serverless, today’s computing and applications are facing tremendous changes. This change leads to finer granularity of resources, shorter life cycles, and on-demand computing.

From the perspective of users, the most obvious changes brought about by cloud native storage are the upward shift of users’ storage interface. Storage services not directly related to applications are transferred from the application layer to the cloud platform, and users are more concerned about applications.

For example, the traditional user needs to care about all the hardware and software, gradually transients to the user care about the VIRTUAL machine, the operating system and the entire application software stack, to today in Serverless evolved to the user care only about the application business and code. System resources are moved from the physical resource layer and virtual resource layer to the application development layer. Users do not need to worry about underlying infrastructure.

Under such a technical system, the evolution of storage capability is mainly reflected in the following three aspects:

1. In the era of high-density VMS, each VM has a complete storage space, which can be used to store all data required by applications. In the Serverless function computing environment, applications are divided into functions, and the corresponding resources need storage management. Therefore, the storage density has changed greatly, and the storage density is higher.

2. Elasticity As the granularity of application splitting becomes more and more refined, the storage density gradually increases. Serverless function computing large-scale instances requires high concurrent startup, and storage requires extreme flexibility.

From the perspective of Serverless function calculation, the function is a part of the whole process, and the life cycle naturally becomes shorter. This results in a large number of container instances with short life cycles. As life cycles get shorter and shorter, storage needs to be quickly mounted/unmounted and accessed.

As the service interface moves upward, the storage management interface is reshaped, and the built-in storage and external storage become clearer. In the Serverless environment, the user visible interface is external storage (including file storage and object storage), while the built-in storage (including mirror storage and temporary storage) is not visible to the user. The built-in storage is managed by Ali Cloud, providing opportunities for innovation.

Mirror accelerated technological innovation

Alibaba container scale deployment challenges

The main challenges facing The large-scale deployment of Alibaba containers are reflected in the following aspects:

1. Large business volume. Cluster size, up to 100,000 magnitude; All applications are containerized and the application image is large, usually tens of GB.

2. Faster deployment. With the continuous growth of business scale, it is required that the cloud platform can rapidly deploy applications to handle the growth of business. In particular, it is difficult to accurately estimate the capacity of each service in advance because of the emergency capacity expansion during the 11th National Day of double Eleven.

3. However, creating or updating container clusters on a large scale is still slow, mainly due to the slow download and decompression of container deployment images. The main technical challenges are as follows:

• High time cost: time cost mirror size * node number; A thousand nodes need to store a thousand mirrors;

• High CPU time: Gzip decompression is slow and can only be decompressed serial.

• High I/O pressure: Download and decompress two rounds of disk write, including many nodes write disk at the same time, the cloud disk “resonance”;

• Memory occupancy disturbance: serious disturbance to the page cache of the host;

• But the percentage of valid data is small: on average, 6.4% of mirrored data is required at startup.

To address the above technical challenges, the key requirements for large-scale container deployment can be abstractly summarized in three aspects:

1, on demand: download decompression speed is fast enough, data on demand access and on demand transmission.

2. Incremental layerization: Data is decoupled and layers are divided using the OCI-Artifacts standard Overlayfs so that incremental data and time resources are used more efficiently.

3. Remote Image: The Remote Image technology is adopted to change the Image format and reduce the consumption of local resources.

Comparison of Remote Image technical schemes

There are two ways to realize Remote Image: one is based on file system and the other is based on block device. The technical scheme of Remote Image is shown in the following figure:

The main feature of Remote Image technology based on file system is to provide file system interface directly, which is a natural extension of container Image. High complexity, stability, optimization and advanced features are difficult to implement. In terms of versatility, it is bound to the operating system and has fixed capabilities, which may not match all applications. At the same time the attack surface is larger. Industry representatives are mainly Google CRFS, Microsoft Azure Project Teleport, AWS SparseFS.

The main feature of Remote Image technology based on block device is that it can be used together with conventional file system, such as ext4. Common containers, secure containers, and VMS can be used directly. Complexity, stability, optimization, and advanced functionality are easier to implement. In terms of generality, it is unbound from operating system and file system, and the application can freely choose the most appropriate file system, such as NTFS, to package Image as a dependency. And the attack surface is small.

Alibaba chose Date Accelerator for Disaggregated Infrastructure (DADI), and aggregated about its scale.

Alibaba developed its own container image acceleration technology DADI

DADI is Alibaba’s original technology solution. DADI mirroring service is a layered block-level mirroring service that enables agile and flexible deployment of applications. DADI completely eliminates the traditional waterfall type of container startup (download, unpack, start) and implements fine-grained on-demand loading of remote images. The image does not need to be deployed before the container is started, but can be started immediately after the container is created.

The data path for DADI is shown below, with kernel state below the dotted line and user state above the dotted line. DADI abstracts an image into a virtual block device on which the container application mounts a regular file system such as ext4. When users apply read data, the read request is first processed by the regular file system, which translates the request into one or more reads from the virtual block device. Read requests to block devices are forwarded to user-mode DADI modules, which are finally converted to random reads by one or more layers.

DADI image adopts the block storage + layer technology, each layer only records the data block modified incrementally, and supports compression and real-time on-demand decompression. Support on demand transmission, transmission only used data blocks download use; DADI can also use P2P transmission architecture to spread network traffic across multiple nodes in a large cluster.

Key technology interpretation of DADI

DADI incremental mirroring can be implemented using a block-based + layering technique, where each layer corresponds to an LBA change. Key technologies of DADI include fine-grained on-demand transmission of remote images, efficient online decompression, trace-based reading, and P2P transmission technology for handling burst work. DADI is very effective in increasing the agility and resilience of deployed applications.

1. Overlay Block Device

Each layer records the LBA of variable-length data blocks that have been incrementally modified. It does not involve the concept of file/file system and has a minimum granularity of 512 bytes. Fast index, support variable length records to save memory, the LBA of each record does not overlap, support efficient interval query.

2. Native support writable layer

DADI images can be constructed in two modes: appending files and random sparse files. Read – only layer, each read – only can be in accordance with different types of size, each layer query interval, very fast. The writable layer consists of storing Raw Data and storing Index, which is organized by append only.

3, ZFile compression format

Standard compressed file formats, such as GZ, BZ2, and XZ, cannot efficiently perform random read and write operations. No matter which part of the compressed file is read, it needs to decompress from the header. In order to support layer blob compression and remote image on-demand reading, DADI introduces ZFile compression format. The ZFile compression format is shown in the following figure. It compresses data blocks of a fixed size and decompresses only the read data blocks. It supports a variety of effective compression algorithms, including LZ4, ZSTD, and Gzip.

4. Prefetch based on Trace

Only the location is remembered, not the data itself. When cold startup is applied, if there is trace record, DADI will pre-fetch the data to the local based on trace in advance, which adopts high concurrent read and is more efficient. Trace is stored in the image as a special layer for acceleration, invisible to the user, and can accommodate other acceleration files in the future. The green part of the figure below shows the acceleration layer, holds trace files, and other files.

5, P2P transmission on demand

In our production environment, there are several critical applications that have been deployed on thousands of servers and contain up to gigabytes of layers, and the deployment of these applications puts a significant strain on Registry and the network infrastructure. To better handle such large applications, DADI caches recently used data blocks on the local disk of each host and transfers data between hosts in P2P mode.

1. Use tree topology to distribute data

• Each node caches recently used data blocks

• Cross-node requests have a high probability of hitting the parent node’s cache

• Missed requests are passed recursively up to registr

2. The topology is dynamically maintained by the root node

• Each layer has a separate transport topology

3. Deploy a root group for each equipment room

• Multi-node high availability architecture

• Division of labor based on consistency hashing

Large-scale startup time test

We compared the DADI container startup delay with.tgz images, Slacker, CRFS, LVM, and P2P image downloads. Using WordPress images on DockerHub.com, we looked at the cold startup delay for single instances. All servers and hosts are located in the same data center. As shown in the figure on the left, the results show that using DADI can significantly reduce the cold start time of the container.

We created 1000 VMS on the public cloud and used them as hosts for the container. Start 10 containers on each host for a total of 10,000 containers. The test uses Agility, a small program written in Python, to access the HTTP server and log the time on the server side. As shown in the figure on the right, the results show that the cold start of DADI is completed quickly in less than 3 seconds.

The large-scale operation of DADI in Alibaba

DADI has been scaled up in Alibaba Group and deployed on a large scale within alibaba’s production environment. The data shows that DADI starts 10000 containers on 10000 hosts in 3-4 seconds. DADI perfect deal with double 11 promoting flood peak, and the current inside the alibaba group has deployed nearly one hundred thousand server hosting, support group Sigma, search, UC business online and offline applications such as more than 20000, greatly improve the application release, increase efficiency, experience the silky smooth. Our experience with DADI in the production environment of one of the world’s largest e-commerce platforms shows that DADI is very effective in increasing the agility and resilience of deployed applications.

Embrace open source and unlock the benefits of cloud native technology

DADI is now working to better unlock the benefits of cloud native technology by contributing to the community, and hopes to work with more enterprises and developers to create container mirroring standards.

Currently DADI has opened support for Contained (Docker does not yet support remote image), support node directly connected to Registry + local cache technology, and support to build and transform images.

P2P on-demand transmission will be open in the future: The P2P subsystem is redesigned as an extension of Registry, which supports shared storage, such as NFS, HDFS, CEPh, GlusterFS, etc. The global Registry + room shared storage + node local cache + P2P data transmission builds room cache.

You can learn more by checking out the Github link below:

Control plane for Containerd:

Github.com/alibaba/acc…

Data plane (OverlayBD)

Github.com/alibaba/ove…

The technological evolution of container persistence

Challenges for storage access technologies

Above we talked about the new paradigm of Serverless application architecture, and now we are seeing a trend from virtual machines to normal containers to Dragon bare-metal deployment security containers. From a storage layout perspective, the obvious challenge is to be more high-density and multi-tenant.

Container access technology trend: The computing layer architecture based on ECS + common container is evolving to that based on Dragon + secure container. The density of single node reaches 2000, the minimum granularity memory of single instance size is 128MB, and 1/12 CPU. The trend in container access technology presents the challenge of scaling up I/O resources.

Ali Cloud storage has its own thinking on end access. Storage is divided into built-in storage (mirror and temporary storage) and external storage (file system/shared file system, big data file system, database file system, etc.).

How can the storage system better connect to the underlying layer? The storage access container is unloaded to the Divine Dragon Moc card through virtio. The divine Dragon Moc card + Virtio path has better linkage with the underlying storage service.

Persistent storage – flexible supply cloud disk ESSD for modern applications

ESSD cloud disks provide users with block-level random access services of high availability, reliability, and performance, as well as rich enterprise features such as native snapshot data protection and cross-domain disaster recovery (Dr).

Flexible supply cloud disk ESSD for modern applications has two key features:

The mount density of cloud disks increases by four times. A single instance supports a maximum of 64 cloud disks
Performance and capacity are completely decoupled, and user requirements do not need to be set in advance.

For example, in order to cope with the problem faced by many users: the inability to accurately predict the peak service, it is difficult to accurately plan the performance configuration. If the reserved performance configuration is too high, a lot of idle resources will be wasted. If the performance reservation is insufficient, the service will be damaged due to sudden flood peak. We launched ESSD Auto PL cloud disk, which supports performance specific configuration and automatic scaling based on service load. A single disk can automatically improve performance to a maximum of 1 million IOPS, providing secure and convenient automatic performance configuration for unexpected unexpected access.

Persistent storage – Container network file system CNFS

In view of the advantages and challenges of using file storage in containers, Ali Cloud Storage team and Container service team jointly launched the container network file system CNFS, which is built into The Kubernetes service ACK hosted by Ali Cloud. CNFS by ali cloud file storage for a K8s abstract objects (CRD), an independent management, including the creation, deletion, description, mount, monitoring and expanded operations such as operation, users can enjoy the container in the use of file storage and convenient at the same time, improve the performance of file storage and data security, and to provide consistent container declarative management.

CNFS has been deeply optimized for container storage in accessibility, elastic capacity expansion, performance optimization, observability, data protection and declarative, which makes it have the following obvious advantages compared with open source solutions:

In terms of storage types, CNFS supports file storage, and currently supports Aliyun file storage NAS
Support kubernetes-compatible declarative lifecycle management, one-stop container and storage management
Supports online and automatic PV capacity expansion, optimized for flexible container expansion
Support better data protection combined with Kubernetes, including PV snapshot, recycle bin, delete protection, data encryption, data disaster recovery, etc
Supports application-level application consistency snapshots, automatically analyzes application configurations and storage dependencies, and supports one-click backup and one-click restoration
PV level monitoring is supported
Supports better access control and enhanced permission security of shared file systems, including directory-level quotas and ACLs
Provides optimized performance for reading and writing small files in file storage
Cost optimization, provide low frequency medium and conversion strategy, reduce storage cost

Best practices

Database containerization Best practices for using ESSD cloud disk high-density mounting

Database containerization The requirements for high-density ESSD cloud disk mounting are as follows: The database deployment mode changes from VMS to containers, improving flexibility and portability, and simplifying deployment. The container deployment density increases linearly with the number of CPU cores. Therefore, persistent storage is required to increase the mount density. As I/O intensive services, databases have higher requirements on stand-alone storage performance.

Our solution is that the database uses G6SE to store enhanced instances, providing a maximum mount density of 64 cloud disks for a single instance, and G6SE to store enhanced instances, providing a maximum IOPS and 4GB storage throughput, meeting the performance requirements of high-density single-node deployment.

Database containerization The advantages of using ESSD cloud disks are as follows:

High-density mounting: The cloud disk mounting density increases by 400 percent compared with that of the previous generation, which increases the single-node deployment density of database instances.
High performance: a maximum of 1 million IOPS in a single disk, and I/O isolation between multiple cloud disks, providing stable and predictable read and write performance.
High elasticity: The ESSD cloud disk supports IA snapshots. Snapshots can be created immediately at the second level for read-only instances.
High reliability: The cloud disk is based on the reliability design of nine databases and supports data protection measures such as snapshot and asynchronous replication to prevent data security caused by hardware and software faults.

Prometheus monitors best practices for using file storage for services

Prometheus is implemented as Prometheus Server for capturing and storing data. The Client libraries are used to connect to servers and perform query operations. The Push gateway is used for batch and short-term monitoring data aggregation nodes and service data reporting. Different exporters are used to collect data in different scenarios, for example, collecting MongoDB information.

Prometheus’ core storage TSDB, a storage engine similar to LSM Tree. We are seeing a trend towards multi-node data synchronization for storage engines, requiring the introduction of the Paxos consistency protocol. It is very difficult for small and medium-sized customers to manage the consistency protocol when they manage the engine. The architecture separates computing from storage, and computing is stateless. The storage engine of TSDB is released to distributed file system, which naturally requires NAS shared file system.

The advantages of Prometheus monitoring service using file storage are:

Shared high availability: Multiple PODS share persistent NAS storage. Failover of compute nodes implements high availability of container applications.
0 Modification: Distributed POSIX file system interface, no modification is required
High performance: Supports concurrent access, instant index query, synchronous data loading, and low latency index query + write
High elasticity: The storage space does not need to be pre-configured. The storage space is charged by volume and is compatible with the elasticity of containers

conclusion

The innovative development of storage oriented towards container and Serverless Computing has driven a new change in storage perspective. The whole storage interface has moved up, developers are more exclusive to the application itself, and the operation and maintenance of infrastructure is managed as far as possible. The storage provides higher density, flexibility, and maximum speed.

The above share ali cloud container storage technology innovation, including DADI image acceleration technology, laid a good foundation for the large-scale start of containers, ESSD cloud disk provides the ultimate performance, CNFS container network file system provides the ultimate user experience.

Reinvent storage As cloud native innovation kicks off, cloud native storage innovation has just taken its first step, and I believe in creating and Reinvent storage innovation opportunities together with industry experts.

For more information about the technical capabilities, application scenarios and usage of CNFS, please click “Read the original article”.