background
Meituan’s container cluster management platform is called HULK. Marvel’s HULK turns into the HULK when he’s angry. This feature is similar to the container’s “elastic expansion”, so we named the platform HULK. There seems to be some company’s container platform with that name, which is purely coincidental.
In 2016, Meituan began to use containers. At that time, Meituan already had a certain scale, and various systems existed before using containers, including CMDB, service governance, monitoring alarms, publishing platform, and so on. When we explore container technology, it’s hard to give up assets. So the first step in containerization is to open up the container life cycle and the interaction of these platforms, such as applying/creating, deleting/releasing, publishing, migrating, etc. We then verified the feasibility of the container as an operational environment for the core online business.
In 2018, after two years of operation and practical exploration, we upgraded the container platform, which is HULK 2.0.
The current container usage status of Meituan is as follows: online services have exceeded 3000 services, the number of container instances has exceeded 30000, and many core link services with large concurrency and low latency requirements have been running stably on HULK. This paper mainly introduces some of our practices in container technology, which belong to basic system optimization and polishing.
The basic architecture of Meituan container platform
Firstly, I would like to introduce the infrastructure of Meituan container platform. I believe that all container platform architectures are basically the same.
First of all, container platform connects with service governance, publishing platform, CMDB, alarm monitoring and other systems. By communicating with these systems, the container provides the same experience as the virtual machine. Developers can use containers in the same way they use VMS without changing their habits.
In addition, containers provide elastic capacity expansion, which can dynamically increase and decrease the number of container nodes of services according to certain elastic policies, so as to dynamically adjust the service processing capacity. There is also a special module — “Service Portrait”. Its main function is to better complete the scheduling of containers and optimize resource allocation through the collection and statistics of service container instance operation indicators. For example, you can determine whether a service is computation-intensive or IO-intensive based on the CPU, memory, and I/O usage of the container instance of a service. Try to put complementary containers together during scheduling.
For example, if we know that each container instance of a service will run with about 500 processes, we will create the container with a reasonable limit on the number of processes (such as a maximum of 1000 processes) to prevent the container from consuming too much system resources in the event of problems. If the container of this service suddenly applies to create 20000 processes when it is running, we have reason to believe that the service container encounters a Bug, restricts the container by previous resource constraints, and sends an alarm to inform the service to handle it in time.
The next layer is “container Choreography” and “image Management.” Container choreography addresses the problem of dynamic instances of containers, including when containers are created, where they are created, when they are deleted, and so on. Image management addresses static instances of containers, including how container images should be built, how they should be distributed, where they should be distributed, and so on.
The lowest layer is our container runtime. Meituan uses the mainstream Linux+Docker container solution. HULK Agent is our management Agent on the server.
Expanding on the previous container runtime, you can see this architecture diagram, from bottom to top:
Meituan mainly uses open source components from the CentOS family, because we believe Red Hat has strong open source technology and we hope that the open source versions of Red Hat will help solve most of the system problems rather than directly using the open source community versions. We also found that even with CentOS open source components deployed, it is possible to run into problems that the community and Red Hat have not addressed. To some extent, it also shows that large Domestic Internet companies have reached the world’s leading level of technology application scenarios, scale and complexity, which is why they encounter these problems before the community and Red Hat customers.
Some problems encountered by containers
In container technology itself, we encountered four major issues: isolation, stability, performance, and generalization.
Container implementation
In essence, a container is a group of related processes that serve the same business goal in the system. The processes in the same namespace can communicate with each other without seeing the processes in other namespaces. Each namespace can have its own independent host name, process ID system, IPC, network, file system, user and other resources, to some extent, to achieve a simple virtualization: a host can run multiple systems that do not know each other.
In addition, in order to limit the use of physical resources by the namespace, the CPU, memory and other resources that can be used by the process need to be limited to some extent. This is the Cgroup technology. Cgroup stands for Control group. For example, we often say that the 4C4G container is actually limited to the process used in the namespace of the container, can use up to 4 core computing resources and 4GB of memory.
In short, the Linux kernel provides namespace for isolation and cgroups for resource restriction. Namespace +Cgroup forms the underlying technology of the container (RootFS is the container file system layer technology).
Solution, improvement and optimization of Meituan
isolation
I have been working with virtual machines before, but until I used the container, I found that the CPU and Memory information in the container are the information of the server host, not the configuration information of the container itself. Until now, the community version container is still the same. For example, in a 4C4G container, you can see resources with 40 cpus and 196GB memory inside the container. These resources are actually the information of the host where the container is located. It can feel like a vessel’s “ego puff,” thinking that you’re competent when you’re not, and it can cause problems.
The figure above is an example of memory information isolation. When obtaining system memory information, the kernel of Community Linux returns the host memory information uniformly, whether on the host or in the container. If the application in the container is configured according to the host memory discovered by it, the actual resources are far from enough, resulting in an OOM exception in the system soon.
The isolation we do is that when we fetch memory information from the container, the kernel returns the container’s memory information based on the container’s Cgroup information (similar to what LXCFS does).
CPU information isolation is implemented similarly to memory, but here is an example of how the number of cpus affects application performance.
As you all know, JVM GC (garbage object collection) has an impact on Java program execution performance. The default JVM uses the formula “ParallelGCThreads = (ncpus <= 8)? Ncpus: 3 + ((ncpus * 5) / 8) “to calculate the number of parallel GC threads, where ncpus is the number of system cpus discovered by the JVM. Once the JVM in the container discovers the number of cpus on the host machine (which is usually much higher than the actual CPU limit of the container), this can cause the JVM to start too many GC threads, which can directly degrade GC performance. Java services experience increased latency, increased TP monitoring curve spikes, and decreased throughput. There are various solutions to this problem:
● On the new platform, we have improved the kernel to get the correct number of CPU resources in the container, making it transparent to the business, image and programming language (similar issues can affect the performance of OpenMP, Node.js and other applications).
There was a time when our container was running with root permission, which was realized by adding the ‘Privileged = True’ parameter during docker run. This extensive usage enables the container to see the disks of all containers on the server on which it resides, leading to security and performance issues. Security issues are well understood, but why do they cause performance problems? Imagine a scenario where each container does a disk state scan. Of course, too much permission is also reflected in the mount operation can be arbitrarily, can arbitrarily change the NTP time, and so on.
In the new version, we removed the root permission for the container and found some side effects, such as causing some system calls to fail. By default, we give the container additional sys_ptrace and sys_admin permissions to run GDB and change the host name. If a special case container requires more permissions, it can be configured on our platform by service granularity.
Linux has two types of IO: Direct IO and Buffered IO. Direct IO Directly writes to a disk. Duffered IO writes to the cache before writing to the disk. In most scenarios, the value is Buffered IO.
We use Linux kernel 3.X, the community version of all containers Buffer IO share a kernel cache, and the cache is not isolated, there is no rate limit, resulting in high IO containers can easily affect other containers on the host. Buffer IO cache isolation and speed limiting in Linux 4.x are significantly improved by Cgroup V2. We also use Cgroup V2 to implement the same functionality in our Linux 3.10 kernel: Each container has a proportional I/O Cache based on its own memory. The rate at which Cache data is written to disks is limited by the CONTAINER’S Cgroup I/O configuration.
Docker itself supports more Cgroup resource restrictions on containers, but K8s can pass fewer parameters when calling Docker. In order to reduce the interaction between containers, we set different resource restrictions on containers of different services based on resource allocation of service portraits. In addition to the usual CPU, memory, there are IO limits, Ulimit limits, PID limits, etc., so we extended K8s to do this.
It is common for businesses to generate core dump files in the process of using containers. For example, if a C/C++ program runs out of memory, or if the system kills the process that occupies too much memory in OOM, a core dump file will be generated by default.
The default core dump file for the community container system is generated on the host. Because some core dump files are large, such as the JVM core dump is usually several GB, or some buggy programs, the frequent core dump can easily fill the host’s storage and cause high disk I/O, which also affects other containers. Another problem is that the consumer of the business container does not have access to the host machine to retrieve the dump file for further analysis.
To do this, we changed the core dump process to write the dump file to the container’s own file system and use the container’s own Cgroup I/O throughput limit.
The stability of
We found in practice that Linux Kernel and Docker are the main factors affecting system stability. Although they are reliable system software in their own right, there are some bugs in large-scale, high-intensity scenarios. This also shows that Chinese Internet companies are leading the world in application scale and application complexity.
On the Kernel side, Meituan found an implementation problem with the Kernel 4.x Buffer IO limit, which was confirmed and fixed by the community. We also followed up with a series of Ext4 patches for CentOS, which solved the problem of frequent process freezes over a period of time.
We hit two key stability issues with Red Hat Docker:
Facing the system software of system kernel, Docker, K8S and other open source communities, there is a point of view that we do not need to analyze problems by ourselves, just need to take the latest updates from the community. However, we do not agree. We believe that the ability of the technical team is very important, mainly for the following reasons:
performance
Container platform performance mainly includes two aspects:
● Performance of container operations (create, delete, and so on).
The figure above is an example of our CPU allocation. The mainstream server we use is a two-channel 24-core server, consisting of two nodes with 12 cores each, and a total of 48 logical cpus including hyperthreading. Typical NUMA architecture: each Node has its own memory, and a CPU in a Node can access its memory much faster than another Node’s memory.
In the past, we have encountered network outages concentrated on CPU0, which could lead to increased network latency or even packet loss under heavy traffic. To ensure network processing power, eight logical cpus are allocated from Node0 to handle network outages and tasks on the host system, such as high-CPU tasks such as image decompression. These eight logical cpus do not run any Workload of the container.
In terms of container scheduling, we try not to allocate container CPU across nodes. Practice has proved that memory access across nodes has a great impact on application performance. In some computationally intensive scenarios, container allocation within Nodes can increase throughput by more than 30%. Of course, the Node allocation scheme also has some disadvantages: it will lead to increased CPU fragmentation. In order to use CPU resources more efficiently, in actual system, we will allocate some CPU-insensitive service containers to use CPU resources across nodes according to the information of the service portrait.
The figure above shows the comparison of TP indicator lines for response delay of a real service before and after CPU allocation optimization. You can see that the TP999 line has dropped by an order of magnitude, and all indicators are more stable.
Performance optimization: File system
For file system performance optimization, the first step is selection. According to the application read and write characteristics, we choose Ext4 file system (more than 85% of file read and write operations are on files smaller than 1M).
The Ext4 file system has three journaling modes:
We chose Writeback mode (oderded by default), which is the fastest of the mount modes, but has the disadvantage of not recovering data in the event of a failure. Most of our containers are stateless, just pull up another one on another machine when it fails. So between performance and stability, we chose performance. The container provides an optional memory-based file system (TMPFS) for applications to improve the performance of services with a large number of temporary file reads and writes.
As shown in the figure above, creating a VIRTUAL machine inside Meituan takes at least three steps, with an average time of more than 300 seconds. The average time to create a container with an image is 23 seconds. The flexibility and speed of the container is clearly demonstrated.
The average capacity expansion time of 23 seconds covers various aspects of optimization, such as capacity expansion link optimization, image distribution optimization, initialization optimization, and service pull optimization. Next, this article focuses on the optimization related to image distribution and decompression that we did.
The figure above shows the overall architecture of Meituan container image management, which has the following features:
Image distribution is an important link that affects the capacity expansion time of containers.
As can be seen from the figure above, the original distribution time increases rapidly as the number of distributors increases, while the P2P image distribution time basically remains stable.
Docker image pull is a process of parallel download and serial decompression. In order to improve the speed of decompression, we have also made some optimization work in Meituan.
For the decompression of a single layer, the parallel decompression algorithm is used to replace the default serial decompression algorithm of Docker, and the implementation is to replace gzip with PGZIP.
Docker images have a hierarchical structure, and the combination of image layers is a serial operation of “decompress one layer and merge another layer, and then decompress another layer and merge another layer”. In fact, only merge needs to be serial, and decompression can be done in parallel. Instead of decompressing multiple layers in parallel, the decompressed data is stored in temporary storage and then sequentially merged based on the dependencies between layers. The previous change (unpacking all the layers into temporary space in parallel) nearly doubled the number of disk I/OS, which also resulted in the unpacking process not being fast enough.
Instead, we used memory-based Ramdisk to store the extracted temporary files, reducing the overhead of additional file writes. Having done the above work, we also found that the layering of the container also affected the time to download and decompress. The above is the result of our simple test: the decompression time can be greatly improved regardless of the level of parallel image decompression, especially for multiple levels of image decompression.
To promote
The first step in promoting containers is to be able to say the advantages of containers. We believe that containers have the following advantages:
The combination of these three features brings greater flexibility and lower computing costs to the business.
Because the container platform itself is a technology product and its customers are RD teams in various businesses, we need to consider the following factors:
● Product advantages
● Connect with the existing system
● Native application development platform and tools
● Smooth migration of VMS to containers
● Closely cooperate with application RD
● Resource tilt
conclusion
Docker container and Kubernetes arrangement is one of the mainstream practices of container cloud. HULK, meituan container cluster management platform, also adopts such a scheme. This paper mainly shares some exploration and practice of Meituan in container technology. The content mainly covers some optimization work of Meituan container cloud in Linux Kernel, Docker and Kubernetes, as well as some thoughts of meituan to promote containerization process, welcome you to exchange and discuss with us.
Data and cloud
Data and cloud