This article is based on the 14th issue of Zheng KunMeituan Comment technology Salon “Meituan Cloud you Don’t Know”The presentation has been edited and published in the January 2017 issue of Programmer magazine.
Meituan point review container platform introduction
This paper introduces Meituan-Dianping’s Docker container cluster management platform (hereinafter referred to as “Container Platform”). The platform started in 2015 and is a Docker container cluster management platform developed based on the infrastructure and components of Meituan Cloud. At present, the platform provides container computing services for meituan-Dianping’s food delivery, hotel, in-store, Maoyan and more than a dozen business divisions, carrying hundreds of online businesses, with an average of more than 4.5 billion online requests. The business types include Web, database, cache, message queue, etc.
Why develop a container management platform
As a large O2O Internet company in China, Meituan-Dianping has witnessed rapid business development, with a large number of searches, promotions and online transactions occurring online every day. Before the implementation of container platform, all of Meituan Dianping’s business ran on virtual machines provided by Meituan private cloud. As services expand, in addition to providing high stability for online services, private clouds also need to be flexible. They can quickly create a large number of VMS during peak service hours and recycle resources to other services during peak service hours. Most of meituan-Dianping’s online businesses are oriented to consumers and merchants, with various business types and flexible time and frequency, which put forward high requirements for flexible services. At this point, virtual machines are unable to meet the requirements, mainly because of the following two aspects.
First, the elasticity of virtual machines is weak. To deploy services using VMS, you need to apply for VMS, create and deploy VMS, configure the service environment, and start service instances. The first steps belong to the private cloud platform and the next steps belong to the service engineer. Multiple departments need to cooperate to complete capacity expansion at a time. Capacity expansion takes hours, making it difficult to automate the process. Automatic one-click rapid capacity expansion greatly improves service flexibility, releases more manpower, and eliminates the risk of accidents caused by manual operations.
Second, IT costs a lot. Due to the poor elasticity of VMS, service departments reserve a large number of VMS and service instances to cope with traffic peaks and bursts. That is, a large number of VMS or dedicated servers are deployed and the resources required during peak hours are usually twice as many as those required during off-peak hours. The resource reservation approach brings very high IT costs, and during off-peak hours, these machine resources are idle, which is also a huge waste.
Due to the above reasons, Meituan-Dianping introduced Docker in 2015 to build a container cluster management platform and provide high-performance elastic scalability for business. Many companies in the industry adopt open source components of the Docker ecosystem, such as Kubernetes and Docker Swarm. Based on the existing architecture and components of Meituan Cloud, we have developed a self-developed Docker container management platform based on our own business needs. We choose to develop our own container platform for the following reasons.
Quickly meet the various business needs of Meituan-Dianping
Meituan-dianping has a wide range of business types, covering almost all business types of Internet companies. The needs and pain points of each business are also different. For example, some stateless services (such as Web services) have high requirements on elastic capacity expansion delay. The database, the master node of the business, requires extremely high availability, and there is also the need to adjust configurations such as CPU, memory and disk online. Many services require SSH login to access containers for tuning or quick fault location, which requires the container management platform to provide convenient debugging functions. Container platforms require a lot of iterative development effort to meet the diverse needs of different business units. Based on the existing platforms and tools we are familiar with, we can achieve development goals “as fast, as fast, and as cheaply as possible” to meet the multiple needs of the business.
From the perspective of container platform stability, it is necessary to have higher control over the platform and the underlying technology of Docker
Container platform carries a large number of online services of Meituan-Dianping. Online services have very high requirements on SLA availability, which generally reaches 99.99%. Therefore, the stability and reliability of container platform are the most important indicators. If we directly introduce external open source components, we will face three difficulties: 1. We need to be familiar with the open source components, master their interfaces, evaluate their performance, at least to achieve source level understanding; 2. 2. To build the container platform, it is necessary to make splicing of these open source components to continuously optimize performance bottlenecks at the system level and eliminate single point of hidden dangers; 3. Integration with meituan-Dianping’s existing infrastructure in monitoring and service governance. All of these work requires a lot of work, and more importantly, the platform built in this way is difficult to guarantee its stability and availability in a short period of time.
Avoid redundant private cloud construction
Meituan Private cloud, which carries all the online businesses of Meituan Dianping, is one of the largest private cloud platforms in China. After several years of operation, reliability through the company’s massive business test. We can’t just put the mature and stable private cloud aside and start from scratch with a new container platform because we want to support containers. Therefore, considering stability and cost, it is the most economical solution for us to build a container management platform based on the existing private cloud.
Architecture design of Meituan-Dianping container management platform
We view the CONTAINER Management platform as a cloud computing model, and cloud computing architecture applies to containers as well. As mentioned earlier, the architecture of the container platform relies on the existing architecture of the Meituan private cloud, where most components of the private cloud can be reused directly or developed with a small amount of extension. The container platform architecture is shown below.
It can be seen that the overall architecture of container platform is divided into business layer, PaaS layer, IaaS control layer and host machine resource layer from top to bottom, which is basically consistent with meituan cloud architecture.
Business layer: Business lines that use containers on behalf of Meituan-Dianping. They are the end users of the container platform. PaaS layer: Uses HTTP API of container platform to implement functions such as choreography, deployment, elastic scaling, monitoring, and service governance of containers, and provides services for the upper business layer through HTTP API or Web. IaaS control layer: provides container platform API processing, scheduling, network, user authentication, and image warehouse management functions, and provides HTTP API interfaces for PaaS. Host resource layer: Docker host cluster consists of several computer rooms and hundreds of nodes. Each node is deployed with HostServer, Docker, monitoring data collection module, Volume management module, OVS network management module, and Cgroup management module. Most components in container platform are developed based on existing components of Meituan private cloud, such as API, mirror warehouse, platform controller, HostServer and network management module, which will be introduced separately below.
API
The PaaS layer creates and deploys cloud hosts through APIS, which are used by container platforms to provide services externally. We view containers and virtual machines as two different virtualized computing models that can be managed using a common API. That is, virtual machines are set (more on that later) and disks are containers. This approach has two advantages: 1. Service users do not need to change the usage logic of cloud hosts. The original service management process based on VMS also applies to containers, so services can be seamlessly migrated from VMS to containers. 2. The CONTAINER platform API does not need to be redeveloped, and the API processing process of Meituanprivate cloud can be reused. Creating VMS involves multiple stages, such as scheduling, disk preparation, deployment, configuration, and startup. Containers are much simpler, requiring only scheduling and deployment startup. Therefore, we simplified the API of the container, integrating disk preparation, deployment configuration and startup into one step. After simplification, the creation and startup delay of the container is less than 3 seconds, basically reaching the startup performance of Docker.
Host-SRV
Host-srv is the container process manager on the Host. It is responsible for container image pulling, container disk space management, container creation, destruction and other runtime management.
Image pull: after receiving the creation request from the controller, host-SRV downloads the image and cache from the image repository and loads them into the Docker through the Docker Load interface.
Container runtime management: Host-SRV communicates with The Docker Daemon through the local Unix Socker interface, controls the container life cycle, and supports container Logs, exec and other functions.
Container disk space management: Manages the disk space of Rootfs and Volume and reports the disk usage to the controller. The scheduler determines the container scheduling policy based on the disk usage.
Host-srv and Docker daemons communicate with each other through Unix sockets. Container processes are hosted by docker-containerd, so the update of host-srv does not affect the running of local containers.
Mirror warehouse
The container platform has two mirror warehouses:
- Docker Registry: Provides the Mirror function of Docker Hub to speed up image download and facilitate business teams to quickly build service images;
- Glance: Docker image repository developed based on the Glance extension of Openstack component, used to host Docker images made by business departments.
Image repositories are a necessary component not only of container platforms but also of private clouds. Meituan private cloud uses Glance as an image repository. Before the container platform is built, Glance only hosts VM images. Each image has a UUID. Using the Glance API and image UUID, you can upload and download vm images. Docker images are actually composed of a group of child images, each with an independent ID and a Parent ID attribute pointing to its Parent image. We’ve tweaked Glance a bit, adding the Parent ID attribute to each Glance image and modifying the image upload and download logic. After a simple extension, Glance has the ability to host Docker images. Supporting Docker images with the Glance extension has the following advantages:
- The same image warehouse can be used to host images of Docker and virtual machine, reducing operation and maintenance management costs;
- Glance is mature and stable. Using Glance can reduce the number of pits in image management.
- Glance can make Docker image warehouse and Meituan private cloud “seamless” connection, using the same set of image API, can support virtual machine and Docker image upload, download, support distributed storage backend and multi-tenant isolation and other features;
- Glance UUID and Docker Image ID are in one-to-one correspondence. Using this feature, we realize the uniqueness of Docker Image in the warehouse and avoid redundant storage.
Some might wonder if using Glance as a mirror repository is “reinventing the wheel.” In fact, our transformation of Glance was only about 200 lines of code. Glance is simple and reliable. We completed the development and launch of the image warehouse in a very short time. At present, Meituan-Dianping has hosted Docker images of more than 16,000 business parties, and the average delay of uploading and downloading images is seconds.
High-performance, flexible container network
Networking is a very important and technically challenging area. A good network architecture requires high network transmission performance, high elasticity, multi-tenant isolation, and support for software-defined network configuration. The network scheme provided by Early Docker is relatively simple, with only four network modes: None, Bridge, Container and Host, and no user development interface. Docker integrated Libnetwork as its network solution in version 1.9 in 2015, which supports users to develop corresponding network drivers according to their own needs and realize the function of self-defining network functions, greatly enhancing Docker’s network expansion ability.
From the perspective of container cluster system, the network access of single host machine is far from enough. The network also needs to provide the capability of cross-host machine, rack and machine room. From this point of view, Docker and virtual machine are common without obvious differences. Theoretically, the same set of network architecture can be used to meet the network requirements of Docker and virtual machine. Based on this concept, container platform reuses meituan cloud network infrastructure and components for networking.
Data plane: We adopt 10-gigabit network card, combine ovS-DPDK scheme, and further optimize the forwarding performance of single stream. Bind several CPU cores to OVS-DPDK for forwarding, and only need a small amount of computing resources to provide 10-gigabit data forwarding capability. Ovs-dpdk is completely isolated from the CPU used by the container and therefore does not affect the user’s computing resources.
Control plane: We use OVS scheme. In this scheme, a self-developed software Controller is deployed on each host to dynamically receive network rules delivered by network services and further deliver the rules to OVS flow table to decide whether to expel a certain network.
MosBridge
Prior to MosBridge, we configured the container network using None mode. The so-called None mode is the mode of user-defined network. The following steps are required to configure the network:
- Net =None; net=None;
- After the container is started, create an eth-pair.
- Connect one end of the eth-pair to the OVS Bridge.
- Use the Namespace tool nsenter to add the other end of the eth-pair to the network Namespace of the container, rename the pair, and configure IP addresses and routes.
In practice, however, we found that the None pattern had some shortcomings:
- When the container is started, the network is checked before some services are started. As a result, services cannot be started.
- The network configuration is disconnected from the Docker, and the network configuration is lost after the container is restarted.
- The network configuration is controlled by host-SRV, and the configuration process of each nic is implemented in host-SRV. Future upgrades and extensions of network functions, such as adding network cards to containers or supporting VPCS, will make host-SRV increasingly difficult to maintain.
To solve these problems, we turned to Docker Libnetwork. Libnetwork provides users with the ability to develop Docker networks, allowing users to implement network drivers based on Libnetwork to customize the behavior of their network configuration. That is, users can write drivers that let Docker configure IP, gateway, and routing for the container according to specified parameters. Based on Libnetwork, we developed MosBridge — Docker network driver for Meituan cloud network architecture. When creating containers, it is necessary to specify the container creation parameter — NET = Mosbridge, and pass the IP address, gateway, OVS Bridge and other parameters to Docker, and mosBridge completes the network configuration process. With MosBridge, you have a network to work with once container creation is started. The container’s network configuration is also persisted in MosBridge and is not lost after the container is restarted. More importantly, MosBridge fully decouples host-SRV from Docker, making it easier to upgrade network functions in the future.
Fix Docker storage isolation
Many companies in the industry face storage isolation issues when using Docker. That is to say, the data storage scheme provided by Docker is Volume. A directory of the local disk is mounted to the container by mount bind and used as the “data disk” of the container. In this way, the capacity of a Volume on a local disk cannot be limited. Any container can write data to the Volume without limit until the disk space is used up.
To address this problem, we developed the LVM Volume solution. In this solution, an LVM VG is created on the host as the storage backend of the Volume. When creating a container, an LV is created from the VG as a disk and mounted to the container. In this way, the capacity of the Volume is strongly restricted by LVM. Thanks to the powerful management capability of LVM, we can manage Volume more finely and efficiently. For example, you can easily run the LVM command to view the usage of a Volume, delete a Volume by labeling it, and perform the recycle bin function. You can also use the LVM command to expand the Volume capacity online. It is worth mentioning that LVM is developed based on the Linux kernel Devicemapper, which has a long history in the Linux kernel. Devicemapper has been integrated as early as kernel 2.6, and its reliability and IO performance can be trusted completely.
Container status collection module for multiple monitoring services
Container monitoring is a very important part of container management platform. Monitoring not only needs to get the running status of containers in real time, but also needs to get the dynamic changes of the resources occupied by containers. Before container monitoring was designed, Meituan-Dianping already had a number of monitoring services, such as Zabbix, Falcon and CAT. Therefore, we do not need to redesign and realize a complete set of monitoring services, but rather consider how to efficiently collect container operation information and report it to the corresponding monitoring services according to the configuration of the operating environment. In short, we just need to consider the realization of an efficient Agent, which can collect various monitoring data of the container on the host. There are two things to consider:
- There are many monitoring indexes and large amount of data, so the data acquisition module must be efficient.
- Low overhead of monitoring, the same host machine can run dozens or even hundreds of containers, a large number of data collection, sorting and reporting process must be low overhead.
Monitor data collection schemes
Based on Libcontainer, we developed the MOS-Docker-Agent monitoring module to meet the monitoring requirements of services and operation and maintenance. The module collects container data from proc, CGroup and other interfaces of the host computer, and reports data through different monitoring system drivers after processing and conversion. This module is written in the GO language, which is efficient and can use Libcontainer directly. In addition, the monitoring data collection and reporting process does not go through the Docker Daemon, so it does not add to the burden of the Daemon.
In terms of monitoring configuration, the monitoring reporting module is plug-in and can highly customize the reported monitoring service type and monitoring item configuration, so it can flexibly adapt to the requirements of different monitoring scenarios.
Support microservice architecture design
In recent years, microservice architecture has emerged in the field of Internet technology. Microservices use lightweight components to disassemble a large service into multiple microservice instances that can be encapsulated and deployed independently. The complex logic of the large service is realized by the interaction between the services.
Many of Meituan-Dianping’s online businesses are microservices. For example, the service governance framework of Meituan-Dianping configures a service monitoring Agent for each online service, which is responsible for collecting and reporting the status information of online services. There are many similar microservices. For this microservice architecture, there are two encapsulation modes available with Docker.
- Encapsulate all microserver processes into a container. However, the update and deployment of services are not flexible. Any update of microservices has to rebuild the container image, which is equivalent to using the Docker container as a virtual machine, without giving full play to the advantages of Docker.
- Encapsulate each microservice into a separate container. Docker is lightweight and environmentally isolated, making it ideal for encapsulating microservices. However, this can create additional performance problems. One is that the containerization of large services will produce several times of computing instances, which brings great pressure to the scheduling and deployment of distributed systems. Another problem is performance deterioration. For example, two closely related services have a large amount of communication traffic with each other, but are deployed in different equipment rooms, resulting in considerable network overhead.
Kubernetes’ solution to the problem of supporting microservices is Pod. Each Pod consists of multiple containers and is the smallest unit for service deployment, orchestration, management, and scheduling. Containers in a Pod share resources with each other, including network, Volume, and IPC. So multiple containers within the same Pod can communicate with each other efficiently.
We borrowed Pod ideas and developed a set of microservices-oriented containers on the container platform, which we called set internally. A set logic diagram is shown below.
Set is the basic unit of container platform scheduling and elastic capacity expansion/reduction. Each set consists of a BusyBox container and several service containers. The BusyBox container manages the network, Volume, and IPC configurations of the set, but not specific services.
All containers in a set share the network, Volume and IPC. The set configuration uses a JSON description (as shown in Figure 6). Each set instance contains a Container List. The fields of the Container describe the configuration of the Container at run time.
- Index indicates the container number, which indicates the container startup sequence.
- Image: name or ID of the Docker Image at Glance;
- Options, which describes the configuration of container startup parameters. Where CPU and MEM are percentages, representing the amount of CPU and memory allocated by the container relative to the entire set (for example, for a 4-core set, container CPU:80 means that the container will use up to 3.2 physical cores).
Through SET, we standardized all container services of Meituan-Dianping, that is, all online services are described by SET, and there is only SET in the container platform. Set is the unit for scheduling, deployment, start and stop. We also do some special processing for the set implementation:
- Busybox has Privileged rights and can customize some SYSCTL kernel parameters to improve container performance.
- For stability, users are not allowed to log in to Busybox using SSH. Only other service containers are allowed.
- To simplify Volume management, each set has only one Volume, which is mounted to Busybox and shared by each container.
In many cases, containers in a set come from different teams, and the image update frequency is different. We designed a grayscale update function based on set. This feature allows businesses to update only part of the container image in a SET, which can be upgraded online via a grayscale update API. The biggest benefit of grayscale update is that you can update part of the container online and keep the online service uninterrupted.
Docker stability and features of solution: MosDocker
As we all know, Docker community is very hot, version update is very frequent, about 2 ~ 4 months there will be a large version update, and each version update will be accompanied by a large number of code reconstruction. Docker does not have a long-term LTS version, and new bugs are inevitably introduced with each update. Typically, a Bug fix will wait until the next release due to timeliness. For example, the bugs introduced in 1.11 will not be resolved until 1.12, and if 1.12 is used, new bugs will be introduced and 1.13 will be required. In this way, the stability of Docker is difficult to meet the requirements of production scenes. Therefore, it is very necessary to maintain a relatively stable version. If bugs are found, you can fix them on the basis of this version by self-research or by using the community’s BugFix.
In addition to the need for stability, we also need to develop some functions to meet the needs of Meituan Dianping. Some of the requirements for meituan reviews come from our own production environment and are not industry-specific. These are the kinds of requirements that the open source community usually doesn’t consider. Many companies in the industry have similar situations, as the company’s basic service team must meet this demand through technology development.
Based on the above considerations, we started from Docker 1.11 and developed and maintained a branch, which we called MosDocker. Docker was chosen to start with version 1.11 because there were several major improvements to the Docker Daemon. Docker was reconfigured to be a Binary Daemon, Containerd, and runC, and single points of failure were addressed.
- Supports OCI standards, and containers are defined by uniform RootFs and specs.
- Libnetwork framework is introduced to allow users to customize container network through development interface.
- The Docker image storage backend was reconstructed, and the image ID was changed from the original random string to Hash based on the image content, making the Docker image more secure.
The main features developed by MosDocker so far are:
- MosBridge, which supports network driver of Meituan cloud network architecture, realizes network functions such as container multi-IP and VPC based on this feature;
- Cgroup persistence: Extending the Docker Update interface enables more Cgroup configurations to be persisted in the container to ensure that Cgroup configurations are not lost after the container restarts.
- Docker Save supports sub-images, which can greatly improve the upload and download speed of Docker images.
In short, maintaining MosDocker enables us to gradually control the stability of Docker in our own hands, and make customized development according to the needs of the company’s business.
Promotion and application in actual business
During the operation of container platform for more than a year, it has been connected to many large business departments of Meituan-Dianping, and the business types are also diverse. The introduction of Docker technology brings many benefits to the business department, typical benefits include the following two points.
- Rapid deployment to deal with unexpected service traffic. Due to the use of Docker, business machine application, deployment and business release are completed in one step, and business expansion is reduced from the original hour level to second level, which greatly improves the flexibility of business.
- Save IT hardware and o&M costs. Docker is more efficient in computing, and its high elasticity makes it unnecessary for business departments to reserve a lot of resources and save a lot of hardware investment. For example, 32 8-core, 8-GB VMS are reserved for a service to cope with traffic fluctuations and unexpected traffic. Using the container elastic solution, that is, using three containers and elastic capacity expansion to replace 32 fixed VMS, the average QPS of a single vm increases by 85% and the average resource usage decreases by 44-56%(as shown in Figure 7 and 8).
- Docker online capacity expansion to ensure service continuity. For stateful services, such as databases and caches, runtime tuning of CPU, memory, and disk is a common requirement. If the vm is deployed on a VM, the VM needs to be restarted to adjust the configuration. As a result, service availability is inevitably interrupted, which becomes a pain point for services. Docker manages resources such as CPU and memory through The CGroup of Linux. To adjust the configuration, you only need to modify the CGroup parameters of the container without restarting the container.
conclusion
This paper introduces the practice of Meituan-Dianping Docker. After one year of promotion practice, from the internal use of the department to cover most of the company’s business departments and product lines; From a single business type to dozens of business types online, Docker container virtualization technology has proved its value in improving operation and maintenance efficiency, streamlining the release process and reducing IT costs.
At present, Docker platform is still in the in-depth promotion of Meituan Dianping. In this process, we found that Docker (or container technology) itself has many problems and shortcomings, for example, Docker has the problem of IO isolation is not strong, can not limit Buffered IO; Occasionally Docker Daemons will freeze up and not respond; Container Memory OOM Causes the container to be deleted. If OOM_kill_disabled is enabled, the host kernel may crash. Therefore, in our opinion, Docker technology and virtual machine should be complementary. Docker can not be expected to replace virtual machine in all scenarios. Therefore, only when Docker and virtual machine are equal, can users’ requirements for cloud computing in various scenarios be met.
Don’t want to miss tech blog updates? Want to comment on articles and interact with authors? First access to technical salon information?
Please follow our official wechat account “Meituan-Dianping Technical Team”. Take out your phone now and scan it: