This article is based on the speech delivered by Wang Guoliang, Meituan Infrastructure Department, at the Cloud Native + Open Source Virtual Summit China 2020.

I. Background and current situation

Kubernetes is an open source system for bringing container applications into large-scale industrial production environments. It is also the de facto standard in the field of cluster scheduling, which has been widely accepted by the industry and adopted on a large scale. Kubernetes has become the management engine of Meituan cloud infrastructure. It not only brings efficient resource management, but also greatly reduces the cost, and lays a solid foundation for the advancement of Meituan cloud native architecture, supporting Serverless, cloud native distributed database and other platforms to complete the construction of containerization and cloud biogenesis.

Since 2013, Meituan has built a cloud infrastructure platform based on virtualization technology. In 2016, Hulk1.0 container platform was built based on the resource management capability of OpenStack. In 2018, meituan began to build the Hulk2.0 platform based on Kubernetes technology; At the end of 2019, we basically completed container transformation of Meituan cloud infrastructure. In 2020, we firmly believe that Kubernetes is the cloud infrastructure standard of the future and begin to explore the implementation and evolution of cloud native architecture.

At present, we have built a cloud infrastructure represented by Kubernetes, Docker and other technologies to support the entire Meituan service and application management, the containerization rate has reached more than 98%, there are dozens of Kubernetes clusters, tens of thousands of management nodes and hundreds of thousands of Pod. However, for the sake of disaster recovery, we set the maximum single cluster to 5K nodes.

The following figure shows the current scheduling system architecture based on Kubrnetes engine. We have built a unified resource management system with Kubernetes as the core, serving each PaaS platform and business. In addition to directly supporting Hulk containerization, it also directly supports Serverless, Blade and other platforms, realizing the containerization and cloud protobiochemistry of PaaS platform.

Two, OpenStack to Kubernetes transition obstacles and benefits

For a company with a mature technology stack, the transformation of the entire infrastructure is not smooth. In the period of OpenStack cloud platform, we face the following major problems:

  1. Complex architecture and difficult operation, peacekeeping and maintenance: The management module of computing resources in the entire architecture of OpenStack is very large and complex, so troubleshooting and reliability are always big problems.
  2. Environment inconsistency is a common problem in the industry before the emergence of container mirroring, which is detrimental to fast service rollout and service stability.
  3. Virtualization itself consumes a lot of resources: Virtualization itself consumes about 10% of the host’s resource consumption, which is a huge waste of resources when the cluster is large enough.
  4. The delivery and reclamation period of resources is long and it is difficult to flexibly allocate resources. On the one hand, the entire VM creation process is long. On the other hand, the preparation of various initialization and configuration resources takes a long time and is prone to errors, so it leads to a long period from application to delivery of the entire machine resources, and fast resource allocation is a problem.
  5. With the rapid development of mobile Internet, there are more and more peaks and lows in the company’s business. In order to guarantee the stability of service, the company has to prepare resources according to the highest resource demand, which leads to serious idle resources during peak hours, and then causes waste.

2.1 Containerization process and obstacles

In order to solve the virtual machine problem, meituan began to explore the implementation of a more lightweight container technology, namely Hulk1.0 project. However, based on the resource environment and architecture at that time, Hulk1.0 is a container platform implemented by the original OpenStack as the basic resource management layer. OpenStack provides the resource management capability of the underlying host, which solves the business demand for elastic resources and reduces the entire resource delivery period from minute level to second level.

However, with the promotion and implementation of Hulk1.0, some new problems have been exposed:

  1. Poor stability: Because the underlying resource management capability of OpenStack is used, the entire capacity expansion process involves two-layer resource scheduling. In addition, the data synchronization process is complex, and the isolation of the equipment room is poor. Problems often occur in one equipment room, and the capacity expansion or reduction of other equipment rooms is also affected.
  2. Lack of capability: Because there are many systems involved and cross-department cooperation, it is difficult to realize the migration and recovery capability of faulty nodes, and the resource type is relatively single. Therefore, troubleshooting and communication efficiency are low.
  3. Poor scalability: the control layer of Hulk1.0 has limited ability to manage the underlying resources and cannot be rapidly expanded according to scenarios and requirements.
  4. Performance: Business needs to scale up and deliver flexible resources faster, and weak isolation of container technology leads to increased interference and negative feedback on business services.

After a period of optimization and improvement, the above problems can not be completely solved. In this case, we have to rethink the rationality of the architecture of the whole container platform, and at this time Kubernetes has been gradually recognized and applied in the industry, its clear architecture and advanced design ideas let us see the hope. Therefore, we built a new container platform based on Kubernetes. In the new platform, Hulk is completely based on the native Kubernetes API. Hulk API is used to connect with the internal release and deployment system, so that the two layers of API decouples the entire architecture, and the domain is clear. Kubernetes powerful orchestration and resource management capabilities highlighted.

The core idea of containerization is to make Kubernetes do a good job of resource management, and through the upper control layer to solve the problem of meituanapplication management system and operation and maintenance system dependence, maintain the original compatibility of Kubernetes, reduce the subsequent maintenance costs, and complete the rapid convergence of resource management requirements. At the same time, it also reduces the learning cost for users to apply for resources based on the new platform, which is very important and also the “foundation” for us to quickly migrate infrastructure resources on a large scale.

2.2 Challenges and countermeasures of containerization process

2.2.1 Complex, flexible, dynamic, and configurable scheduling policies

Meituan has a variety of products, business lines and application features, so we have a lot of requirements for resource types and scheduling strategies accordingly. For example, some businesses require specific resource types (SSD, high memory, high IO, and so on), and some require specific fragmentation strategies (such as computer rooms, service dependencies, and so on), so how to respond to these diverse requirements is a big problem.

In order to solve these problems, we increased the strategy for the expansion link engine, business can be applied to their APPKEY custom entry strategy demand, at the same time, based on analysis of large data service, can also according to the business characteristics and application of company management strategy for the business strategy recommended, finally the strategy will be saved to a center. In the capacity expansion process, we will automatically label the application instances with corresponding requirements, and finally take effect in Kubenretes to complete the expected resource delivery.

2.2.2 Fine resource scheduling and operation

Elaborate resource scheduling and operation are carried out for the following two reasons: complex resource demand scenarios and insufficient resources are common.

Relying on private cloud and public cloud resources, we deploy multiple Kubenretes clusters, some of which bear general business and some are dedicated clusters for specific applications. We allocate cloud resources in the cluster dimension, including the division of computer rooms and models. Under the cluster, we also build zones of different business types according to different business needs, so as to achieve the isolation of resource pools to meet business needs. In a more detailed dimension, we perform resource scheduling at the cluster layer based on resource requirements, DISASTER recovery requirements, and stability at the application layer. Finally, based on different hardware and software at the bottom layer, we achieve finer granularity resource isolation and scheduling such as CPU, MEM, and disk.

2.2.3 Improvement and governance of application stability

Both THE VM and the original container platform have always had problems with application stability. To this end, we need to do more to ensure the SLA of the application.

2.2.3.1 Container Reuse

In the production environment, the host restart is a very common scenario, which may be active or passive, but from the user’s point of view, the host restart means that some system data may be lost, and the cost is relatively high. We need to avoid migration or rebuilding of the container and restart the restore directly. However, we all know that in Kubernetes there are several restart policies for containers in Pod: Always, OnFailure, and Never. Containers are created again after the host restarts.

To solve this problem, we added the Reuse policy to the container’s restart policy type. The process is as follows:

  1. When kubelet SyncPod is reused, Reuse retrievesthe last App container to which Pod is reused, or creates a new one if it does not exist.
  2. Update the pauseID of the App container mapping to the ID of the new pause container. This establishes the mapping between the new pause container and the original App container.
  3. Pull up the App container again to complete Pod state synchronization, and the container data will not be lost even if the host restarts or the kernel is upgraded.

2.2.3.2 Numa Perception and Binding

Another pain point for users relates to container performance and stability. We continue to receive business feedback that containers with the same configuration have considerable performance differences, mainly reflected in the high request latency of some containers. Through our testing and in-depth analysis, we found that these containers have access to CPUS across Numa nodes. After we limited the CPU usage of containers to the same Numa Node, the problem disappeared. Therefore, for some delay-sensitive services, Numa Node usage needs to be sensed on the scheduling side to ensure consistency and stability of application performance.

To solve this problem, we collect Numa Node allocation in the Node layer, increase Numa Node awareness and scheduling in the scheduler layer, and ensure the balance of resource usage. For sensitive applications that require Node binding, capacity expansion fails if no suitable Node can be found. For some applications that do not need to be bound to Numa nodes, you can select a policy that meets the requirements as much as possible.

2.2.3.3 Other stability optimization

  1. At the scheduling level, we add load awareness and scatter and optimize strategies based on service portrait application characteristics for the scheduler.
  2. In terms of fault container discovery and processing, the alarm self-healing component based on the signature database can discover, analyze, and process alarms in seconds.
  3. Some applications with special resource requirements, such as high I/O and memory requirements, are isolated to avoid adverse impact on other applications.

2.2.4 Platform Service containerization

I believe that students who have done ToB business should know that there is a big customer plan for any product, so for companies like Meituan, there will be such a situation inside. The containerization of platform business has one characteristic: the number of instances is large, thousands or thousands, so the resource cost is relatively high; Business status is relatively high, generally very core business, high performance and stability requirements. So it’s unrealistic to think of a one-trick solution to a business like this.

Here, we take MySQL platform as an example. Database services have high requirements on stability, performance and reliability, and the services are mainly physical machines, so the cost pressure is very high. For database containerization, we mainly customize and optimize resource allocation from the host side to the entry point.

  1. For CPU resource allocation, exclusive CPU set is adopted to avoid contention among pods.
  2. Improve stability by allowing custom SWAP sizes to handle transient high traffic and turning off Numa nodes and PageCache.
  3. In disk allocation, PODS are used to isolate IOPS from exclusive disks, and preallocate and format disks to improve capacity expansion and resource delivery efficiency.
  4. Scheduling supports a unique scatter policy and capacity reduction confirmation to avoid capacity reduction risks.

As a result, we improved the delivery efficiency of the database by 60 times and performed better than previous physical machines in most cases.

2.2.5 Ensuring service resource priorities

For an enterprise, considering the cost, resources will always be insufficient, so how to ensure the supply and distribution of resources is very important.

  1. Business budget quota to determine resource supply, through the zone to do dedicated resources.
  2. Build elastic resource pools and open public clouds to cope with unexpected resource demands.
  3. Ensure resource usage based on service and application priorities to ensure that core services get resources first.
  4. Multiple Kubenretes clusters and multiple machine rooms for disaster recovery, coping with the failure of the cluster or machine room.

2.2.6 Landing of cloud native architecture

After migrating to Kubernetes, we further implemented the implementation of cloud native architecture.

In order to solve the obstacles of cloud native application management, we designed and implemented KubeNative, a cloud native application management engine featured by Meituan, to make the configuration and information management of applications transparent to the platform. The business platform only needs to create native Pod resources, without paying attention to the details of application information synchronization and management. It also supports each PaaS platform to extend the control layer and run its own Operator.

The following figure shows our entire cloud native application management architecture, which supports Hulk container platform, Serverless, TiDB and other platforms.

2.3 Benefits after infrastructure migration

  1. 98% of the company’s business has been containerized, significantly improving the efficiency of resource management and business stability.
  2. Kubernetes stability 99.99% or more.
  3. Kubernetes became the standard for meituan’s internal cluster management platform.

Challenges and countermeasures of operating large-scale Kubernetes cluster

In the whole process of infrastructure migration, in addition to solving historical legacy problems and system construction, with the rapid growth of the size and number of Kubernetes clusters, we encountered a new challenge: how to operate a large Kubernetes cluster stably and efficiently. In the past few years of Kubernetes operation, we have gradually explored a set of operational experience to verify the feasibility.

3.1 Optimization and upgrade of core components

The original version of Kubernetes we used was version 1.6, which had poor performance and stability. When we reached 1K nodes, problems gradually appeared. When we reached 5K nodes, the basic cluster was unavailable. For example, the scheduling performance is very poor, the cluster throughput is low, “avalanche” occasionally occurs, and the link expansion and shrinkage takes longer.

The analysis and optimization of core components are summarized from four aspects: KuBE-Apiserver, KuBE-Scheduler, ETCD and container.

  1. For Kube-Apiserver, in order to reduce the occurrence of 429 request retries for a long time during the restart process, we implemented multi-level flow control, reduced the unavailability window from 15min to 1min, and reduced the cluster load by reducing and avoiding the List operation of the external system, and balanced the load of nodes through the internal VIP. Ensure the stability of control nodes.
  2. In the Kube-Scheduler layer, we enhance the scheduling awareness strategy, and the scheduling effect is more stable than before. The pre-selected interrupt and local optimal policies proposed for scheduling performance optimization have been incorporated into the community and become common policies.
  3. For THE operation of ETCD, the pressure on the main library is reduced by splitting out the independent Event cluster, and the deployment of physical machines based on high configuration SSD can achieve 5 times the daily high traffic access.
  4. At the container level, container reuse improves container fault tolerance and application stability through fine CPU allocation. Premounting disks in containers improves Node recovery speed.

In addition, the iteration of community version is very fast, and the higher version is better in terms of stability and feature support. It is inevitable that we need to upgrade the version, but how to ensure the success of the upgrade is a big challenge, especially when we do not have enough Buffer resources for resource shifting.

The common solution in the industry is to upgrade a cluster directly based on the original cluster. The solution has the following problems:

  1. The upgrade version is limited and cannot be upgraded across a larger version. The upgrade can only be done from an earlier version to a later version, which is time-consuming and difficult, and the success rate is low.
  2. The risk of control plane upgrade is uncontrollable: Data will be overwritten or even cannot be rolled back when API changes are made.
  3. Users perceive that the container needs to be created, and the cost and impact are high: this is a painful point, and it is inevitable to create a new container.

To this end, we have studied the control mode of Kubernetes on the container level, designed and implemented a scheme that can smoothly migrate the container data from the low version cluster to the high version cluster, and refined the cluster upgrade into Node granularity of the original geothermal upgrade of each host container, which can be suspended and rolled back at any time. The new solution focuses on using external tools to migrate Node and Pod data from lower-version clusters to higher-version clusters and address Pod object and container compatibility issues. The core ideas are as follows: the lower version is compatible with the higher version API, and the container under Pod cannot be new by refreshing the Hash of the container; Migrate Pod and Node resource data from a cluster of an earlier version to a cluster of a later version.

The program highlights mainly include the following four aspects:

  1. Cluster upgrades in mass production environments are no longer a problem.
  2. The problem of uncontrollable risk of existing technical solutions is solved, and the risk is reduced to the level of the host machine, making the upgrade more secure.
  3. It can be upgraded to any version and has a long life cycle.
  4. It elegantly solves the problem of new container construction during the upgrade process and truly achieves the original geothermal upgrade.

3.2 Platformization and operational efficiency

Large-scale cluster operation is very challenging, and meeting the rapid development of business and user needs is also a great test for the team. We need to consider the cluster operation and RESEARCH and development capabilities from different latitudes.

In the whole operation and maintenance capacity building of Kubernetes and ETCD cluster, we focus on safe operation, efficient operation and maintenance, standardized management and cost saving. Therefore, for Kubernetes and ETCD cluster, we have completed the platform-based management and operation, covering six aspects including feature expansion, performance and stability, daily operation and maintenance, fault recovery, data operation and security control.

For a Kubernetes team with non-public cloud business, the manpower is still very limited. In addition to the daily operation of the cluster, there are also RESEARCH and development tasks, so we pay great attention to the improvement of operation efficiency. We gradually transform the daily operation and maintenance to build a set of Kubernetes cluster management platform within Meituan.

  1. Standardizes and visualizes cluster management, avoiding black-and-white screen operation and maintenance (O&M) operations.
  2. Through alarm self-healing and automatic inspection, problem solving is converged. Therefore, although we have dozens of clusters, our operation and maintenance efficiency is relatively high, and students on duty seldom need to pay attention to it.
  3. The whole process of operation and maintenance not only improves the efficiency of operation and maintenance, but also reduces the probability of failure caused by human operation.
  4. Through the analysis of operation data, we can further do the fine scheduling of resources and failure prediction, further discover risks in advance, and improve the quality of operation.

3.3 Risk control and reliability assurance

With a large scale and extensive business coverage, any cluster failure will directly affect the stability of the service and even the user experience. After many operation and maintenance failures and security pressure, we have formed a set of replicable risk control and reliability guarantee strategy.

In the whole risk management and control link, we divide it into five levels: indicators, alarms, tools, mechanisms & measures and personnel:

  1. Collect core indicators from nodes, clusters, components, and resources as data sources.
  2. Risk push is a multi-level and multi-dimensional alarm mechanism covering core indicators.
  3. In tool support, reduce the risk of misoperation through active, passive and process.
  4. In terms of mechanism guarantee, through testing, gray level verification, release confirmation and drill to reduce the situation of negligence.
  5. People are the root of risk, and that’s where we’ve been building and rotating to make sure that the problem is responsive.

In terms of reliability verification and operation, we firmly believe that we need to focus on evaluation, through cluster inspection to assess the health of the cluster, and push reports; Regular outage drills ensure that real faults can be quickly recovered, and daily problems are filled into the full link test to form a closed loop.

Iv. Summary and future outlook

4.1 Experience

  1. The implementation of Kubernetes is fully compatible with the community Kubernetes API; Only plug-in extensions are made, and the original behavior of the control layer is not changed as much as possible.
  2. Some features of the community, learn from each other, and have the expected upgrade, not blindly upgrade and follow up the community version, try to maintain a stable version of the core every year.
  3. Landing to the user pain point as a breakthrough, business is more practical, why need to carry out migration? The business will be afraid of trouble and not cooperate, so the promotion should find the pain points of the business and start from the perspective of helping the business, the effect will be different.
  4. The value display of internal cluster management operation is also very important, so that users see the value and businesses see the potential benefits, they will take the initiative to come to you.

In the container era, Ku1. Bernetes is not the only thing to look at, but the issue of “up” and “down” convergence and compatibility is also critical for infrastructure within the enterprise. “Up” provides interconnection for users in business scenarios, because containers cannot directly serve services. It also involves application deployment, service governance, and scheduling. “Down”, the combination of containers and infrastructure, is more about compatible resource types, greater isolation, more efficient use of resources, and so on.

4.2 Future Outlook

  1. Unified scheduling: VMS will exist in small quantities for a long time, but it is very expensive to maintain both sets of infrastructure products, so we are also implementing Kubernetes to manage VMS and containers in a unified manner.
  2. VPA: Explore the use of VPA to further improve the efficiency of the entire resource.
  3. Cloud native application management: Currently, we have implemented cloud native application management in the production environment. In the future, we will further expand the coverage of cloud native application and constantly improve the efficiency of research and development.
  4. Implementation of cloud native architecture: promote implementation of cloud native systems in various areas of middleware, storage systems, big data and search business cooperation.

The authors introduce

Guo Liang, a technical expert of Meituan-Dianping, is now responsible for the overall operation and maintenance of Meituan-Dianping Kubernetes cluster and the implementation of cloud native technology support.

Recruitment information

Meituan-dianping infrastructure team sincerely seeks senior and senior technical experts based in Beijing and Shanghai. We are committed to building a unified high concurrency and high performance distributed infrastructure platform for the whole company, covering the main technical fields of infrastructure such as database, distributed monitoring, service governance, high performance communication, message middleware, basic storage, containerization, cluster scheduling, Kubernetes, cloud native and so on. Interested students are welcome to submit their resumes to [email protected] (subject: Infrastructure)

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!