This paper introduces meituan’s practice in how to solve the problems of large-scale cluster management and design an excellent and reasonable cluster scheduling system, and expounds the problems and challenges that Meituan is concerned about when implementing the cloud native technology represented by Kubernetes, as well as corresponding promotion strategies. At the same time, this paper also introduces some special support for Meituan business demand scenarios, hoping that this paper can be of help or inspiration to students who are interested in cloud native field.
Introduction:
Cluster scheduling system plays an important role in enterprise data centers. With the increasing number of clusters and applications, the complexity of developers to deal with business problems has also increased significantly. How to solve the problem of large-scale cluster management, design excellent and reasonable cluster scheduling system, ensure stability, reduce cost, improve efficiency? This article will answer them one by one.
| note: the article first published in the new programmers 003 cloud native developers column of The Times.
This section describes the cluster scheduling system
Cluster scheduling system, also known as data center resource scheduling system, is generally used to solve the data center of resource management and task scheduling problems, and its goal is to achieve the effective use of data center resources, improve resource utilization, and to offer our automated business operational ability, decrease the cost of service operations management. Well-known cluster scheduling systems in the industry, such as open source OpenStack, YARN, Mesos and Kubernetes, etc., as well as well-known Internet companies such as Google Borg, Microsoft Apollo, Baidu Matrix, Alibaba Fuxi and ASI.
As the core IaaS infrastructure of Internet companies, cluster scheduling system has undergone many architectural evolutions in recent decades. With the evolution of business from monolithic architecture to SOA (service-oriented architecture) and the development of microservices, the underlying IaaS infrastructure has gradually moved from the bare-metal era to the container era. While the core issues we have to deal with have not changed over the course of evolution, their complexity has grown exponentially as cluster size and number of applications have ballooned. This paper will elaborate the challenges of large-scale cluster management and the design ideas of cluster scheduling system, and take meituan cluster scheduling system as an example to describe how to continuously improve the utilization rate of resources by creating unified scheduling service for multiple clusters, and provide Kubernetes engine service to enable PaaS components. Provide better computing service experience for business and a series of cloud native practices.
The challenges of large-scale cluster management
It is well known that rapid business growth leads to an explosion in server size and data center numbers. For developers, in the business scenario of large-scale cluster scheduling system, two problems must be solved:
- How to manage large-scale cluster deployment and scheduling of data centers, especially in cross-data center scenarios? How to achieve resource flexibility and scheduling capabilities, improve resource utilization while ensuring application service quality, and reduce data center costs.
- How to transform the underlying infrastructure, create a cloud native operating system for the business side, improve the computing service experience, and realize automatic disaster recovery response and deployment upgrade of applications, so as to reduce the mental burden of the business side on the resource management at the bottom and enable the business side to focus more on the business itself.
The challenges of operating large clusters
In order to solve these two problems in a real production environment, they can be further divided into the following four large-scale cluster operation management challenges:
- How to solve the diverse needs of users and respond quickly. Business scheduling needs and scenarios are rich and dynamic. As a platform service like cluster scheduling system, on the one hand, it needs to be able to deliver functions quickly and meet business needs in time. On the other hand, it is necessary to make the platform generic enough to abstract business personalized requirements into generic capabilities that can be implemented on the platform and iterate over the long term. This is very challenging for the platform services team’s technology evolution plan, because if you are not careful, the team can get bogged down in the endless development of business functions that meet the business needs but result in a low level of duplication of team work.
- How to improve resource utilization of online application data centers while ensuring application service quality? Resource scheduling has always been a recognized problem in the industry. With the rapid development of cloud computing market, cloud computing vendors continue to increase investment in data centers. The problem is exacerbated by low resource utilization in data centers. In a sign of how wasteful resources are, Gartner found that global data center server CPU utilization is only 6 to 12 percent, and even Amazon’s EC2 Elastic Compute Cloud is only 7 to 17 percent. The reason is that online applications are very sensitive to resource utilization, and the industry has to reserve additional resources to ensure the Qualityof Service (QoS) of important applications. The cluster scheduling system needs to eliminate the interference between applications and realize resource isolation between different applications.
- How to automatically handle exceptions for applications, especially stateful applications, shielding differences in the equipment room and reducing users’ awareness of the underlying layer? As the scale of service applications continues to expand and the cloud computing market matures, distributed applications are often deployed in data centers in different regions, or even across different cloud environments, achieving multi-cloud or hybrid cloud deployment. However, the cluster scheduling system needs to provide unified infrastructure for service providers to implement a hybrid multi-cloud architecture and shield the underlying heterogeneous environment. In addition, the complexity of application o&M management is reduced and application automation is improved, providing better O&M experience for services.
- How to solve the performance and stability risks associated with cluster management caused by a large single cluster or a large number of clusters. The complexity of the lifecycle management of a cluster increases with the size and number of clusters. Taking Meituan as an example, the two-place multi-center and multi-cluster scheme adopted by us has avoided the hidden danger of large cluster scale to a certain extent, and solved the problems of business isolation and regional delay. With the emergence of cloud requirements on edge cluster scenarios and PaaS components such as databases, we can expect a significant increase in the number of small clusters. As a result, the complexity of cluster management, the cost of monitoring configuration, and the cost of operation and maintenance increase obviously. At this time, the cluster scheduling system needs to provide more effective operation specifications, and ensure operation security, alarm self-healing and change efficiency.
The choice of designing cluster scheduling system
To solve the above challenges, a good cluster scheduler will play a key role. However, there is never a perfect system in reality, so when designing the cluster scheduling system, we need to make a choice among several contradictions according to the actual scene:
- System throughput and scheduling quality of cluster scheduling system. System throughput is usually a very important criterion to evaluate the quality of a system, but the scheduling quality is more important in online service-oriented cluster scheduling system. Because the impact of each scheduling result is long-term (days, weeks, or even months), non-abnormal conditions are not adjusted. Therefore, if the scheduling result is incorrect, the service delay will be increased directly. The higher the scheduling quality, the more computing constraints need to be considered, and the worse the scheduling performance, the lower the system throughput.
- Architecture complexity and scalability of cluster scheduling system. The more functions and configurations the system provides to upper-layer PaaS users, the more functions the system supports to improve user experience (such as application resource preemption and recovery and application instance self-healing), which means that the higher the system complexity, the more likely subsystems are to conflict.
- Cluster scheduling system reliability and single cluster size. The larger the size of a single cluster is, the larger the scheduling range is, but the greater the reliability challenge of the cluster is, because the explosion radius increases and the impact of failure increases. In the case of a small cluster, although the scheduling concurrency can be improved, the scheduling range becomes smaller, the probability of scheduling failure increases, and the cluster management complexity increases.
At present, the cluster scheduling system in the industry can be divided into five different architectures: single scheduler, two-level scheduler, shared state scheduler, distributed scheduler and mixed scheduler (see figure 1 below). They all make different choices according to their own scene requirements, and there is no absolute good or bad.
- The singleton scheduler uses complex scheduling algorithms combined with the global information of the cluster to calculate high quality placement points, but with high latency. Such as Google’s Borg system, open source Kubernetes system.
- Two-level scheduler solves the limitation of monomer scheduler by separating resource scheduling from job scheduling. The two-level scheduler allows different job scheduling logic for specific applications, while maintaining the feature of sharing cluster resources between different jobs, but it cannot preempt high-priority applications. Typical systems are Apache Mesos and Hadoop YARN.
- The shared state scheduler solves the limitation of the two-level scheduler in a semi-distributed way. Each scheduler in the shared state has a copy of the cluster state, and the scheduler updates the copy of the cluster state independently. As soon as the local state copy changes, the state information of the entire cluster is updated, but continuous resource contention can degrade the scheduler performance. Typical systems are Google’s Omega and Microsoft’s Apollo.
- Distributed scheduler uses a relatively simple scheduling algorithm to achieve large-scale parallel task placement with high throughput and low latency. However, due to the relatively simple scheduling algorithm and lack of global resource usage perspective, it is difficult to achieve high quality job placement effect. Representative system is Sparrow of University of California.
- The hybrid scheduler spreads the workload across centralized and distributed components, using complex algorithms for long-running tasks and relying on distributed layouts for short-running tasks. Microsoft Mercury has taken this approach.
Therefore, how to evaluate the quality of a scheduling system mainly depends on the actual scheduling scene. Take YARN and Kubernetes, the most widely used systems in the industry, for example. Although both systems are general-purpose resource schedulers, YARN focuses on offline batch short tasks, while Kubernetes focuses on online long-running services. In addition to the differences in architecture design and functions (Kubernetes is a single-level scheduler, and YARN is a two-level scheduler), they also have different design concepts and perspectives. YARN focuses on tasks and resource overcommitment, avoiding multiple copies of remote data, and implementing tasks at a lower cost and higher speed. Kubernetes is more focused on service state, focusing on staggered peaks, service profiling, and resource isolation with the goal of ensuring quality of service.
Evolution of Meituan cluster scheduling system
In the process of containerization, meituan changed the core engine of cluster scheduling system from OpenStack to Kubernetes according to the requirements of business scenarios, and achieved the set goal of the containerization coverage of online business exceeding 98% by the end of 2019. However, it still faces the problems of low resource utilization and high operation and maintenance costs:
- The overall resource usage of the cluster is low. For example, the average CPU resource utilization is still at the average level of the industry, which is far behind other first-tier Internet companies.
- The containerization rate of stateful services is not enough. Especially, MySQL and Elasticsearch do not use containers, so there is a large space for optimizing service operation and maintenance costs and resource costs.
- Considering business requirements, VM products will exist for a long time. VM scheduling and container scheduling are two sets of environments, resulting in high operation and maintenance costs of team virtualization products.
Therefore, we decided to start cloud native transformation of cluster scheduling system. Create a large-scale high-availability scheduling system that provides multi-cluster management and automatic o&M capabilities, supports scheduling policy recommendation and self-configuration, provides cloud native low-level expansion capabilities, and improves resource utilization while ensuring application service quality. The core work is to build the dispatching system around ensuring stability, reducing cost and improving efficiency.
- Stability: improve the robustness and observability of the scheduling system; Reduce the coupling between each module of the system, reduce the complexity; Improve the automatic o&M capability of the multi-cluster management platform; Optimize the performance of system core components; Ensure the availability of large-scale clusters.
- Cost reduction: deeply optimize the scheduling model and open up the cluster scheduling and single-machine scheduling links. From static resource scheduling to dynamic resource scheduling, offline service containers are introduced to form a combination of free competition and strong control. On the premise of ensuring high-quality service quality of business applications, resource utilization rate is improved and IT costs are reduced.
- Improve efficiency: Users can adjust scheduling policies on their own to meet personalized service requirements, embrace the cloud native domain, and provide core capabilities for PaaS, including orchestration, scheduling, cross-cluster, and high availability, to improve O&M efficiency.
Finally, meituan cluster scheduling system architecture is divided into three layers (see Figure 2) according to the domain, namely, scheduling platform layer, scheduling policy layer, and scheduling engine layer:
- The platform layer is responsible for service access, opening up Meituan infrastructure, encapsulating native interfaces and logic, and providing container management interfaces (expansion, update, restart, and capacity reduction) and other functions.
- The policy layer provides the unified scheduling capability for multiple clusters, continuously optimizes scheduling algorithms and policies, and improves CPU usage and allocation rate based on service level and sensitive resource information.
- The engine layer provides Kubernetes service to ensure the stability of the cloud native cluster of multiple PaaS components, and sinks the general ability into the choreography engine to reduce the access cost of the business cloud native landing.
Through refined operation and product function polishing, on the one hand, we managed nearly one million container/VIRTUAL machine instances of Meituan in a unified manner, on the other hand, we raised the average resource utilization rate to the first-class level in the industry, and also supported the containerization of PaaS components and the native implementation of cloud.
Unified multi-cluster scheduling: Improves data center resource utilization
Resource utilization is one of the most important indexes to evaluate the performance of cluster scheduling system. So even though we did containerization in 2019, containerization is not an end, it’s just a means. Our goal is to bring more benefits to users by switching from VM stack to container stack, such as lower computing costs for users across the board.
However, the improvement of resource utilization is limited by individual hotspot hosts in the cluster. Once capacity expansion is performed, service containers may be expanded to hotspot hosts, and service performance indicators such as TP95 may take time fluctuations. As a result, we have no choice but to ensure service quality by increasing resource redundancy just like other companies in the industry. The Kubernetes scheduling engine simply considers Request/Limit Quota (Kubernetes sets Request and Limit values for the container), which is a static resource allocation. As a result, although different host computers allocate the same amount of resources, but due to the service difference of the host computer, the resource utilization rate of the host computer also has a great difference.
In academia and industry, there are two common approaches to resolve the contradiction between resource utilization efficiency and application service quality. The first method is to solve the problem globally by efficient task scheduler. The second approach is to enforce resource isolation between applications by means of stand-alone resource management. Either way, it meant that we needed to fully understand the cluster state, so we did three things:
- The association of cluster state, host state and service state is systematically established, and combined with scheduling simulation platform, the peak utilization rate and average utilization rate are considered comprehensively, and the prediction and scheduling based on host historical load and real-time service load are realized.
- Through the self-developed dynamic load regulation system and inter-cluster rescheduling system, the linkage between cluster scheduling and single-machine scheduling links is realized, and the qos guarantee policies of different resource pools are realized according to service levels.
- After three iterations, it has realized its own cluster federated service, solved the problems of resource preoccupation and state data synchronization, improved the scheduling concurrency between clusters, and realized computing separation, cluster mapping, load balancing and inter-cluster scheduling control (see Figure 3 below).
The third version of cluster federated service (Figure 3) is split into Proxy layer and Worker layer according to modules, and deployed independently:
- The Proxy layer will select the appropriate cluster for scheduling based on the factors and weights of cluster status, and select the appropriate Worker to distribute requests. The Proxy module uses ETCD for service registration, master selection and discovery. The Leader node is responsible for scheduling preoccupation tasks, and all nodes can be responsible for query tasks.
- The Worker layer processes query requests of some clusters. When tasks in a Cluster are blocked, one corresponding Worker instance can be quickly added to alleviate the problem. When the scale of a single cluster is large, there will be multiple Worker instances. Proxy will distribute scheduling requests to multiple Worker instances for processing, so as to improve scheduling concurrency and reduce the load of each Worker.
Finally, through the unified scheduling of multiple clusters, we realized the transition from static resource scheduling model to dynamic resource scheduling model, thus reducing the proportion of hot hosts, reducing the proportion of resource fragments, ensuring the quality of service of high-quality business applications, and increasing the average value of server CPU utilization of online business clusters by 10 percentage points. Calculation formula for the average cluster resource usage: Sum(nodeA. CPU. Number of cores in use + nodeB. CPU. Number of cores + XXX)/Sum(nodeA. CPU. Total cores + nodeb. CPU. Total cores + XXX), one point per minute, and all values of the day are averaged.
Scheduling engine services: Enabling PaaS services to be native to the cloud
Cluster scheduling system not only solves the problem of resource scheduling, but also solves the problem of service using computing resources. As mentioned in Software Engineering at Google, cluster scheduling system is one of the key components of Compute as a Service. In addition to resource scheduling (from physical machine disassembly to CPU/Mem) and resource competition (noisy neighbors), application management (automatic instance deployment, environment monitoring, exception handling, ensuring the number of service instances, determining the amount of resources required by services, and different types of services) needs to be addressed. In addition, application management is more important than resource scheduling to some extent, because it will directly affect the efficiency of business development, operation and maintenance and service disaster recovery. After all, the cost of human resources on the Internet is higher than that of machines.
Containerization of complex stateful applications has always been a challenge in the industry because distributed systems in these different scenarios often maintain their own state machines. When the capacity of the application system is expanded or upgraded, how to ensure the availability of the existing instance services and how to ensure the connectivity between them is a much more complicated and thorny problem than stateless applications. Although we have containerized stateless services, we have not yet exploited the full value of a good cluster scheduling system. If you want to manage computing resources, you have to manage the state of the service, separate resources from services, and improve service resiliency, which is what the Kubernetes engine excels at.
We built meituan Kubernetes engine service MKE based on the optimized and customized version of Kubernetes:
- Strengthen the cluster operation and maintenance capacity, improve the cluster automatic operation and maintenance capacity construction, including cluster self-healing, alarm system, Event log analysis, etc., continuously improve the observability of the cluster.
- Establish benchmark of key business, cooperate deeply with several important PaaS components, and quickly optimize users’ pain points such as Sidecar upgrade management, Operator gray level iteration and alarm separation to meet users’ demands.
- Continuous improvement of the product experience, continuous optimization of the Kubernetes engine, in addition to supporting users to use custom operators, but also provides a general scheduling and orchestration framework (see Figure 4) to help users access MKE at a lower cost and obtain technical dividends.
In the process of promoting the implementation of cloud native, a widely concerned question is: what is the difference between managing stateful applications based on Kubernetes cloud native method and building our own management platform before?
To solve this problem, we need to consider the root of the problem — operationability:
- Based on Kubernetes, this means that the system is closed loop, and there is no need to worry about the data inconsistency that often occurs between the two systems.
- The Recovery Time Objective (RTO) of the system is reduced. The Recovery Time Objective refers to the maximum period of service suspension that can be tolerated and the minimum period of service Recovery from the disaster).
- The system operation and maintenance complexity is reduced, and automatic disaster recovery (Dr) is implemented for services. In addition to the service itself, the configuration and state data that the service depends on can be recovered together.
- Compared with the previous “smokestack” management platform of each PaaS component, the common capability can be sunk into the engine service to reduce the development and maintenance cost. By relying on the engine service, it can shield the underlying heterogeneous environment and realize service management across data center and multi-cloud environment.
Future outlook: Build cloud native operating system
We believe that cluster management in the era of cloud native will be fully transformed from the previous functions of managing hardware and resources to an application-centered cloud native operating system. To achieve this goal, Meituan group scheduling system needs to make efforts in the following aspects:
- Application link delivery management. With the increase of business scale and link complexity, the operation and maintenance complexity of PaaS components and underlying infrastructure that businesses rely on has long exceeded the general recognition, and it is even more difficult for newcomers to take over projects. Therefore, we need to support services to deliver services through declarative configuration and implement self-operation and maintenance (O&M) to provide better O&M experience for services, improve application availability and observability, and reduce services’ burden on underlying resource management.
- Edge computing solutions. As The business scenarios of Meituan continue to enrich, the demand for edge computing nodes grows faster than expected. We will refer to the best practices in the industry to develop edge solutions suitable for meituan, provide edge computing node management capabilities for services in demand as soon as possible, and realize cloud edge and end collaboration.
- Ability building in offline mixing department. There is a limit to the resource utilization improvement of online business clusters, according to Google’s paper Borg: According to the data center cluster data disclosed in The Next Generation in 2019, the resource utilization rate of online tasks is only about 30% excluding offline tasks, which also indicates that further improvement is risky and the input-output ratio is not high. Follow-up, Meituan cluster scheduling system will continue to explore in offline, but due to the offline Meituan room relatively independent, the implementation of our paths are different from common solutions in the industry, will first from the mixing of the online services and near real-time task, complete the construction of the ability of the underlying exploration of the task of online and offline mix again.
conclusion
In the design of Meituan cluster scheduling system, the overall principles are followed. After ensuring system stability while meeting basic business needs, the architecture is gradually improved to improve performance and enrich functions. Therefore, we chose:
- In the system throughput and scheduling quality, we choose to meet the throughput demand of the system first, and do not excessively pursue the single scheduling quality, but through rescheduling adjustment and improvement.
- In terms of architecture complexity and scalability, we choose to reduce the coupling between the modules of the system, reduce the system complexity, and the extension function must be degraded.
- In terms of reliability and single cluster scale, we choose to control single cluster scale through unified scheduling of multiple clusters to ensure system reliability and reduce explosion radius.
In the future, we will continue to optimize and iterate meituan’s cluster scheduling system according to the same logic, and completely transform it into an application-centered cloud native operating system.
Author’s brief introduction
Tan Lin, from Meituan Basic R&D Platform/Basic Technology Department.
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.