Flexible expansion has the value of emergent, cost saving and automation. On the platform side, scattered and idle resources of various businesses are integrated to form a large-scale resource pool, and a better balance between the company’s operating cost and business body sense is sought through flexible scheduling and inventory control technologies.
This paper will introduce the technical challenges, promotion and some thoughts on operation of Meituan elastic expansion system in the process of landing. Meituan is flexible and scalable in diversified business scenarios. Compared with the public cloud and self-built private cloud companies in the industry, it has both commonalities and its own characteristics. We hope to provide new ideas or inspirations for everyone.
preface
Stable, efficient and reliable infrastructure is the foundation stone for Internet enterprises to cope with business peak traffic. As the unified basic technology platform of Meituan, the Basic Technology Department has been committed to ensuring the stable, safe, low-cost and sustainable operation and development of the basic technology platform on which all the internal business online production system depends through the implementation of cutting-edge technology in the industry.
Elastic expansion system is an automatic elastic expansion platform developed based on Docker, which has experienced many years of development in Meituan.
As early as 2016, Meituan tried to use container environment in online environment, and launched the Container cluster platform Hulk 1.0 based on OpenStack. With the container landing, elastic scaling 1.0 comes into being, which solves problems such as slow instance expansion, slow online expansion, slow resource reclamation, and redundant computing resources.
Combined with the exploration and practice of the previous two years and the technology maturity of relevant fields in the industry, since 2018, we have carried out a bottom-up overall upgrade of the container cluster platform, namely the current Hulk 2.0 platform. At the bottom, Hulk 2.0 builds its own operating system and container engine, and replaces OpenStack with Kubernetes, the de facto standard for container orchestration. At the level of the upper elastic scaling system PaaS, elastic scaling system 2.0 was introduced to solve many business pain points in the practice of elastic scaling 1.0, including:
- If the versions of expanded service codes are inconsistent, the service logic is abnormal, resulting in asset loss.
- Insufficient resources in some peak service periods: services cannot carry traffic effectively and cause capital loss.
- Platform maintenance costs are high: Businesses in Beijing and Shanghai each have a flexible client management platform and elastic logic (because the service governance framework, CMDB system and publishing system have not been standardized before the merger of Meituan and Dianping).
- Low flexibility in configuration and usage: When an IDC is added or removed from a service cluster, it must be configured on the same platform. Therefore, resource regulation cannot effectively match the traffic scheduling system of the company, including the unit architecture, same region, and same center.
1. Evolution of elastic scaling system
Elastic scaling 1.0 architecture
Let’s start by looking at the product evolution path and elastic scaling 1.0 architecture diagram.
From left to right, top-down module introduction:
- Client Portal: OCTO management terminal. OCTO is a service governance platform of Meituan. Considering the habit of Beijing business operation service nodes, a corresponding flexible and scalable management page is developed on OCTO. TTT management terminal, TTT is the CMDB management platform of Shanghai Side (formerly Dianping side). Considering the habit of business operation service nodes in Shanghai side, a corresponding elastic and scalable management page is developed on TTT.
- Hulk-apiserver: Gateway service for the Hulk 1.0 container platform.
- Hulk-policy: a service logic module of elastic scaling system 1.0, which covers specific indicator aggregation, capacity expansion and reduction logic, policies, and the Master/slave architecture mode. Zookeeper is used for Master election and slave/slave cold backup.
- Hulk data source: OCTO, Meituan Service governance platform; CAT, meituan server and mobile terminal monitoring platform, mainly responsible for the application level; Falcon, a system monitoring platform customized based on open-source Falcon, is mainly responsible for the machine level.
- Scheduler: Hulk 1.0 container platform scheduling system, secondary development based on OpenStack.
Resilient Scaling 2.0 architecture
Elastic scaling System 2.0 mainly evolves in the following four aspects:
1) Scheduling system – replacement: OpenStack is replaced with Kubernetes, the de facto standard of container orchestration, and the zone resource pool and emergency resource pool are newly hosted on the basis of resident resource pool. 2) Individual service-microservitization. A. Api-server: elastic scaling – gateway service. B. Engine: elastic scaling – Engine, responsible for the initiation and termination of each service instance expansion, Master/slave architecture mode, using Zookeeper for Master election, slave is hot standby. C. metrics-server, metrics-data: [Added] Elastic scaling indicator service, which is responsible for the collection and aggregation of Raptor, OCTO and other Data sources. D. Resource-server: [added] Elastic scaling – Inventory control service to escort the Resource demand of automatic expansion and contraction of each service. 3) Service Portrait-Data platformization: Portrait-Server, Portrait-Data, self-built service Portrait Data system of elastic and scalable system. 4) Observability: [Added] Alarm and Scanner, which are responsible for monitoring and managing alarms and operations with elastic scaling.
Challenges and solutions
Before the introduction, it is important to highlight the time point of 2018. As mentioned in the preface, some problems existed in the Elastic scaling 1.0 PaaS platform before 2018, as well as some problems encountered in the implementation of Hulk 1.0 container platform. In terms of product strategy, we put forward the plan of Hulk 2.0, which covers scheduling system, container engine, kernel isolation enhancement, elastic scaling enhancement and other fields. In tactics, follow the top-down design principle, bottom-up implementation principle.
In a long period of time before April 2018, the RESEARCH and development of Hulk container platform mainly focused on the upgrade of the underlying system (priority was given to the process of Hulk 2.0 container programming Kubernetes and container creation/destruction), and the investment in elastic scalable PaaS platform was about 0.5 people, incremental business stopped access. Existing services continue to be maintained. After the Hulk 2.0 underlying system was basically formed, the container platform team started to smoothly migrate Hulk 1.0 containers that were not connected to the elastic scaling platform to Hulk 2.0 containers. On the other hand, we started to set up an elastic expansion 2.0 system that could connect with the arrangement capability of Hulk 2.0 container platform, so as to make technical reserves for the smooth migration of Hulk 1.0 containers that have been connected to the elastic expansion platform.
In the evolution process of the Elastic Scaling 2.0 system, it follows the evolution strategy of “technology”, “promotion”, and “operation”. Let’s take a look at some of the challenges and solutions in building the platform.
2.1 Technical Challenges
Combined with the current situation of elastic expansion system, and after some research on elastic expansion products in the industry, the following three goals are determined on platform construction:
- Phase 1 goal: Elastic scaling 2.0 system MVP version, in simple terms, is to replace the underlying OpenStack related ecology into Kubernetes surrounding ecology, the upper functions remain unchanged first.
- Objective of the second phase: in terms of business, pilot the business of the same department; based on feedback, take small steps and run fast and iterate quickly; Technically, Beijing and Shanghai service management platform, CMDB system and other related business logic integration.
- Objective of phase III: The integration of the 1.0 system into one set reduces the understanding cost on the business side and the development/maintenance cost on the R&D side.
After the objectives of the above three phases were basically implemented, elastic expansion 2.0 system could basically run, which was no worse than elastic expansion 1.0 system. Then, based on the summary of previous operation and maintenance problems and business interview demands, several relatively large upgrades were made on the back-end architecture.
2.1.1 Flexible scheduling
As mentioned in the introduction, similar to most elastic scaling products, the early enabling process of elastic scaling is as follows:
- Creating an elastic group: An elastic group is a basic unit of elastic scaling management. An elastic group is used to manage homogeneous service instances. In meituan practice, an elastic group is mainly an instance in the same IDC.
- Create elastic scaling configuration: Determine the model configuration of the expanded elastic scaling instance, such as the CPU, memory, and hard disk size.
- Create an elastic scaling rule: Specifically, expand or reduce the capacity of several devices.
- Create elastic scaling tasks, such as monitoring tasks and scheduled tasks.
In the landing scene of Meituan, we encountered the following problems:
- When a new IDC instance is deployed in the service cluster, it is difficult to create an elastic group for this IDC instance. As a result, other IDCs can be expanded during peak hours, but this IDC cannot be expanded.
- When the service cluster does not use an IDC any more, it does not realize the need to shut down the elastic group. As a result, some scheduled tasks still expand the IDC instances, which may cause service exceptions in some cases.
- The limitation of EXPANSION and contraction of IDC perspective, not standing in the “God perspective” to make expansion and contraction decision, will lead to part of IDC resource shortage, it is difficult to Fail-Fast.
- The service logic of different IDCs is inconsistent. Some services combine specific policies into the service logic. If a service cluster uses one set of mirrors to expand capacity, service exceptions may occur.
Based on the above reasons, we have cleared all PaaS parties and sorted out the relationships among traffic groups, elastic groups and release groups to ensure the accuracy of elastic scaling in the expansion of Meituan private cloud.
- Traffic grouping: The division comes from MEituan service governance platform -OCTO and SET platform – Dayu. For example, if the service is SET mode (class industry unitary architecture), then each SET is a traffic group. By analogy, set is undifferentiated call mode, so the global is a traffic group; If the call mode is set to the same Region, each Region serves as a traffic group. Set the same center call mode, then each center is a traffic group; If the same IDC priority call mode is set, each IDC is a traffic group.
- Elastic group: You do not need to manually divide a traffic group. You can create an elastic group by Mapping.
- Publishing group: The distribution is from Meituan publishing platform -Plus. For service clusters that do not use the SET architecture, there is only one default-publishing group. For clusters with SET architecture, each SET corresponds to a set-publish group, and their code/mirror can be different. Based on the same SET, there may be multiple regions, multiple centers, and multiple IDCs (the current SET architecture of Meituan is mainly called with IDC), so the relationship between meituan and traffic group is 1 to N.
The access mode of the same region is representative. Here, the automatic expansion mode of the business cluster called in the same region & unset and called in the same region &SET is explained. Other combinations can be followed by analogy.
2.1.2 Inventory control
This is not difficult if an elastic scaling system simply provides an automatic provisioning capability for IaaS resources. However, in practice, the elastic system needs to solve the problem of resources in the process of landing.
- Business Concerns: How are resources secured? There have been a number of instances in the past where resources could not be expanded in elastic scaling System 1.0.
- Organizational concerns: How large is an elastic resource pool? If a large number of resources are reserved but cannot be properly multiplexed, as a PaaS platform itself, resources will be idle.
In view of this concern of business, public cloud vendors in the industry basically do not make SLA commitments or it is difficult to fulfill SLA commitments, so they naturally do not face the concern problem of organizations. Specifically, in the United States is mainly through the following measures to solve.
2.1.2.1 Multi-tenant Management
The platform does not allocate unlimited resources to each line of business, but sets a default Quota for each elastic group of business clusters. If the business feels that the default Quota is not enough, it can be evaluated in a workorder round (in practice, most cases do not need to be adjusted). For this resource Quota, the platform side promised SLA: 99.9% expansion success rate.
There is a question: is the Quota given to the business directly allocated to the business? Answer: No. The platform needs to control the idle rate of resources and do some “overselling”, but it needs to solve the possible problems of “overselling” in the process, which is the water line supervision mechanism.
2.1.2.2 Normal-Resource Security
Regular-resource guarantee refers to the resource supply mechanism on weekdays and holidays. Resources in this part come from resident resource pools (as shown in the architecture diagram). The platform will re-estimate resources of all services connected to the platform in an hourly granularity. The specific schematic diagram is as follows:
When adding or updating service scaling tasks and scaling rules, the system estimates the impact of the current change on the water level of the entire resource pool. If the impact exceeds the water level of the entire resource pool at the superposition point in time, the platform rejects the change operation and sends notification notifications to users and platform administrators. Users can specify resource requests in the form of work orders to ensure resource reliability of the existing services.
2.1.2.3 Emergency-Resource Assurance
The size of the resident resource pool is limited. Emergency-resource guarantee refers to the abnormal resource supply mechanism such as service recruitment, promotion, and holidays. The resources in this part come from the resident resource pool and emergency resource pool (as shown in the architecture diagram). An emergency resource pool is simply a public cloud resource that we purchase on a pay-as-you-go basis, which is a more flexible resource pool model (with a certain lease). On top of this, we built a more cost-effective hybrid cloud elastic scheduling capability (previously scheduling only in resident resource pools in private clouds).
- Elastic capacity expansion: The resident resource pool instance takes precedence over the emergency resource pool instance.
- Elastic capacity reduction: The capacity of the emergency resource pool instance is reduced first, and the capacity of the resident resource pool instance is reduced second.
The following diagram shows an example of a service requiring 208 containers during the promotion period (lasting T days), among which 8 are normal resource demands and 200 are emergency resource demands.
Big before promoting:
After promotion (T+1) :
After T+1 day, the emergency resource pool will be vacated and returned to the public cloud vendor, and the company does not need to pay for the follow-up. Multi-service emergency scenarios are actually more complex; In special cases, rescheduling and governance are required.
2.2 Promotion Ideas
Before 2020, Hulk 2.0 elastic expansion system was not promoted to the business in a large scale. The main reasons can be attributed to two aspects:
- The company is still in the process of migrating from virtual machines and Hulk 1.0 containers to Hulk 2.0 containers. Hulk 2.0 container instances are still in the process of polishing and gradually gaining business trust. Containerization is the first step to promote elastic elastic scaling 2.0 systems.
- Hulk 2.0 system is not sufficient to meet complex business scenarios in terms of basic functions, such as set-based services.
By the end of 2020, about 500 services were connected. The services in this section are mainly Eat Your Own Dog Food (Eat Your Own Dog Food), which are smoothly migrated from the elastic scaling 1.0 system.
In 2020-2021, it is the scale phase of elastic Scaling 2.0 systems.
- Data driven: from tens of thousands of service, through self-built excavated thousands of Meituan portrait data service system for access to the elastic expansion of services, mainly refer to the highs, whether to have state, the instance configuration, business cluster scale, and other factors, and drill down to each business unit, the construction of 30 + operation market, locking the platform side want to assign to the business community.
- Value quantification: it has gone through some cognitive iterative process, and finally it is extracted from three aspects: “should break out”, “save cost” and “automation”.
- In-depth business: After the first 500 or so services were relatively stable, I spent about two or three months talking with leaders of various business lines of Meituan, mainly focusing on business introduction (only by understanding it, it is possible to empower it), flexible and flexible value introduction (best practices of horizontally extending other businesses), What are the OKR segments of the business this year (evaluate whether the value of flexibility and expansion can help the business achieve better performance and win-win cooperation), and discuss the possible benefits of the current business after access.
- Technical training: according to the early in-depth business, the feedback. The business focuses on technical principles, access costs, system robustness, best practices, and potential potholes. Team students take these, summarizes the output elasticity of telescopic white paper (technical principle, FAQ, manuals, best practices, etc.), avoid pit guidelines, and in the business department in need of VIP technology sharing, time is not enough, can come twice, large and small business team, the share of corporate level, we did a decade or two times.
- Channel closed loop: at the company level, there will be some big promotion activities, such as “Peace of mind recovery Plan”, “517 Food Festival” and “1001 Nights Live”. As long as we know about these activities, we will actively join in and see what can be helped. From the result, the effect is quite good. We searched keywords such as “load”, “second kill”, “surge” and “capacity expansion” in the COMPANY’s COE system (fault recovery analysis tool) to learn the processing process and to-do list of problems. If we found help after evaluation, we actively contacted the business to communicate.
- Business return visit: Although we will answer questions in the technical support group and Review the problems on duty every week, it will be relatively scattered. At first, we used questionnaire survey to get feedback, but the effect was not so good after practice. Therefore, we adopted a more primitive approach — “long talk”. We proactively pulled out the problems encountered after access from the business side and quickly iterated the system itself after evaluating the priorities.
Through these efforts, 80%+ of our services are connected to elasticity, covering 90%+ of the company’s business lines. To get back to that, it’s not that hard if elastic scaling systems simply provide an automatic provisioning capability for IaaS resources, and our target is PaaS.
2.3 Operation Problems
This section focuses on the three typical problems encountered when delivering an elastic container instance to a business: configuration, startup, and performance. Most of these problems are not directly closed loop issues within the domain of elastic Scaling 2.0, the PaaS platform itself, which often depends on the full standardization of each PaaS platform, business logic, and the performance of the infrastructure layer. There is only one reason for us to do this step more: Elastic capacity expansion instances can help services solve problems only when they can help service load sharing.
2.3.1 Configuration Problems
2.3.2 Startup Problems
Much of the startup problem is solved by this out-of-the-box approach to the container, rather than the old approach of applying a machine and then publishing a code package on top of it. And this will often face business questions, why I manually release when there is no such problem, the use of elastic expansion appeared?
2.3.3 Performance Problems
In the production environment, the performance of a service container is complicated. It may involve heavy I/O containers, high host Load, and sudden CPU usage by multiple containers. In the elastic capacity expansion scenario, resource performance problems are more likely to occur due to frequent resource allocation, compared with the stable scenario that does not use elasticity to hoard computing power. In order to ensure the use effect, the Hulk project team solved the performance problem completely from three perspectives:
- Pre – service: Configure personalized scheduling policies for the service granularity.
- Project: Based on the service portrait data platform to provide service time sequence characteristics, host machine sequence characteristics, construction of multi-dimensional resource dispersing ability scheduling algorithm.
- Post-dispatch: Rescheduling heavy IO, heavy CPU and other containers on the existing nodes.
3. Service enabling scenario
3.1 Expansion and contraction during holidays
Business scenario: it has an obvious feature of “Seven festivals in February”. In terms of traffic, the weekend traffic is about 1.5 times that of weekdays, and the holiday traffic is about 3-5 times that of weekends. The service machine is basically deployed according to the traffic of holidays, and the usage of the machine is very low.
How to use elastic stretching:
- Elastic scheduling tasks are added for capacity expansion during holidays to cope with traffic peaks during holidays. Elastic capacity reduction during non-holiday peak hours to reduce server resource overhead.
- Task configuration example: Configuring scheduled tasks before a session capacity expansion. Configure scheduled task capacity reduction after the festival. Monitor expansion tasks as a backstop.
Access effect: The average cost savings on the service side is 20.4%.
3.2 Capacity Expansion during daily peak hours
Business scenario: Delivery is the downstream of takeaway service, with obvious midday peak characteristics.
How to use elastic stretching:
- Set scheduled tasks: expand a sufficient number of machines before the noon peak, and shrink the surplus machines after the noon peak. For example, group [global-Banner-east-no Swimlane] is bound with 2 scheduled tasks, and expand 125 machines at 09:55 a day; Reduce capacity of 125 machines at 14:00 every day.
Access effect: Before access flexibility, 2420 permanent machines are required for this service to cope with midday peak traffic; After adding elasticity, 365 permanent machines were released, and elastic machines accounted for 15% of the total number of machines in peak periods.
3.3 Emergency resource assurance
Service scenario: The risk control anti-creeping service is the service security shield and data security shield of the company. At present, hundreds of business parties have access to anti-creep service, which processes billions of key traffic every day, preventing and controlling malicious traffic for major business parties, and ensuring business stability, data security and the correctness of statistical data. Risk control anti-creeping related services. During active holidays, the service load will increase significantly due to the impact of services, and a large number of resources need to be added during active holidays.
How to use the elastic policy: During holidays, risk control anti-creeping services, apply for flexible emergency resources (procurement of public cloud hosts), increase the amount of elastic capacity expansion during activities, and improve service stability. After the activity holiday, reduce the capacity of emergency resources, vacate the public cloud host, save resources.
Access effect: Supported 5 large-scale online activities for risk control services. Based on the emergency-resource guarantee mechanism of elastic scaling, provided public cloud resources totaling 700+ high-configuration containers and about 7000+CPU resources.
3.4 Capacity Expansion of service Links
Business scenario: Catering SaaS adopts the delivery function of “train model publishing mode”, and it is necessary to apply for a dedicated grayscale link machine for 70+ services for grayscale verification of cloud functions. However, the machine only needs to be used for 2-3 working days every month, and the machine is always held, resulting in a waste of machine resources; At first, the team solved the problem by manually applying and releasing the machine at a regular time every month, which cost a lot of manpower and affected the efficiency of research and development to some extent.
How to use an elastic strategy:
- Configure the link topology.
- Before the monthly activity starts, configure link tasks, including capacity expansion time, machine SET/LiteSet identifier, and link service capacity expansion.
- When the set time is reached, the machine will automatically expand and shrink.
Access to the effect
- Before use: the train delivery involves 70+ services, and the service principals of 70+ each month need to expand the machine before delivery and reduce the machine size after verification.
- After use: After the link topology is configured for the first time, only one RD student is required every month, and one link task is configured in a unified manner to automatically expand and shrink the capacity when the preset time is reached, greatly improving efficiency.
4. Future planning of elastic expansion
With the large-scale implementation of Flexible flexible system 2.0 in Meituan, we will deepen and expand in the future from four aspects of stability, ease of use, business solutions and new technology exploration:
(1) Stability: The stability of basic services is the cornerstone of everything, which is often overlooked by many R & D students. R & D students need to “repair the roof on sunny days”.
- System robustness: cluster full link splitting (reducing explosion radius), pooling, resource QoS capability building.
- Stability of elastic instances: strengthen discovery ability, continuously improve anomaly detection tools (different from conventional health detection, comprehensive decisions will be made from heterogeneous computing power of the host, different kernel parameters, and various system indicators), and automatically replace abnormal instances; Strengthen data operation, improve feedback ability, and continuously cooperate with scheduling algorithm evolution.
(2) Ease of use
- Enhanced preview mode: Can predict the current elastic scaling rules and how the service will expand over the next 24 hours. We are currently making an MVP version of this section. Next, we will not only further improve the user reach rate, but also plan to present post-use revenue data to the user side for access users.
- Automatic task configuration: The threshold-based configuration depends largely on the experience of engineers. Engineers may be too young to correctly configure tasks, leading to some exceptions. At present, the task recommendation function has been released for some services on the access side. After the tool is further polished based on operational data, it will automatically help the business to adjust the configuration periodically.
(3) Business solutions
- Link scaling: The link topology is configured in batches and the service scaling rules can be split. The back pressure feedback between services, between services and middleware, and between storage is then applied to elastic decisions.
- Zone elastic scaling: At present, the “emergent” value of the elastic scaling system has been played in the financial business line and Oceanus, the gateway of Meituan seven-layer load balancing service. In the future, we plan to provide elastic scaling support for more businesses with zone scenarios.
(4) New technology exploration: draw lessons from the design concepts of Knative and Keda to make technical reserves for the elastic expansion of cloud native business scenarios.
Author’s brief introduction
Tu Yang is currently head of the Resilience Strategy Team in the Infrastructure Department.
Recruitment information
Meituan infrastructure team sincerely seeks senior and senior technical experts based in Beijing and Shanghai. We are committed to building a unified high concurrency and high performance distributed infrastructure platform for meituan, covering the main technical fields of infrastructure such as database, distributed monitoring, service governance, high performance communication, message-oriented middleware, basic storage, containerization and cluster scheduling. Interested students are welcome to submit their resumes to: [email protected].
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.