Alibaba super-scale Kubernetes infrastructure operation and maintenance system revealed

Authors: Zai Ren, Mo Feng, Guang Nan

The preface

ASI: Alibaba Serverless Infrastructure, Alibaba’s unified infrastructure designed for cloud native applications. ASI is based on ACK, the public cloud container service of Ali Cloud, and supports the Serverless infrastructure platform for cloud bio-chemistry and cloud products of the Group.

2021 Tmall Double 11 is another unforgettable year for ASI. This year we accomplished many “firsts” :

The first comprehensive unified scheduling: e-commerce, search, ODPS offline and ASI business were all unified scheduling architecture, with tens of millions of cores in the whole business.
First smooth migration of search business to ASI “without perception” : nearly 10 million cores of business moved to ASI without perception (but we had a lot of sleepless nights).
The K8s single cluster in ASI scenario has a scale of more than 10,000 nodes and millions of cores, which exceeds the scale of 5000 nodes in K8s community and continuously optimizes the performance and stability of large-scale clusters.
Middleware services support group business with cloud product architecture for the first time: Middleware is based on ASI public cloud architecture, smoothly migrates middleware services to the cloud, and supports group business with cloud product architecture, realizing “trinity”.

Under the tempering of large-scale production application, ASI not only precipitated a lot of K8s stability operation and maintenance ability, but also incubated a lot of innovation ability under the support of Serverless scenario. If the operation and maintenance of K8s (especially large-scale operation and maintenance cluster) students will have a deep feeling: it is easy to use K8s, want to use K8s is really not easy. ASI experienced many lessons in the early growth phase of using the K8s scheduling architecture, and we continued to grow, learn, and mature. Such as:

A normal Kubernetes large version upgrade process, upgrade Kubelet in a cluster of nearly a thousand business POD all rebuild;
An online non-standard operation deleted a large number of VIpServer services. Thanks to the push protection of the middleware, it did not cause a disastrous impact on the business.
The node certificate expires. As the self-healing component of the node misunderstands the fault and the risk control/flow control rules are incorrect, the self-healing component mistakenly evicts all services on 300+ nodes in a cluster.

All kinds of fault scenarios listed above, even the professional K8s team can not avoid lightning, if it is the user who knows little about K8s, certainly can not prevent and avoid risks. So, to all users who are using K8s services, or want to use K8s services, a piece of advice: don’t think you can run the K8s cluster, there are more holes in it than you can imagine, professional people do professional things, let professional products and SRE team to achieve the operation and maintenance. Here, I also strongly suggest users to use Alibaba Cloud container service ACK, because our precipitation capacity enhancement, automatic operation and peacekeeping capacity will feed back into ACK under the large-scale scenario of Alibaba, helping to better maintain users’ K8s cluster.

ASI can operate and maintain so many large K8s clusters, must have “two skills”. Today, I will give you a detailed introduction of ASI’s operation and maintenance system and stability engineering capabilities in terms of building cloud native infrastructure for Ali Group and supporting the Serverless development of Ali Cloud products.

ASI technical architecture

Before introducing ASI’s fully managed o&M system, LET me take a moment to introduce ASI. ASI is a Serverless platform for group and cloud products based on ACK and ACR, aiming to support Alibaba cloud application and Serverless alibaba cloud products. Here are some members of the container service family: ACK, ASK, AND ACR.

For Alibaba and cloud product business scenarios, we will provide users with enhanced capabilities in terms of K8s cluster functions, such as enhanced scheduling capability, Workload capability, network capability, node flexibility capability and multi-lease security architecture, etc. At the level of cluster operation and maintenance, Serverless No Ops experience is provided, such as cluster big-version upgrade, component upgrade, node component upgrade, node CVE vulnerability repair, node batch operation and maintenance, etc., to guarantee the stability of K8s cluster for users.

ASI fully managed operation and maintenance support system

ASI provides full hosting and o&M user experience for large-scale K8s clusters. These capabilities are not inherent in the K8s, but are the ability to reinforce system stability that has been accumulated in the process of numerous practices and failures. Looking at the whole industry, it is alibaba’s large-scale complex scene that can temper the LARGE-SCALE K8s operation and maintenance service system.

Before I talk about ASI, I want to emphasize an important principle in the system capacity building process: do not duplicate the wheel, but do not rely entirely on the capabilities of other systems. No single product or system can cover all the problems of every business (especially a business as large as ASI). It depends on the established system capability of upstream and downstream links, but not completely. It is necessary to do a good job in the hierarchical design of the system. If a system has done a good job in the bottom operation and maintenance channel, we will not do another operation and maintenance channel, but based on the upper operation and maintenance channel to do our own business change arrangement; If a system is capable of monitoring and alerting links, we can manage monitoring, alerting rules and route distribution.

Another very important thing: to be a stable team to do a good operation and maintenance management and control system, it is necessary to have a very comprehensive and in-depth understanding of the business architecture. The stability team cannot only operate, nor do 1-5-10 outside the architecture, so it is difficult to control the stability of the entire architecture. Although ASI SRE is the team that is responsible for the stability of ASI infrastructure, many SRE students can independently connect with new businesses and decide the ASI architecture of the whole business. In fact, in most cases, if SRE and R&D cooperate to meet the business side, there are often fewer problems, because the two roles are very complementary: R&D has a good judgment on the technical architecture, and SRE has a good judgment on the rationality of the architecture and stability risk.

As shown in the figure above, ASI cluster deployment architecture is completely based On KOK (Kube On Kube) base of ACK product Infra architecture. The entire architecture is layered as follows:

Meta cluster (KOK architecture underlying K) : used to bear the core management and control components of K8s business cluster. Containerized deployment of management and control of business cluster can ensure a more standard deployment mode and greatly improve the deployment efficiency.
Control-plane: four core management and Control components of a service cluster: Kube-Apiserver/KuBE-Controller-Manager/Kube-Scheduler and ETCD cluster.
Add-ons: Serverless core function component, scheduling enhancement component (unified scheduler), network component, storage component, Workload component (OpenKruise), CoreDNS, and some other bypass components.
Data-plane: node management components, such as Containerd, Kubelet, kata, and other plug-ins on nodes.

Based on the entire ASI architecture, we have abstracted the ASI operation and maintenance system into several core modules through continuous exploration and abstraction, as shown in the figure below:

Unified change management and Control: This is a system capability that we have been building since ASI’s first day, because from the experience and lessons learned during the development of Alibaba technology, many major failures are caused by irregular changes, no review, and no risk bottlenecks;
Cluster operation and maintenance management: ACK will provide standard product capability of K8s cluster full hosting, but we need to consider how/when to do the orchestration, verification and monitoring of scale upgrade; And we also need to build a reasonable backup mechanism to ensure the stability of the cluster;
ETCD operation and maintenance management: ETCD is also based on the fully hosted ETCD Serverless product capability provided by ACK. We will work with ACK to build a super-large cluster with ETCD performance optimization and backup capacity to better serve ASI.
Component operation and maintenance management and control: ASI operation and maintenance system is very core capacity building, Serverless fully managed service, the most core is that each core component should have the corresponding R & D team for function expansion and operation and maintenance support. In this way, how to define the research and development mode of the r&d students and ensure the stability and efficiency of daily operation and maintenance are the key for ASI products to support a large number of businesses. Therefore, at the beginning of ASI’s establishment (in 2019, we supported the group’s business cloud), we established ASI component Center and gradually defined and optimized the R&D, operation and maintenance mode of ASI core components.
Fully managed node operation and maintenance control: This is the capability that the very cloudy product team hopes container services will provide, especially the business product team, who are very ignorant of the infrastructure, and hope that a team can help fully managed node operation and maintenance. Node operation and maintenance capability is also a very important capability precipitation of ASI in the process of supporting Ali Group. We also export this part of experience to the sales area and continue to optimize it. The biggest feature of the cloud is resource elasticity. ASI also provides users of cloud products with the ultimate flexibility of nodes in the sales area.
1-5-10 Capacity building: Users on the cloud have a very important feature, very low fault tolerance. This poses a huge challenge for ASI. We are constantly exploring how to detect, troubleshoot, and recover problems in a timely manner.
Resource operation: Backup capacity and cost optimization have always been the problems to be solved for infrastructure services. We should not only ensure the stable operation of services (such as no OOM and NO CPU competition), but also reduce costs, especially the service cost of cloud products.

Next, I will elaborate on the design ideas and specific schemes of key system/technical capabilities in ASI fully managed operation and maintenance system, and show you how we build large-scale K8s fully managed operation and maintenance service step by step.

Fully managed cluster o&M capability

When we operate and maintain large-scale K8s clusters, the deepest feeling is that scale will not only bring great complexity to a single OPERATION, but also greatly expand the risk explosion radius of a single operation. We are often challenged by the following questions:

Are all changes subject to change risk management?
So many clusters, so many nodes (ASI single cluster has more than ten thousand nodes), how to do gray stability risk minimum?
Black screen change can not be eliminated, how to control the risk?
Although a single O&M operation is not difficult, we often face complex operations combined with multiple O&M operations. How can the system conveniently arrange these O&M operations?

With these four questions in mind, I will go into more detail about how we have been abstracting and optimizing our system capabilities in practice, and crystallizing the stability system capabilities that are now so important to ASI’s fully managed services.

Unified change risk control

When we first formed ASI SRE in 2019, we were exploring how to manage the risk of change. At that time, the stability system capability was still very weak, which was really a lot of work to be done. The students of the new SRE team were all selected from the sub-r&d teams of the Sigma scheduling system developed by Ali Cloud, so they were very proficient in container, K8s, ETCD and other technologies, but they were totally ignorant about how to do SRE and stability. At the beginning, it took us 2-3 weeks to access ChangeFree for change approval in the ASIOps system (then called ASI – Deploy). Faced with the new architecture (Sigma -> ASI), new scenarios (cloud in group business) and such a complex and huge K8s business volume, we do not have much experience from outside.

At that time, we thought that it would be too late to rely on the system to control the risk of change (the group’s business has been fully launched in the cloud, a large number of new technical solutions, a large number of online changes), and we had to rely on the “rule of people” first. So we told the whole ASI team that any changes should be approved by SRE, but SRE was not able to understand all the technical architecture details of ASI and do a complete risk assessment. To this end, we started to set up a “change review” meeting, at which we invited experts from each field to participate in the risk review of the change plan. Because of this change review mechanism, ASI was able to get through the “difficult” period with a very inadequate change risk mitigation system. ASI change review meetings have continued until today, and will be held as scheduled without special periods of network closure. During that period, SRE also accumulated a lot of safety production rules by participating in the approval of every online change:

At the same time, we began to implement these very well-defined rules for changing risk blocking into the ASIOps system. At the beginning, the risk blocking capability of ASI was implemented at the underlying system architecture level. Later, it was found that many changes could not be detected only through the indicators/detection of ASI at the underlying level. Therefore, a mechanism was needed to link the upper-layer service systems to trigger the judgment of some risk blocking rules at the business level. This is as close as possible to ensuring that our changes do not impact the upper level business. Therefore, we started to implement the management of change risk rule base in ASIOps, and implemented a Webhook mechanism to link inspection /E2E testing of the upper business side.

ASI has this set of online change risk blocking system capability, we have no private changes during the closure period, changes do not do gray, do not verify such violations of the red line change behavior.

Ability to change gray

From practical experience, every online change, no matter how carefully and strictly we reviewed the plan in the early stage, how perfect the risk blocking was, how well the operation and maintenance function was written. Once the code goes live, there’s always something we “don’t expect.” For what we already know, we can do it well. The scary thing is what we can’t think about. It’s not a matter of ability.

So the function of the online must be gray. Of course, we also need to ensure the certainty of the change action. We cannot say that the change of Zhang SAN is in this order to grayscale, and the same change of Li Si is another grayscale order. ASI grayscale change ability, we have also gone through many iterations.

In the Sigma era, clusters are deployed across equipment rooms or regions. Therefore, Sigma requires less than 10 clusters to support such a large service volume. For R & D, because the number of clusters is not many, what the cluster does, what the business type is, are very clear, so the release cost is not very high (of course, due to the explosion radius is too large, release small problems are constantly). However, after the evolution to ASI architecture, the cluster planning is cut strictly according to Region/ machine room, and due to the scalability problem of K8s cluster itself, a cluster can not bear hundreds of thousands of nodes like Sigma cluster. At that time, the K8s community gave that the size of a single cluster should not exceed 5000 nodes. (Although ASI has been optimized to 10,000 nodes in a single cluster, the risk of a large cluster is also higher in terms of stability and explosion radius.) Under this architecture, the number of ASI clusters will surely be much larger than that of Sigma clusters. The r & D students are still in the late Sigma and early ASI era, and many R & D habits still follow the model of Sigma at that time. The release tool is still the product of Sigma era, which cannot support the release of large-scale K8s cluster refined components. Each team’s research and development every release is also terrified, but also afraid of problems.

At that time, before the number of ASI clusters in the group grew, we realized that we had to solve the problem of change certainty. ASI has so many clusters and hundreds of thousands of nodes. It will definitely cause problems if each r&d student decides how to change. However, at that time, our system capacity was very inadequate, and we could not determine the best gray order for the students to change through comprehensive judgment of various conditions intelligently. So what to do? The system isn’t awesome, but it has to be fixed. So we put forward the concept of pipeline: The release sequence of online core cluster is determined by SRE and core R&D TL together, which is defined as a pipeline. Then all R&D must bind this pipeline when upgrading components. When releasing, gray scale release can be carried out in accordance with the cluster sequence stipulated by us. This is where the concept of pipeline comes from. This “low looking” feature cost us a lot of effort to make a first release. However, when we promoted Pipeline to r&d students with “full confidence”, we did not receive “flowers and applause” as we imagined, but a lot of “ridicule and optimization suggestions”. So we changed the marketing strategy: gradually small, gradually revised, then large, until everyone accepted. Now Pipeline has become an essential release tool for ASI students. Now that I think about it, I think it’s pretty interesting. Also let us understand a truth: any new function can not be “closed”, must be from our user point of view to design, optimization, only user satisfaction, to prove the value of our system/product.

Below is the release order of the group’s core transaction cluster defined for R&D in the order of test -> small traffic -> Daily -> Production:

Static pipeline’s ability to orchestrate ASI clusters was not a problem when it supported only a small number of ASI clusters in the group. However, when ASI business expanded to Ali Yunyun products, especially after we hatched the ASI hard multi-lease VC architecture with Flink products, one user had a small cluster, and the number of clusters increased sharply. This manual arrangement of cluster sequence exposed many problems:

The update was not timely: a new cluster was added, but relevant students were not informed, and the corresponding pipeline was not added;
Insufficient automatic adaptation ability: ASI is newly connected to a cloud product, and a new pipeline needs to be manually added, which is often not updated in time.
High maintenance cost: with more and more business, each R&D owner has to maintain many pipelines by himself;
Lack of scalability: Pipeline order cannot be dynamically adjusted. After ASI supports cloud products, there is a very important requirement to grayscale according to GC level, which static pipelines cannot support at all.

Based on the total shortcomings of static pipeline, we have long started the optimization thinking and exploration of technical solutions. The core of ASI is resource scheduling, and our scheduling ability is very strong. Especially in the unified scheduling project of the group, the group’s e-commerce business, search business, offline business and ant business are all connected to ASI with a unified scheduling protocol. I was thinking, ASI unified scheduler is the scheduling of resources CPU, memory, cluster information, Node number, Pod number, user GC information are also “resources”, why we can not use the idea of scheduling to solve the problem of ASI cluster gray order arrangement? Therefore, we realized cluster-scheduler by referring to the design of the Scheduler, integrated various information of the Cluster, scored and sorted it, and obtained a Cluster pipeline, which was then provided to the r&d students for grayscale release.

Cluster-scheduler implements a “dynamic” pipeline capability, which can solve various problems encountered by static pipelines:

When component grayscale is published, Cluster scope filtered by cluster-Scheduler will not miss Cluster.
The cluster release order can be set according to GC level, and the cluster weight can also be dynamically adjusted according to the cluster scale data.
During the development and release, there is no need to maintain multiple static pipelines, only need to select the component release range, the cluster release sequence will be automatically arranged.

Of course, static pipeline has a great advantage: the cluster release sequence can be self-arranged, in some new function online scenarios, research and development needs to have this kind of self-arranged ability. Therefore, in the future, static/dynamic pipelines will be used together to complement each other.

Cluster webshell tools

When SRE is doing stability risk control, it must want all changes to be white screen and online. However, in view of the actual situation of our K8s operation, it is impossible to achieve all the operation and maintenance operations in a blank screen. We cannot directly provide cluster certificates to r&d students: first, there will be security risks of permission leakage; The second is to develop the local certificate operation cluster, the behavior is not controllable, the risk is not controllable. ASI early also appeared many times in the local kubectl tool mistakenly deleted business Pod behavior. Although we cannot provide all K8s operations with blank screen on the system for r & D, we can provide kubectl tools online for R & D to use, and then provide stability, security reinforcement, risk control and other capabilities based on online tools.

Therefore, we provide webshell, a cluster login tool, in the Ops system. Developers can apply for access to cluster resources according to the principle of “minimum availability”, and then access the cluster through Webshell for corresponding operation and maintenance operations. We will record all user operations in webshell and upload them to the audit center.

Online Webshell, we have done a lot of security/stability hardening compared to the user’s local certificate access to the cluster:

Refined permission control: permission is bound to users, and the validity period and permission range are strictly controlled.
Security: no certificate will be provided to the user, so there will be no certificate leakage problem;
Auditing: All operations are audited;
Risk control: Detect dangerous operations, initiate online approval before operation.

Change orchestration capability

The risk blocking, gray scale change and black screen change convergence mentioned above are all aimed at solving ASI stability problems. But who can help solve the challenges facing us SRE students?

Stability students all know that only after the changes are white screen/online, we can centralize the control of these changes and control the risk of change. But for a very large and complex infrastructure service like ASI, the change scenarios are numerous and complex. We, SRE, are responsible for the construction of the ASIOps operation and maintenance control platform. We have to face the daily heavy operation and maintenance work and build the system. What’s more, all our classmates are back-end development engineers. The front page should be drawn for at least a day.

SRE team is a technical service team, not only to make our service side satisfied, but also to make our own satisfaction. Therefore, in the process of system capacity building, we have been exploring how to reduce the cost of operation and maintenance system development. As you know, o&M capabilities are different from business system capabilities. O&m operations are more integrated operations arranged by multiple operations. For example, to solve the ENI network card cleaning problem on the online ECS, the complete O&M capabilities are: First, execute a scan script on the node to scan out the ENI network card leakage; Then the ENI network card is scanned out of the leak as input input to clean up the ENI network card program; ENI network card is cleared and corresponding status is reported. Therefore, we wanted to do one thing at that time: to implement a set of o&M operation choreography engine, which can quickly orchestrate multiple single independent o&M operations to achieve complex o&M logic. At the time, we also investigated a number of open source programming tools such as Tekton and Argo. Either the project PR was very good, but the functionality was too basic to meet our scenario; Or they are designed to be more applicable to business scenarios and very unfriendly to our underlying infrastructure.

Therefore, we decided to implement ASI’s own o&M orchestration engine by taking the best of existing orchestration tools and referring to their design. This is where the ASIOps Taskflow orchestration engine comes from, as shown below:

PipelineController: Maintains dependencies between tasks
TaskController: maintains task status information
TaskScheduler: Schedules tasks
Task/Worker: Task execution

For example, if a set of node life-cycle management functions are implemented separately, all operation functions must be written by the user. However, with the Taskflow choreography capability, there are only three executor logic that need to be implemented: Ess expansion, node initialization, and node import. Taskflow connects the three executor execution streams to perform a node expansion operation.

At present, the orchestration engine Taskflow is widely used in ASIOps, covering diagnosis, plan, node import and export, VC cluster service, one-time operation and maintenance, release and other scenarios, which greatly improves the efficiency of system capability development in new operation and maintenance scenarios.

After more than two years of training, the core R&D students of SRE team are basically “full-stack engineers” (proficient in front and back end R&D). In particular, front-end interface development is now not only not a burden to our team, but an advantage of our team. Many system capabilities require front-end interfaces to be exposed to users. In ASI, where most of the r&d is done by back-end engineers, front-end development resources of SRE team have become our very important “competitiveness”. Also fully proved: technology is not pressure body.

summary

As for the full hosting operation and maintenance capability of ASI cluster, I introduced how to do change risk blocking, change orchestration, change gray scale and convergence of black screen change in the realization of system capability. Of course, we did a lot more than these system capabilities at ASI’s fully managed level, and there were a lot of large online changes to architecture upgrades, because we had so many scenarios to accumulate, so many important system capabilities were deposited.

Fully managed o&M capability of components

ASI component full hosting capacity, we have published a detailed article: ASI component grayscale system construction, we are interested in can have a detailed look, indeed in ASI such a large-scale scenario, there will be some technology and experience precipitation. So HERE I do not do too much technical scheme introduction, more is to introduce our technology evolution process. ASI’s share of component grayscale capacity building was also included in KubeCon Topic 2020: How We Manage Our Widely Varied Kubernetes Infrastructures in Alibaba.

The fully managed capability of ASI components in the fully managed mode is a very important difference from the current semi-managed container service cloud products: ASI is responsible for the maintenance of core components in the K8s cluster (research and development, troubleshooting and o&M). In fact, this is related to the origin of ASI. ASI originated in the period when collective services were fully in the cloud. We provide a large cluster + common resource pool mode to gradually migrate services from Sigma architecture to ASI. As for the group business, it would not maintain the K8s cluster and various components in the cluster, so ASI team was completely responsible for this, and ASI gradually incubated the system capability of fully managed components.

As shown in the figure above, components at all levels of the ASI architecture are now managed uniformly based on ASIOps for greyscale changes. In fact, it now seems that all ASI components in one platform to maintain, and unified gray capacity building is a very natural thing. But it took a long time to make today’s structure sound. After many intense discussions and various pressures from stability, we finally explored a top-level design that was more consistent with the current K8s architecture:

IaC component model: use K8s declarative design to change all ASI component type changes to end-state design;
Unified change arrangement: the grayscale is the most important for component change, and the grayscale is the grayscale sequence of cluster/node. All components need to change grayscale arrangement;
Component cloud native transformation: the original node based on space-based packet change management is transformed into K8s native Operator oriented end-state design, so that node components realize basic component change channel, batching, pause and other capabilities. The Ops system realizes component version management and gray scale change arrangement.

After more than two years of development, component changes under ASI system are completely unified under one platform, and very perfect grayscale capabilities have been built based on cloud native capabilities:

Fully managed node o&M capability

As I mentioned earlier, we don’t reinvent the wheel when building system capabilities, but we can’t rely entirely on the capabilities of other products. ACK provides basic product capabilities for node lifecycle management. As a Serverless platform based on ACK, ASI needs to build large-scale operation and maintenance capabilities based on basic ACK product capabilities. From the Sigma era to ASI’s support of the group’s large unified scheduling cluster, ASI has accumulated the ability and experience of many large-scale operation and maintenance nodes. Next, we will introduce how to build full custody capacity of nodes in the sales area.

Node life cycle definition

In order to build a relatively complete node fully managed operation and maintenance capability, we first need to sort out what needs to be done in each stage of the node’s full life cycle. As shown in the following figure, the node’s full life cycle can be roughly divided into five stages:

Before node production: the complex scene in the sales area is that each cloud product has one or more sets of resource accounts, and many ECS images need to be customized. These need to be defined in detail when new services are added;
When importing nodes: When importing cluster nodes, you need to create, expand, import, and offline nodes.
Node running time: node running time is often the stage with the most problems, which is also the stage requiring key capacity construction, such as node component upgrade, batch script execution ability, CVE vulnerability repair, node inspection, self-healing ability, etc.
Node offline: Large-scale node operation and maintenance capabilities such as node relocation and offline are required in scenarios such as node cost optimization and kernel CVE vulnerability repair.
Node failure: When a node fails, we need to have the ability to quickly detect node problems, diagnose problems and self-heal nodes.

Big picture of node capacity building

The node hosting capacity of ASI sales area has been built for more than one year, which has carried all ASI cloud products in the sales area. Most of the core capabilities have been well built, and we are constantly optimizing and improving the node self-healing capacity.

Nodes elastic

One of the biggest characteristics of the cloud is resource elasticity. Node elasticity is also a very important capability provided by ASI for users of cloud products. The flexibility of ASI nodes depends on the extreme flexibility of ECS resources. It can purchase and release ECS resources at the minute level, helping cloud products fine-control resource costs. Vcloud products currently rely heavily on ASI node flexibility to control resource costs. The video cloud node elasticity is more than 3000 times a day on average. After continuous optimization, ASI node elasticity can fully pull up video cloud services in a few minutes.

In terms of node elasticity, performance optimization has been carried out throughout the node life cycle:

Control level: By controlling the concurrency, flexible task processing of hundreds of ECS can be completed quickly;
Component deployment optimization:
- Change all daemonset components to go to Region.
- The RPM component adopts the ECS image preinstallation mode, and arranges the deployment sequence of node components to improve the installation speed of node components.
- Finally, the bandwidth optimization of yum source was changed from sharing bandwidth to exclusive bandwidth mode to avoid the impact of other RPM download tasks on our node initialization.
Service initialization: The dadI image preheating technology is introduced to quickly preheat service images during node import. Currently, it takes only 3 minutes to start a service with a size of 10 GB images.

1-5-10 capacity building

ASI full hosting mode of services, the most important or we can cloud product users for the underlying cluster stability problems. The 1-5-10 capability of ASI is very high. Next, I will mainly introduce three core stability capabilities to you:

Risk control: ASI should be able to apply the brakes in any scenario;
KubeProbe: quickly detect cluster core link stability problems;
Self-healing: The large node size is very dependent on the self-healing capability of the node.

Risk control

At any time, ASI must have the ability to “step on the brake”, whether it is our own classmates misoperation, or the upper business side misoperation, the system must have the ability to stop loss in time. At the beginning of this article, I also mentioned that ASI had a massive reboot and deleted PODS by mistake. Just because of the lessons learned before, we have created a lot of risk control ability.

KubeDefender flow limiting: For core resources such as POD, Service, node, operations (especially Delete operations), operation tokens are set with time dimensions of 1m, 5m, 1H, and 24h. A circuit breaker is triggered if the token runs out.
UA traffic limiting: Set QPS for certain services (identified by UserAgent) to operate certain resources by time. This prevents frequent access to the Apiserver from affecting cluster stability. UA traffic limiting capability is an enhancement capability of ACK products.
APF traffic limiting: Considers the request priority and fairness of apiserver to prevent some important controllers from starving due to a large number of requests. K8s native enhancements.

KubeProbe

KubeProbe is the ASI inspection/diagnostics platform. Through iterations, we have evolved two architectures: the central architecture and the Operator resident architecture. KubeProbe has also won this year’s Shanghai KubeCon issue. If you are interested, you can also attend the Shanghai KubeCon online conference.

1) Central architecture

We’ll have a central control system. Users’ use cases are accessed through the mirror of the unified repository, using our common SDK library, and custom inspection and detection logic. We will configure the relationship between the cluster and the use cases on the central management system, such as which cluster groups a use case should be executed on, and do various runtime configurations. We support periodic triggering/manual triggering/event triggering (such as publishing) use case triggering. After a test case is triggered, a Pod is created in the cluster to perform inspection or probe logic. The Pod executes customized service inspection or probe logic, and notifies the central end of a success or failure through a callback or message queue. The central end is responsible for clearing alarms and use case resources.

2) Resident Operator architecture

For some high-frequency short-period detection cases requiring 724 hours of continuous operation, we also implemented another resident distributed architecture that uses a ProbeOperator in a cluster to listen for probe config CR changes and execute probe logic in probe POD over and over again. This architecture perfectly uses the additional functions provided by KubeProbe’s central end, such as alarm/root cause analysis/release blocking, and uses the standard Operator’s cloud native architecture design. The resident system brings a great increase in detection frequency (because it eliminates the overhead of creating patrol Pod and cleaning data) and can basically achieve seamless coverage of the cluster in 724 hours, while facilitating external integration.

Another important point to mention is that the platform only provides a platform level of capability support. The real effect of this thing depends on whether the use cases built on the platform are rich, and whether it is convenient for more people to write various inspection and detection use cases. The test platform is important, but the test case is more important. Some general workload detection and component detection can find many problems on the control link, but more problems, even problems of the business layer, are exposed, depending on the joint efforts of the infrastructure and business layer students. In terms of our practice, test students and business students contributed a lot of relevant inspection cases, such as the creation and deletion of ACK&ASK full-link detection inspection, canary business full-link expansion cases, such as the PaaS platform application inspection of local students, etc., which also obtained a lot of stability results and benefits. At present, there are dozens of inspection/detection cases, and the number of inspection/detection cases is nearly 30 million, which may exceed 100 million next year. More than 99% cluster management and control problems and hidden dangers can be discovered in advance, and the effect is very good.

self-healing

When our business scale reaches a certain scale, it is far from enough to only rely on the online Oncall of THE SRE team to solve problems, which must require our system to have a very strong self-healing ability. K8s is designed end-state oriented, and facilitates Pod self-healing through Readiness and Liveness mechanisms. However, when a node fails, we also need the node to heal itself quickly, or to expel services from the node to the normal node quickly. ACK products also provide self-healing capabilities, on which ASI has made many enhancements based on ASI business scenarios. The architecture design of self-healing ability of nodes in our sales area is as follows:

With the development of ASI service form, node self-healing capability will be enhanced in the following scenarios:

The diagnosis and self-healing rules are more abundant: the current diagnosis and self-healing rules are not covered in many scenarios, so the coverage needs to be optimized constantly, and there are more node failure scenarios.
Fine-grained self-healing risk control and flow control based on node pool: the premise of self-healing of nodes is not to make the status quo worse, so we need to make more accurate judgment when doing self-healing;
Connecting node self-healing capability with upper-layer services: Different services have different requirements on node self-healing. For example, Flink services are all task types. When we encounter node problems, we need to expel the business as soon as possible and trigger task reconstruction. What we fear most is that the task is “half dead”. Middleware/database services are stateful services, which do not allow us to expel businesses randomly. However, if we integrate the self-healing ability with the upper-level business logic, we can reveal the node failure to the business, so that the business can decide whether to self-heal and how to self-heal.

Looking to the future

ASI, as the unified Serverless infrastructure continuously polished by container services ACK within Alibaba, is continuously building more powerful fully autonomous K8s clusters, providing full hosting of clusters, nodes and components, and as always, delivering more experience to the entire industry. ASI as ali Group, Ali cloud infrastructure base, for more and more cloud products to provide more professional services, hosting the underlying K8s cluster, shielding complex K8s threshold, transparent almost all of the infrastructure complexity, and professional product technical ability stability, so that cloud products only need to be responsible for their own business, Professional platforms do professional things.

Click here to go to the cloud yuan Child community to see more content!