The author | Han Tang, zhe far, drunken source | alibaba cloud native public number

preface

In an interview with a reporter, Taiwanese writer Lin Qingxuan commented on his writing career of more than 30 years: “In the first ten years, I was brilliant. In the second decade, I finally “shine”, not to steal the limelight, but to complement the beauty around me; Entering the third decade, the prosperity and decline of the real alcohol, I entered the stage of ‘alcohol light’, really appreciate the beauty of the realm.

When the night is poor, true water has no fragrance. After seeing the thrill of K8s and the beauty of its ecosystem, it’s time to take a step back and appreciate the beauty of high availability systems. After all, can only stand beating or not alone in wulin!

There is a well-known problem in the field of K8s high availability, that is, the SLO problem brought by the scale of K8s single cluster, how to continue to ensure? To give you a sense of individuality today, I’ll start with the high availability challenges of growing the size of a single cluster.

ASI single cluster scale supports more than 5000 units in the community. This is a very interesting and challenging thing, and it will certainly be an interesting topic for students who need to carry out K8s production, or even those who have K8s production experience. Looking back at the development path of ASI single cluster scale from 100 to 10000, with the growth of business and innovation brought by each cluster scale growth, our pressure and challenge gradually changed.

ASI: Alibaba Serverless Infrastructure, a unified infrastructure designed by Alibaba for cloud native applications. ASI is the enterprise version of Alibaba Group’s ACK public cloud service.

As you know, the K8s community can only support 5,000 nodes, and beyond that, there are various performance bottlenecks, such as:

  • Etcd has a lot of read/write latency.
  • Kube-apiserver queries for Pods/Nodes have very high latency, even resulting in etcD oom.
  • The controller cannot sense data changes in time, for example, the data delay of Watch occurs.

For example, when the number of ASI apiserver nodes increased from 100 to 4000, we optimized the performance of ASI Apiserver clients and servers in advance. From the perspective of apiserver clients, we preferentially accessed the local cache and performed load balancing on the clients. Apiserver server mainly optimized watch and cache index. In the ETCD kernel, the concurrent read is used to improve the read processing ability of single ETCD cluster, the new algorithm of Freelist management based on HashMap improves the upper limit of ETCD storage, and raft Learner technology is used to improve the multiple backup ability.

From 4000 nodes to 8000 nodes, we have done QPS flow limiting management and capacity management optimization, ETCD single resource object storage separation, component specification lifecycle implementation through client specification constraints to reduce the pressure on Apiserver and penetration into ETCD pressure and so on.

Finally, when the number of nodes increased from 8,000 to 10,000, we began to optimize etCDCompact algorithm in full swing. Etcd single-node multiboltDB architecture optimization, Apiserver server data compression, through component governance to reduce etCD write magnify; At the same time, we started to build a regular pressure testing service capability and continued to answer ASI’s SLO.

These examples are common in high availability challenges, and the capabilities listed are only a small part of the list, so it may be difficult to see the connections between capabilities and the underlying evolutionary logic. Of course, more capacity building has settled into our systems and mechanisms. This article will start with an overview of some of the key parts of our efforts to build ASI’s global HIGH availability architecture, followed by a detailed explanation of the technical points and evolution path. If you have any questions or questions, please leave them in the comments section.

ASI Overview of global high availability

High availability is a complex proposition. Any daily changes such as service upgrade, hardware update, data migration, traffic surge, etc., may cause service SLO damage or even unavailability.

ASI, as a container platform, does not exist in isolation, but forms a complete ecosystem with the cloud underlying layer and public services. To solve ASI’s high availability problem, it is necessary to find the optimal solution for each layer from a global perspective, and finally form the optimal overall solution in series. The aspects involved include:

  • Cloud infrastructure management, including availability area selection, planning and hardware asset management
  • Node Management
  • ASI Cluster Management
  • The public service
  • The cluster operations
  • Application development

Especially in this scenario, ASI to support business clusters of large, involves the development of numerous operations staff, frequent release of iterative development function model and the runtime business variety of complex and changeful, compared to other container platform, ASI high availability face more challenges, its difficulty is self-evident.

ASI global high availability design

As shown in the figure below, the overall strategy of HCA building at the present stage targets 1-5-10 (fault detection in 1 minute, location in 5 minutes, stop loss in 10 minutes), focusing on the ability to reside in the system or mechanism so that SRE/Dev can oncall without distinction.

To avoid problems as far as possible, find, locate and recover problems as soon as possible is the key to achieve the goal. Therefore, we will implement ASI global HIGH availability system in three parts: first, basic capacity construction; Second, emergency system construction; The third is to achieve the preservation and continuous evolution of the above capabilities through normal pressure testing and failure drills.

Through the rotation drive of three parts, the global HIGH availability system of ASI is constructed, the top of which is SLO system and 1-5-10 emergency system. Behind the emergency response system and data-driven system, we have built a large number of high availability infrastructure capabilities, including defense systems, high availability architecture upgrades, failure self-healing systems, and continuous improvement mechanisms. At the same time, we have established a number of basic platforms to provide supporting capabilities for the high utility system, such as normal fault drill platform, full-link simulation pressure test platform, alarm platform, plan center and so on.

Global high availability infrastructure capacity building

Before the construction of global high availability capability, accidents and dangerous situations constantly appear in our system under the rapid development and change, which need to be handled every three or four times. As a result, problems catch up with us and there is no efficient means to deal with them, we are faced with several severe challenges:

  • How can we improve our usability in architecture and capabilities and reduce the probability and impact of system failures?
  • How to make some breakthroughs in core link performance and architecture to support such complex and changeable business scenarios and common requirements of business growth?
  • How to let the problem no longer chase after the body, do a good job of prevention, avoid emergency?
  • How to find, diagnose and stop the loss quickly when emergency occurs?

In view of these problems, the following core reasons are summarized:

  • Poor availability: In a group scenario, components change constantly, increasing the pressure and complexity of the system. As a result, ASI is not capable of producing availability, such as traffic limiting degradation and load balancing. Components are prone to misuse, resulting in low-level errors that affect cluster availability.
  • Inadequate system risk control and POD protection capabilities: In case of human misoperation or system bug, it is easy to cause innocent or large-scale damage to service POD.
  • Capacity risk: several hundred clusters with close to a hundred components; In addition, due to the configuration of podCIDR and node IP address, the node size of most ASI meta-clusters is limited to 128. With the rapid development of services, there are great challenges to capacity risks.
  • Service development is affected by the limited scale of a single cluster and the insufficient horizontal expansion capability: The increasing scale of a single cluster, changes in service types and components all affect the maximum size supported by a single cluster and the continuous stability of the SLO.

1. Top-level design of high availability basic capabilities

To solve these problems, we made the top-level design of high availability basic capacity, which is mainly divided into the following parts:

  • Performance optimization and high availability architecture construction: Improve the types and volumes of services supported by the cluster from the perspective of performance optimization and architecture upgrade.
  • Component specification full life cycle management: mainly from the perspective of the specification, the whole life cycle of components should be implemented, from birth to enable and cluster access, to every change, to the entire life cycle of components to prevent misuse, wild growth, infinite expansion, control components within the range of the system can bear.
  • Attack and defense system construction: The ASI system is mainly triggered to improve the security, defense and risk control capabilities of the system from the perspective of attack and defense.

Here are a few key capacity building descriptions for some of our pain points.

2. Pain points of K8s single cluster architecture

  • The ability to control ApiServer is insufficient, and the emergency ability is insufficient. In our own experience, the number of cluster Master anomalies exceeds 20+ times, and the recovery time of each cluster is longer than 1 hour.
  • ApiServer is a single point in the ApiServer cluster with a large explosion radius.
  • When a single cluster is large, the Apiserver memory water level is high. As a result, more and larger resource objects are written due to frequent query.
  • The service layer lacks cross-equipment room Dr Capability. When ASI is unavailable, you can only rely on the RECOVERY capability of ASI.
  • As the cluster size continues to grow, a large number of offline tasks are created and deleted, causing greater pressure on the cluster.

The usability of cluster architecture can be improved from two major perspectives. In addition to the architecture optimization and performance breakthrough in a single cluster, it is necessary to support a larger scale through the horizontal expansion capability of multiple clusters.

  • One is to solve the horizontal expansion capability of a single cluster and cross-cluster disaster recovery capability of a single region by using multi-cluster capabilities such as federation.
  • Another architecture of a single cluster itself can also differentiate SLO assurance from an architectural standpoint of isolation and prioritization policies.

3. The ASI architecture is upgraded

1) APIServer multichannel architecture upgrade

The core solution is to group apiserver and treat it with different priority policies, so as to implement differentiated SLO guarantee for services.

  • Reduce apiserver pressure on main link by shunt (Core appeal)
    • P2 and the following components are connected to the bypass Apiserver and can perform overall flow limiting in case of emergency (for example, its stability is affected).
  • The bypass APiserver works with the main link to make blue-green and grayscale (secondary appeal)
    • Bypass Apiserver can use an independent version to add new functions, such as an independent traffic limiting policy, such as enabling new feature verification.
  • SLB DISASTER Recovery (Secondary Appeal)
    • The bypass APiserver provides services when the active APiserver is abnormal (the controller needs to change the destination address).

2) Upgrade the ASI multi-cluster federated architecture

At present, a machine room in Zhangbei Center has tens of thousands of nodes. If the management of multiple clusters is not solved, the following problems will arise:

  • Disaster recovery layer: The central unit of a core transaction application is deployed in a cluster. In the worst case, the cluster becomes unavailable and the application service becomes unavailable.

  • Performance: When core applications are used at a certain point in time, maximum limits are set on a single machine and CPU exclusivity is guaranteed. If the core applications are deployed in a cluster, applications will be stacked due to the limited cluster node scale, resulting in CPU hot spots and poor performance. For the Master managed by ASI, the performance of a single cluster will always be bottleneck due to unlimited expansion.

  • Operation and maintenance: When an application has no resources, SRE has to consider which cluster to add the node to, which increases the work of SRE cluster management.

Therefore, ASI needs to achieve a unified multi-cluster management solution to help upper-layer Pass, SRE, and application development to provide better multi-cluster management capabilities to shield the differences between multiple clusters and facilitate multi-party resource sharing.

ASI chose to build on the community federation V2 version to meet our needs.

4. K8s cluster encounters great performance challenges brought by scale growth

What are the performance issues in a large K8s cluster?

  • The first is to query related questions. The most important thing in a large cluster is how to minimize expensive Requests. For millions of objects, it is easy to cause etCD, Kube-Apiserver OOM, packet loss, avalanche and other problems when querying Pod by label and namespace and obtaining all nodes.

  • Secondly, write related issues. Etcd is applicable to the scenario where there are too many read requests and too few write requests. A large number of write requests may cause the DB size to increase continuously, and the write performance may reach the bottleneck. For example, a large number of offline jobs need to frequently create and delete POD, and the write amplification of POD objects through ASI link will eventually enlarge the write pressure on ETCD by dozens of times.

  • Finally, there is the issue of large resource objects. Etcd is suitable for storing small key-value data, but performance deteriorates rapidly when the value is large.

5. Breakthrough of ASI performance bottleneck

Direction of ASI performance optimization

ASI performance can be optimized from the apiserver client, Apiserver server, and ETCD storage.

  • On the client side, cache optimization can be performed to give clients preferential access to the local Informer cache. Load balancing is also required, including load balancing for Apiserver and ETCD. At the same time, for various optimization of the client, you can verify whether it meets the requirements when enabling and accessing components through component performance specifications.
  • APIServer can be optimized from the access layer, cache layer, and storage layer. In the cache layer, we focus on the optimization of cache index and watch, in the storage layer, we focus on the data compression of POD through SNappy compression algorithm, and in the access layer, we focus on the construction of traffic limiting ability.

  • Etcd storage side optimization, we also did a lot of work from several aspects, including etCD kernel level algorithm optimization work, and through the ability to split different resources into different ETCD clusters to achieve the basic horizontal splitting ability, At the same time, I also improved the expansion capacity of Multi BoltDB in etCD Server layer.

6. The prevention ability of K8s cluster is weak

In K8s, kube-Apiserver acts as a unified entry point, all controllers/clients work around Kube-Apiserver, although we SRE standardize constraint point improvements through the full life cycle of components, For example, a large number of low-level errors are prevented after the implementation of the block point approval in the enabling and cluster access stages and the comprehensive cooperation and transformation of the owners of each component, but some controllers or behaviors are not controllable.

In addition to failures at the infrastructure level, changes in business traffic are the factors that cause K8s to be very unstable. Sudden pod creation and deletion, if not restricted, will easily suspend Apiserver.

In addition, illegal operations or code bugs may affect business pods, such as deleting illegal pods.

Combined with all the risks of stratified design, layer by layer risk prevention and control.

7. The prevention capability of ASI single cluster is strengthened

1) Support multi-dimensional (Resource /verb/client) fine flow limiting of API access layer

The early traffic limiting method adopted by the community mainly controlled the total read and write concurrency through inflight. We realized the lack of traffic limiting ability before APF came out, and we were unable to limit traffic from request sources. In apF, traffic limiting via User (or authN filter first) has some disadvantages. On the one hand, authN is not cheap, and on the other hand, it just allocates the capability of API Server according to the configuration. It is not a traffic limiting scheme or emergency plan. We need to urgently provide a traffic limiting capability to deal with emergencies. We developed ua Limiter traffic limiting capability and realized a set of traffic limiting management capability based on ua Limiter simple configuration, which can facilitate default traffic limiting management in hundreds of clusters and complete emergency traffic limiting plans.

The following is a detailed comparison between the UA Limiter traffic limiting scheme developed by us and other traffic limiting schemes:

The emphasis of UA Limiter, APF and Sentinel on current limiting is different:

  • Ua Limiter provides a simple QPS hard limit based on ua.
  • Apf focuses more on concurrency control and considers traffic isolation and fairness after isolation.
  • Sentinel has a comprehensive function, but its support for fairness is not as comprehensive as THAT of APF, and its complexity is too high.

Considering our current requirements and scenarios, we found that ua Limiter was the most appropriate to be implemented, because we limited the flow of components through the difference of user Agent. Of course, for further refined current limiting, APF and other schemes can be considered for further strengthening.

Current-limiting strategies on how to manage, hundreds of clusters, each cluster size, cluster node number, pod number is different, have nearly a internal components, with an average of 4 kinds of each component of a resource required to service this request, the different resources and have an average of three different movements, if everyone do current limit, the rules will explode, Even after convergence, the maintenance cost is very high. So we focus on the core: core resource pod\ Node, core actions (create, delete, big query); Largest: Daemonset components, PV/PVC resources. Combined with the analysis of the actual flow on line, about twenty general flow limiting strategies are sorted out, which are incorporated into the cluster delivery process to realize the closed-loop.

When new components are connected, we will also do traffic limiting design for them. If they are special, we will bind rules and automatically issue policies during cluster access and deployment. If a large number of traffic limiting situations occur, an alarm will also be triggered, and SRE and R&D will follow up and optimize and solve them.

2) Support refined traffic limiting at the POD level

All POD-related operations are connected to the Kube Defender Unified Risk Control Center for flow control at the level of seconds, minutes, hours, and days. The global risk control traffic limiting component is deployed in the central end and maintains the interface invocation traffic limiting function in various scenarios.

Defender is a risk control system that defends (flow control, fuses, checkouts) and audits user-initiated or system-initiated risky operations from a whole-k8S cluster perspective. The reason for making Defender is mainly from the following aspects:

  • Components such as Kubelet/Controller have multiple processes in a cluster, and no single process can see the global view and perform accurate limiting.
  • From the perspective of o&M, it is difficult to configure and audit rate limiting rules scattered among components. If some operations fail due to traffic limiting, check whether the link length is too long.
  • The distributed design of K8s is for the final state, and each component has the ability to make decisions. Therefore, a centralized service is needed to control the risk of those dangerous decisions.

Defender looks like this:

  • Defender Server is a K8s cluster-level service. You can deploy multiple defender Servers, one of which is active and the others standby.
  • Users can configure risk control rules through Kubectl.
  • Components in K8s, such as Controller, Kubelet, extension-Controller, etc., can be accessed through the Defender SDK (with minor changes) to request risk control from Defender before performing dangerous operations. Decide whether to continue the hazardous operation based on the risk control results. Defender acts as a cluster-level risk control protection center to ensure the overall stability of the K8s cluster.

3) Digital capacity governance

In only a few core cluster scenario, relied on expert experience management capacity can be easily done completely, but with the rapid development of container business, covering extensive trade, middleware, a new business such as ecology, a new calculation and selling area in ASI, developed the hundreds just a few short years time cluster, and then a few years thousands of thousands? It is difficult for so many clusters to rely on the traditional human resource management mode, and the labor cost is getting higher and higher. In particular, it is easy to face the following problems, such as low utilization rate of resources, serious waste of machine resources, and eventually lead to the online risk caused by insufficient capacity of some clusters.

  • Components change constantly, as do service types and pressures. The real capacity of the online service (how many QPS can it carry) is unknown to everyone. When the service needs to increase traffic, whether it needs to be expanded? Does horizontal expansion not solve the problem?
  • The early application for container resources is arbitrary, resulting in a serious waste of resource costs. Therefore, it is necessary to specify how much resources (including CPU, memory, and disk) should be reasonably applied based on minimizing container costs. In the same area and the same meta-cluster, a waste of resources in one cluster will cause resource tension in other clusters.

Component change is the norm in ASI, and how component capacity adapates to this change is a major challenge. Daily operation, maintenance and diagnosis require accurate capacity data to support capacity backup.

Therefore, we decided to apply for reasonable (low-cost, secure) container resources through data-guided components. Data is used to provide capacity data required for daily operation and maintenance (O&M) to perform capacity backup and perform emergency capacity expansion when the production water level is abnormal.

At present, we have completed water level monitoring, full risk broadcasting, pre-scheduling, profile performance data timing capture, and then promoted CPU memory and CPU memory ratio optimization through the component specification. We are working on automated specification suggestions, node resource replenishment suggestions, and automated node import. We are building a “one key backup capacity” loop with ChatOPS. In addition, the baseline comparison of each component is obtained based on the data of full-link pressure test service, and the card points are released through risk decision-making to ensure the safety of the component online. At the same time, the future will be combined with real changes online to continuously answer the SLO performance of real environment and accurately predict capacity.

Global high availability emergency response capacity building

The building of a high availability infrastructure provides ASI with strong protection against risks, thus ensuring the availability of our services in the event of various risks. However, how to quickly intervene to eliminate hidden risks after the occurrence of risks, or orderly stop losses after the occurrence of failures that cannot be covered by high availability capabilities, has become an engineering problem with great technical depth and horizontal complexity, which also makes ASI’s emergency capacity building a very important investment direction.

At the beginning of the construction of the emergency system, due to the rapid development and change of our system, the continuous occurrence of accidents and dangerous situations, obviously exposed several serious problems we were facing at that time:

  • Why do customers always find problems before we do?
  • Why is it taking so long to recover?
  • Why do the same problems keep cropping up?
  • Why are only a few people able to deal with problems online?

In view of these problems, we have conducted sufficient brain burst and discussion, and summarized the following core reasons:

  • There is only one way to find problems: metrics data is the most basic way to expose problems.
  • Lack of ability to locate problems: there are only a few monitoring panels, and the degree of observable capacity building of core components is not uniform.
  • Recovery methods are not systematic: Fixing online problems requires AD hoc typing and scripting, which is inefficient and risky.
  • Lack of system norms for emergency response: lack of linkage with the business side, serious thinking of engineers, not stop loss as the first goal, lack of awareness of the severity of the problem.
  • Long-term problems lack of follow-up: hidden dangers found online, or follow-up items of accident review, lack of continuous follow-up ability, resulting in repeated pit stepping.
  • Lack of ability retention mechanism: The business is changing so fast that some capabilities, after a period of time, will fall into an awkward “can’t use, dare not use, and can’t be guaranteed to use” situation.

1. Top-level design of emergency capacity building

In view of these urgent problems, we have also made the top-level design of emergency response capability, and the architecture diagram is as follows:

The overall emergency response capacity building can be divided into several parts:

  • 1-5-10 emergency system: Can achieve the low-level ability and mechanism of “1-minute detection, 5-minute location and 10-minute recovery” for any unexpected risks online.
  • Problem tracking and follow-up: Ability to continuously track and advance all potential risks found online, no matter serious or not.
  • Ability preservation mechanism: 1-5-10 ability to build, in view of its low frequency of use.

2. Sub-module construction of emergency response capacity building

For each sub-module in the top-level design, we have made some stage work and achievements.

1) One-minute detection: problem detection ability

In order to solve the problem of not being able to find the problem before the customer, the most important goal of our work is to achieve: let all problems can not hide, the system actively discovered.

So it’s going to be a long battle, and what we’re going to do is we’re going to cover one new problem after another by every possible means, and we’re going to capture one city after another.

Driven by this goal, we have also developed a very effective set of “strategic thinking”, namely ** “1+1 thinking” **. Its core point is that any means to find problems may lead to occasional failure due to external dependence or its own stability defects, so it is necessary to have a link that can be used as a mutual backup for fault tolerance.

Under the guidance of this core idea, our team built two core capabilities, namely black box/white box alarm double channel, which have their own characteristics:

  • Black box channel: Based on the idea of black box, ASI as a whole from the customer’s perspective as a black box, directly issue commands to detect forward functions; For example, expand a statefulset directly.
  • White box channel: Based on the idea of white box, potential problems can be found with the help of abnormal fluctuations of observable data of various dimensions exposed inside the system; For example, the memory of APIServer increases abnormally.

The specific product corresponding to the black box channel is called KubeProbe, which is a new product formed by our team based on the idea of community KuberHealthy project through more optimization and transformation, and has also become an important tool for us to judge whether the cluster has serious risks.

The construction of the white box channel is relatively more complex, and it needs to be built on the basis of complete observable data to give full play to its power. Therefore, we first constructed three data channels based on SLS from the dimensions of metrics, logs and events to unify all observable data to SLS for management. In addition, we have also built an alarm center, which is responsible for the batch management of the alarm rules of hundreds of clusters and the ability to deliver. Finally, we have constructed a white box alarm system with complete data and extensive problem coverage. We have recently further migrated our alarm capabilities to SLS Alarm 2.0 for richer alarm functionality.

2) Five-minute location: automatic location of the root cause

As we become more experienced with online troubleshooting, we find that there are many problems that occur more frequently. Their screening methods and recovery methods have been relatively solidified. Even if there may be multiple reasons behind a problem, with the rich experience of online investigation, the investigation roadmap of this problem can be gradually iterated. The following figure shows the troubleshooting routes for unhealthy ETCD cluster alarms.

If these relatively confirmed troubleshooting experiences are solidified into the system, decisions can be automatically triggered when problems occur, which is bound to greatly reduce the processing time of online problems. So in this regard, we have also started some capacity building.

From the aspect of black box channel, KubeProbe has built a self-closed loop root location system, which integrates the expert experience of problem investigation into the system to achieve rapid and automatic problem location. Through the common root cause analysis tree and the machine learning classification algorithm for failure inspection and detection events/logs (ongoing development), the root cause of each KubeProbe failure Case is located, and the unified problem severity evaluation system within KubeProbe is implemented (the rules here are still relatively simple). Evaluate the severity of an alarm to determine how to handle the alarm, such as whether the alarm is self-healing or a phone alarm.

From white box channel aspect, we through the pipeline of the underlying engine arrangement ability, combination has data platform for the construction of the multidimensional data, implements a generic returning for diagnostic center, through a variety of observable data to troubleshoot problems returning for the process of curing to the system by means of yaml layout, form a returning diagnosis task, And a diagnosis of a problem is formed after the task is triggered. And each conclusion will also bind corresponding recovery means, such as calling preplan, self-healing and so on.

Both channels achieve effects similar to ChatOps through stapling robots and other means to improve the speed of onCall personnel to deal with problems.

3) 10-minute recovery: Recovery of stop-loss ability

In order to improve the speed of stop-loss recovery from run-time failures, we also prioritize the building of stop-loss recovery capabilities. There are two core principles in this regard:

  • Stop loss ability to systematize, white screen, precipitation.
  • Everything is aimed at stopping loss, not at finding absolute root cause goals.

So driven by these two principles, we did two things:

  • Construction plan center: centralize all our stop loss capacity into the system, white screen management, access, operation. On the one hand, it can also unify the plans scattered in the hands of various r&d or documents and centrally manage the plans, realizing the centralized control of the plans. On the other hand, the plan center has also developed the ability to support users to input the plan through YAML arrangement, so as to achieve low-cost access.
  • Build a common stop loss set: Based on the historical experience, combined with the unique characteristics of ASI, build a variety of common stop loss capacity sets, as an important starting point in emergency. It includes common functions such as component restart in place, component rapid expansion, controller/ Webhook rapid degradation, and cluster quick switch to read-only.

4) Problem tracking mechanism BugFix SLO

To solve the problem of lack of follow-up ability, we proposed the BugFix SLO mechanism. As the name suggests, we consider every problem we find to be a “Bug” that needs to be fixed, and we have done the following to address this Bug:

  • On the one hand, a series of categorization methods are defined to ensure that problems are identified to the team and to a specific person in charge.
  • On the one hand, define the solution priority, that is, the SLO to solve the problem, L1-L4. Different priorities represent different solution criteria, AND L1 represents the need to quickly follow up and solve the problem within the day.

Every two weeks, we will produce a stability weekly report based on the new problems collected in the past period, which will reveal the extent of problem solving and synchronize the key issues. In addition, there will be a full pull alignment every two weeks to identify the person responsible for each new problem, priority alignment.

5) Ability acceptance fresh-keeping mechanism

Since stability risks occur at relatively low frequencies, the best way to preserve stability is to drill. Therefore, on this basis, we designed or participated in two drill programs, which are as follows:

  • Normalize the fault test mechanism
  • Production surprise drill mechanism

[Regular drill mechanism]

The core purpose of the regular fault drill mechanism is to continuously check ASI system failure scenarios and the ability to recover from such failures at a more frequent rate, so as to identify component stability defects and verify the effectiveness of various recovery plans.

Therefore, in order to increase the frequency of practice as much as possible, we:

  • On the one hand, it starts to build its own fault scenario database, and registers, classifies and manages all scenarios to ensure comprehensive coverage of scenarios.
  • On the other hand, we cooperated with the quality assurance team to make full use of the injection fault capability provided by Chorus platform to implement our design scenes one by one and configure them for continuous operation in the background. We also make use of the rich flexible plug-in capabilities of the platform to connect the platform with our alarm system and plan system for API connection. After the injection of the fault scene is triggered, the injection, inspection and recovery of the scene can be completed completely through the mode of automatic call in the background.

Given the high frequency of rehearsals, we usually do continuous background rehearsals in a dedicated cluster to reduce the stability risks associated with rehearsals.

[Production surprise drill mechanism]

No matter how often we do it, we can’t guarantee that the same problem will occur in the production cluster and that we will be able to handle it in the same way. There’s no way to really know if the failure was as widespread as we expected; The root cause of these problems is that the clusters we use in normal failure drills are test clusters with no production traffic.

Therefore, fault simulation in the production environment can be more real to reflect the actual situation on the line, thus improving our confidence in the correctness of the recovery method. On the ground, we by actively involved in the cloud native team organization quarterly production raids, compare us some relatively complicated or important rehearsal scene to achieve the secondary acceptance in a production environment, at the same time also to the discovery of our speed, response speed and the profile assessment, not only found some new problems, It also gave us a lot of input on how to design scenarios in the test cluster that more closely match the online reality.

Write in the last

This paper only introduces some exploration work and thinking behind the construction of ASI global HIGH availability system as a whole. The follow-up team will focus on specific fields, such as ASI emergency system construction and ASI prevention system construction. In-depth interpretation of fault diagnosis and recovery, construction and operation of full-link refined SLO, performance bottleneck breakthrough of ASI single cluster scale and other aspects, please look forward to it.

ASI, as the leading implementation of cloud native, its high availability and its stability affect and even determine the development of Ali Group and cloud products business. ASI SRE team has been recruiting for a long time, technical challenges and opportunities are present, interested students are welcome to contact: [email protected], [email protected].

How to better harness the power of the cloud in the digital age? What is the new and convenient development mode? How can developers build applications more efficiently? Technology empowers society, technology drives change, expands developers’ energy boundaries, and everything is different in the cloud. Click now to sign up for the event, and the answer will be brought to you at alibaba Cloud Developer Conference 2021.