On July 27, 2021 Trusted Cloud Conference was held in Beijing. At the meeting, Ali Cloud fault drill platform was selected as the best technology practice of trusted cloud, and the first batch of trusted cloud chaos engineering platform capability requirements of the highest level – advanced level certification. At the same time, led by the Institute of Information and Communication, Ali Cloud Computing Co., Ltd. and a number of enterprises jointly launched the “Chaos Engineering Laboratory” announced the establishment.

Double authentication, Ali Cloud fault drill platform trusted cloud highest level certification

With the deepening of enterprises’ understanding and practice of cloud computing, the distributed architecture based on cloud computing has become the preferred solution for more and more enterprises to build applications. How to improve the stability of cloud native system and ensure business continuity through chaos engineering has become a topic of general concern in the industry.

Chaos engineering is mainly through the way of fault injection, the system stability and other problems in advance to improve the resilience of the system and organization, build a resilient architecture, and ensure business continuity. In the evaluation of trusted cloud Chaos engineering platform of CIITC, Ali Cloud fault drill platform passed 8 ability evaluations of resource support, failure scenario, scenario management, experimental process, experimental protection, experimental measurement, authority management and security audit with the highest score, and was selected as the best technology Practice of trusted cloud in 2021 and double certification. Once again proved that Ali Cloud in the field of chaos engineering technology and product strength.

With the development of Alibaba’s system architecture from micro-service, to containerization, and then to cloud native, the company has nearly 10 years of practical experience in chaos engineering. Ali cloud fault drilling platform alibaba internal experience exported in the form of the transition, provide a rich field of experimental scenario and expert experience repository, solution, meet the demand of the user’s fault scenarios, in integration of flexible process choreography and open ability, provide the monitoring, report, etc. To achieve chaos engineering implementation of closed loop, Rights management and protection are used to control the risk of fault drill, helping enterprises improve system stability and business continuity during cloud migration, cloud readiness, and cloud native.

Since the chaos engineering theory was put forward, many enterprises in the exploration and practice, but the landing form is different, Ali Cloud fault drill platform is different?

  • Flexible process choreography: a standardized walkthrough is developed to which the desired process nodes can be added. Supports multiple scenarios at the same time.
  • Visualized fault drill: Integrated with architecture awareness, fault injection can be realized based on architecture topology visualization. At the same time, it can cooperate with architecture inspection to discover system risk points and use fault drill for verification.
  • Diverse expert experience base: It has accumulated alibaba’s internal years of failure drill experience into the drill template, which has the authenticity and practicability of the drill scenario, greatly improving the efficiency of the drill creation, and solving the problem that it is difficult for users to start the chaos project.
  • Domain-based solutions: Provide product solutions for stability verification of service components and system architecture, dynamically identify components and architecture through architecture awareness and dependency analysis, and automatically generate drill solutions to achieve fast, accurate and complete drill.

Using the fault drilling platform to do chaos engineering can measure the fault tolerance of microservices, estimate the fault tolerance line of the system, and measure the fault tolerance of the system. In addition, the fault drilling platform can verify whether the container configuration is reasonable, test whether the PaaS layer is robust, verify the timeliness of monitoring alarms, and improve the accuracy and timeliness of monitoring alarms. Through fault raid, randomly inject faults into the system, investigate the emergency ability of relevant personnel to the problem, and whether the problem reporting and processing process is reasonable, so as to achieve the battle to fight, train people’s ability to locate and solve problems. Through fault injection, problems such as system stability can be detected in advance to improve the resilience of the system and organization, build a resilient architecture, and ensure business continuity.

Ali cloud fault drilling platform commercialization since 2019, through a variety of experimental tools, automated tool deployment, the practice of the multi-dimensional way, flexible process choreography, rich fault scenarios, the practice of the practical templates, professional solutions and the practice of the safety protection, the depth of product integration of cloud, already have nearly thousand corporate customers, It has served customers such as Huatai Securities, Bixin Technology and Baby, helping enterprises build digital resilience in the cloud native era.

Promote standard unification, build ChaosBlade open source project, and shorten the path of building chaos engineering

In recent years, more and more enterprises begin to pay attention to and explore chaos engineering, which gradually becomes a test system of high availability, building an indispensable tool for system information. However, the chaos engineering field is still in a rapidly evolving stage, and there is no uniform standard of best practice and tool framework. There are some potential business risks associated with implementing chaos engineering, and a lack of experience and tools will further discourage DevOps personnel from implementing chaos engineering. At present, there are many excellent open source tools in the field of chaos engineering, respectively covering a certain field, but the ways of using these tools vary widely. Some of them are difficult to use, high learning cost, and single chaos experiment ability, which makes many people forbidding the field of chaos engineering.

Alibaba Group has been in the field of chaos engineering for many years, in order to help enterprises better build the path of chaos engineering, Alibaba in 2019 open source chaos engineering project ChaosBlade, and this year became CNCF Sandbox project. By forming a unified technology system of “self-developed technology”, “open source project” and “commercial products”, Aliyun has realized the maximum technical value through the positive cycle of the trinity.

ChaosBlade is an open source tool that follows the principles of chaos engineering, including chaos engineering experimental tool ChaosBlade and chaos engineering platform ChaosBlade – Box, which aims to help enterprises solve the problem of high availability in cloud native processes through chaos engineering. Experimental tool ChaosBlade supports three system platforms and four programming languages, involving more than 200 experimental scenarios and more than 3000 experimental parameters, which can control the experimental scope in fine detail. ChaosBlade has become the basic capability base of Ali Cloud fault drill platform to serve many enterprise customers.

In the future, ChaosBlade will continue to provide chaos engineering platform and chaos engineering experimental tools for multi-cluster, multi-environment and multi-language based on cloud native; In the future, it will host more chaos experimental tools and compatible mainstream platforms, realize scene recommendation, provide business and system monitoring integration, output experimental reports, and complete the closed loop of chaos engineering operation on the basis of ease of use.

The first chaos engineering laboratory in the industry was officially established to promote the implementation of chaos engineering practice

In the digital industry to the system stability and cloud computing high availability requirements of the background, led by The China Institute of Information and Communication, Ali Cloud and many other enterprises to participate in the chaotic engineering laboratory was formally established. Chaos Engineering Laboratory will promote the practice and implementation of chaos engineering in typical application scenarios in various fields, and link upstream and downstream enterprises of cloud computing to jointly promote the rapid development of chaos engineering.

Ali Cloud has the most abundant chaotic engineering practice experience in China, and is committed to building a chaotic engineering standard system in the cloud primitive era. Ali cloud in the vast Internet service as well as the practice of double 11 calendar year scene, precipitate out of the pressure measurement, including full link available online traffic control, fault exercises such as the core technology, and through the open source and exported in the form of cloud services, in order to help enterprise users and developers to enjoy technology dividend, improve the development efficiency, shorten the construction of business process.