The best way to reduce failures is to make them happen more often. Continue to improve the system’s fault tolerance and resilience by repeating the failure process over and over in a controlled environment.

So what are the steps needed to run an efficient chaos engineering experiment?

Answer: 2 steps.

1) log in ChaosBlade

(2) Download the release version and create a dedicated tool for fault drill

High availability architecture is the core of service stability.

Alibaba in the vast Internet service as well as the practice of double 11 calendar year scene, precipitate out of the pressure measurement, including full link available online traffic control, fault exercises such as the core technology, and through the open source and exported in the form of cloud services, users and developers to help enterprises enjoy the alibaba technology dividend, improve the development efficiency, shorten the construction of business process.

For example, with the help of Ali Cloud performance test PTS, a full-link pressure measurement system was efficiently constructed, and the open source component Sentinel realized the flow limiting and degradation functions. This time, after 6 years of improvement and practice and tens of thousands of online practice scenarios, we condensed Alibaba’s creativity and practice in the field of failure drill into a chaos engineering tool, and named it as ChaosBlade.

To access the project and experience the Demo, click here.

What is ChaosBlade

ChaosBlade is a chaos engineering tool that follows the experimental principles of chaos engineering, provides a variety of fault scenarios, and helps distributed systems improve fault tolerance and recovery. It can inject underlying faults, featuring simple operation, non-invasive, and strong scalability.

ChaosBlade is based on Apache License V2.0. ChaosBlade and ChaosBlade-exe-JVM repositories are available.

Chaosblade contains CLI and a base resource, container-specific chaos experiment implementation module implemented using Golang. Chaosblade-exe-jvm is an executor that performs chaos experiments on applications running on the JVM.

The ChaosBlade community will continue to add chaos experimental actuators in C++, node.js and other languages.


Why Open Source

Many companies have begun to pay attention to and explore chaos engineering, which has gradually become an indispensable tool for testing system high availability and constructing system information. However, the field of chaos engineering is still in a phase of rapid evolution, and there is no unified standard for best practices and tool frameworks. There are some potential business risks associated with implementing chaos engineering, and the lack of experience and tools further prevents DevOps from implementing chaos engineering.

At present, there are many excellent open source tools in the field of chaos engineering, covering a certain field respectively, but the use of these tools varies greatly. Some of them are difficult to get started, costly to learn, and have single chaos experiment ability, which make many people recoiled from the field of chaos engineering.

Alibaba Group has been practicing in the field of chaos engineering for many years, and opened the chaos experiment tool ChaosBlade for the purpose of:

  • Let more people understand and join the field of chaos engineering;
  • Shorten the path of constructing chaos engineering;
  • At the same time, relying on the power of the community, improve more chaos experiment scenes, and jointly promote the development of chaos engineering field.

What problems can ChaosBlade solve

To measure the fault tolerance of microservices, call delay, service unavailability and full load of machine resources are simulated to check whether the faulty nodes or instances are automatically isolated and offline, whether the traffic scheduling is correct, whether the plan is effective, and whether the overall QPS or RT of the system is affected. On this basis, the range of faulty nodes can be slowly increased to verify whether upstream service traffic limiting degradation and fuses are effective. Finally, the fault node increases to request service timeout, and the system fault tolerance red line is estimated to measure the system fault tolerance capacity.

Verify whether the container arrangement configuration is reasonable By simulating the killing of service Pod, killing nodes and increasing the load of Pod resources, observe the availability of system services, verify the replica configuration, resource restriction configuration and container deployment under Pod is reasonable.

Test whether the PaaS layer is robust by simulating the upper layer resource load to verify the effectiveness of the scheduling system. Simulation dependent distributed storage is unavailable to verify the fault tolerance of the system; If the simulation scheduling node is unavailable, test whether the scheduling task is automatically migrated to the available node. Simulate the faults of the active and standby nodes to test whether the active/standby switchover is normal.

Verify the timeliness of monitoring alarms By injecting faults into the system, you can verify whether monitoring indicators are accurate, monitoring dimensions are complete, alarm thresholds are reasonable, alarms are fast, alarm recipients are correct, and notification channels are available to improve the accuracy and timeliness of monitoring alarms.

Emergency ability to locate and solve problems Through fault raid, randomly inject faults into the system, investigate the emergency ability of related personnel to solve problems, and whether the problem reporting and processing process is reasonable, so as to cultivate people’s ability to locate and solve problems by fighting.

Functions and Features

High scene richness

ChaosBlade supports chaotic experiment scenarios that not only cover basic resources, such as CPU full load, disk I/O high, and network latency, but also application experiment scenarios running on JVM, such as Dubbo call timeout and call exceptions, specified method delay or throw exceptions, and return specific values. It also involves container-related experiments, such as killing containers and killing PODS. We will continue to add experimental scenes.

Simple to use and easy to understand

ChaosBlade runs on the CLI and provides a friendly command prompt function, making it easy to use. The writing of the command follows the fault injection model abstracted from the failure test and drill practice in Alibaba Group for many years, with clear hierarchy, easy to read and understand, reducing the threshold of chaos engineering implementation.

Easy to expand scenarios

All ChaosBlade experimental actuators also follow the fault injection model mentioned above, making the experimental scenario model uniform and easy to develop and maintain. The model itself is easy to understand and the learning cost is low, so more chaos experiment scenes can be easily and quickly expanded according to the model.


ChaosBlade’s evolutionary history

EOS (2012-2015) :

In the early version of the fault drill platform, the fault injection capability is realized through bytecode enhancement, which simulates common RPC failures and solves the weak and strong dependency governance problem of microservices.

MonkeyKing (2016-2018) :

The upgraded version of the fault drill platform enriches the fault scenarios (such as resource and container layer scenarios) and performs large-scale drills in the production environment.

AHAS (2018.9- present) :

Ali Cloud applies high availability services, and has all functions of the built-in drill platform, supporting the ability to arrange drills, drill plug-in extensions and other capabilities, and integrating the functions of architecture awareness and traffic limiting degradation.

ChaosBlade (2019.3) :

It is a tool to implement the fault injection of MonkeyKing platform. It defines a set of fault model by abstracting the fault injection capability of drill platform. Open source with user-friendly CLI tools to help cloud native users test chaos engineering.


The recent planning

Functional iterations:

  • Enhanced JVM walkthrough scenarios to support more mainstream Java frameworks such as Redis, GRPC
  • Enhanced Kubernetes walkthrough scenarios
  • Added support for C++, node.js and other applications


The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.