ChaosBlade is an open source chaos engineering project of Alibaba in 2019, which has been added to CNCF Sandbox. At first, it included chaosBlade, a chaotic engineering experimental tool oriented to multi-environments and multi-languages. Now, it has developed to ChaosBlad-Box, a chaotic engineering platform oriented to multi-clusters, multi-environments and multi-languages. The platform supports the hosting of experimental tools and the automatic deployment of tools. Focus the user’s energy on solving the problem of high availability in cloud native process through chaos engineering. This paper introduces ChaosBlade in detail from three stages: chaos experimental model abstraction, chaos experimental tool open source and chaos engineering platform upgrade project.

The author | XiaoChangJun sanjay valley (arch)

ChaosBlade is alibaba’s 2019 open source chaos engineering project, which has been added to CNCF Sandbox. At first, it included chaosBlade, a chaotic engineering experimental tool oriented to multi-environments and multi-languages. Now, it has developed to ChaosBlad-Box, a chaotic engineering platform oriented to multi-clusters, multi-environments and multi-languages. The platform supports the hosting of experimental tools and the automatic deployment of tools. Focus the user’s energy on solving the problem of high availability in cloud native process through chaos engineering. This paper introduces ChaosBlade in detail from three stages: chaos experimental model abstraction, chaos experimental tool open source and chaos engineering platform upgrade project.

In this year’s trusted cloud evaluation, Ali Cloud fault drill platform passed the advanced level certification, the highest level required by trusted cloud chaos engineering platform, with the highest score.

Chaotic experimental model

The ChaosBlade project covers chaotic experimental scenarios such as basic resources, application services and container services. At the beginning of the design of the experimental tools, the unified scene model was considered to facilitate the expansion and precipitation of the scene, and to provide a model basis for the platform managed experimental tools to achieve unified scene invocation. All experimental scenarios in the ChaosBlade project follow the design of this experimental model. The following is a detailed introduction to this model through the derivation, introduction, significance and specific application of the experimental model.

1. Derivation of experimental model

Chaotic experiments mainly include fault simulation, which is generally described as follows:

  • 10.0.0.1 The service is unavailable because disk A mounted on the machine is full;
  • The execution of the B dubbo service on all nodes is slow, which delays the invocation of the upstream A Dubbo service. As A result, the user access is slow.
  • All CPU cores on Node B in Kubernetes CLUSTER A are fully used, causing Pod scheduling exceptions in Cluster A.
  • In Kubernetes C cluster, the Pod network of D is abnormal. As a result, the access to services related to D is abnormal.

From the above, we can use the following sentence to describe the failure: what happened to what component on such and such machine (or cluster resource, such as Node, Pod), which caused the impact. We can also see the breakdown of the fault description through the following figure:

The four parts can be used to describe the existing fault scene, so we abstract a fault scene model, also known as chaotic experimental model.

2. Introduction of the experimental model

This experimental model is described in detail as follows:

  • Scope: Scope of the experiment, which refers to the machine, cluster and its resources to carry out the experiment.
  • Target: The experimental Target, the component on which the experiment takes place. Such as CPU, network, and disk in the basic resource scenario; application components such as Dubbo, Redis, RocketMQ, and JVM in the Java scenario; Node, Pod, and Container in the Container scenario.
  • Matcher: an experimental rule Matcher that defines experimental matching rules based on the configured Target. You can configure more than one. Because each Target may have its own special matching conditions, for example, Dubbo and gRPC in RPC domain can be matched according to the service provided by the service provider and the service called by the service consumer, and Redis in cache domain can be matched according to the set and GET operations. Matcher can also be extended to extend the scenario execution strategy and control the timing of the experiment trigger.
  • Action: indicates the simulated scenario. The scenario varies according to the Target. For example, the disk is full, the DISK I/O is high, and the disk hardware is faulty. If it is an application, experimental scenarios such as delay, exception, return of specified values (error codes, large objects, etc.), parameter tampering, and repeated invocation can be abstracted. For Container services, you can simulate Node, Pod, Container, or basic resource exceptions.

Using this model, the following problems to be clarified in the implementation of chaotic experiments can be clearly expressed:

  • What is the scope of chaos experiment
  • What is the object of the chaos experiment
  • What are the conditions under which the subject triggers the experiment
  • What experimental scenarios will be implemented

3. Significance of the experimental model

This model has the following characteristics:

  • Concise: clear level, easy to understand;
  • Universal: Covers all fault scenarios, including basic resources, application services, container services, and cloud resources.
  • Easy to implement: very convenient definition of clear interface specification, simple implementation of experiment scenario extension;
  • Language and domain independence: Can extend the multi-language, multi-domain model implementation.

This model has the following implications:

  • More accurate description of chaotic experimental scene;
  • Better understanding of chaotic experimental injection;
  • Convenient precipitation of existing experimental scene;
  • Excavate more scenes according to the model;
  • Chaos experimental tools are more standard and concise.

4. Application of experimental model

The application of chaotic experimental model can be summarized as follows:

  • The chaotic experimental model makes the variables of the experimental scene parameterized and the parameters normalized.
  • The horizontal extension of experiment scene domain can be realized by following the model.
  • The chaotic experimental model can be combined with the realization of the standardization in the field to realize the vertical expansion of the scene in the field conveniently.
  • The upper domain scene can reuse the scene defined by the chaotic experimental model.
  • The scene description declared by chaotic experimental model can be well connected to ChaosBlade;
  • Following the experimental model, it is convenient to construct the upper chaotic experimental platform.

The following article mainly introduces ChaosBlade, a chaotic engineering tool based on this model.

Chaos engineering experimental tool: ChaosBlade

Alibaba initially introduced chaos engineering to solve the dependency problem of microservices, to steady-state verification of business services and cloud services, and further upgraded to the business continuity guarantee of public cloud and proprietary cloud, as well as accumulated rich scenarios and practical experience in verifying the stability of cloud native systems. At that time, open source tools related to chaos engineering had such problems as decentralized scene capabilities, difficulty in getting started, lack of experimental model standards, and difficult to expand and precipitate scenes. These problems make it difficult to platformize, to include these tools in one platform. Therefore, open source chaos engineering experimental execution tool ChaosBlade, the following is a detailed introduction of this tool through the scene introduction, use, architecture design and case.

1. Chaotic experiment scene

The design of Chaosblade tool takes ease of use and the convenience of scene extension into consideration at the early stage, so that it is convenient for people to get started and expand more experimental scenes according to their needs. Following the chaotic experimental model, it provides a unified and simple execution tool. Chaos experimental tools support Linux, Windows, Docker, Kubernetes and other system platforms, covering Java, Golang, NodeJS, C++ language applications, involving more than 200 experimental scenarios, more than 3000 experimental parameters (v1.0.0-ga). The scenario areas currently included are as follows:

  • Basic resources: such as CPU, memory, network, disk, process, kernel, etc
  • Application services: such as databases, caches, messages, the JVM itself, microservices, etc. You can also specify any class method to inject a variety of complex experimental scenarios; Specify experimental scenarios such as injection delay for any method or line of code, variable and return value tampering, etc
  • Docker container: such as killing container, container CPU, memory, network, disk, process and other experimental scenarios
  • Kubernetes platform: such as node CPU, memory, network, disk, process experiment scene, Pod network and Pod itself experiment scene such as kill Pod, container experiment scene such as the Docker container experiment scene above
  • Cloud resources: such as Ali Cloud ECS downtime and other experimental scenarios

2. Tool usage

ChaosBlade is a tool that can be downloaded and unzipped directly, no installation is required, and then it can be invoked in CLI mode, directly executing blade commands.

For example, in this example of network latency, you can add the -h parameter and see a very nice command prompt. For example, I want a port 9520 call for network packet loss. Its matcher is called remotely to a service port 9520. After successful execution, the experiment result will be returned. Each experiment scenario will be treated as an object, and it will return a UID of the experiment object. This UID will be used for subsequent experiment management, such as destruction and query experiments. To destroy, and thus restore, the experiment, simply execute the Blade Destroy command.

Another way to call ChaosBlade is Web, exposing HTTP services by executing server commands. In the upper layer, if you build a chaotic experimental platform by yourself, you can directly call it through HTTP requests.

3. Tool architecture design

ChaosBlade is encapsulated into independent projects according to domain implementation, and each project is implemented according to the best practices of each domain. It can not only meet the usage habits of each domain, but also establish the relationship with ChaosBlade CLI project through chaotic experimental model, which is convenient to use ChaosBlade for unified call. In all areas of experimental scenario based on the experimental model to generate chaos yaml file description, exposed to the upper chaos experiment platform, chaotic scene description file changes according to the experiment experiment platform, automatic perception experiment scene change, no new scene to do platform development, make chaos platform focus more on other parts of engineering. Currently, the included actuator items are as follows:

  • Chaosblade: chaos experiment management tool, including experiment creation, experiment destruction, experiment query, experiment environment preparation, experiment environment undo and other commands, is the execution tool of chaos experiments, including CLI and HTTP. Provide complete commands, experimental scenarios, and parameter descriptions, and the operation is simple and clear.
  • Chaosblade-spec-go: An experimental model of chaos defined by Golang language. Scenarios that are easy to implement using Golang language are conveniently implemented based on this specification.
  • Chaosblade-exec-os: Implementation of basic resource experiment scenarios, such as CPU, network, memory, disk, etc.
  • Chaosblade -exec-docker: Docker container experimental scenario implementation, through calling the Docker API standardization implementation.
  • chaosblade-operator: Kubernetes platform experiment scene implementation, the chaotic experiment through the Kubernetes standard CRD method definition, very convenient use of Kubernetes resource operation method to create, update, delete the experiment scene, including the use of Kubectl, client-go and other methods to execute, It can also be executed using the above ChaosBlade CLI tool.
  • Chaosblade-exec-jvm: Java application experimental scenario implementation, using Java Agent technology dynamic mount, without any access, zero cost to use, and support uninstallation, fully reclaim various resources created by the Agent.
  • Chaosblade -exec-cplus: C++ application experimental scenario implementation, using GDB technology to achieve method, line of code level experimental scene injection.

4. Tool Use cases

The use of chaosBlade tool is introduced through a Dubbo microservice case. The consumer calls the Provider, the Provider calls the base, and the Provider calls the Mk-Demo database. There are two instances of the Provider and base services.

The experimental scenario performed in this case is database call latency. We first define the monitoring indicators: slow SQL number and alarm information, and make the expected assumption that the number of slow SQL increases and the nail group receives a slow SQL alarm. Next, perform the experiment. If the database is Demo and the table name is D_Discount, 50% of the queries will be delayed by 600 ms.

We use Aliyun product ARMS for monitoring and alarm. As you can see, immediately after the chaotic experiment was performed, an alarm was received in the nail group. So when we compare the monitoring indicators defined before, they are in line with expectations. However, it should be noted that meeting the expectation this time does not mean meeting the expectation in the future, so it needs to be verified through the continuous chaos engineering. When a slow SQL occurs, you can use the LINK trace of ARMS to locate and identify which statements are executing slowly.

Chaos engineering platform: Chaosblade – Box

In order to enable users to focus on solving the problem of high availability of the system through chaos engineering rather than the selection and deployment of experimental tools, the ChaosBlade brand is upgraded and the ChaosBlade -box Chaos engineering platform is open source. The platform hosts the mainstream chaos experimental tools, realizes the deployment of tool automation, and realizes the implementation of chaos engineering through a unified operation page.

The following is to introduce the chaotic engineering platform ChaosBlade – Box through the functional characteristics, architectural design and use cases of the platform.

1. Platform features

It has the following features:

  • Support for open source experimental tool hosting: the platform can host mainstream experimental tools in the industry, such as its own ChaosBlade and external LitmusChaos, etc. The Chaos Mesh experimental tool will also be hosted later.
  • Abundant experimental scenarios: It includes basic resources (CPU, memory, network, disk, process, kernel, and file), multi-language application services (Java, C++, NodeJS, and Golang), and Kubernetes platform (covering Container, Pod, and Node resource scenarios, including the above experimental scenarios).
  • Automatic deployment of experimental tools: You do not need to manually deploy the experimental tools. The experimental tools are automatically deployed on hosts or clusters.
  • Unified chaos experiment user interface: Users do not need to care about the use of different tools, in the unified user interface chaos experiment.
  • Multi-dimensional experiment: support from the host to Kubernetes resources, and then to the application dimension for experiment choreographer.
  • Integrated cloud native ecology: using Helm deployment management, integrated Prometheus monitoring, support cloud native experimental tool hosting, etc.

2. Platform architecture design

The automatic deployment of managed tools such as ChaosBlade and LitmusChaos can be realized through the console page. The experimental scene is unified according to the chaos experimental model established by the community. The target resources are divided according to the host, Kubernetes and applications, and controlled by the target manager. Can realize the target resource selection of white screen. The platform executes the experimental scenarios of different tools by calling chaotic experimental execution, and can observe experimental metric indicators with access to Prometheus monitoring, and provide rich experimental reports in the future. Chaosblade- Box deployment is also very simple, you can see the details:

Github.com/chaosblade-…

3. Instructions for use

After the Kubernetes cluster or host information is configured, you can view the cluster or host data on the host list page. Select Experiment Management to create an experiment. The experiment can be conducted by host, Node, Pod, or Container. After you select a dimension, the resource list is displayed. The walkthrough contains all of the hosted experiment scenarios. After completing the experiment creation, it will automatically jump to the drill details page. Click “Execute” to jump to the task details page.

The detail page of the drill task shows the basic information of the experiment and the status of the experiment task, which can easily control the experiment and clarify the status of the experiment task.

The future planning

1, chaosblade

ChaosBlade will provide chaos engineering platform and chaos engineering experimental tools for multi-cluster, multi-environment and multi-language based on cloud native in the future. The lab tool continues to focus on the richness and stability of lab scenarios, supporting more Kubernetes resource scenarios and standard application service lab scenarios, providing standard implementation of multilingual lab scenarios.

2, chaosblade – box

In the future, the core functions of Ali Cloud fault drill platform (Trusted Cloud Chaos Engineering Platform Advanced certification) will be open source, and the existing chaos engineering platform will be integrated to achieve more capabilities. At the same time, simplify the deployment and implementation of chaos engineering tools, the future will host more chaos experimental tools and compatible mainstream platform, to achieve scene recommendation, provide business, system monitoring integration, output experimental reports, on the basis of easy to use to complete the chaos engineering operation closed loop.

Author introduction:

Xiao Changjun (name: Qiong Gu) : Alibaba technology expert, ChaosBlade Founder&Maintainer, Ali Cloud fault drill platform end side principal, trusted cloud standard expert, Chaos engineering advocate, years of experience in distributed system architecture and stability building.

Sanjay: Working in the R&D Center of Agricultural Bank of China, engaged in big data research and development of financial related systems.

The original link

This article is ali Cloud original content, shall not be reproduced without permission.