1. Introduction

The micro-service architecture has been implemented in Qunar for many years, and the number of micro-service applications has reached thousands. As the call links between services become more and more complex, failures occur frequently, bringing huge economic losses to the company, stability construction has become an important work. Chaos Engineering has been proved to be an effective way to discover system weaknesses since Netflix proposed to improve system stability through Chaos Engineering in 2010. An effective means of establishing confidence in the system’s ability to withstand runaway conditions in production. Since the end of 2019, Qunar has also started to explore chaos engineering in combination with its own technology system. The following is a brief introduction of our practical experience.

2. The selection of

In order to avoid repeating the wheel, we investigated the relevant tools of chaos engineering that had been open source at the beginning of the project and analyzed them in combination with the characteristics of our own technical system:

  • At the time, the base resources were KVM and containerization was being explored, so both platforms needed support.
  • The main technology stack within the company is Java.

Based on the above two points, plus the active community, we chose ChaosBlade as the fault injection tool, plus our own chaos Engineering console (there was no ChaosBlade-Box at that time) as the final solution.

3. The architecture

Based on the company’s internal system, the overall architecture is as follows:

Vertically, top-down:

  • Service governance. Portal (the platform of CICD that provides application portraits) provides information about application dependencies, application properties, and runtime resources. Through chaos engineering UI, fault drills can be created (fault drills include application information, application resources, and faults to be injected, etc.).

  • Chaos engineering console (Chaos console), provides multiple applications of multiple fault task flow choreography, failure drill process control functions;

  • Saltstack, ChaosBlade -operator provides chaosBlade installation and uninstallation capabilities;

  • Application resources are divided into containers hosted by KVM and K8S. The fault drill orcheographer communicates with the HTTP service enabled by ChaosBlade through Restful apis to inject and recover faults.

Horizontally:

  • The automated test platform mainly provides the regression ability of online case during the drill, as well as the marking assertion for strong and weak dependence.

  • At the beginning of the drill, the Chaos console listens for the core indicator alarms of the related application. If there is an alarm, it will notify the relevant person and terminate and resume the drill, so that the loss can be stopped in time.

4. System evolution

The chaos project of Qunar has gone through two stages:

1. Construction of fault injection capability. In this stage, the main problem is that users can manually create a fault drill to verify whether certain aspects of the system meet expectations through appropriate fault policies.

2, provide strong and weak dependent scenarios, dependent marking, strong and weak dependent verification, and automatic strong and weak dependent closed-loop capabilities, using chaos engineering to improve the efficiency of micro-service governance.

4.1 Fault Test

Simulation of fault occurrence by fault injection is the basic capability of chaos engineering. This stage mainly provides three scenarios of fault injection, machine shutdown, OS layer fault, and Java application fault injection, on this basis, we also do the scene function.

4.1.1 Drill process

A typical walkthrough flow is as follows:

4.1.2 the difficulty

  • The open source fault policy is insufficient

The chaosBlade-exec-JVM provides the basic capabilities for Java fault injection, as well as plug-ins for some open source components, but is not sufficient for in-house components. Therefore, our middleware students carried out secondary development, adding plug-ins related to AsyncHttpClient and QRedis fault injection, and also added fault injection function based on call point for HTTP DUBBO.

  • Containerization modification

In the middle of 2021, Qunar will start containerized migration of applications, and the fault drill also needs to support containerized drill. The following scheme selection is made based on ChaosBlade-operator:

The program focuses on three main issues:

  • Install and uninstall the Agent
  • Policy injection and recovery
  • Cost of transformation at the control end

Based on the comparison of the above schemes, the final implementation is based on Scheme 3.

4.2 Strong and weak depend on automatic closed loop

2 background

Based on the fault drill platform, we provide fault drill functions in strong and weak dependence scenarios:

  • Application dependency information display, dependency annotation
  • According to the dependency information, reverse fill in the fault policy parameters

However, the verification of the whole strong and weak dependence relationship still needs to be driven by people, so we combine the automated test tool to develop the function of automatic mark of strong and weak dependence, and complete the maintenance of strong and weak dependence relationship through the automated process, forming a closed loop.

4.2.2 scheme

The Chaos console periodically fetches application dependencies from the service governance platform and generates fault drills based on exception throwing strategies based on downstream interfaces. Then the application of the test environment for fault injection, and then through the automated test platform to run the case and do real-time diff to assert, finally get the assertion results. The Chaos console combines test assertions with logs of failed policy hits to determine whether the current downstream interface is strongly or weakly dependent.

The difficulties holdings

1. The Java Agent compatibility automated test platform supports record playback mode, which allows you to mock some interfaces with pre-recorded traffic during regression testing. In this mode, a JVM-Sandbox-based record playback Agent is used. Chaosblade-exec-jvm is also based on jVM-sandbox agent. There are some compatibility issues that need to be resolved when two agents are used together.

  • Cannot both Agents take effect at the same time? Jvm-sandbox added namespace functionality in 1.3.0, which means that multiple JVM-Sandbox-based Java Agents can be enabled at the same time, but only if namespaces are different. The default namespace used in ChaosBlade is fixed by modifying namespce in ChaosBlade.
  • When AOP cuts a Libary at the same time, if the mock takes effect first, fault injection will not work. In the recording playback agent added the blacklist function to avoid this problem.

2, test assertions and common test have distinction Using regression testing, test automation platform do pay more attention to is the integrity of the data and the accuracy, but when doing the fault drills, usually weak dependence has a problem, in addition to regular status code judgment, etc., to return the result of judgment is more a core data node is correct. For this reason, a separate assertion configuration is added to the automated test platform to accommodate the failure drill.

5. Open source contributions

The main open source project used in qunar chaos engineering practice is Chaosblade. Chaosblade, ChaosBlade-exec-JVM, and ChaosBlade-operator have been redeveloped and Bug fixed to varying degrees during use, and some of the changes have been committed to the official REPO and merged. I also communicated with the Chaosblade community, and planned to build the community and contribute to the open source community.

Plan for the future

At present, our fault drill platform has supported more than 80 times of simulated machine room power outage drill, and has also had more than 500 times of daily drill, involving 50+ core applications and 4000+ machines. A good cultural atmosphere of quarterly drill and pre-launch verification has also been formed for business lines.

Our next’s main goal is to automate online random drills, minimize the explosion radius determined by service dependency link, build online exercises static assertions, finally realizes the whole department all core page link to random drills regularly, it will also find chaos engineering stability in service management, the construction of the use of scenarios, and stable development of business for the company to provide technical support.