This article is excerpted from the Bytedance Infrastructure Practices series. “Bytedance Infrastructure Practice” is a series of articles designed by the technical teams and experts of Bytedance Infrastructure department to share the team’s practical experience and lessons in the development and evolution of infrastructure, and to exchange and grow with technical students. Chaos engineering helps the system to find weak points by means of fault injection to improve the stability of the system. With the development of micro-services and cloud native technologies, distributed systems have become popular throughout the industry, but they also bring challenges such as a sharp increase in complexity, difficult to predict the consequences of failure, and difficult to avoid and verify. Chaos engineering helps solve the above problems through fault injection. This paper discusses the relevant practices of bytedance since its introduction into chaos engineering, hoping to provide some references.

background

What is chaos engineering

In the actual operation of distributed system in production environment, it is inevitable that there will be all kinds of unexpected emergencies. At the same time, the development of cloud native constantly promotes further decoupling of micro-services, and massive data and user scale also bring large-scale distributed evolution of infrastructure. Distributed systems are inherently interdependent, and there are countless things that can go wrong, which can lead to business damage or other unexpected anomalies.

In complex distributed systems, there is no way to prevent these failures from occurring, and we should aim to identify as many risks as possible before these abnormal behaviors are triggered. Then, you need to reinforce and prevent faults to avoid serious consequences.

Chaos engineering is a set of methods to find out the weak links of production distributed system through experiments. This kind of empirical verification method can obviously create a more flexible system for us, and at the same time let us more thoroughly grasp the various behavior laws of the system operation. We can build confidence in running highly available distributed systems while continuing to build systems that are more resilient (resilience: the ability of systems to respond to and recover from failures).

Practical chaos engineering can be as simple as running kill -9 in a production environment to simulate the sudden outage of a service node, or as complex as picking a small (but representative) portion of traffic online and automatically running a series of experiments with a set of rules or frequencies.

More basic introduction of chaos engineering will not be described here. There are many related discussions, please refer to Chaos Engineering: The Netflix System Stability Approach [1].

The industry practice

As a matter of fact, all the major factories in the industry have chaos engineering practice, and the representative projects are as follows:

  • Netflix first systematically puts forward the concept of chaos engineering, and published the first book in the field of the chaotic engineering the chaotic engineering: Netflix system stability of [1], in this book chaotic engineering degree of maturity model and application model is put forward, and sums up the five senior principle, has guiding significance to the development of the chaotic engineering. In addition, Netflix opened source its Chaos Monkey project [3].
  • Alibaba is one of the earliest companies in China to explore chaos engineering and make open source. Its open source project ChaosBlade[4] can conduct chaos experiments in combination with Ali Cloud.
  • PingCap, as an excellent open source company in the field of database in China, has always invested in the field of Chaos engineering, and recently opened the internal Chaos engineering practice platform – Chaos Mesh[5].
  • Gremlin, a chaos engineering commercialization company, provides a chaos engineering experimental platform that triggers failures by installing its agent on a cloud host. Meanwhile, the concept of Chaos Gameday [2] was proposed.

How does Bytedance practice

Bytedance has always had trouble-shooting in its lines of business, as well as some simple tools that evolved into trouble-shooting platforms. After finding that the platform could not be satisfied, we began to introduce the theoretical concept of chaos engineering and rethink chaos engineering. We are going to discuss chaos engineering in three parts:

  • Fault injection
  • Automation index analysis
  • Activity practice landing

In the implementation of chaos engineering, we find that we need to rely on two core atomic capabilities, namely fault injection and stability detection. Fault injection is the basis of chaos engineering without much explanation. Stability detection capability can: 1. Reduce the time cost of experiments. We can rely on automated index analysis to help us make auxiliary judgments, so as to seek greater output. 2. To reduce the risk and cost of experiments, we can rely on the analysis of automation indicators to judge the stability, which can be used as the decision basis for stopping the chaos experiment automation.

In addition, as for the landing of the activity practice in the third part, we will have more or less procedural and transactional content for different purposes in the Chaos experiment, which we hope to deposit on the platform.

The first generation of

In the first generation, bytedance used a disaster recovery drill platform as an internal fault platform. The architecture is as follows:

Architecture of the Dr Test platform

The main objective of the platform is to solve the problem of fault injection, while providing simple threshold based indicator analysis and automatic stop. Faults are mainly based on network interference to simulate downstream dependency faults, which helps some services to implement partial DISASTER recovery (Dr) drills in the production environment. However, the platform had a situation where the early design for fault injection focused on network failures, and the architecture and model were not easily scalable to other fault modes. There is no clear and unified description of the fault domain, so there is no clear description of the explosion radius. In addition, the index analysis of the platform is relatively simple, and the practical activities are mainly based on simple manufacturing failures. So we began to introduce chaos engineering theory and rebuild a new chaos engineering platform.

The second generation

This stage still focuses on fault construction, and the objectives of this stage are as follows:

  1. In the aspect of fault injection, an extensible fault center is designed to achieve accurate and controllable faults.
  2. In practice, chaos engineering norms are established and best practices are explored.
  3. Decoupled fault implementation and chaos practice management.

The system design

The overall design is as follows:

Preliminary overall design

The user-operated drill platform does not need to care about fault realization and fault state maintenance, and focuses on the management and arrangement of chaos experimental plan. All failure related implementations sink to the failure center, and the platform layer simply sends the task to the failure center. The fault center abstracts the fault model and provides a set of declarative interfaces. It is responsible for converting and calculating the fault statement, determining the container, physical machine or middleware where the fault occurs, automatically installing the agent, and issuing instructions. Accurate fault control and maintenance can be realized.

The fault model

Before we start talking about how to do fault injection, we first need to define the fault model. How can we abstractly describe network failures, OS failures, downstream dependency failures, middleware failures and other failures that occur at different levels? We first established that:

All failures affect a microservice either indirectly or directly. Our ultimate goal is to observe how resilient the service itself is when external dependency anomalies occur.

Therefore, we take a microservice as the observation target to expand the definition of the fault. The fault model is as follows:

The fault model
  • Target – the Target microservice mentioned above. Before starting the CHAOS experiment, it needs to be clear which service is injected into the fault and the service is the main observation Target.
  • Scope Filter – Corresponds to the explosion radius in the concept of chaos engineering. In order to reduce experimental risk, we will not affect the total flow of service. It usually filters out a deployment unit, whether it is a machine room, a cluster, or even down to the instance level or traffic level.
  • Dependency – We assume that a service is affected by a failure when in fact its Dependency is abnormal. The exception may come from middleware, from a downstream service, from a dependent CPU, disk, network, etc.
  • Action – Failure events, which describe the failure of the external dependencies of the service, such as the downstream service returning a rejection, packet loss, or delay; For example, the disk is overloaded with data or full of data.

According to the fault model, the pseudo-code for fault declaration is described as follows:

Spec. // Application A's cluster1 instance cpus are suddenly full of tareget("application A"). Cluster_scope_filter ("cluster1"). percent_scope_filter("10%"). dependency("cpu"). action("cpu_burn"). end_at("2020-04-19 13:36:23")Copy the code

Spec. // The downstream Application C on which the cluster2 cluster of service Application B depends suddenly delays by 100ms tareget(" Application B"). cluster_scope_filter("cluster2"). dependency("application C"). action("delay, 200ms"). end_at("2020-04-19 13:36:23")

Fault center design

We design a set of declarative interfaces based on the above fault model. When a fault is injected, it only takes effect by adding the fault declaration according to the above model. To terminate the fault, simply delete the declaration. After receiving the above statement, the fault center starts to search for qualified instances from the internal RESEARCH and development system platform, automatically installs the faulty Agent, and sends relevant instructions to the agent to achieve the purpose of fault injection. The fault center appropriately draws on the architectural design and concept of Kubernetes, and its architectural design is as follows:

Fault center architecture diagram

The fault center consists of three core components, API Server, Scheduler, Controller, and a core storage etCD. Among them, API Server is responsible for packaging ETCD and providing declarative interface externally. Scheduler is responsible for resolving the fault statement and continuously looking for instances of Target and downstream instances/middleware/physical devices of Dependency based on the statement. After that, the Controller will parse the action fault into executable instructions, which will be delivered to the agent of the corresponding instance or call the API of the corresponding middleware to achieve precise fault injection.

Principles of experimental selection

In the CHAOS experiment, considering the risk and the different characteristics of each business, we defined the principle of experiment selection, which can be decided by each business line according to the actual situation. The principle is as follows:

  • From offline to production
  • Since the childhood
  • From facing the past to facing the future
  • From weekdays to rest days

From offline to production

This article refers to the choice of environment. Generally speaking, chaos engineering is only meaningful in production environment. But we think a gentler experimental step is to go from offline to production. This is also a comprehensive consideration. Starting offline will make all parties feel at ease. However, for distributed systems, different deployment and traffic will bring different results, which can only be truly verified by experiments in production. A better path is:

Test environment -> Pre-release Environment -> Preview environment specific traffic -> Production cluster production traffic

Since the childhood

This article refers to the selection of fault scope. We recommend that failures start small and mild. When sufficient confidence is established, expand the scope of the fault. A better path is:

Controllable traffic -> Single interface -> Single machine -> Single cluster -> Single machine room -> Full link

From facing the past to facing the future

This section indicates the selection of the fault type. We consider the failures that have occurred to be of the highest experimental priority. And human history tells us that people fall over and over again in the same place; The failure of production is likely to happen again; Similar faults may occur on other links. So a better path is:

Recurrence of faults from history -> Types of faults from history & Similar Links -> Various random faults & full links.

From weekdays to rest days

This article refers to the selection of experiment time of chaos engineering. Rest day refers to any time. We recommend that the experiment start from weekdays, and the optimal time is around 3pm on weekdays (each business can consider according to its own peak period). During this period, the relevant personnel are generally on the job, any situation can be timely dealt with. The early goal of chaos engineering was to expose problems in advance in a controlled environment. Of course, as chaos engineering matures, we’ll slowly start experimenting at random times. A better path is:

Weekday afternoon -> weekday evening -> rest day -> random time

Experimental process design

In this phase, we designed best practices for the CHAOS experimental process for business systems. Following this process, chaos experiments will be more purposeful and the observed content will be more meaningful.

Before the experiment 0. ⚠️ Before starting your first chaos experiment, make sure that your service has applied elastic mode and is prepared to handle possible errors, otherwise do not try it at will. 1. Ability of preparing for fault injection a. Ability of fault simulation on bytedance chaos Engineering experimental platform B. Contact each dependent party to manually create faults. 2. Select the hypothesis of this experiment, for example: A. The business will not be affected because a downstream service hangs. B. Services are not affected by redis network jitter. C. Services are not affected because a POD is suddenly killed. D. When a core downstream dependency fails, the downgrade scheme must be effective and have acceptable side effects. 3. Select indicators that can reflect the experimental hypothesis and observe them. 4. Select indicators that reflect service loss and set a baseline. 5. Communicate well within the organization. 1. Pay close attention to relevant indicators during execution, as the experiment may need to be terminated at any time. 2. Keep in mind the assumptions of the experiment and collect relevant indicators to assist in the analysis of experimental results later. 3. During the experiment, experimental parameters (fault range and intensity) may be adjusted at any time according to the fluctuation of the index. More attempts will achieve better results. After the experiment, according to the indicators and business performance, the results of the experiment can be analyzed. According to the empirical feedback, the following results are generally obtained: – Finding vulnerable points and obtaining improvements – verifying degradation/preplans and enhancing confidence – finding inflection points of system performance – sorting out a wave of invalid alarms and optimizing alarm efficiency

conclusion

In this stage, we completely reconstructed the fault center to make fault injection more simple and controllable in architecture. In model abstraction, fault injection is more extensible. In this phase, we combed out best practices for chaos experiment selection and flow. In the next generation of products, in addition to continuing to enrich the failure capability, we will focus on complementing the index analysis capability and further precipitation of more productive practices.

The third generation

After the initial practice, the chaos core capability – fault injection has been established, and byteDance’s various business lines have started chaos journey. So at this stage, our goal is:

  1. In the aspect of automatic index analysis, complement index analysis ability.
  2. In the aspect of fault injection, enrich the types of faults.
  3. In terms of practical activities, we should precipitate the practical activities summarized in the previous stage, further explore the form of practice, and dig out greater value output.

The system design

Based on the above purposes, the overall design is as follows:

Mature stage system design

In the atomic power layer, added the automated metric observation ability. By introducing machine learning, we achieve threshold-free anomaly detection based on index history law.

In the platform layer, add automatic strong and weak dependence carding, and red and blue confrontation module.

Automation index observation

In chaos experiments, it is a tedious and exhausting activity to comb and collect relevant indicators. After observing the chaos experiment process, we summarized three types of indicators as follows:

  • Fault indicator – Determines whether the fault is successfully injected.
  • Stop-loss indicator – To ensure that the system does not lose too much due to failure.
  • Observation indicator – Observe the details and the associated anomalies caused by the fault.

The fault indicator

  • Indicator definition – A fault indicates that a fault causes fluctuations of indicators. For example, if we inject a fault: redis delay increases by 30ms, then this indicator serves the target -> mean delay between Redis, PCT99 delay, etc. This indicator helps users see the occurrence and end of a fault.
  • How to Handle the fault? – For such indicators, only display them to ensure that users can clearly see when the fault starts and ends.
  • Access – from failure. When the platform makes a failure, the platform knows what direct indicators will be affected.

Stop indicators

  • Index definition – stop loss index is vital for the target service/target business, said the drills can withstand maximum, related indicators may come from the service itself (such as foreign error rate), and also may come from far away associated business indicators (such as on demand quantity per minute), and may even from out service indicators (Max out services index, for example); It could be some combination of the indicators mentioned above.
  • Troubleshooting method – For such indicators, you need to accurately identify the critical threshold. Once the fluctuation reaches the threshold, the loss bottom line is reached. All operations must be stopped immediately and the fault must be recovered.

To observe

  • Index definition – observation index, for the chaos experiment to find new problems, has a great auxiliary role. The metrics should be anything related to services and failures. For example, there are four gold indexes of SRE (latency, traffic, errors, and saturation) of the service itself, and correlation indexes that faults may affect (whether delay faults will lead to changes in other INDEXES of REDis? Redis QPS, REids errors, QPS changes of degraded services), such as associated alarm records, associated logs.
  • How to deal with – There is no clear threshold for such indicators, but it is often easy to find all kinds of potential problems in the post-analysis of such indicators. We introduce machine learning methods when dealing with such indicators, and compare them with the historical rules of indicators, so as to achieve automatic anomaly detection.

Therefore, the core of indicator observation is intelligent indicator screening and abnormal detection without threshold. In addition, with a set of manual rules based on experience, we can make all kinds of automatic judgments or auxiliary decisions for different practical activities.

Red and blue against practice

Bytedance’s red-blue confrontation practice was absorbed from chaos Gameday introduced by Gremlin [2]. In the internal practice of Bytedance, we constantly adjusted to local conditions, and eventually developed into the red-blue confrontation practice featured by Bytedance. The implementation goal of red-blue confrontation is to help the business system to carry out a comprehensive survey, which can also be considered as a centralized verification of the stability of the business system construction goal.

At present, the red-blue confrontation has helped ByteDance recommend the middle stage to conduct a comprehensive survey for many times, and found problems in various aspects from monitoring and alarm to bottom saving, downgrade, and circuit breaker strategy.

Process design

Communication between red and blue is especially critical before starting the red-blue confrontation. The Red Army (the defensive side) needs to make a number of decisions, such as assessing the range of services and services it can confidently participate in the confrontation, assessing the pace of recent business iterations, and balancing business iterations against stability building. We follow the following process to practice the activities before confrontation and complete the platform precipitation:

Flowchart of red-blue confrontation before execution

Once the red-blue confrontation is started, the blue team will take the lead in the main operation, and the red team will basically stand by during the process unless there is an unexpected situation (which usually means that the defense fails) or it needs to operate the contingency switch. The main process is as follows:

Flowchart of red-blue confrontation execution

The key link is the replay review after the confrontation, by summarizing the data recorded during the red-blue confrontation. We can clearly see the overall effect of the confrontation, and clearly understand the stability construction of the target business system in the plan.

Summary of single match against red and blue

In addition, we summarize and record the problems found during the process, and keep a complete record of the confrontation. This makes finding problems traceable and profiling problems on-site.

Summary of the results of a single match against red and blue

Strength and weakness depend on automated carding practice

Strong and weak service dependency information is crucial to service governance and DISASTER recovery system design. The real situation of strong and weak service dependency can be verified only when faults occur. Therefore, we started the validation of strong and weak dependence, and improved the automation degree of strong and weak dependence combing with practice. After introducing machine learning to help us detect non-threshold index anomalies, the strong and weak dependence carding process has been almost fully automated.

At present, strong and weak dependency combing has basically covered the core scenes of Douyin and Huoshan, providing huge input for their service management and disaster recovery system design.

The overall process of automatic carding is as follows:

Strong and weak depend on automatic carding process

conclusion

In this stage, we supplemented the index analysis capability and greatly reduced the index analysis cost by introducing machine learning.

Based on automated index analysis capabilities, we tried to combine new practices to mine more output. Red-blue counteractivity helps business systems gain a more complete understanding of their own stability. Strong and weak dependence analysis helps the business system to have a deeper understanding of its own stability details.

Future phases

Chaos engineering for infrastructure

The above discussion on chaos engineering mainly focuses on the construction of the resilience of the business layer to failure. But the chaotic engineering of infrastructure is more important. In particular, all kinds of computing and storage components are the cornerstone of the upper layer business in Internet enterprises, and their stability is the premise of ensuring the stability of the upper layer business.

However, the closer you get to infrastructure failure modeling, the more challenging it becomes. For example, the fault simulation of the core and disk that storage components depend on will be carried out step by step in the kernel state of OS or even in the physical layer.

In addition, data validation of storage components is also a larger topic, including whether distributed storage can guarantee its promised consistency characteristics under failure, and how to verify this consistency [6].

This is a whole new direction that we’re going to start exploring, how to do chaos engineering for infrastructure services.

IAAS With Chaos

In chaos engineering for infrastructure, we mentioned that the dependency of infrastructure services is too low-level, and we are also thinking about whether we can build an IAAS cluster through OpenStack and restore the production-equivalent deployment model in this cluster. Further fault simulation is performed at the virtualization layer through OpenStack API. This provides an IAAS with chaos features.

Fully automatic random Chaos experiment

As red-blue confrontation practices become more prevalent, red-blue confrontation platforms will gradually accumulate enough business defense goals that describe the maximum failure capability that a business system can withstand. We can then begin to attempt to automate random fault injection from time to time to verify its stability within the range of defensive targets.

Intelligent fault diagnosis

We also consider whether the active fault injection capability of chaos engineering can accumulate sufficient orders of magnitude of faults and indicators. Thus, the corresponding relationship between a certain pattern feature of the index and the fault can be trained. This will help eliminate obstacles in production and achieve the purpose of intelligent fault diagnosis [7].

At the end

In fact, the early exploration of chaos engineering has been in the industry for a long time. It used to exist as fault test, disaster recovery drill and so on. With the continuous development of microservice architecture and the continuous expansion of distributed systems, chaos engineering begins to emerge and is paid more and more attention. When Netflix formally proposed the concept of chaos engineering, related theories began to enrich rapidly. The practice of Netflix also proves the great significance of chaos engineering in the field of stability. In the meantime, we’re trying to see if we can do more with chaos engineering. As the Internet becomes an infrastructure service, its stability will be emphasized. What we need to do is to face the failure, not fear the failure, to avoid the occurrence of black swan events. Welcome to join us to carry out chaos engineering practice and promote the development of chaos engineering.

reference

  1. The way of engineering chaos: Netflix system stability “: https://www.oreilly.com/library/view/chaos-engineering/9781491988459/
  2. “How To Run a GameDay” : https://www.gremlin.com/community/tutorials/how-to-run-a-gameday/
  3. Netflix chaotic engineering open source project – Chaos Monkey:https://github.com/Netflix/chaosmonkey
  4. Open source project – ChaosBlade:https://github.com/chaosblade-io/chaosblade alibaba chaotic project
  5. PingCAP chaotic engineering open source project – Chaos Mesh:https://github.com/pingcap/chaos-mesh
  6. Distributed Conformance Testing Framework – Jepsen: https://jepsen.io/
  7. Zhou, Xiang, et al. “Latent error prediction and fault localization for microservice applications by learning from system trace logs.” Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. https://dl.acm.org/doi/10.1145/3338906.3338961

More share

GDB indicates that the COredUMP file is truncated

Improvement of Bytedance on RocksDB storage engine

Bytedance’s own quadrillion graph database & graph computing practice

Welcome to the Bytedance technology team