This article is reprinted from InfoQ

Author: Tang Liu

Planning: Zhao Yuying

Original link: www.infoq.cn/article/bxG…

As a distributed database, TiDB faces a serious challenge in convincing users that the data stored in TiDB is safe from loss or damage. Therefore, in the early days of TiDB development, PingCAP introduced chaos engineering to ensure the stability of TiDB under various extreme conditions. This article is compiled from ArchSummit 2019 Global Architect Summit (Shenzhen) speech, share TiDB application chaos engineering method, introduce k8S-BASED automated test platform Schrodinger, A practical example is given to illustrate how to apply chaos in Schrodinger to test the system.

Everybody is good! I am Tang Liu, now the chief architect of PingCAP, and I am also responsible for the research and development of TiKV, the underlying component of TiDB. This project belongs to the incubation project of CNCF, and it should be the only database project entered into CNCF in China. I’m also a typical open source enthusiast, doing a lot of work on go-mysql, raft.rs, GRPC-RS and other open source components.

Why chaos engineering?

Suppose we start building a system, and whatever the specific function of the system is, we need to ensure that the system is stable, but how do we know that the system is in a stable state? Often, teams can verify this through unit testing, integration testing, and performance testing. However, no matter how well these tests are written, we don’t think they are enough, because errors can happen at any time, especially for distributed systems, where Chaos Engineering is introduced.

Take the actual production process of TiDB as an example. As Raft consistency protocol is adopted at the bottom of TiDB for copy replication, there are concepts of Follower and Leader. Followers passively accept the Leader’s logs and synchronize relevant data. When a new Follower node is added to the cluster, the Leader sends a Snapshot to the Follower. In other words, the Leader packages the entire data input into a Snapshot and sends it to the Follower. In TiDB, Snapshot consists of four parts: Meta file, default. SST, write. SST and lock. SST. The Meta file records the source information of the data file, including the size of the data file. The other three files are data files. After receiving the Snapshot file, the Follower saves the four parts to different files. The Follower then checks the Snapshot, that is, if the Snapshot is correct, it will be applied to the overall Follower status. As shown in the figure above, a Panic occurred between the Save Snapshot and Check Snapshot and a restart was performed. If the file is written to the Page Cache, Linux will force the Page Cache to be flushed to disk. Ensure file security. However, after our Follower hangs and restarts, we find that the file is missing, as shown in the above image, and write.sst becomes 0 MB, but according to the Meta file, write.sst cannot be 0 MB. That is, files are lost after the process restarts without any problems with the disk. By looking at DMSEG, the prompt SLUB: Unable to Allocate Memoy on Node appears. This can occur when the file Page Cache does not work properly due to other problems, such as insufficient memory, although the system is not having a problem. For us, while many times we can assume that the program is fine, the operating system that works with the program can be buggy and cause the entire data to be lost, no matter how many times we unit test the program.

The second example above shows a common Gray Failure problem in distributed systems. In general, the obvious way to determine whether a program is alive or dead is to write a Checker program and run it periodically to test its status. We can have a situation called Gray Failure, where the checker is communicating with the system, but the client and the system are not interacting at all, and we think the system is fine, but it’s actually broken. To sum up, there are many problems in distributed system that can not be solved only by testing, so we came up with a very good solution is chaos engineering.

What is chaos engineering?

The concept of chaos engineering has been around for a long time, but it wasn’t until 2012 that Netflix introduced chaos to the outside world. In order to better promote Chaos engineering, Netflix introduced a Monkey. Imagine that a Monkey in the system is usually quiet and does nothing, but suddenly goes crazy and starts messing around in the system. As an engineer, one thing to do is to catch the Monkey and make it stop messing around. That’s sort of what chaos engineering is all about. To put it simply, chaos engineering is also an engineering discipline, which means it is necessary to conduct experiments. Chaos experiments are designed to observe the true reflection of the system on various faults, so as to improve and ensure the stability of the system. But before we start chaos engineering, we need to make sure that the system is fault-tolerant, which is commonly known as double live, multiple live. Assumes that the system is a typical single point architecture, as long as the single point of damage, the whole system collapses, unable to verify the effect of the chaotic engineering, so the system must be able to support fault tolerance, then through continuous fault introduced to validate the system fault tolerance, if the system can’t fault tolerance, we are not limited to consider can let the system fault tolerance, thereby to consider chaos engineering.

For practical purposes, you can refer to the steps on the Principles of Chaos Engineering page. As shown in the figure above, the first step is to define the steady state of the system. In general, the system can be defined by metrics or client indicators, such as QPS and delay, etc. As long as these indicators do not fluctuate greatly, the system can be considered stable. Second, after defining the system steady-state, we divided it into experimental group and control group for experiments. It was assumed that the whole system could continue to maintain a stable state regardless of any operation on the experimental group. The third step is to introduce variables in real life, that is, to simulate errors that may occur in the real world, such as hardware failure, network delay isolation, etc., in the experimental group; Finally, compare the difference between the experimental group and the control group before and after the stable state, whether can meet the expectation. If they are consistent, the system can be regarded as fault-tolerant. Conversely, if the stable states of the two are inconsistent, a system weakness has been identified and can be fixed to improve system reliability.

As shown in the example of TiDB in the figure above, for the three-copy Raft algorithm, the Leader provides client write operations externally. If the Leader is killed, followers will immediately choose a new Leader to continue providing services to the outside world. For this system, if chaos engineering is to be carried out, what should be done? First, the steady state of the system is defined according to some indicators, such as QPS. Secondly, if the QPS of the client is attacked and the Leader node is killed, there will be a jitter. The followers will immediately choose a new Leader node and quickly recover to a stable state. The third step, error injection experiment; Finally, observe the results. If it is found that the system QPS drops to zero and never recovers, it proves that there is a bug in the system. We need to find out the problem and correct it. Conversely, if the QPS is restored, the system can tolerate the failure and proceed to the next experiment. In order to better practice chaos engineering, Netflix provides relevant principles on its official website: The first principle is to construct the hypothesis of system steady-state; The second principle is to introduce the variable events of the real environment; The third principle is to run experiments in the production environment. It should be noted that any operation in the production environment is risky. Therefore, it is necessary to communicate with relevant departments in advance to avoid business failure and unavailability due to negligence. The fourth principle is to continuously automate the operation of experiments, if all through manual way to achieve, the efficiency will be very low; The last principle is to control the “explosion radius”, when conducting chaos experiments must pay attention to the affected range, if not well estimated, it is easy to lead to all users can not use, this is a very serious problem.

PingCAP practices chaos engineering at TiDB

At PingCAP, we mainly conduct chaos engineering practice for TiDB, focusing on two general directions: first, error detection; The second is injection error. At TiDB, we use three primitive methods to analyze system state: Metrics, Log, and Tracing.

The first one is based on Metrics, TiDB uses Prometheus, and here’s a typical QPS graph. You can see at 2 am, the latency curve spikes.

So we have a very crude and simple script that raises an alarm when it detects that the delay is greater than a certain threshold.

However, if we carefully observe the curves before and after the two days, we can see that the delay of fixed time will increase every day, which can be considered as the normal workload of users. If we simply judge by weak indicators such as Metrics, we cannot find the problems related to the system well, so we need to check the historical data. In particular, the history of Metrics, and then you can compare them and basically see if the numbers are right. Of course, we will also use machine learning to make more accurate judgments.

The second one is based on Log, because Log stores detailed error information, but as a startup company, we are not able to build a complete Log system at the present stage, so we use mainstream open source solutions, such as FluentBit or Promtail. These data are imported into ES or LOKI for correlation analysis. In the future, we will write related Log analysis components by ourselves. For example, for transaction, we will have a transaction ID, so that the transaction query may be divided into multiple components, and all the ids will be displayed in detail. This is actually analyzed by Log.

The third method is to introduce Tracing. Although TiDB supports OpenTracing, I always believe that only when the Log or Metrics cannot solve the problem, Tracing should be used. The performance of the whole system is affected by the opening of Tracing. Generally, TiDB turns off Tracing by default and starts it only when necessary, such as when you need to query where the Tracing is consuming a lot of time. Currently, Metrics, Log, and Tracing will also be called Observability, and TiDB’s Observability is still the industry’s dominant solution without much customization.

Fault injection

Once you’ve learned how to detect errors, the next step is to consider how to inject errors and introduce various failures into the system. Because TiDB is a distributed database, we are mainly concerned with two issues: network and file system failures. Because it is distributed, it must not open the network problem; Because you need to store data, you need to consider file systems. Although there are many network topologies out there, there are generally three models for error injection of a network:

As shown in the figure above, the first is Complete, where the network between the two nodes is completely disconnected. The second is Simplex. User A can send messages to user B, but user B cannot reply messages to user A. The third type is Partial, where A and B are completely disconnected, but A and B can interact with each other through another node, C. For TiDB, we try to simulate the relevant network environment and find as many errors as possible under the network isolation.

Here, a typical example is called Network Partition Ring. In this group, N1 can send messages to N2, N3, N4, and N5, but N1 can only receive messages from N2 and N3, but not from N4 and N5. In fact, the problems of this network topology can hardly be found in real life. Why do we need to do this? We want to experiment with chaos so that we can proactively find and solve problems before they cause harm to users. In addition to network, storage also needs fault injection.

In TiDB, we do file system interference primarily through the Fuse mechanism. As shown in the figure above, the actual data may be stored in the /root/o path and can be mounted to another path by Fuse to allow applications to interact with the Mount path. Because of Fuse, Mount can do error injection throughout the IO chain. In this way, you can easily simulate various IO error situations, and if you don’t want to use Fuse, you can also consider other Debugging tools for Linux.

For file systems, we might have a more brutal approach. In TiDB, we will often unplug the power supply and manually trigger power outage and network disconnection to check whether the system can maintain stability. The following are commonly used error portraits for reference only:

In addition, Jepsen is a great tool for testing distributed systems, and anyone interested in error injection can refer to Jepsen’s code. However, Jepsen is written in Clojure, which is a little hard to understand.

Chaos engineering practice on the cloud

PingCAP introduced chaos engineering to TiDB early in its development. Early chaotic engineering experiments and if necessary, can oneself apply for several redundant or idle machine, all experiments need to be done manually, including their build and release the entire TiDB cluster, although this process also found some problems, but very inefficient, time consuming and manual deployment on the resource utilization is unreasonable.

We decided to streamline the whole process. As shown above, the first step is to better manage the machine through Kubernetes; The second step is to automate the process. Therefore, based on Kubernetes platform, we build a set of automatic chaos engineering platform — Schrodinger platform.

As shown above, there are three boxes in Kubernetes, and each Box has two use cases to verify that the system can remain stable through random injection. Once automated, all you need to do is enter the error into the Schrodinger platform, which automatically compiles the version and runs the relevant test cases. If a use case fails, we are notified to act accordingly. PingCAP is now working with other companies to optimize and build a more generic chaos engineering platform that people can run their business on. Since we are still based on Kubernetes, we can run directly on our platform by combining the cluster Helm configuration file with chaos Engineering. If you are not familiar with some Kubernetes concepts, you can compare the concepts of Linux to understand.

Specifically, to run a business on this platform, we use the Chaos Operator, which defines all objects as CRDS, starts a DaemonSets on different physical nodes, The DaemonSets are responsible for interfering with different loads and pods, and the corresponding pods will inject a Sidecar, which can be considered as a Thread. The Sidecar helps us inject the Thread, and is responsible for destroying the Pod. For the user, just provide his own Helm Chart and put our Chaos CRD into the Chaos Operator. Once the Chaos Operator is started, it sets the Daemmsets using Web hooks and performs a series of operations.

Guest introduction: Tang Liu, chief architect of PingCAP, is mainly responsible for the research and development of distributed key-value TiKV, as well as the testing and tool development of the whole TiDB product.