The destruction and rebuilding of faith
When I first learned programming, I always thought it was very easy to write programs. The programs were always executed in sequence according to my ideas, and given an input, they always had a fixed output in line with expectations. Maybe the biggest challenge in writing code back then was understanding the branching, the loop, but no matter what, if you control it properly, things are always certain.
That was a very happy time until I encountered multi-threading. For the first time in my life, I was afraid that my confidence would be broken. In a multi-threaded world, things would not work the way I thought they should, and I needed to consider data racing and memory ordering. Fortunately, after a brief period of discomfort, I was soon able to embrace concurrency, as our world is designed to run in parallel. Although writing multithreaded programs is more difficult than before, in fact, as long as you master some of the multithreaded concurrency primitives, know how to use mutex, Semaphore, channel, etc., in fact, you will find the multithreaded world is quite interesting. In addition, the new generation of programming languages, whether it is Go or Rust, have made it easier to handle concurrency problems. Given an input, you can still get the output you want. But it is much harder to be certain than before.
But it didn’t last long. After navigating multithreading, my “confidence was shattered again” as I entered the world of distributed systems, where everything was uncertain. Given an input, the result I might get is unknown, because I don’t know if the execution was properly executed remotely. And it’s the unknown that scares humans the most. I don’t know when there will be a network exception in my system, or when the disk will suddenly break down, or whether the computer room will suddenly break down. Everything is unknown to me. For example, the following is a problem we encounter in practice:
Our users running TiDB on a local cloud vendor’s machine reported to us that read latency was increasing irregularly. We looked at the monitor and found that the only exception was a sudden drop in Cached memory during that time. I had no idea what was going on. It turned out that there was a bug in the cloud vendor’s o&M monitoring script that would occasionally hot-plug the disk and flush the existing page cache to the disk, so most of TiDB’s read operations during that period were re-reading data from the disk.
As can be seen, distributed system is really a very complex system, failure is everywhere, so how do we survive in the world of such a complex distributed system? Now, a good answer is – Chaos Engineering.
Chaotic engineering
Rather than worrying about what’s going to happen to the system, it’s better to simulate what’s going to happen to the online environment in advance to see if our system can be fault-tolerant and still provide service. OF course, we do not simply power down the machine or unplug the network cable in the online environment. In the field OF CHAOS ENGINEERING, there is a set OF guidelines and standard experimental procedures, which can be specifically referred to as PRINCIPLES OF CHAOS ENGINEERING.
In short, to do a chaos experiment, we only need to do the following four steps:
-
Define the steady-state of the system. This steady-state is the performance of the system at normal times, such as QPS, latency, etc.
-
Divide the system into the experimental group and the control group, and make a hypothesis, for example, if I introduce a fault in the experimental group, the steady-state can still be maintained in the experimental group.
-
The experimental group was presented with real-world problems, such as unplugging a network card.
-
Verify the hypothesis of step 2. If the steady-state of the experimental group is different from that of the control group, it proves that our system cannot be fault-tolerant in the failure of step 3, so we need to improve it.
As you can see, the above steps are very simple, but there are still some difficulties in doing chaos experiments well in practice, mainly in the following points:
-
Automation. We need an automated system for fault injection, hypothesis comparison, etc.
-
Introduce as many different faults as possible. There may be a lot of faults in the real environment, but it is not as simple as pulling out the network cable, so the more faults introduced, the better.
-
The service party has no perception. If every time we do chaos test, we need the business system to cooperate, for example, write some codes related to chaos in the business, let chaos test call, or change the deployment logic of the system, and cooperate with chaos test, this kind of tightly coupled.
Hello Chaos Mesh®!!
Therefore, in order to make Chaos experiments better, we developed Chaos Mesh®, which is a set of cloud native Chaos engineering platform based on Kubernetes. The Chaos Mesh® architecture is as follows:
Compared with other Chaos platforms, Chaos Mesh® has the following advantages:
-
Based on K8s. As long as your system can run on TOP of K8s, you can integrate Chaos Mesh® seamlessly without changing any business code, and the system under test is truly unaware.
-
A variety of fault injection. Chaos Mesh® helps you troubleshoot networks, disks, file systems, and operating systems. We will also provide the ability to chaos K8s, or the cloud service itself, later.
-
Easy to use. You don’t need to pay attention to the low-level implementation details of Chaos Mesh®, just configure the Chaos experiment with YAML, and all the subsequent experiments are fully automated. We also have dashboards that you can easily experiment with on the web.
-
Observability. The Chaos Mesh® Dashboard allows you to easily observe your system, know when and what tests have been performed, and know how your system is performing. Of course, there is a bit of configuration involved. You need to tell Chaos Mesh® how to get a steady state indicator of your system. For example, if your system uses Prometheus, you can tell Chaos Mesh® how to search Prometheus for related monitoring metrics.
-
Strong open source community support. The Chaos Mesh® community has grown very quickly, and we are very happy to see that most of the features are supported by the community and have a large number of users. You don’t have to worry about running into problems that you don’t know how to solve, of course, you might have to worry about doing experiments with Chaos Mesh that will completely wipe out your data, so you have to control the radius of your experiments, that’s one of the principles of Chaos engineering.
A Chaos experiment?
Before we can start a Chaos experiment, you need to meet two conditions:
- Your own business runs on top of K8s.
- Chaos Mesh® is installed on the K8s.
In addition, before we start the experiment, I want to emphasize a few things about Chaos experiments. You may think I’m too verbose, but you can never be too careful, because if you don’t pay attention, you can lose your data.
-
If you are applying Chaos Mesh® to your system, be sure to use it in a test environment first. Your system should still be fragile and dangerous to experiment with online.
-
In a production system, the test must be good control of the blast radius, good control effect, such as we can to users of a street, and then expanded to a certain area, or a city, if we are at the start of the radius of influence is very big, a slightly more don’t pay attention to, your boss can let you off the next day.
-
Chaos experiment is definitely not a random experiment, we have a purpose, we need to plan, rather than aimlessly random fault injection of the system, we should first ask ourselves a question: “in order to have more confidence in the performance of the system in chaos, where is the most valuable chaos experiment?” That is, we need to familiarize ourselves with our system and do high-leverage chaos experiments.
Ok, now you are ready to travel Chaos Mesh® because it is so easy to use, you only need to refer to the user guide to get started, so I won’t go over the details. If you still have problems, please feel free to issue Chaos Mesh®. Trust me, the Chaos Mesh® community will be more than happy to help you out.
conclusion
With the rise of ServiceMesh, Serverless and other ideas, our system really tends to be more and more distributed. Although this simplifies the implementation of our individual modules, it may also lead to the complexity of our system because of over-distributed. So how to ensure the normal and stable operation of the system in such a complex environment, chaos engineering can be a very good choice.
There are many platforms out there that support Chaos engineering, but I still recommend Chaos Mesh® because it gives you a huge boost of confidence in your system.
Finally, welcome to the world of complex distributed systems.