Authors: Ye Ben, Yin Chengwen

Chaos Mesh® is a kubernetes-based Chaos testing tool. Chaos Mesh provides the ability to simulate system anomalies, but it’s just one part of Chaos engineering. The core principles of complete chaos engineering include the definition of stable state of the system, putting forward hypothesis, running experiment, verification and improvement.

This paper mainly introduces how we build our own automated test platform TiPocket on the basis of Chaos Mesh and Argo to realize fully automated Chaos test and form a complete closed loop of Chaos test.

Why TiPocket?

In order to ensure the data security of users, we need to ensure that every version of TiDB provided to users has been rigorously tested, so we designed various exception scenarios for TiDB and implemented dozens of test cases, so in our Kubernetes cluster, There could be a dozen or dozens of Chaos experiments running at the same time, and even though we have the Chaos Mesh to help us manage error injection, it’s not enough, we have to manage the TiDB cluster, we have to collect metrics, we have to analyze results, we have so many Chaos experiments running at the same time, on the other hand, We also needed to do chaos testing on other tools in the TiDB ecosystem, which was unthinkable, so we developed TiPocket to free ourselves.

TiPocket is a fully automated testing framework based on Kubernetes and Chaos Mesh. At present, we mainly use it to test TiDB cluster. However, due to its all-in-one K8s feature and extensible interface, It also currently supports testing other components in the TiDB ecosystem, and it is easy to add support for various applications by simply adding Create/Delete logic for applications in Kubernetes.

Chaos Mesh provides the ability to simulate failures

Fault injection can be said to be an important part of chaos testing, and in the field of distributed database, there are many possible failures, not only node failure, all kinds of network failure, file system failure, and even kernel failure. If TiDB does not handle these exceptions correctly, the consequences are unimaginable, which is one of the main reasons we developed Chaos Mesh in the first place. Chaos Mesh is well combined in TiPocket, and Chaos Mesh is regarded as one of the most basic dependencies to achieve the purpose of fault injection in Chaos test.

So far we have provided several types of fault injection in TiPocket:

  • Network: Based on Chaos Mesh, NetworkChaos provides simulated Network partitioning, or random link packet loss, disorder, repetition, and delay.

  • Time Skew: Based on TimeChaos, it simulates the Time offset of the container to be tested. The implementation of TimeChaos is also very interesting, and you can refer to our previous articles if you are interested.

  • Kill: Use PodChaos to Kill the pod. We also implement a variety of kill types, the simplest is to randomly delete any pod in the cluster; For each component, there are also dedicated Chaos random kill one or two TiKV nodes, or leader nodes dedicated to kill PD.

  • IO: Based on IOChaos, we use a lot of IO Delay to TiKV, and then see the write situation.

Having solved the problem of fault injection, we then need to determine whether our system meets expectations after TiDB injection failure.

How to determine whether TiDB is normal?

In order to achieve this goal efficiently, TiPocket has implemented dozens of test cases and combined with different inspection tools to verify that TiDB is normal. The following will take several test cases as examples to briefly introduce how TiPocket verifies TiDB.

Fuzz testing:SQLsmith

TiPocket creates a TiDB cluster and a MySQL instance respectively, and uses go-SQLSmith to generate random SQL on TiDB and MySQL respectively. And inject all kinds of faults into the TiDB cluster, and finally compare the results of the execution, if the results are inconsistent, then we can judge that there is a problem in our system.

Transaction consistency test: Bank/Porcupine

1. Bank

The bank test simulates the transfer process in a banking system. In this test, we created a series of simulated bank accounts, and at any time, we selected two accounts to transfer money to each other using transactions. One account subtracted a certain amount, the other account increased the corresponding amount, and such transactions were continuously executed concurrently. Under snapshot isolation, all transfers must be made to ensure that the total amount in all accounts is the same at any given time. TiDB still needs to ensure that such constraints hold even in the case of various injection failures. Once the constraints are broken, it can be judged that the system at this time is not as expected.

2. Porcupine

Porcupine is a linear consistency verification tool implemented with Go. It is based on p-compositionality algorithm, which utilizes the Locality principle of linear consistency, that is, if all the sub-histories of a call history meet linear consistency, the history itself also meets linear consistency. Therefore, some unrelated histories can be divided into several smaller sub-histories to verify the linear consistency of these sub-histories. Pocupine checkers are used in many cases of TiPocket to check the generated history and determine whether TiDB meets the constraints of linear consistency.

Transaction isolation level test:Elle

Elle is a checking tool to verify the level of database transaction isolation. Elle is a pure black box testing tool, which ingeniously constructs a test scenario, constructs a dependency graph based on the history generated by the client, determines the isolation level of transactions by judging whether there are rings in the dependency graph and analyzing rings to determine the abnormal types of transactions. In TiPocket, we implemented the Go version of the Elle inspection tool Go-Elle by referring to the Elle project, and verified the isolation level of TiDB with the Go-ELLE tool.

These are only a small part of TiPocket used to verify the correctness of TiDB, if the reader is interested in reading the relevant source code, see more verification methods. Now that we have fault injection, we have TiDB clusters to test, and we have a way to verify TiDB, how can we automate these chaos experiments? How to maximize the use of resources? We’ll look at how TiPocket solves this problem in the next section.

Argo automates the process

Like most engineers, our first idea is to develop and build wheels by ourselves, so that TiPocket can be equipped with scheduling and management functions. However, considering our current manpower and time, we know that there are many open source tools that can provide similar functions. Therefore, we finally choose to make TiPocket more pure and entrust scheduling and management to a more appropriate tool. Considering the nature of our All-in-one K8s, Argo was the perfect choice for us.

Argo is a workflow engine designed for Kubernetes. It has been open source in the community for a long time and has gained widespread attention and application immediately. The well-known KubeFlow project, for example, makes extensive use of Argo. Next, we will first introduce the basic concepts of Argo, and then talk about how to combine TiPocket and Argo.

Argo abstracts several CRDS for workflows. The main types include Workflow Template, Workflow, and Cron Workflow.

  • The Workflow Template can be understood as a Workflow Template. You can pre-define the Template for each different test task and pass in different parameters when actually running the test.

  • Workflow The orchestration of multiple Workflow templates and their execution in different order. That is, the actual running tasks. Based on the capabilities provided by Argo itself, we can also implement conditional judgment, loops, DAG and other complex capabilities in pipeline.

  • As the name suggests, running Workflow as a Cron is perfect for situations where you want to run some test tasks for a long time.

Let’s look at a simple example:

spec: entrypoint: call-tipocket-bank arguments: parameters: - name: ns value: tipocket-bank - name: nemesis value: random_kill,kill_pd_leader_5min,partition_one,subcritical_skews,big_skews,shuffle-leader-scheduler,shuffle-region-schedu ler,random-merge-scheduler templates: - name: call-tipocket-bank steps: - - name: call-wait-cluster templateRef: name:wait-cluster
              template: wait-cluster
        - - name: call-tipocket-bank
            templateRef:
              name: tipocket-bank
              template: tipocket-bank
Copy the code

In the example above is the Workflow for the Bank test we defined. In the example, we used the Workflow template and used parameters to define the fault injection we needed so that we could reuse the template and run multiple workflows simultaneously for different test scenarios. In TiPocket, we use Nemesis parameter definition for fault injection, which means that we provide a large number of fault injection. When users want to use it, they only need to set the corresponding parameters. Of course, users can expand TiPocket to add more fault injection. More examples of workflows and templates can be found in the TiPocket repository. Argo allows us to handle all kinds of complex logic well. We can define our Workflow in the way of writing code, which is very friendly for developers. This is one of the important reasons why we choose Argo.

Now that our chaos experiment is automated, how do we locate our problem if our results don’t match our expectations? Fortunately, there is a wealth of monitoring information stored in TiDB, but logs are also essential. So we need a better log collection method to make our system more observable.

Loki improves the observability of experiments

Observability is a very important part of cloud protosystems. Generally speaking, observability mainly includes Metrics, Logging, and Tracing. Since the test cases mainly run in TiPocket are aimed at testing TiDB clusters, problems can be located by relying on metrics and logs.

Needless to say, Prometheus has become the de facto standard for monitoring in Kubernetes. When it comes to logging, however, there is no universal answer. Elasticsearch, Fluent-Bit, and Kibana solutions work well, but they consume a lot of resources and cost a lot of maintenance. In the end, we abandoned EFK and adopted Grafana’s open source Loki project as a logging solution.

Loki uses the same label system as Prometheus, which makes it easy to combine Prometheus’ monitoring metrics with the logs of the corresponding POD, using a similar query language. Grafana already supports The Loki Dashboard, so it’s easy to use Grafana to display both metrics and logs. On the other hand, TiDB also includes Grafana components in its own monitoring system, so I can reuse this Grafana directly.

Let’s see what happens

Finally, let’s look at what a complete chaos experiment looks like in TiPocket.

  1. Create an Argo Cron Workflow task that defines the cluster to be tested, the faults to be injected, the test Case to check the correctness of the TiDB cluster, and the execution time of the task. Cron Workflow allows you to view Case logs in real time if necessary.

  2. Prometheus is run internally by Prometheus-operator in the cluster. Alarm rules for Argo Workflow are configured in Prometheus. If the task fails, it is sent to the Alertmanager, which sends the result to Slack Channel.

  3. The alert contains the address for Argo Workflow, and from the Workflow page you can click the link to go to Grafana to find cluster monitoring and logs. At this point, you can enter the log Dashboard in the figure below to query.

    But this is also an inconvenient time. Currently there is no way to set the step parameter for log queries in the Grafana Logs dashboard. This parameter is used to control the sampling of log queries, and to control the number of queries, it automatically adjusts to the total time you are querying. For example, if logs are queried for one minute, step is automatically set to 1s. If you query logs of one day, the step may change to 30s. Some logs cannot be displayed at this time. Therefore, it is recommended to add as many filter criteria as possible to search for logs, or use the Loki command line tool Logcli to download all logs and query them.

  4. If the test Case ends normally, the cluster is cleaned up and waits for Argo to schedule the next test execution.

That’s how we built an automated Chaos test platform using Chaos Mesh and some open source projects. If you are also interested in chaos engineering, please join usTiPocket 和 Chaos Mesh !