My SQL team (Qinyuan Du, Yubo Han, Baoping Huang, Junpeng Man) and I won the third prize of TiDB Hackathon 2019 for their project “Root Cause Analysis of SQL Bug Based on Path Statistics”.

Read a routine on Hacker News about an Oracle engineer working on a bug:

  • Take a couple of weeks or so to understand how 20 parameters can cause bugs in a magical combination.
  • I changed a few lines of code, tried to fix the bug, and the submitted test cluster started running nearly a million test cases, which usually took 20-30 hours.
  • If you are lucky, there will be more than 100 cases, sometimes thousands of cases, so you have to pick a few and find 10 parameters that you didn’t notice before.
  • After another two weeks, I finally found the real parameter combination that caused the bug and ran all the tests. And add more than 100 test cases to ensure coverage of his changes.
  • After more than a month of code review, his changes were finally merged and the next bug was addressed…

Later the engineer sighed, “I don’t work for Oracle anymore. Will never work for Oracle again!”

With nearly 25 million lines of C code in Oracle 12.2, testing complex systems is hard, hard, and arduous. And testing of a distributed database is more complicated, we never know what kind of SQL, users may write table structure and index of how many kinds of combination, in addition to consider when cluster nodes downtime, as well as by network jitter, disk performance degradation, the influence of such factors as possible is almost unlimited.

So is there a way for the program to automatically find bugs for us?

This seemed like a good idea, and with this idea, we formed a team to participate in TiDB Hackathon 2019, and unexpectedly won the third prize.

How to “let the application automatically locate bugs while sleeping”?

The idea of the project is actually very simple. If we can analyze the code path of enough experiments with statistical methods in each case run, we can find out the code suspected of bug, and the final result is visually presented by the front end in the way of code coloring, and the effect as shown in the following figure is obtained:

This is an experiment we did on a TiDB PR in Hackathon. The darker and brighter the color, the more likely it is to contain faulty logic. This method is not only suitable for database system testing, but also for any other complex system.

The principle behind it

The project was initially inspired by a VLDB paper APOLLO: Part B: Automatic Detection and Diagnosis of Performance Regressions in Database Systems thanks to authors from Georgia Tech and eBay. This paper focuses on how to diagnose the code that causes database performance rollback, and the core ideas are also applicable to bug detection. The automatic diagnosis system mentioned in this paper is composed of SQLFuzz, SQLMin and SQLDebug modules.

  • SQLFuzz: Is responsible for randomly generating SQL and using binary lookup to locate the two versions of performance rollback and pass them to the SQLMin module.
  • SQLMin: Simplifies the SQL generated by SQLFuzz using the pruning algorithm, obtains the minimum SQL that can reproduce the problem, and sends it to the SQLDebug module. The goal is to reduce extraneous code paths and reduce noise.
  • SQLDebug: Inserts source code so that it can output the execution path of the code during SQL execution. Then the code paths of the two versions are analyzed and a statistical model is built to locate the problem.

Finally, the system automatically generates a test report, including:

  • Which code commit introduces performance rollback?
  • Problem code source file.
  • The exact position of the function.

In practice, code execution path analysis can be quite complex, considering the effects of concurrency, loops, recursion, and so on. To ensure that we can show results in such a short period of time as Hackathon, we also referred to another paper, Visualization of Test Information to Assist Fault Localization, The core idea is to count the number of times code blocks are passed by correct and wrong test cases, and then paint different colors based on the analysis algorithm, simple and practical.

In fact, this idea can also be applied to other fields, which we will expand on later. Let’s take a look at how SQLDebug is implemented.

G ān (huo)

How to generate test cases automatically?

Since this is a statistics-based diagnosis, we need to build as many test cases as possible, which is best done automatically by the program. In fact, grammer-based tests have a long history of verifying compiler correctness, and the DBMS community uses a similar approach to verify database functionality. Examples include the RAGS system developed by Microsoft’s SQL Server team for continuous automated database testing, and the SQLSmith project, which is well known in the community. Sql-spider, another award-winning project from TiDB Hackathon this year, serves a similar purpose.

Here we temporarily implement SQL Fuzzing using PingCAP’s open source random testing framework go-Randgen, which requires users to write some rule files to help generate random SQL test cases. A rule file consists of a number of productions. Randgen randomly walks through the production each time starting from Query, generating an SQL and producing a path like the red line in the following figure.

We use the ratio of the correct and wrong use cases generated by each production as the color value of that production and draw it into a page, which is the display page for SQLFuzz. From this page, you can easily see which production is more likely to produce incorrect SQL.

The code tracking

In order to track the code execution path of each SQL piece at runtime, a key operation is Dynamic Instrumentation for the program under test. The VLDB paper mentions DynamoRIO, a binary piling tool, but we are not sure if it will work with Go compiled binaries. Instead, what if you could stake the source directly before compiling it?

Referring to the Go Cover Tool implementation, we wrote a special code pegging tool, tidb-wrapper. It can process any version of TiDB source code and generate wrapped code. And inject an HTTP Server into the program, assuming that the digest of an SQL statement is DF6BFBFF. Http://tidb-server-ip >::43222/trace/df6bfbff

// http://localhost:43222/trace/df6bfbff

{
  "sql": "show databases"."trace": [{"file": "executor/batch_checker.go"."line": null
    },
    {
      "file": "infoschema/infoschema.go"."line": [[113, 113], [261, 261], //....}],}Copy the code

Each binary output by the LINE field is the start and end line number of a base block (left closed and right closed). A basic block is defined as a block of code that never branches and is the smallest granularity of our statistics. So how do you identify the basic blocks in Go code? In fact, the workload is quite large, fortunately, Go source code has this paragraph, we just saw it, cut it out, become go-blockscanner.

Because the main goal is to correct diagnosis, so we not qualified system TiDB concurrently executing SQL, so that you can think from the server/conn. Go: handleQuery method is invoked, All base blocks executed between the time the SQLDebug module accesses the Trace interface are the execution paths of this SQL. When the SQLDebug module accesses the HTTP interface, it also deletes trace information related to the SQL to avoid memory overflow.

Basic block statistics

After obtaining the basic block information of each SQL entry, the SQLDebug module establishes the following visual model for each basic block.

The first is color; the higher the percentage of use cases that fail through the base block, the darker the base block will be.

Then there is brightness. The higher the proportion of failed use cases that pass through the base block in the total number of failed use cases, the higher the brightness of the base block.

Why have a brightness indicator when you already have a color indicator? In fact, the brightness index is to compensate for some of the “color Score” bias. For example, if a code path is passed by only one bad use case, it will obviously receive the highest Score of 1. In fact, this path is not representative, because only one bad use case will pass through this path, which is most likely not the true cause of the error. So an extra brightness metric is needed to avoid this path interference, and only dark, bright code blocks are really suspect code blocks.

The above two models are based primarily on the previous Visualization paper, and we also created our own metrics for file ordering, where the greater the density of failed use cases in the file (in terms of basic blocks), the higher the file ranks:

The front end takes these indicators and displays them in the order calculated above. The higher the file, the higher the risk of problems.

When you click expand, you can see the dyed code block:

After some simple experiments, the file-level diagnosis is relatively accurate, and the basic block diagnosis is relatively rough, which has a lot to do with not implementing SQLMin, after all, SQLMin can remove a lot of statistical noise.

Is there anything else I can do?

Looking at this, you might think that this project is nothing more than an automated test of a database system. In fact, the idea of automatic debugging code can give us more inspiration.

Source of teaching

Reading and analyzing the source code of complex system is a headache, TiDB had 24 source code reading series articles, with a text for everyone to interpret the source code, known as the “24 chapters”. Could it be a universal tool based on runtime visual tracing of source code? In this way, when the program is executed, you can see the running process of the code intuitively, and it will be of great help to quickly understand the source code. Further, is it possible to make an online Web application with online source execution?

Full-link test coverage statistics

The single test coverage statistical tools provided by the language itself are relatively complete, but the general testing process also needs to pass E2E tests, integration tests, stability tests and so on. Can we use the method in this paper to comprehensively calculate the coverage of various tests and integrate it with CI system and automated test platform? Using code coloring techniques, you can also output a thermal map analysis of code execution. Can profiler tools also help you locate performance problems in your code?

Chaos Engineering

There are a number of Chaos testing platforms within PingCAP to verify the robustness of distributed systems, such as Schrodinger, Jepsen, etc. The downside of chaos testing is that it’s hard to reproduce a run when it goes wrong, so you have to guess where the code might have gone wrong. If you can record the execution path of the code when the program is running, further narrow the scope according to the logs and monitoring near the time when the problem occurs, and then analyze the code path, you can accurately and quickly locate the cause of the problem.

Integration with the distributed Tracing system

Google has a paper about its internal distributed tracking system Dapper. Meanwhile, the community also has a well-known project Open Tracing as its Open source implementation, and Apache also has a similar project Skywalking. The general Tracing system mainly tracks the invocation relationship between user requests and multiple services, and helps troubleshoot problems through visualization. The Tracing granularity of the Tracing system is usually at the service level. If trace_id and SPAN_id are also passed to code blocks for piling, wouldn’t it be cool to drill down into the source code directly from the Tracing system interface?

Next work

Due to the limited time of Hackathon, we only completed a very simple prototype at that time, and we still have a long way to go to realize the automatic bug detection while sleeping. We plan to continue to improve the project.

Next, the parallel execution of multiple test cases should be supported first, so that enough experimental samples can be obtained in a short time and the analysis results can be more accurate. In addition, the injected code should be minimized to minimize the impact on application performance so that it can be used in a wide range of applications, such as performance testing scenarios, or even in a production environment.

See here you may already be impatient, attached the full source of the project, Welcome to hack!

pingcap.com/blog-cn/sql…