Today we will talk about the application of Go in our TiDB. Let me introduce myself first. I started my own business in the direction of infrastructure in 2012, but failed to do so. Then I went to 360 Infrastructure group to engage in MySQL open source middleware. It became clear that middleware was a limited solution, so we started exploring the possibility of needing something like NewSQL. Then I found out that TiDB was doing the same thing, so I joined TiDB. I’m going to focus today on TiDB from testing to optimization.
Figure 1
You know, it’s hard to do a database, it’s hard enough to do a stand-alone database and it’s even harder to do a distributed database with SQL, with transactions. One of the hardest things for me to do with TiDB was to make sure we wrote the right code. Internally, we divide testing into many pieces, one is the simplest unit test, and the integration test, which is constantly running. For example, when we submit a code to Github, we run our unit tests and integration tests. The amount of tests we run varies from day to week. In addition to that, we also have a test platform called Schrodinger. Finally, we’ll talk about how we do TiDB memory optimization. The process of the presentation is like the process of software. We do things right first, and then we do things fast.
We are a toB company. ToB company is different from Internet companies. Our trial and error cost is very high. We cannot let customers test or have gray scale strategy after we go online like those Internet companies. We can’t do those things, so we do a lot of testing related things internally.
Figure 2
This is our main product overview. TiDB can be thought of as the engine of SQL, and can be sent to storage, storage is divided into two parts, one is TiFlash, one is TiKV, this is our row storage and column storage. PD is the part of data scheduling. As you can see, Go is used in all components except storage, including PD, TiDB, Binlog, and surrounding K8S and Schrodinger, etc.
Figure 3
Let’s talk about TiDB SQL Layer architecture. First, from the left, we see a SQL that parses the protocol, parses the SQL into AST tree by AST tree, and then overwrites it into a Logical Plan tree structure that we represent internally. And then we’re going to do two things, we’re going to do a logical optimization of this tree, and a logical optimization is going to do a relational algebraic equivalent to it, and we’re not talking about which execution path we’re going to take. Then it comes to physical optimization, and physical optimization will find a better path, such as whether to go to the index, whether to scan the table directly and so on, and there is a cost calculated by the summary information in the statistics to select the path, and then there is a distributed actuator, and of course there are some other modules. This SQL Layer is a very complex thing, and there are many combinations of logic to implement the entire SQL. So how do we test this complex thing?
Figure 4.
We have encountered some testing problems in the distributed system, these problems may appear in all modules, not only our own modules, it may encounter disk failure, kernel Bug, network break, CPU problems, including clock problems we have encountered; The software level of the problem is more, including our own code bugs, protocol stack bugs. What shall we do?
Figure 5
We’re going to introduce a framework for random error injection, called Schrodinger platform, and it’s going to be able to use configuration files to select the cluster that you want to build, for example, how many cpus, how many memory, how many TiDB nodes to run, how many PD nodes to run, and it’s going to help me pull up the configured cluster. As we run these tests, we inject random errors, and we find a lot of problems through the platform that we didn’t realize when we were testing ourselves.
Figure 6.
So this is our interface, and we can start a test with the Create New Box, and if something goes wrong, it will report back to us. There are many, many tests. Schrodinger is a platform for randomly injecting errors that are unpredictable. Schrodinger will do a lot of random testing for us. So how to inject some deterministic test scenarios?
Figure 7.
We inject some buried points into Go. Failpoint is a normal function that normally returns the following Default string, so we inject a Failpoint. There’s a comment here, and at the end there’s a Return SomeFuncStringDefault that, if we activate it, will Return that value in the function. With this thing, we can inject any error into the critical path, including random firing, probabilistic firing. Why is there such a thing as Failpoint? Because there are some we have confirmed that the scene will trigger some problems, such as customers or in schrodinger we met some problems in their work, we have find out problems, need to add it to a deterministic test, ensure that this problem will not happen again after we change (regression testing), In this case, we need to resort to Failpoint. On the one hand, the problem of randomness is difficult to reproduce, so we use this method to reproduce. On the other hand, the problem we come up with cannot be randomly injected through schrodinger platform, so. Failpoint is a complement to Schrodinger’s test.
Figure 8.
We used Go injection a long time ago, and we used ETCD Gofail injection. Have you used it? Failpoint is one of the earliest FreeBSD applications and is written in C. Gofail is porting it to Go. It comes in the form of comments that, when triggered, become actual code.
Figure 9.
Let’s look at an example. This is a piece of code in TiDB. For example, I’m going to have a lot of problems with the transaction commit, so I’m going to inject an error here to see if the system is running correctly. If you look at Gofail it’s actually not very readable when written. What happens if it goes like this? I don’t think I’ll be able to read it in 10 minutes anyway.
Figure 10.
So, we thought, what’s wrong with Gofail? First of all, Gofail can only be written in one way. A static analysis tool like an IDE can’t analyze your code because it doesn’t consider it code because it’s just a comment until it’s activated. And usually you write Gofail, and then you write a comment, and then you have to go back and change the comment, test it and change it back, and that’s not a very convenient process. Another point is that because Gofail is a global test, if I inject Failpoint, all the code directly here may be triggered by it, which may not be what I want, because IF I want a concurrent test, I have to control the state myself.
Figure 11.
If you change the code by accident and you want to go back to the comment, you can go back if you know the history of the change, but if you don’t know the history of the change, it’s very cumbersome, so you have to pull out the previous code, which is very cumbersome. To address these issues, we implemented something called Failpoint, which is now open source.
Here’s how PingCAP’s Failpoint works.
Figure 12
TiKV is written in Rust, and it’s easy to write Failpoint in Rust because rust has macros in it, and it’s easy to turn Failpoint on and off during precompilation. There are no macros in Go, no compiler plugins, and Build Flags are not elegant or easy to write. Let’s start with some of our principles for designing Failpoint: Failpoint must not affect our normal code, this is the most important principle, and we hope that the code can look like normal code, like a rust macro, Go does not have macros, take a look at the code below, is the simplest Failpoint injection error code. The first one on the left is failpoint.name, which tells us which Failpoint to fire, in the form of a package, which makes it easier to isolate the namespace of the function. But there’s no way the above code will actually inject you with a piece of code.
Figure 13
I will tell you how to change to the following form, and the following is the converted form, which will determine whether your Failpoint should be enabled, if enabled, it will be directly transferred to the following form. We parse the GO code file, and then we get a bunch of AST structures, and finally we get Callexpr. We can rewrite the left IF Statement with parse. Our modified AST tree, written back to the file, becomes the following.
Figure 14
The function we injected is actually a tag, mainly for our convenience when we Parse it.
If you write a Break directly inside the code, it will become the code outside the Break, so you can only use the flag function.
Figure 15
And as I said, the parallel tests that we do, we can use the Context, we can Hook the previous Context, and we can put some criteria in there, like if we only allow certain contexts to fire, it won’t fire in any other test case, And then you pass in the normal Context, and you can test it concurrently.
Figure 16
On the left hand side we put everything in there, and if we put a Failpoint, it looks something like this on the left, and on the right hand side it looks like this, and you can see that there’s a label on there, and if we don’t do that, it’s going to be recognized by the outside code, so we use a label function to do that. Failpoint just said that injection parses Call Expr, so it can inject everywhere the function is called. Of course, it’s not just those places, we put so many contexts in test cases that you never think we inject them. Like this, the entire code is a little more readable, because first the IDE recognizes it, including syntax errors that can be detected directly by the IDE. Before, it was not convenient to write Gofail, which made people reluctant to inject more. Now that it is more convenient, people may be more willing to do this.
Figure 17
Figure 18
Does injecting a function have an impact on the final performance? The Maker function just listed is an empty function. The argument to the right is empty and does not execute. The normal reason for this is that we Parse it to recognize, find, and mark it. If you look at the assembly, you will find that the empty function is then eliminated directly. The fundamental reason is that GO doesn’t have macros, and if we were to write this thing in C, macros would be easy to do that, but for GO we’d just have to try to write it a little bit more elegantly. At this point, I have covered failpoint and recommend you to try it out.
Now, how do we detect Goroutine leaks internally, Goroutine leaks are not very common, but when they do happen, they’re a big accident online, and they’re not very easy to detect. So when we write code, we try to find problems early, so that they don’t escalate.
Figure 19
What would cause a Goroutine leak?
Figure 20
GO in the middle we’re going to start Goroutine, and it’s going to read a channel, and that channel is not going to be closed, it’s not going to be closed, it’s going to be forgotten to Close, or it’s never going to get any data in, it’s going to read data from somewhere else, It’s logical that when we’re done with our code we should Close it, and then we’ll exit, and that’s the logical thing to do. But the Goroutine will leak if you forget to Close because the code is not written properly.
We use the Runtime function to get the current running stack, and we filter out if we think Routine is running normally. Before Testing, remember all the Routine that you are running before this T. Normally, after running Test Cases, you should recycle all the Routine. If you run down, and you call runtime. stack again, and you see a new routine, which is probably a leaky routine, then don’t let it pass through our CI, PR can’t merge.
Figure 21
I mentioned that we used some testing methods. In fact, our distributed database still encountered some minor problems with customers, such as some compatibility problems. We can only avoid these problems with more types of testing, but I believe we can only work in this direction, after all, there will always be some problems that are overlooked. When you find new problems, be sure to add the corresponding tests, which will reduce the problems. In fact, we did a lot of things like this after 2.1 and before 3.0. We spent a lot of time on testing, including building schrodinger platform, including internal CI platform. We spent a lot of machine resources and human resources to do this.
Figure 22
Moving on, we’re not going to do our performance optimization points until we’ve done these stable states. Here’s one of the big optimizations we introduced in 2.0.
Figure 23
So if we look at this table, we have four columns, each of which has a different type, logically representing if three rows should look like this, let’s imagine if Golang could represent one row at a time? How to represent this row in a data schema, and there’s all kinds of. In the past, our expression is like this (Figure 23). One column is the structure of Datum, which contains a K. K refers to what type. If it’s another complex type it’s the one down here. This is the first optimization that comes to mind, because it’s easy to express so many types directly. The simplest is Interface{}, but Interface{} doesn’t perform very well. What’s wrong with this Datum? We just saw that there is a lot of unnecessary space overhead in this. For example, this is a one-byte integer, but if you want to use Datum, it takes tens of bytes, which is a huge waste of space. There is another problem. If you are a compound Type, you may need to check the Datum for a Type Assertion. In addition, if the calculation of a column, every time to get the jump array, the CPU Cache Miss is more serious, this is not very friendly to the CPU.
Figure 24
Most of our database computations are performed in columns, and the computations between columns can be computed in parallel and output together. So how does our code optimize this problem?
Figure 25
Here is a format called Apache Arrow memory representation, which is a binary format. On the right is an overview diagram (Figure 25), with a length identifier that tells us how many elements there are in Arrow. You can see that in the following example there are four elements, so the quantity is 4. It also has a NULL bitmap, which indicates which location is NULL, mainly to save memory. Finally, the following values are stored in a binary array, which is used in many AP database formats.
Figure 26
Such an expression is a relatively compact memory expression. So how do we turn Arrow’s memory representation into our Golang code? In TiDB we call it Chunk, and Chunk has an array of columns, where each element is a Column. As you can see (Figure 26), if I label the colors together, it will be A Column together in Column, and we can see that this one will store all 1234 of the Column A in this block of memory in Data. If it’s equal length, we don’t need the array Offsets, we don’t need it, we save space. The B column is again a Cloumn object, which can be grouped together to form a memory representation. And Chunk is a Chunk of everything that’s in that Chunk, but actually it’s column by column.
Figure 27
You might ask yourself how do you store billions of rows of data in this database? Instead of storing all the data in Chunk, we will set a Size. For example, a Chunk can only store 100 lines at most, and the Size can be adjusted. As you know, the biggest performance killer of GO is the constant application pointer, which can have a huge impact and can be easily overlooked when writing code. However, if you write to a database, the problem is obvious, and if you use this approach you can get a big performance boost and save a lot of time in such scenarios.
Figure 28
An important advantage of using arrays in binary is that you don’t have to use type assertion for complex structures. Instead, you can use Unsafe for binary data. This is a very common operation. A binary array can be of any type. If you’re looking for Go, you can use Unsafe to do this, which is a big boost to efficiency.
Figure 29
There is also A worth is vectorized implementation, we used the expression of (figure 28) is such A line in the past, if I want to be A + C, before the situation I have to take A back C from Data array, and calculate after will take A this column to the next line of Data, and then jump down, so to switch back and forth A lot of an array, If the array is too long, the CPU may dump the entire array in the Cache first, which is a huge overhead and cannot be parallelized. If we do it the way we did with chunks, column by column, if I do A plus C, I start with an array, and A plus C is an array, and I scan through that array, and I compute all the results, and I put them in that array, and I get the final result. This is the basic concept of vectorization execution.
This is also done in TiDB, including our TiKV module that supports vectorization. Here is our TiDB code (Figure 29), where vectorization is performed: usually the calculation is done column by column. When we enter this, you can see a column that has an iterator. Each NEXT has changed the subscript, but it accesses the same array.
That’s all I’m going to share today about code or practice. But I also want to share with you some of the lessons that we’ve learned in the process of building databases or building software that’s different from the Internet.
Figure 30
The first thing to say is let’s get things right before we do them, and then worry about optimizing performance. You know performance optimization is going to be done in a Cheek to Cheek way, if you’re dealing directly with memory, or you’re dealing directly with an operating system, if you’re introducing operating system information into your code, it’s going to couple your code, or make your code more complex. This is definitely not going to be good for the stability of your code or the stability of your tests, and it’s not going to be easy to do in the beginning. So we can only consider making a change if we have to complete it from the various facilities around us, because only by making a change can we be confident that this change has a regression test and won’t cause the previous code to fail. I’m just taking the obvious example of the small Chunk, and there are other examples that haven’t been shared.
In addition, in the process of this test, we will find many phenomena that we think do not meet our expectations. If we do not investigate this phenomenon carefully, it will be ignored in many cases. But now there is an Error. In Schrodinger platform, the Error will be directly sent to the mailbox, forcing you to read the Error. Only after this matter is clearly investigated and there is a statement, can we continue to operate. This workflow process over the past year or two that we find so many hidden Bug in our code for several years, show the phenomenon of it is like a normal network fault, because people think a network failure or storage failure is very common, usually this kind of error is easy to be ignored, but may be implied a lot of problems, In fact, it’s probably a bug in the code.
Another lesson is that testing is a practical thing, and what kind of testing is appropriate for the system. Now there are many external theories, including Chaos and other methods, all need to do a deep understanding of your system and then do a customization. Just as you might think of injecting random errors, you might also consider testing the upgrade section.
Some of our customers may have problems before the upgrade but after the upgrade, because the upgrade is a modified behavior, some corresponding tests should be done during the upgrade to ensure that the old data can run well on the new cluster after the upgrade. Compatibility testing is a big topic, including whether performance falls back and so on.
There are also pressure and parallel testing, where there are a lot of problems that occur in boundary conditions, and your whole system, no matter what module, when it comes to a boundary, there may be some problems that you don’t think of, so put pressure on each module to see if it’s behaving correctly.
Another type of testing that I think is also important is called stability testing. Stability testing means that if you start a cluster from scratch and the normal write scale is hundreds of terabytes or even petabytes, we have to make sure that the write latency or read latency does not decrease significantly as you scale. Another aspect is that the cluster capacity of the system itself is relatively stable, but the workload is mainly read. As long as the traffic does not increase, we must ensure that the read is stable, which is stability test.
Since our database is estimated at the cost of statistical information, the execution plan may change with the running of the cluster. Whether these changes will cause some problems for customers, we also need to add some tests.
Q&A
Question: Hello, how is this aspect monitored in a production environment?
Yv: You don’t usually focus on that. You usually focus on memory.
Question: Memory leak?
Yv: It’s kind of a Goroutine leak. By the time you find out, you’ll have revealed a lot. Two metrics: one is quantity, one is memory size. Maybe the number is right, but if there’s a lot of load, there’s a lot of routine. Your memory normally goes up all the time, but it doesn’t come down any more, so this is definitely one type of leak, but memory can leak as well.
Q: Is the Goroutine leak check accurate?
Yao Wei: We run CI continuously. Every modification will run CI, and there is a high probability that it will run out. Once this problem occurs, we will definitely check it. If it’s a mistake, change the test case, and if it leaks, you have to reclaim the routine.
Question: MAY I ask, is our database a distributed database?
Yawei: Yeah.
Question: How to synchronize a modified data between different databases in a distributed database to another database?
Yao Wei: The data store actually exists in TiKV. It is A cluster and stateful. Its data is written to A node, as you said before, and it does not copy all storage nodes. It’s written to a Region and then copied to different copies by Raft protocol.
Q: You use Raft protocol for this A, do you use distributed locks?
Yao Wei: The implementation of our idea is an optimistic implementation. It is a two-stage commit implementation. It can be considered as a lock, but it is not a waiting lock in the traditional sense.
Question: Questions about testing. One is Failpoint, which is very interesting and I plan to try it out. Do you have any suggestions on where to add it? Many places have been added. Secondly, does your platform randomly generate combinations of different configurations for different configurations and then extract and make some test cases? The simplest one can open source for us to use?
Yv: That’s something I’m working on. If you look at our test code in TiDB, it is easy to search. If you search the key word is Failpoint, you can find all the code that we inject. We have a PR that replaces Failpoint with Gofail. As far as code intrusion is concerned, I’d rather write more code than have it actually found out to me from the customer, which I’d rather be.
Question: As for the student who asked, since you have Rut, how do you write, do you focus on UT level or Integration?
Yao Wei: We use Gofail, which can be triggered by HTTP interface and Failpoint can be triggered by API in integration test. The HTTP interface can be tuned before testing, which is relatively complete and can also be integrated for testing.
Question: is this failure result written manually?
Yaw wei: You know your plan. Anticipation is writing test cases knowing that your case is going to fail.
Question: Thinking about the possibility that there is a problem with the code itself, but it is very common that there is a problem in the test, and this comparison is not easy to repeat?
Yawei: I haven’t really thought about that. I didn’t quite follow you.
Question: Hello, I would like to ask that the data in that column of the data table has only ABCD single character, and suddenly change A column and A row of data, the character is long, change A to ABC, become A string, do you want to change the Chunk?
Yao Wei: We only use Chunk for reading. We don’t actually walk Chunk when we modify it. Modify direct access to the KV interface instead of Chunk. Chunk is used in the calculation of some functions. It takes a lot of memory to get data, so we don’t need to do this when we write. KV is what we write. Q: This extension, I am changing, rather to regenerate a Chunk, you are not directly write this table, is to write KV, KV to write data again?
Yao Wei: Again, if you need to modify column data, the Chunk structure is not very suitable
Question: I want to know about TiDB calculation method. When PD deals with a complex data query, does it have the ability of cluster calculation?
Yao Wei: Why does PD have to deal with complex calculations?
Question: I don’t know much about it. In my previous impression, PD is used to deal with data query.
Yaw wei: TiDB did the query. Go on.
Q: I just want to know if TiDB can utilize the computing power of multiple clusters when processing a very complex query.
Yao Wei:
Our calculations can be pushed down to storage nodes. In addition, we currently have an engine called TiFlash, which can handle more complex AP queries. At present, TiDB is still calculated on a single node without MPP architecture.
Q: I would like to ask another question, does TiDB have a solution to synchronize MySQL?
Yaw wei: Our binlog component will spit out all the changes in a format that is not MySQL’s. We can also write the changes to downstream MySQL
Parse binlog by yourself.
Yv: We already have the tools to do that.
Big event preview
Gopher Meetup Beijing station is about to open. Great lecturers from Didi, Ali, JINGdong and PingCAP will bring front-line practical experience sharing in Go development field on September 7th, Shangdong Digital Valley, Zhongguancun Software Park!
To register, read the original article
Go to China
Sweep yards attention
The largest and most active Go developer community in the country