This article is based on huang Yan’s speech delivered at the 2018 MySQL Technology Exchange Conference in Shanghai on August 3, 2018.
Huang Yan
As the R&D director of Aikson, he is deeply engaged in distributed database related technologies, and is good at MySQL middleware products and development in the industry, as well as the application practice of distributed middleware in enterprises.
Abstract: Today I share a system observation tool that is related to, but not related to, such a topic of MySQL itself.
Sharing Outline:
1. Slow diagnosis of MySQL
2. Introduction of system observation tools
3. BCC (eBPF script set) example
4. EBPF usage method/limitation
Today I share bring system observation tools associated with MySQL, but not MySQL, the main reason why choose this topic is speaking at 4 today, just now is the official experts to introduce the MySQL new features, followed by two experts, one is to introduce the MySQL large-scale application in real business, another is to introduce the source code, There wasn’t a lot of room left for me, so I chose a topic related to MySQL but not related to MySQL itself.
First of all, I would like to ask you, if you encounter MySQL slow, what is your first impression, if the performance of MySQL database is not good, what is your handling method?
I consulted some peers and got the following feedback. The first reaction was to try again, the second was to optimize SQL, and the third reaction was that we increased the buffer pool and started to change the hardware, such as SSD. Then, when it was really difficult, we asked a search engine and said, “What should I do if MySQL is slow?”
If you are using A domestic search engine, the search engine will recommend so-so know or so-so, recommend some MySQL tuning experience, increase the parameter A, lower the parameter, etc., similar websites can tell you how to do MySQL slow.
Let’s analyze the hidden meaning behind these phenomena:
-
If you try again and it works, it means you may have encountered some non-repeatable external factor that causes MySQL to slow down.
-
If tuning SQL can be solved, it means that the execution complexity of SQL is far greater than it needs to be.
-
If a larger buffer pool works, it means MySQL is running into some limitations of its own.
-
If changing SSDS can solve this problem, it means that server resources are limited.
-
If you need a search engine, that means tuning has become metaphysical.
So what I want to introduce to you today are four parts:
1.MySQL slow diagnostic thinking
2. Introduction of system observation tools
3. BCC (eBPF script set) example
4. EBPF usage method/limitation
In the first part, we will introduce you to the general idea of slow MySQL diagnostics, which is also common in the industry.
The second part is the main topic introduced today — related content of system observation tools, we will have a general understanding of what is called system observation tools.
In part three, we introduce you to a script set that is open source, out of the box and can help you quickly diagnose some problems with MySQL. We go straight to 10 examples and quickly show you what this script set can do for us.
Finally, we describe how eBPF is used and the script structure. Anything that works has its limits or it’s too good to be true, so we’ll explain what those limits are as well.
1. Slow diagnosis of MySQL
Let’s take a look at the first stage, the slow diagnostic thinking of MySQL. We usually do it in three directions:
-
The first direction is observations inside MySQL
-
The second direction is the observation of external resources
-
The third direction is the transformation of external demand
1.1 MySQL Internal Observation
The first part is Processlist to see which SQL is under abnormal stress, the second part is Explain, explain its execution plan, and the third part is Profilling, If the SQL can be executed again, create a Profilling, and then the senior DBA will use performance_SCHEMA directly. MySQL 5.7 will use sys_SCHEMA directly. Sys_schema is a view that contains all kinds of information in a convenient way. To help you diagnose performance. At a higher level, we will use Innodb_metrics to perform an engine diagnosis.
In addition to these methods, there are some other messy methods that people have proposed, which I won’t list here, but these are the general idea of a MySQL internal state observation. In addition to these, MySQL also provides some schemes to expose their status, but these schemes have not formed routines in practice, because of the high learning cost.
1.2 External resource observation
For the section on external resource observation, I quoted an article, the QR code of which I posted above. This article is written by a god abroad, the title is: 60 seconds of quick inspection, let’s take a look at it in 60 seconds on the server exactly what kind of inspection. There are ten of them. Here are the first five. Let’s go through them one by one.
1. Uptime. Uptime tells us how long the machine has lived and what its average load is.
2. The dmesg – T | tail, tell us is there any system log an error.
3. Vmstat 1 tells us the status of virtual memory, whether there is a problem with page in and out, and whether swap is in use.
4. Mpstat -p ALL tells us whether the CPU pressure is uniform across ALL cores.
Pidstat 1 tells us about the resource usage of each process.
Let’s look at the last five:
Iostat-xz 1 is used to check the IO problem, and free-m is used to check the MEMORY usage, and then the two SAR are used to check the network consumption status according to the dimension of the device network card device, and the overall TCP usage and error rate. The last command, top, takes a look at the general process and thread problems.
This is the diagnosis of external resources, and these 10 commands reveal which external resources to diagnose.
1.3 Transformation of external demand
The third diagnostic idea is external requirement transformation. I quoted a document here, which is a chapter of the official MySQL document called Examples of Common Queries. The document introduced how to write Common SQL and gave some Examples. The linked QR code for the article is on Slide.
Let’s take a look at one of its examples.
What it does is to select from a table, which has three columns, article, Dealer and Price. It selects the most expensive goods of each author and lists them in the result set. This is its most original SQL, which is very consistent with the writing method of business, but it is an associated sub-query.
Associative subqueries are expensive, so the documentation above will show you how to quickly turn it into a non-associative subquery. You can see that there is no correlation between the middle subquery and the outer query.
The third step, we will teach you to directly remove the subquery, and then into such a SQL, this is called business transformation, before and after the three SQL is different, the cost of removing the associated subquery, the SQL will run very well, but this SQL has not been well represented. This approach is recommended only if you have a diagnosis of high SQL cost.
Why can it tear down an associative subquery?
The principle behind this is relational algebra, all SQL can be expressed as equivalent relational algebras, there is an equivalent relationship between relational algebras, this equivalence relationship can be removed by transformation of the associated subquery.
The above document is a college textbook that teaches the relationship between algebra and SQL from the beginning. Then, step by step, explain how to simplify this SQL statement.
First, MySQL itself provides a lot of commands to observe the various state of MySQL itself, you can check from the top to the bottom generally can detect SQL problems or server problems.
Secondly, from the perspective of the server, we start from the perspective of the script of the inspection. There are only these kinds of server resources, and there are only so many observation methods. We just observe all the resources of the server around.
Third, if it is really difficult to deal with, the demand side must be in accordance with the database easy to accept the way to write SQL, this cost will drop very quickly, this is the conventional MySQL slow diagnosis idea.
2. Introduction of system observation tools
Let’s switch from a discussion of diagnostic thinking to a discussion of system observation tools, first understand what a system observation tool is and look at an example of it, and then go back to the diagnostic thinking to see how the introduction of new tools can change our thinking.
So what is a system observation tool? Citing this document, two microcodes as above, this is a document written by a foreigner, I tore it apart and described three things in the middle:
First, where do the data sources of the system observation tools come from?
Second, the data acquisition process, because the collection is the operation of the system, so how to collect this is a difficult point;
Third, how should we look at the data, whether to use a graph or a table, it is called the data processing front end;
The first step is to look at data sources. Linux provides us with these kinds of data sources, including the observation points provided by the operating system kernel mode and the observation points provided by the user mode. MySQL has provided the observation points in the user mode for a long time.
The second step is how to pull the data out, and when you pull it out, you can see that the tools that you’re most familiar with are PERf and Ftrace, and some of you are using sysdig, and some of the others you may have heard of, this is a way to pull data out of the operating system.
The third step is the data processing front end, which is also commonly used in perF and FTrace. If you are familiar with PERF, you will know that the data coming out of perF is a tree of data, and you can interact with the tree, for example, to see how long a function has been running, which function has been running the longest, and this is the data processing front end.
Today I’m going to introduce the third class, eBPF, and the fourth class that we used to use, and the first two are generic tools, so let’s compare these four and see why Linux provides so many tools.
First, take a look at fTrace. Ftrace is a peg in sysFS that provides a form of observation by typing the signature of the desired function into the peg, and then the operating system provides information about the status of the function. The structure of FTrace is shown in the figure on the left. The data processing front end and collection end are FTrace, and the data source is the following pile.
Second, perF, which is commonly used, is that the operating system provides a system call that writes data to a cache, and then the client pulls that data out and displays it on the display. This is how PERF works.
Third, eBPF is the solution that we are going to focus on today. EBPF is different from the previous two solutions, one is the stubs on the file system provided by the operating system, and the other is the system call provided by the operating system. EBPF is a mechanism for inserting a piece of code directly into a location in the operating system kernel.
Fourth, Systemtap works by compiling a piece of C code into a kernel module, and embedding the module into the kernel. It is not a mechanism provided by the kernel, but a function provided by the kernel module mechanism.
That’s the difference between the four observation tools.
And the reason why we’re going to talk about the differences between these four observation tools is because you know what you’re going to do when you’re choosing an observation tool.
Of the four observation tools, which do the least damage to the system?
The least damaging to the system are system calls, which are promised services. Then there’s fTrace, which is a port that the system provides at the file system level to tell you that you can interact with the system through this port.
Who is the most intrusive to the system?
EBPF is the most intrusive to the System, because it directly inserts a piece of code into the System. System Tap is the most unstable, because it is a module of the System and provides very complex functions.
This diagram shows the architecture of eBPF. EBPF first compilers a program into binary code, and then inserts it into the operating system. When the operating system runs the code, the collected data is spit out into a space of the operating system itself, and then returns the data uniformly.
The eBPF structure, at its core, is about inserting code into the operating system, and it requires all sorts of security to do that, so that’s where it gets complicated.
3. BCC (eBPF script set) example
We referenced BCC, an open source eBPF script set, to take a quick look at what eBPF can do, right out of the box.
First example, MySQL request delay analysis, one MySQL is doing a lot of work, thousands of concurrent requests there, which SQL is the slowest, which SQL is more than a second, in addition to slow log, you can also use this method to see.
The results of this command is divided into three columns, and it is the first column of the request of delay and increasing exponentially, unit is microseconds, middle column is the number of hits it, if you have made a request to the range of 64-127 microseconds, will hit the number plus one, the last column is the layout of it, it is in the same report provides a numerical approach and figure, it is easy to see the result.
For this server, I placed a select performance pressure, and most of its requests were concentrated between 64 and 127 microseconds. The performance of this database is probably not bad.
Let’s look at another pressure, and I did a select+ INSERT mix pressure in a database, and it changed again, and it showed a nice bimodal pattern, and I put two peaks in a different color, and these two peaks mean that there’s a good chance that there’s mixed pressure in a database, Or maybe the top part of the pressure is hitting some cache, and the bottom part of the pressure is not hitting the cache, so that part of the request is slower, and there is another peak, so you can see the general running state of the database through this peak analysis.
If you can do a better job, you can sample your database and graph the latency of the database, for example, at the same time today and yesterday, under the same business pressure. If the latency peak of the database keeps getting later, it means that the state of the database is getting worse. This is one of the first things that BCC can do, and again it can be downloaded straight out of the box.
Second example, slow query in MySQL, MySQL itself provides a good slow query, why would I use another mechanism to get slow query in MySQL?
Let’s take a look at its output, which is actually a little bit simpler than slow queries in MySQL itself. So why do we get slow queries in a different way?
Because it can do these things that MySQL’s slow queries can be difficult to do, it can be done cheaply compared to MySQL’s slow logging:
1. Obtain a small number of slow queries
2. Get slow queries for certain patterns
3. Obtain the slow query of a user
For example, get a small number of slow queries. Why a small number? Because we are not sure now online how much is the delay, slow query log moments only one second is pile up, the performance will be down, but if the slow query a 10 seconds, no request in this interval, so it bit by bit to adjust the value, such as online 1% of the slowest query can hit, but in the script, You can take the largest queries in a given interval and pull it out.
It is also possible to hit a pattern of slow queries by scripting. For example, if we only care about slow queries for update, then it is not very meaningful to get the results of select, or if I must get some queries related to a particular table, I can do it by scripting.
The third case is where I want to get a slow query from a particular user, and this is usually the case for multi-tenant systems, because multi-tenant systems where I only want to do slow query analysis for a particular user, this script works better, and this is the second example I want to talk about.
The next few examples are related to IO, so I refer to another document, this document is the Linux IO stack diagram, and on the right is the cited QR code. This stack diagram looks very complicated, but this is actually the first version of the IO stack diagram drawn in 2012. Now the IO stack diagram is much more complex than this. You can try it on this website. And then we pull out the key elements, and we look at what are the levels of the IO stack?
Starting with MySQL, which runs in user mode, an IO request goes down to kernel mode through an interface at the VFS layer, and then from VFS to the actual file system, the IO request goes down to the block device layer, where it goes through an IO scheduler. In common MySQL tuning recommendations, the scheduler Settings are either null or deadline. This is where the scheduler takes effect. Finally, the data request is sent to the physical device through the SCSI interface.
The third example, VFS latency analysis, is that we can script IO analysis for each layer. For example, I can do VFS latency analysis.
When you do a latency analysis of the VFS, which is a write stress on the database, you can clearly see a bimodal graph, the two peaks of write, which is the database’s response to the kernel’s write stress.
What does that mean? This might mean that since this part of the write hit the operating system’s file system cache, and this part of the write is actually written to the device, they have different latency. This is a typical bimodal diagram, and you need to separate the two peaks to do this analysis.
In other words, if the writing pressure is concentrated here, and there is no second peak, do we need to change the physical device? Probably not, because everything hits the operating system cache.
In the fourth example, IO latency analysis for Ext4 files, the graph we saw earlier is based on disk devices.
So can I see which file’s IO is slow according to the file dimension, this script can do directly. I put the simplest write pressure on the database.
The red ones are higher than the non-red ones, and what they have in common is that they’re all data files, and the non-red ones are all log files, and that’s why we say that log files are written sequentially, and data files are written randomly, and sequential writing is faster than random writing, and that’s reflected in this delay, So this way you can see what the average latency is for each file.
The main reason I use this tool is to capture evidence that other processes are affecting the IO of the database. DD is deliberately used in this place. It is not didi for taxi, but DD for writing IO. The IO pressure of DD will be caught by the tool, which is iron evidence. This is the fourth example THAT I want to introduce, which can do file-based IO analysis.
The fifth example, block device delay analysis, why add a block device delay analysis? This script shows how much latency is actually going down to the device, since the latency analysis is all about operating system caching.
I don’t know if the device is good or bad. When you do IO pressure, it’s hard to judge whether it’s good or bad by absolute value. You need to draw the right conclusion by sequential process.
This bimodal diagram, which was made by a colleague of mine, asked me if this was a typical bimodal diagram, was there something wrong with the IO, something wrong with the device, and one set of IO’s latency was significantly higher than the other.
So is there anything wrong with this picture? There’s not much of a problem with this graph, because it has a small count. Its actual select reads to the device are only five requests, three of which are obviously higher latency than the other two, which is not worth analyzing. Most of the requests are all held by InnoDB buffer and operating system Cache, so it is not worth analyzing. You can also complete the question of “do you need to change the device? Will MySQL become faster after changing the device?” by using this figure, it should not be in the case of the figure.
I want to know which MySQL thread is causing IO stress. Is InnoDB’s thread responsible for swiping data, or is it the thread that is loading data?
This script will help you do that, and if you look at the output of this script, it looks like this: the left column is TID, which is the thread number, the right column is the file, and the middle part is the size it was written to. In such a database, you can obviously see what is wrong with the database.
You don’t have to have the contents of my.cnf to know what’s wrong with the database, which may be that general Log is enabled. So this is thread-based, so if you look at these two threads, you can tell which one of these two threads came from, and these two threads might have an unusually high AMOUNT of SQL, so the General Log keeps flushing, and it keeps flushing like this. This is a thread-based, file-based IO analysis.
If the SQL is not written properly, it will create temporary tables. These temporary tables have a short lifetime, but they are very large, so you must write files instead of memory.
In this case, there is some stress on the operating system, which is not easy to diagnose because temporary files have a short lifetime, so this script can help provide a solution that looks something like this.
I made a temporary table, which lived for about 5.3 seconds, and it was displayed in the script result. If you scan your MySQL online and you see a lot of stuff here, you’re using a lot of temporary tables, and if you have a lot of IO pressure at this point, you might be affected by temporary tables.
The eighth example, short connection analysis, better applications will use connection pool, but we are not so lucky, often encounter such good applications, so often business will throw a large number of short connections.
In this example, sysbench has a large concurrency, but it only lives for 300 milliseconds. All of these connections live for 300 milliseconds. Repeatedly running sysbench can kill the database, make a thousand connections, and destroy them after 300 milliseconds, make a thousand more connections, and your business will go up and down. With this script, you can catch the pressure from which server, which port, and then fix it. This is the short connection analysis of the database.
The ninth example, long connection analysis, in addition to short connection analysis, there are long connection analysis, which business side is always doing my data, always writing in, always reading in, make the network is very slow.
And this is one of the things that gives you a sense of that, and this is long join analysis. It reads, it writes, it’s all here, this is the ninth script.
Example 10, CPU offCPU consumption analysis. Looking at the last script, I need to give some background on what offCPU is, what consumption analysis is, and what the resulting graph might look like.
Why do we analyze the OFFCPU of the CPU?
Because this is how the CPU normally works, MySQL runs in the user mode of the operating system. Sequence in the process of running state will cut into the kernel, has carried on the system calls for example program, the better is the program can always accounts for a CPU, so it is always in the running, if not very good case, for example met disk IO, network IO, the active sleep, some lock block, not the condition of the CPU, it would be in It gives up the CPU and gives it up to another thread.
But does this mean the database is working well? If you just do an onCPU analysis for MySQL, at this stage onCPU is 0, it doesn’t occupy the CPU, but because I/O is blocked, I want to know what’s blocking here, this is called offCPU analysis, it’s when MySQL starts to dump CPU from kernel state, and I want to know why it dumps, And how long it lasted.
The final output is this graph, this graph is called the flame graph and this is a cold flame graph, it’s offCPU, and the common flame graph is the flame graph, which is red, which refers to the onCPU analysis. We purposely made it cool. This is offCPU’s fire map, and obviously no one can read what it says. So let me introduce you to a flame chart, and a flame chart is a process.
For example to sample database, the process of sampling, picked four sample, the four sample this place on behalf of the database, the operation of the stack and then it is that the operation of the stack, such four stack operation, and then the flame on the drawing they will be merged into this appearance, all four of them involves the first call is A, so it will take A merge together, The second call has two B’s, merge the B’s together, and what you end up with is a Graph that looks like this, and when it gets bigger, it looks like a Flame, so it’s called the Flame Graph.
This is flame figure is formed, it is through sampling, and then the merged into A picture of samples, and then you can get any information on this figure, the entrance to A program may be the information obtained from A, because all the samples had A, the B run independently accounted for A quarter of the time, above B C accounts for A quarter of the time, And you can read this very quickly from this graph.
So if you were to tune this program, where would you tune it, where would you tune it most directly and easily? B runs on its own for a quarter, C runs on its own for a quarter, and E runs on its own for a quarter. The only thing we know is that tuning D is useless, because all of D’s time is taken up by E, so tuning D doesn’t take up any of its own time, and that’s the basic principle of the flame diagram.
So let’s take an example, and this is an example that I took from that graph, and this is part of the offCPU analysis, and it’s about 25% or so of the size of that graph, and the heap reads from the bottom up, and the bottom one, if you can read it, innobase: index_Read, which means the engine reads the index tree. Then read up, don’t know what it means, MVCC don’t know what it means, doesn’t matter, read up again, Btr_… To_nth_level, the NTH level of the index read, and then read up, I opened a page above the buffer, and it started to read the page, and then this place involved fil_io, in order to read the page I started to read the file, and then do_syscall_… Makes the system call, and then gets to the VFS to actually make the system call.
What does it mean if this stack appears in 25% of the entire MySQL stack? That means MySQL spends 25% of its time reading the page, reading the page out of the file system. What does the page do? This page is in the index, so even if you don’t know the code and you read the English, you can probably figure out that if you replace the disk, or if you make the buffer pool bigger, really big, and then you start adding memory, at best you can make the database 25% faster, you can probably eliminate the stack altogether, This is the method of IO analysis that the fire diagram brings to you.
Just now we introduced ten BCC relevant example, these examples are ready-made script, BCC this tool is able to provide you with a complete, can be observed that all aspects of the operating system, such as if something is OOM kill off, then memory leaks can also see, then there are N many other part, And this is basically a treasure trove that we’ve discovered over the years, where you can just call these scripts and do a lot of analysis that no one else can do, and the technology is eBPF, which is the system observation tool that we just introduced. Just go to Github and look it up.
4. EBPF usage method/limitation
If the script doesn’t cut it, we can write it ourselves. Here we look at script writing and the limitations of eBPF.
Let’s take MySQL delay analysis as an example. There are 1000 queries in a MySQL database, and these queries probably fall into the graph in which the delay time is. In order to fulfill this requirement, I need to write two programs, among which the first program is running in the kernel.
The logic of this procedure is like this, give me a little at the beginning of the query intercepted, it records a timestamp, and then request at the end of the intercepted a record a timestamp, then put two timestamps subtraction for a delay, then throw the delay to the result set inside, procedure is completed, the normal way of thinking. I take the end time minus the start time, subtract it to get a delay, and then throw the delay into a statistical container, and the thing is done. This is the first program I will write, is embedded in the kernel of the program, but requires a shell of the program responsible for embedding.
The logic of this shell is very simple, insert the previous kernel program into the MySQL observation point, insert it into the kernel, and then take the result set out, print it out, and that’s how to write an eBPF script, the only thing you need to do is these two programs, and run it.
How long is this program? This program is about 45 lines long, but I’ve left out some parts of it, which are error handling, and the heart of it is 45 lines, and then you just have to take your script and copy it and change it and you can do a lot of things.
Let’s talk about limits. Why don’t a lot of people know about limits?
This feature was introduced in Linux 4.4, but there is a statistical bug in Linux 4.4. If you use the distribution map, you will see that the number on this map is not correct. We recommend Linux 4.9+, and some useful features are only opened in 4.13+. This is the biggest limitation of eBPF. What to do? Can only wish everyone long life! Live until the Linux 4.x kernel is available in production environments.
The second biggest limitation is the build parameter of MySQL. MySQL has provided dtrace observation points for a long time. These observation points are public, but they are compiled without observation points in the official release package by default. So in the direct official release of the binary package is not used in this function, you need to compile it. You need to take this parameter when compiling, which is probably a big limitation.
So if you’re limited, we recommend switching to systemTap.
Linux 2.6 has been, but I just said, its mechanism is to write a kernel module, this mechanism is not very stable, it in order to solve the problem of not particularly stable increased the number of restrictions, such as can be used in the kernel memory size limit, gathering frequency, there are limits to the kernel percentage there are limits to the influence of the performance, With all these limits on, it’s safe.
But many functions such as offCPU flame figure, it is necessary to turn off these restrictions, once off the kernel is not very stable, so this tool, I didn’t dare to write its drawbacks in it because it’s a good tool, we also hard to say it this shortcoming is a fatal flaw, but is not recommended for production use, But it’s a really fun tool to use in a test environment, if you can’t use eBPF you can use SystemTap to do some diagnostics. This is the part about limitations.
Then you have SystemTap, you have eBPF, and you want to know if there are other options, and I also stole this diagram, it’s all alpaca, and there are so many tools, you can choose it.
As for how to choose, you can go to Google and there are articles on how to choose these observation tools, but generally there is no scientific idea, just try and try.
I recommend a book, the source of all the knowledge in this lecture comes from this book, the author of the script collection of BCC we just said, this book is written by him, he also does a lot of god-like things, I strongly recommend to you, this book was published in Chinese early, but it seems that many people have not read many people.
In addition to the Chinese version of the book, I recommend another document. This document is the official document of Red Hat, which does not require members of Red Hat enterprises and can be read for free. It is called: In the second section, the Performance Tuning Guide introduces various available observation tools of the operating system, covering all the external observation tools mentioned in the first part and the system observation tools mentioned in the middle. But I don’t like this document because it few principle analysis, and are on this side of said there is a parameter can be adjusted, there is a parameter can be adjusted, if you want to get this part of knowledge, the quality of this document is also unusually high, red hat I’ve always thought is a selling document by the operating system of the company.
All my content for today has been presented, thank you!
PPT download link:
github.com/actiontech/slides