CPU usage burst again? What the hell is MySQL doing

The author Wang Wen ‘an, Tencent CSIG database special database engineer, mainly responsible for Tencent cloud database MySQL related work, love technology, welcome to leave a message for communication. The article was first published in Tencent Cloud + community Tencent cloud database expert services column.

In my daily work, when I find that MySQL is not in the right state, I usually look at the monitoring indicators, and many times I see a familiar scene: CPU usage has exploded again. This article will introduce you to the relationship between MySQL and CPU, after a certain understanding of this can more accurately determine the cause of the problem, but also to find some hidden problems caused by CPU problems in advance.

How to read the CPU usage

Take the top command in Linux as an example. The result is as follows:

The %CPU column shows CPU usage, and the percentage refers to the percentage of time spent overall:

%us: indicates the CPU usage time of the user process (not scheduled by NICE)
% SY: indicates the CPU usage time of system processes, mainly used by the kernel.
%ni: indicates the elapsed time of the user process by CPU nice.
%id: idle CPU time
%wa: indicates the time the CPU is waiting for I/O
%hi: time spent by the CPU to process hard interrupts
%si: time spent by the CPU to process soft interrupts
%st: CPU time stolen by the virtual machine

In general, when we talk about excessive CPU usage, we refer to the %us metric, and the CPU usage in monitoring is usually the same value (there are other ways to calculate it, but for simplicity, we don’t consider other cases). Several other indicators that are too high also indicate that MySQL is in an abnormal state. For simplicity, this is mainly used to refer to the scenario where %us is too high.

MySQL and thread

MySQL is a single-process multithreaded structure, which means that only one row of data can be seen in an exclusive MySQL server using the top command.

If you want to see the thread status, you need to use top-h

You can see the ID of each thread in MySQL. You can see that after MySQL is started, it creates many internal threads to work on.

These internal threads include the system threads used by MySQL itself to scrub, read and write data, etc., as well as the threads used to process user SQL, let’s call them user threads. User threads have a special feature: The SQL sent from the program end to the MySQL end is executed by only one user thread (one-thread-per-connection). Therefore, when MySQL processes complex queries, there will be an awkward phenomenon of “one core is difficult, but multiple cores are around”.

For Linux, the MySQL process and all threads it starts are not kernel processes. Therefore, the system and user threads of MySQL are counted as %us CPU usage when they are busy.

When will the CPU be 100%

What does MySQL do when the CPU is 100%? From the previous analysis, MySQL has two main types of threads that occupy CPU: system threads and user threads. So on a mysql-only server, you only need to pay attention to these two types of threads to Cover most of the problem scenarios.

System thread

In the real world, system threads will encounter fewer problems. In general, multiple system threads will rarely run full at the same time, as long as the number of available cores on the server is greater than or equal to 4, it will not encounter 100% CPU. Of course, there are some bugs that may affect, such as this:

MySQL BUG
Copy the code

Although the situation is relatively rare, but in the face of the problem of the routine troubleshooting process, the system thread problems also need to be concerned.

User threads

When it comes to busy user threads, a lot of time will definitely be the first time to think of slow queries based on experience. It is true that more than 90% of the time “slow queries” are caused by “slow queries”, but as a methodology, it is based on analysis. See the us% definition, which refers to the amount of CPU time that the user thread occupies.

For example, order by, group by, temporary table, join, etc. This problem may be caused by low query efficiency, which causes a single SQL statement to occupy CPU time for a long time. It may also be caused by a large amount of data, which causes a large amount of computation. On the other hand, the pure QPS pressure is high, so the CPU time is used up. For example, the 4-core server is used to support the point query of 20K to 30K. Each SQL does not occupy much CPU time, but because the overall QPS is very high, so the CPU time is occupied.

Problem location

After the analysis, it is time to start the actual combat, here according to the previous analysis gives some classic CPU 100% scenario, and gives a brief positioning method as a reference.

PS: skip the scene of system thread bug, which will be analyzed as a detailed case in the future.

The slow query

After the 100% CPU problem has occurred, real slow queries are mixed up with regular queries affected by 100% CPU, making it difficult to visually look at processList or slowlog to find the culprit, and some specific characteristics are needed to identify the slowlog.

As can be seen from the previous simple analysis, slow queries with low query efficiency usually have the following situations:

Full table scan: Handler_read_rnd_next can be substantially increased, and row_EXAMINED values in this type of query can be very high in slowlog.
The value of Handler_read_next can be increased by a large amount. Note that this can also be caused by a large amount of traffic. This can be combined with QPS/TPS. This type of query can be difficult to find in slowlog, and the ROW_EXAMINED values generally differ significantly before and after failures, or are unreasonably high.

For example, in a data skews scenario, a small range query with row_EXAMINED high in one particular range and low in the other ranges may be inefficient.

Sort more: The order by, group by types of queries are usually not very good to judge from the Handler’s metrics. You can usually see a lot of these types of queries in processList and slowlog if there is no index or if the index is bad enough to cause sorting operations to not be eliminated.

Of course, if you don’t want to examine MySQL metrics in detail or are in an urgent situation, you can do a simple division in slowlog using ROws_SENT and ROW_examined. For example, row_EXAMINED/Rows_SENT > 1000 can be taken out as suspects. This kind of problem can be solved by optimizing the index.

PS: 1000 is just an empirical value, depending on the actual business situation.

Large amount of calculation

This type of problem is usually caused by a large amount of data, even if there is no index problem and the execution plan is OK, it can cause 100% CPU utilization, and combined with MySQL one-thread-per-connection features, it does not require much concurrency to maximize CPU utilization. This type of query is actually easier to track because it typically takes longer to execute and will be noticeable in processList rather than slowlog, since incomplete slowlog statements are not recorded.

There are three general solutions to this kind of problem:

Read/write separation, which puts this type of query into a read-only slave library that is not used by business.
Split SQL in the program segment, a single large query into multiple small queries.
OLAP solutions such as HBASE and Spark are used.

High QPS

This kind of problem is simply the bottleneck of hardware resources, whether row_EXAMINED/ROws_SENT ratio, or SQL index, execution plan, or SQL calculation, there are no obvious problems, but THE QPS indicator will be relatively high. And processList might not see anything at all, for example:

processlist
Copy the code

conclusion

In fact, CPU 100% problem is not only pure % US, there will be % IO, %sys, etc., these will involve MySQL and Linux associated part of the content, expand will be more. This paper only tries to sort out the ideas and methods of troubleshooting & positioning from %us. When analyzing problems in % IO, %sys and other aspects, similar ideas can also be used. Starting from the significance of these indicators, combined with some features or characteristics of MySQL, the reason behind the appearance can be gradually clarified.

CPU usage burst again? What the hell is MySQL doing

How to read the CPU usage

MySQL and thread

When will the CPU be 100%

Problem location

conclusion

Related Posts

The Python interview is so easy

Here are all the database questions you’ll be asked in an interview!

C++grammar formatting output and I/O stream functions