GitHub 20K Star Java engineers become god’s path, not to learn about it!
GitHub 20K Star Java engineers become god’s way, really not to learn about it!
GitHub 20K Star Java engineer into god’s road, really really not to learn about it!
Some time ago, we installed a new application. Because the traffic is not large, the cluster QPS is only about 5, and the RT of the write interface is about 30ms.
Because of the recent access to new business, the data provided by the business side is that daily QPS can reach 2000, and peak QPS may reach 10,000.
So, in order to assess the water level, we conducted a pressure survey. Pressure tests are performed in a pre-release environment. During the pressure test, it was found that when the QPS of the single machine reached about 200, the RT of the interface did not change significantly, but the CPU utilization increased sharply until it was full.
CPU utilization dropped immediately after the manometry was stopped.
We started to figure out what was causing the CPU spike.
Troubleshooting and resolution of problems
Log on to the machine and start troubleshooting during pressure testing.
In this case, the Arthas tool of Ali open source was used for the investigation process. Arthas is not used, but the JDK’s own command can also be used.
Before starting the troubleshooting, you can take a look at the CPU usage. The easiest way to do this is to use the top command:
Top-10:32:38 Up 11 days, 17:56, 0 Users, Load Average: 0.84, 0.33, 0.18 Tasks: 23 total, 1 running, 21 sleeping, 0 stopped, 1 zombie %Cpu(s): 95.5 US, 2.sy, 0.0Ni, 76.3 ID, 0.0wa, 0.0hi, 0.0Si, 6.1st KiB Mem: 8388608 total, 4378768 free, 3605932 used, 403908 buff/cache KiB Swap: 0 total, 0 free, 0 Used.4378768 Avail Mem PID USER PR NI VIRT RES SHR S %CPU % Mem TIME+ COMMAND 3480 admin 20 0 7565624 2.9g 8976 S 241.2 35.8 649:07.23 Java 1502 root 20 0 401768 40228 9084 S 1.0 0.5 39:21.65 ilogtail 181964 root 20 0 3756408 104392 8464s 0.7 1.2 0:39.38 Java 496 root 20 0 2344224 14108 4396 S 0.3 0.2 52:22.25 StarAgentd 1400 admin 20 0 2176952 229156 5940 S 0.3 2.7 31:13.13 Java 235514 root 39 19 2204632 15704 6844 S 0.3 0.2 55:34.43 argusagent 236226 root 20 0 55836 9304 6888 S 0.3 0.1 12:01.91 Systemd -journCopy the code
The Java process with ID 3480 is consuming a large amount of CPU, so we can determine which thread and which code is consuming a large amount of CPU.
First, download the Arthas command:
curl -L http://start.alibaba-inc.com/install.sh | sh
Copy the code
Start the
./as.sh
Copy the code
Arthas command “thread-n 3 -i 1000″ Arthas command” thread-n 3 -i 1000″ Arthas command
As you can see from the stack information above, CPU hogging threads are mostly stuck on THE TCP socket reads underlying JDBC. Many times in a row, many threads are stuck in this place.
By analyzing the call chain, I found that this place is the insert of the database in my code, and I use TDDL to create the sequence. During the creation of the sequence, I need to interact with the database.
However, based on the understanding of TDDL, TDDL will retrieve 1000 sequences from the database by default every time it queries sequence sequences from the database, caches them locally, and obtains the next 1000 sequences from the database only after all the sequences are used up.
Supposedly, our pressure QPS is only about 300, so we shouldn’t be interacting with the database so frequently. However, after several visits with Arthas, most of the CPU is exhausted here.
So we started looking for code problems. We discovered a silly problem with our sequence creation and use:
public Long insert(T dataObject) { if (dataObject.getId() == null) { Long id = next(); dataObject.setId(id); } if (sqlSession.insert(getNamespace() + ".insert", dataObject) > 0) { return dataObject.getId(); } else { return null; } } public Sequence sequence() { return SequenceBuilder.create() .name(getTableName()) .sequenceDao(sequenceDao) .build(); ** @return */ protected Long next() {try {return sequence().nextValue(); } catch (SequenceException e) { throw new RuntimeException(e); }}Copy the code
Because we rebuild a new sequence for each insert statement, the local cache is discarded, so we pull 1000 rows from the database again each time, but only use one, and then fetch 1000 rows again the next time, and so on.
As a result, the code was tweaked to generate the Sequence instance once at application startup. In this way, when obtaining a sequence, you will not interact with the database every time. Instead, you will check the local cache first. When the local cache is exhausted, you will interact with the database again to obtain a new sequence.
public abstract class BaseMybatisDAO implements InitializingBean { @Override public void afterPropertiesSet() throws Exception { sequence = SequenceBuilder.create().name(getTableName()).sequenceDao(sequenceDao).build(); }}Copy the code
Initialize the Sequence in this method by implementing the InitializingBean and overriding the afterPropertiesSet() method.
After modifying the above code, submit it for verification. From the monitoring data, it can be seen that the read RT of the database decreases significantly after optimization:
QPS of sequence write operations also decreased significantly:
We started a new round of compaction, but the CPU usage was still high and the QPS for compaction was still not going up, so we went back to Arthas and looked at threads.
A new stack of cpu-consuming threads was found, mainly because we used a tuning tool that enabled TDDL collection by default for pre-release (TDDL collection is not enabled by default for pre-release, but actually does).
The tool will desensitize the log during printing, and the desensitization framework will call Google re2j to match the regular expression.
Because there are a lot of TDDL operations in my operation, I collect a lot of TDDL logs and desensitize them by default, which really consumes CPU.
Therefore, this problem can be solved by turning off DP’s TDDL collection in pre-release.
Summary and Reflection
This article summarizes an online CPU spike problem troubleshooting process, in fact, the problem is not difficult, and also quite silly, but this troubleshooting process is worth learning.
In fact, I have checked many times before the problem of CPU surge, this time also according to the old method of investigation, but at the beginning did not find too much problem, just thought that the increase in traffic caused by the database operation more normal phenomenon.
The sequence initialization mechanism of TDDL was found to be faulty after multiple checks (arthas to obtain the sequence contents and database to check the primary key ID of the recently inserted data, etc.).
After solving this problem, I thought that the problem was solved completely. As a result, DP collected TDDL logs, which caused the CPU to soar. Finally, after solving this problem again, I improved significantly.
Therefore, abnormal things will be a demon, troubleshooting problems is a process.
About the author: Hollis, a person with a unique pursuit of Coding, is a technical expert of Alibaba, co-author of “Three Courses for Programmers”, and author of a series of articles “Java Engineers becoming Gods”.
If you have any comments, suggestions, or want to communicate with the author, you can follow the public account [Hollis] and directly leave a message to me in the background.