preface
Recently, I looked through the PPT I wrote before, and found a technical sharing in 2019 about the troubleshooting of Java problems. Since there is no company secret, I would like to sort out and share with you
Online problem handling process
Let’s just go ahead and put in a powerpoint screenshot, which is still out of date
Troubleshoot problems
We can start from three aspects
- Knowledge: Some questions, think about the answer, just like the legend of Doron, recall that line 83 of code is problematic ~
- Tools: Of course, not everyone has a photographic memory, and you may not have written the code at all, so tools are needed to locate the problem
- Data: Data generated as the program runs can also provide a lot of clues
knowledge
There are many aspects of knowledge, here is a brief list:
- Language (specifically Java in this article) : Such as JVM knowledge, multithreading knowledge, etc
- Frameworks: such as Dubbo, Spring, etc
- Components: such as Mysql, RocketMq, etc
- Others: Such as networks and operating systems
For example, we need to understand the entire process of Java objects from application to recycling. This diagram is very clear and recommended to memorize:
Then learn about the common garbage collectors as well:
Throughput = Number of requests processed per unit of time = Run code time/(Run code time + garbage collection time)
Using ParNew + CMS as an example, try to answer the following questions:
- Why generational collection? – Keyword: efficiency
- When does the object enter the old age? – Keyword: Age, size
- When does the Young GC and Full GC occur? – Keyword: Eden Insufficient, Old Insufficient, Meta Insufficient, map/ system.gc
With that in mind, for a practical example, how can we optimize when we find that the Young GC fires frequently and takes a lot of time?
First of all, when does the Young GC trigger? The answer is not enough in Eden.
Then, where is the Young GC time-consuming? The answer is scan + copy, scan is usually fast and copy is slow.
So we tried to increase the size of the new generation, and it really solved the problem. Why? Let’s also analyze it
When the size of Cenozoic is M, it is assumed that the object survives 750ms, the interval of young GC is 500ms, the scanning time is T1, and the replication time is T2
- When the size of Cenozoic is M, the frequency is 2 times /s, and each time takes T1 + T2
- When the Cenozoic expansion is 2M, the frequency is once /s, and each time takes 2T1
And since T2 is much bigger than T1, 2T1 is less than T1 plus T2
This is the power of knowledge
tool
Tools in the Java stack also fall into these categories:
- JDK comes with: jstat, JStack, JMap, JConsole, jVisualVM
- Third party: MAT (Eclipse plug-in), GCHisto, GCeasy (online GC log analysis tool, gceasy.io/)
- Open source: Arthas, Bistoury, Async-Profiler
There are two main types of Cpu profilers:
- Sampling-based: advantage is low performance overhead, disadvantage is sampling frequency limitation, there is SafePoint Bias problem
- Piling: All methods add AOP logic, the advantage is accurate acquisition, the disadvantage is high performance overhead
For example, Uber’s open source Uber-Common/JVM-Profiler is a SAMple-based Cpu profiler, but it has the disadvantage of SafePoint Bias. For example, once we checked a Cpu usage problem, we collected such a flame map. As you can see, it’s almost useless
A SafePoint can be understood simply as a point at which the JVM can pause, and if the sampling location is a particular point, then the sampling is not representative because it might consume more Cpu when it is not SafePoint, This phenomenon is known as SafePoint Bias.
But I used another JVM-profiling /async-profiler collection to see performance bottlenecks:
Although async-profiler is also based on sampling, it avoids SafePoint Bias because it uses AsyncGetCallTrace hacking technology. Therefore, according to the flame map provided by Async-Profiler, the Qps increases from 58K to 81K, while the Cpu decreases from 72% to 41%
data
The data include:
- Monitoring data, such as APM, Metric, JVM monitoring, distributed link tracking, and so on
- Program running data: such as business data, AccessLog, GC log, system log, etc
This part according to the actual analysis, there is no unified template to speak of.
experience
Having said that, from the perspective of experience, we have summarized the following common questions which aspects to start with:
- Abnormal execution: View logs, debug, and request replay
- Application rigidity: jStack
- Time-consuming: Trace tracing and Benchmark
- High Cpu usage: Cpu profile analysis
- Frequent and time-consuming GC: GC log analysis
- OOM, High memory usage, leak: dump memory analysis
Case sharing
Cobar is dead, the process port is in, but cannot process the request
First, kick off the faulty machine, and then rectify the fault on site. According to the logs, locate the fault as a memory leak
Quick thought: Can logs directly determine where a memory leak is? Answer: No
You can download the dump memory for local analysis. If the file is too large, compress it first
jmap -dump:format=b,file=/cobar.bin ${pid}
We found a Bug in the Cobar custom. If you are interested in memory analysis, you can read the following articles:
- A Lengthy Dubbo Gateway Memory Leak Investigation experience
- Skywalking Memory Leak Investigation
High Gateway time
Trace the call using Arthas Trace
trace com.beibei.airborne.embed.extension.PojoUtils generalize
Access to Sentinel leads to application death
After the Sentinel is connected, a rule is configured to cause application death. You can use jStack to troubleshoot the problem at a glance
jstack ${pid} > jstack.txt
The last
This article was first shared in December 2019. It has just been two years. Since it was organized by PPT, the writing is not so smooth, but the ideas and methods of troubleshooting are still the same.
Search attention wechat public number “bug catching master”, back-end technology sharing, architecture design, performance optimization, source code reading, problem solving, practice.