⛽ zipkin2. Reporter. InMemoryReporterMetrics to CPU100% the server and application OOM troubleshoot and solve problem
Here are the problems I encountered, as well as some simple troubleshooting ideas, if there is any wrong place, welcome to leave a message to discuss. If you have encountered an OOM problem caused by InMemoryReporterMetrics and have resolved it, please skip this article. If you are not sure about the CPU100% and the OOM check for online problems, please refer to this article.
Phenomenon of the problem
[Alarm Notification – Application Exception Alarm]
Connection refused
The environment that
Spring Cloud F.
The project uses the spring-cloud-sleuth-Zipkin dependency to obtain zipkin-Reporter by default. The version analyzed was Zipkin-Reporter version 2.7.3.
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>Version: the 2.0.0 RELEASECopy the code
Troubleshoot problems
Based on the alarm information, you can know which service on which server is faulty. Log in to the server first to check.
1. Check the service status and verify that the URL of the health check is OK
This step can be ignored/skipped and is related to the actual company health check and is not universal.
① Check whether the service process exists.
Ps – ef | grep service name ps – aux | grep service name
2 Check whether the IP address and IP port of the service health check are correct
Yes No The URL for the alarm service check is incorrectly configured. Generally, this problem does not occur
This health check address such as http://192.168.1.110:20606/serviceCheck to check the IP and Port are correct.
The service returns a result normallyThe curl http://192.168.1.110:20606/serviceCheck {"appName":"test-app"."status":"UP"}
The service is suspendedCurl: curl http://192.168.1.110:20606/serviceCheck couldn (7)'t connect to host
Copy the code
2. View service logs
Check to see if the service’s log is still being printed and if any requests are coming in. View discovery service OOM.
Tips:
Java. Lang. OutOfMemoryError GC overhead limit exceeded oracle’s official cause of this error is given and the solve method: the Exception in the thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message “GC overhead limit exceeded” indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations. Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit.
Reason: probably means the JVM 98% of the time spent in garbage collection, and only 2% of the available memory, frequent in memory (has been going on for at least five successive garbage collection), the JVM will expose ava. Lang. OutOfMemoryError: GC overhead limit exceeded error.
The above tips source: Java. Lang. OutOfMemoryError GC overhead limit exceeded the reason analysis and solution
3. Check the server resource usage
Run the top command to query the resource usage of each process in the system. The CPU usage of process 11441 reaches 300%, as shown in the screenshot below:
Then query the CPU usage of all threads in this process:
Top -h -p pid Save the file: top -h -n 1 -p pid > / TMP /pid_top.txt
# top -H -p 11441PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11447 test 20 0 4776m 1.2g 13m R 92.4 20.3 74:54.19 Java 11444 Test 20 0 4776m 1.6g 13m R 91.8 20.3 74:52.53 Java 11445 test 20 0 4776m 1.6g 13m R 91.8 20.3 74:50.14 Java 11446 test 20 0 4776m 1.6g 13m R 91.4 20.3 74:53.97 Java....Copy the code
If you look at the threads below PID: 11441, you can see that there are several threads with high CPU usage.
4. Save stack data
1. Print the system load snapshot top-b-n 2 > / TMP /top.txt top -h -n 1 -p pid > / TMP /pid_top.txt
2, CPU, ascending the printing process corresponding to the THREAD list ps – mp – o THREAD, dar, time | sort – k2r > / TMP/process number _threads. TXT
Lsof -p Process ID > / TMP/process ID _lsof. TXT Lsof -p process ID > / TMP/process ID _lsof2.txt
Jstack -l Process ID > / TMP/process ID _jstack. TXT jstack -l process ID > / TMP/process ID _jstack2. TXT jstack -l process ID > / TMP/Process ID_jstack3.txt
Jmap-heap process id > / TMP/process ID _jmap_heap.txt
6, check the heap object statistics jmap – histo process | head – n # 100 > / TMP/process _jmap_histo. TXT
Jstat -gcutil Process ID > / TMP/Process ID _jstat_gc. TXT 8. Production Heap snapshot Heap dump jmap-dump :format=b,file=/ TMP/Process ID _jmap_dump.hprof Process of no.
All the data in the heap, resulting in a larger file.
Jmap-dump :live,format=b,file=/ TMP/Process ID_live_jmap_dump. hprof Process ID
Dump :live: this parameter indicates that we need to fetch objects in memory that are currently in their lifetime, i.e. objects that cannot be collected by GC.
Get the faulty snapshot data and restart the service.
Problem analysis
As a result of the above operations, GC information, thread stack, heap snapshot, and so on have been obtained for the service in question. Here’s an analysis to see where the problem lies.
1. Analyze threads with 100% CPU usage
Conversion thread ID
Analysis of thread stack processes generated from JStack.
Convert the above thread WHOSE ID is 11447:0x2cb7 11444:0x2CB4 11445:0x2CB5 11446:0x2cb6 to hexadecimal (the thread ID recorded in the jstack command output file is hexadecimal). The first conversion method:
$ printf"X" 0 x % 11447"0 x2cb7"Copy the code
The second conversion method: add 0x to the result of the conversion.
Find thread stack
$ cat 11441_jstack.txt | grep "GC task thread"
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f971401e000 nid=0x2cb4 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f9714020000 nid=0x2cb5 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f9714022000 nid=0x2cb6 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f9714023800 nid=0x2cb7 runnable
Copy the code
It turns out that these threads are doing GC operations.
2. Analyze the generated GC file
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 0.00 100.00 99.94 90.56 87.86 875 9.307 3223 5313.139 5322.446
Copy the code
- S0: current usage ratio of surviving zone 1
- S1: Current usage ratio of surviving 2 zones
- E: Usage ratio of Eden Space
- O: Old Gen usage ratio
- M: Usage ratio of metadata area
- CCS: compression usage ratio
- YGC: garbage collection times of young generation
- FGC: the number of garbage collections in the old years
- FGCT: Old age garbage collection cost time
- GCT: Total time consumed by garbage collection
FGC is very frequent.
3. Analyze the generated heap snapshot
Use the Eclipse Memory Analyzer tool. Download address: www.eclipse.org/mat/downloa…
Results of the analysis:
General cause of the problem, InMemoryReporterMetrics caused by OOM. Zipkin2. Reporter. InMemoryReporterMetrics @ 0 xc1aeaea8 Shallow Size: 24 B Retained the Size: 925.9 MB
Java memory Dump analysis can also be used for analysis, as shown in the screenshot below. The functions are not as powerful as MAT, and some functions need to be charged.
4. Cause analysis and verification
Because of this problem, look at the configuration of the troubled service, Zipkin, no different than any other service. The configuration is the same.
I then looked at the corresponding Zipkin JAR package and found that the service in question relied on a lower version of Zipkin.
Zipkin-reporter – 2.4.3. jar for the service in question Package that other services that are not in question depend on: zipkin-Reporter – 2.4.4. jar
Upgrade the package version that the service in question depends on, verify in the test environment, and check the stack snapshot to find no problem.
The reason to explore
Check zipkin-Reporter’s Github: Search for information github.com/openzipkin/… Find the issues below: github.com/openzipkin/…
Fix code and verify code: github.com/openzipkin/… Compare the differences between the two versions of code:
// The code before the fix:
private final ConcurrentHashMap<Throwable, AtomicLong> messagesDropped =
new ConcurrentHashMap<Throwable, AtomicLong>();
// Fixed code:
private final ConcurrentHashMap<Class<? extends Throwable>, AtomicLong> messagesDropped =
new ConcurrentHashMap<>();
Copy the code
Fix using this as key: Class<? Extends Throwable> Replaces Throwable.
Simple verification:
The solution
Upgrade the Zipkin-Reporter version. Zipkin-reporter version 2.8.4 was introduced using the following dependency configuration.
<! -- Zipkin dependencies -->
<dependency>
<groupId>io.zipkin.brave</groupId>
<artifactId>brave</artifactId>
<version>5.6.4</version>
</dependency>
Copy the code
Tip: When configuring JVM parameters, use the following parameter to output a stack snapshot when memory runs out.
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=path/filename.hprof
Copy the code
Refer to the article
Remember a sleuth send zipkin anomalies caused by OOM: www.jianshu.com/p/f8c74943c…
eggs
Enclosed: Baidu search or a little pit
Learn how to troubleshoot Java deadlocks and CPU 100% problems
Thank you for reading, if you think this blog is helpful to you, please like or like, let more people see! I wish you happy every day!
No matter what you do, as long as you stick to it, you will see the difference! On the road, neither humble nor pushy!
The blog’s front page: https://aflyun.blog.csdn.net/
Java Programming technology paradise: a public account to share the knowledge of dry programming technology.