zipkin2. Reporter. InMemoryReporterMetrics to CPU100% the server and application OOM troubleshoot and solve problem

Here are the problems I encountered, as well as some simple troubleshooting ideas, if there is any wrong place, welcome to leave a message to discuss. If you have encountered an OOM problem caused by InMemoryReporterMetrics and have resolved it, please skip this article. If you are not sure about the CPU100% and the OOM check for online problems, please refer to this article.

Phenomenon of the problem

[Alarm Notification – Application Exception Alarm]

Connection refused

The environment that

Spring Cloud F.

The project uses the spring-cloud-sleuth-Zipkin dependency to obtain zipkin-Reporter by default. The version analyzed was Zipkin-Reporter version 2.7.3.

<dependency>
	<groupId>org.springframework.cloud</groupId>
	<artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>Version: the 2.0.0 RELEASECopy the code

Troubleshoot problems

Based on the alarm information, you can know which service on which server is faulty. Log in to the server first to check.

1. Check the service status and verify that the URL of the health check is OK

This step can be ignored/skipped and is related to the actual company health check and is not universal.

① Check whether the service process exists.

Ps – ef | grep service name ps – aux | grep service name

2 Check whether the IP address and IP port of the service health check are correct

Yes No The URL for the alarm service check is incorrectly configured. Generally, this problem does not occur

This health check address such as http://192.168.1.110:20606/serviceCheck to check the IP and Port are correct.

The service returns a result normallyThe curl http://192.168.1.110:20606/serviceCheck {"appName":"test-app"."status":"UP"}

The service is suspendedCurl: curl http://192.168.1.110:20606/serviceCheck couldn (7)'t connect to host

Copy the code

2. View service logs

Check to see if the service’s log is still being printed and if any requests are coming in. View discovery service OOM.

Tips:

Java. Lang. OutOfMemoryError GC overhead limit exceeded oracle’s official cause of this error is given and the solve method: the Exception in the thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message “GC overhead limit exceeded” indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations. Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit.

Reason: probably means the JVM 98% of the time spent in garbage collection, and only 2% of the available memory, frequent in memory (has been going on for at least five successive garbage collection), the JVM will expose ava. Lang. OutOfMemoryError: GC overhead limit exceeded error.

The above tips source: Java. Lang. OutOfMemoryError GC overhead limit exceeded the reason analysis and solution

3. Check the server resource usage

Run the top command to query the resource usage of each process in the system. The CPU usage of process 11441 reaches 300%, as shown in the screenshot below:

Then query the CPU usage of all threads in this process:

Top -h -p pid Save the file: top -h -n 1 -p pid > / TMP /pid_top.txt

# top -H -p 11441PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11447 test 20 0 4776m 1.2g 13m R 92.4 20.3 74:54.19 Java 11444 Test 20 0 4776m 1.6g 13m R 91.8 20.3 74:52.53 Java 11445 test 20 0 4776m 1.6g 13m R 91.8 20.3 74:50.14 Java 11446 test 20 0 4776m 1.6g 13m R 91.4 20.3 74:53.97 Java....Copy the code

If you look at the threads below PID: 11441, you can see that there are several threads with high CPU usage.

4. Save stack data

1. Print the system load snapshot top-b-n 2 > / TMP /top.txt top -h -n 1 -p pid > / TMP /pid_top.txt

2, CPU, ascending the printing process corresponding to the THREAD list ps – mp – o THREAD, dar, time | sort – k2r > / TMP/process number _threads. TXT

Lsof -p Process ID > / TMP/process ID _lsof. TXT Lsof -p process ID > / TMP/process ID _lsof2.txt

Jstack -l Process ID > / TMP/process ID _jstack. TXT jstack -l process ID > / TMP/process ID _jstack2. TXT jstack -l process ID > / TMP/Process ID_jstack3.txt

Jmap-heap process id > / TMP/process ID _jmap_heap.txt

6, check the heap object statistics jmap – histo process | head – n # 100 > / TMP/process _jmap_histo. TXT

Jstat -gcutil Process ID > / TMP/Process ID _jstat_gc. TXT 8. Production Heap snapshot Heap dump jmap-dump :format=b,file=/ TMP/Process ID _jmap_dump.hprof Process of no.

All the data in the heap, resulting in a larger file.

Jmap-dump :live,format=b,file=/ TMP/Process ID_live_jmap_dump. hprof Process ID

Dump :live: this parameter indicates that we need to fetch objects in memory that are currently in their lifetime, i.e. objects that cannot be collected by GC.

Get the faulty snapshot data and restart the service.

Problem analysis

As a result of the above operations, GC information, thread stack, heap snapshot, and so on have been obtained for the service in question. Here’s an analysis to see where the problem lies.

1. Analyze threads with 100% CPU usage

Conversion thread ID

Analysis of thread stack processes generated from JStack.

Convert the above thread WHOSE ID is 11447:0x2cb7 11444:0x2CB4 11445:0x2CB5 11446:0x2cb6 to hexadecimal (the thread ID recorded in the jstack command output file is hexadecimal). The first conversion method:

$ printf"X" 0 x % 11447"0 x2cb7"Copy the code

The second conversion method: add 0x to the result of the conversion.

Find thread stack

$ cat 11441_jstack.txt | grep "GC task thread"
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f971401e000 nid=0x2cb4 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f9714020000 nid=0x2cb5 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f9714022000 nid=0x2cb6 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f9714023800 nid=0x2cb7 runnable
Copy the code

It turns out that these threads are doing GC operations.

2. Analyze the generated GC file

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   0.00 100.00  99.94  90.56  87.86    875    9.307  3223 5313.139 5322.446
Copy the code

S0: current usage ratio of surviving zone 1
S1: Current usage ratio of surviving 2 zones
E: Usage ratio of Eden Space
O: Old Gen usage ratio
M: Usage ratio of metadata area
CCS: compression usage ratio
YGC: garbage collection times of young generation
FGC: the number of garbage collections in the old years
FGCT: Old age garbage collection cost time
GCT: Total time consumed by garbage collection

FGC is very frequent.

3. Analyze the generated heap snapshot

Use the Eclipse Memory Analyzer tool. Download address: www.eclipse.org/mat/downloa…

Results of the analysis:

General cause of the problem, InMemoryReporterMetrics caused by OOM. Zipkin2. Reporter. InMemoryReporterMetrics @ 0 xc1aeaea8 Shallow Size: 24 B Retained the Size: 925.9 MB

Java memory Dump analysis can also be used for analysis, as shown in the screenshot below. The functions are not as powerful as MAT, and some functions need to be charged.

4. Cause analysis and verification

Because of this problem, look at the configuration of the troubled service, Zipkin, no different than any other service. The configuration is the same.

I then looked at the corresponding Zipkin JAR package and found that the service in question relied on a lower version of Zipkin.

Zipkin-reporter – 2.4.3. jar for the service in question Package that other services that are not in question depend on: zipkin-Reporter – 2.4.4. jar

Upgrade the package version that the service in question depends on, verify in the test environment, and check the stack snapshot to find no problem.

The reason to explore

Check zipkin-Reporter’s Github: Search for information github.com/openzipkin/… Find the issues below: github.com/openzipkin/…

Fix code and verify code: github.com/openzipkin/… Compare the differences between the two versions of code:

// The code before the fix:
  private final ConcurrentHashMap<Throwable, AtomicLong> messagesDropped =
      new ConcurrentHashMap<Throwable, AtomicLong>();
// Fixed code:
  private final ConcurrentHashMap<Class<? extends Throwable>, AtomicLong> messagesDropped =
      new ConcurrentHashMap<>();
Copy the code

Fix using this as key: Class<? Extends Throwable> Replaces Throwable.

Simple verification:

The solution

Upgrade the Zipkin-Reporter version. Zipkin-reporter version 2.8.4 was introduced using the following dependency configuration.

<! -- Zipkin dependencies -->
<dependency>
  <groupId>io.zipkin.brave</groupId>
  <artifactId>brave</artifactId>
  <version>5.6.4</version>
</dependency>

Copy the code

Tip: When configuring JVM parameters, use the following parameter to output a stack snapshot when memory runs out.

 -XX:+HeapDumpOnOutOfMemoryError 
 -XX:HeapDumpPath=path/filename.hprof 
 
Copy the code

Refer to the article

Remember a sleuth send zipkin anomalies caused by OOM: www.jianshu.com/p/f8c74943c…

eggs

Enclosed: Baidu search or a little pit

Learn how to troubleshoot Java deadlocks and CPU 100% problems

Thank you for reading, if you think this blog is helpful to you, please like or like, let more people see! I wish you happy every day!

No matter what you do, as long as you stick to it, you will see the difference! On the road, neither humble nor pushy!

The blog’s front page: https://aflyun.blog.csdn.net/

Java Programming technology paradise: a public account to share the knowledge of dry programming technology.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Experienced a line CPU100% and application OOM troubleshooting and solution process

zipkin2. Reporter. InMemoryReporterMetrics to CPU100% the server and application OOM troubleshoot and solve problem

Phenomenon of the problem

The environment that

Troubleshoot problems

1. Check the service status and verify that the URL of the health check is OK

2. View service logs

3. Check the server resource usage

4. Save stack data

Problem analysis

1. Analyze threads with 100% CPU usage

Conversion thread ID

Find thread stack

2. Analyze the generated GC file

3. Analyze the generated heap snapshot

4. Cause analysis and verification

The reason to explore

The solution

Refer to the article

eggs

Experienced a line CPU100% and application OOM troubleshooting and solution process

zipkin2. Reporter. InMemoryReporterMetrics to CPU100% the server and application OOM troubleshoot and solve problem

Phenomenon of the problem

The environment that

Troubleshoot problems

1. Check the service status and verify that the URL of the health check is OK

2. View service logs

3. Check the server resource usage

4. Save stack data

Problem analysis

1. Analyze threads with 100% CPU usage

Conversion thread ID

Find thread stack

2. Analyze the generated GC file

3. Analyze the generated heap snapshot

4. Cause analysis and verification

The reason to explore

The solution

Refer to the article

eggs

Related Posts

Offer 35 36

Python3 Tkinter Base Radiobutton Indicatoron changes the appearance of buttons to round/square

JVM_06 Runtime data area 3- Methods area