I. Problem scenario

The test environment suddenly alerted the CPU to spike. The log showed that GC operations were being performed continuously, and 8 GC threads blew up the CPU.

Two, problem investigation

The first is to keep the scene and print the stack information.

1. Displays the running information of the thread

jstack 85090 > code-api.log
Copy the code

2. Print heap information

jmap -dump:format=b,file=heapdump1.hprof 85090
Copy the code

3. Analyze the dump file

As you can see in the figure, the number of StackTraceelements is staggering.

StackTraceElement Each element represents a separate stack frame. All stack frames (except the one on top of the stack) represent a method call.

So you can be sure that some method is making an infinite recursive call, constantly opening new stack frames

4. Analyze thread run logs

If all worker threads keep calling this method in AlarmService, the problem is basically located.

5. Check the AlarmService

The discovery method is to call an RPC request. Originally, the server in the test environment was migrated, so the original IP address could not be adjusted. You can change the IP address to the domain name.

Third, problem analysis

1. View Feign configurations

The RPC call uses Feign and is configured with NEVER_RETRY.

2.AOP

View the realization of the alarm mechanism, is through the AOP section to capture exceptions for processing. Alarm Service connection alarm center server that is to say, 5 seconds after The Times the timeout exception, the timeout exception was AOP captured, after quote the alarm center alarm Service will timeout exception, but unable to connect to or will be submitted to the timeout exception alarm center, leads to keep open a new recursive method, the problem here is quite clear.

Iv. Solutions

A potential pitfall was found here, that is, if the alarm center server is unstable, it is bound to affect the normal operation of online services, which is unacceptable, so we should try to avoid this situation again.

1.try-catch

  • When a try-catch command is added to an alermService method, no exception is caught.
  • Later analysis is an exception generated when the RPC interface is called by the lower layer of the Service, which is caught by AOP before it is caught.
  • Add try-catch to RPC layer, problem solved.

However, this solution is not elegant enough to try-catch every RPC call.

2.aop

It is resolved through an AOP configuration. For the original

@Pointcut(execution(* com.qbq.test.. *. * (..) ))
Copy the code

Modified to

@Pointcut(execution(* com.qbq.test.. *. * (..) ) &&! execution(* com.qbq.test.alarm.. *. * (..) ))
Copy the code

That is, aop’s facets no longer trap exceptions in alarm packets.

5. Some questions

1. Why does infinite recursive calls not cause StackOverFlowError

Local validation analysis concludes that the request timeout is set to 5 seconds. If you change the timeout event to 5ms, you will soon send a message to StackOverFlow and the stack will burst. If it is 5 seconds, the stack will not pop because the method call is not finished. When many exceptions are reported at the same time, StackTraceElement quickly fills up the new generation, causing the JVM to keep doing YOUNG GC operations.

2. Are you sure the CPU spikes are caused by GC

Because the log was constantly flushing GC information, it was naturally assumed that it was caused by GC.

After a local simulation, arthas’s thread command was used to see how the threads worked. As you can see, there are 8 gcThreads with high CPU usage. This confirms that CPU spikes are due to GC.