A weird garbage collection time problem troubleshooting practice

I said in my previous blog that I have been investigating a problem for nearly two weeks. The problem has been summarized and it is time to summarize it.

The investigation process took a lot of detour, because of the vision and knowledge reserve problems, but also into some thinking mistakes, I hope this problem can be later to query such problems can be warned and reference; In addition, many troubleshooting methods and ideas come from the tips and inspirations given by the department leader and the great gods in the group. A summary can also help us have a deeper understanding of these knowledge.

HttpClient sends A log message at the end of A request to A tripartite interface, parses the request result into A JSON object, and then outputs A log message at the end of A request to A tripartite interface. The time difference between AB will be very large, ranging from 100-700ms, while the NORMAL JSON parsing is very fast, should not exceed 1ms.

GC

Consider the possible causes of this phenomenon in the first place:

There is a lock on the application causing the method to be blocked, but there is no lock on the JSON parsing method. Rule out this possibility. On the JVM, GC can cause STW. On the system, if the system load is high, the operating system is busy, and thread scheduling problems may result in this situation, but the observation monitoring system found that the system load was always at a very low level, and also ruled out the system problem. We all know that the JVM causes STW during GC, which causes all threads to pause, so let’s focus on GC.

First of all, we checked GC status using jstat command, and found that fullGC was infrequent. The system ran 100 times within two days, while minorGC only performed once every few seconds, and gcTime was always within the normal range.

As a result of our JVM added at startup – XX: + PrintGCApplicationStoppedTime parameters, and the parameter name and its meaning is not entirely relative, in this parameter is enabled, gclog prints out GC not only lead to STW, STWS from other causes are also recorded in gCLOG, making gCLOG an important troubleshooting tool.

Check gCLOG and find that there are indeed abnormal conditions, as shown below:

System STW sometimes occurs very frequently and lasts for a long time, with an interval of less than 1ms. That is to say, after a pause lasts for dozens of ms, the system will conduct a second STW before it starts to run. As shown in the figure above, the system should hang for 120ms at a time. It is possible for the interface to time out.

Safety point and bias lock

Security point log

So what makes this frequent STW? MinorGC doesn’t even have this high frequency.

We know that before STW, the system waits for all thread-safe points, where the thread’s state is determined and the heap referenced is still, so that the JVM can feel comfortable doing things like GC. As for the common location of the security point and the implementation of the security point network card has a lot of articles introduced here not to mention more, here focus on the security point log.

The safe point log is the log output by the JVM after all threads enter the safe point STW. The log records how long it takes to enter the safe point, how long the STW lasts, and which step consumes the time within the safe point. Through it, we can analyze the reason of the JVM STW.

Service is started, use – XX: XX: + UnlockDiagnosticVMOptions – + PrintSafepointStatistics – XX: XX: + LogVMOutput LogFile =. / safepoint log parameters, You can export security point logs to safepoint.log.

In the safe point log, I found a lot of log output similar to the following:

As you can see from the previous VMopt column, the reason for entering the safe point is RevokeBias, which, according to the data, releases bias locks.

Biased locking

An optimization of biased locking is the JVM to lock, the JVM statistics found that most of the developers are made by the same thread lock, lock for the possibility is very small, and the system calls a lock release locking overhead is relatively large, so the introduction of biased locking the default lock will not be competitive, biased locking the acquisition to the “to” point to the first thread lock.

After the first access to a locked in, the JVM does not need to lock with system command, but use, is biased locking to sign it, the object header MarkWord biased locking logo is set to 1, set to thread to continue to lock the thread ID, so that after the thread again when applying for the lock if it is found that the lock has been “to” the current thread, Just use it. Moreover, the thread holding the biased lock will not release the lock actively. Only when there is lock contention, the biased lock will be released, and then expand to a higher level lock.

Biased lock can greatly reduce lock consumption in a single-threaded environment, but in a multi-threaded environment with high concurrency, threads compete frequently, and when biased lock is released, in order to avoid conflicts, it needs to wait until it enters the global safety point, so each biased lock release will have greater consumption.

After understanding why THE JVM frequently performs STW, modify the service startup parameters and add the -xx: -usebiasedlocking parameter to disable biased locking, and then observe the service running.

The frequency of finding a pause problem drops by half, but it still happens, and the problem is back to square one.

Location is a matter of guesswork. Identify the points that may be causing the problem, and then replace these factors in turn to see if they can be replicated to determine the point of the problem.

The problem points considered are HttpClient, Hystrix, Log4j2. Log4j2 = Log4j2; Log4j2 = Log4j2; Log4j2 = Log4j2;

Although the location of Log4j2, but log must not remember, still need to find out what the problem is.

Use btrace to check the lock performance in log4j2. First, consider the lock in log4j2 code. It is suspected that the lock conflict causes the output log to be blocked.

View the source code, there are three places where statistics exist locks:

The rollover() method locks the log file to be shred if it detects that it needs to be switched.
The encodeText() method is used to convert logs of various character sets into byte arrays. In large log output, logs need to be divided into blocks. To avoid out-of-order between multithreaded blocks, methods are locked using the synchronized keyword.
The flush() method, also to avoid out-of-order interlusions between logs, needs to be locked.

I used bTrace, a Java performance check tool, to find out what was wrong.

Btrace was used to inject code before and after each of the three methods, output the execution time of each method, and then perform pressure test, waiting for the problem to occur, and finally find the method that executes slowly: encodeText().

The scour code didn’t find anything wrong with the encodeText method.

Analyze problems using JMC

At this point, the group god suggested THAT I use JMC to catch exception events. Add the following parameters to docker-compose to enable the environment’s JFR.

[plain] view plaincopy
environment:  
- JFR=true  
- JMX_PORT=port  
- JMX_HOST=ip  
- JMX_LOGIN=user:pwd  
Copy the code

Looking for the exception event near the pause time in the record, we found that there was one item that stood out in the time statistics: the call to randomAccessFile.write () took 1063 ms, which exactly matched the thread ID.

Check out this time-consuming method, which calls the native method write()…

If the native method is checked further, it will be a system problem. However, I suspect it is a docker problem. I will continue the investigation and update it when there is a conclusion.

To solve

Problems always have to be solved, and they’re pretty simple:

The service keeps fewer logs, too many logs can cause this problem.

Use Log4j2 asynchronous logging, although it is possible to lose logs when the buffer is full or when the service is restarted.

summary

Indeed, I learned a lot of knowledge in the process of problem finding, which made me more familiar with the Java ecology. However, I think the more important thing is the way of thinking for problem finding, so I summarized a general procedure for problem finding as a checkList for future problem finding.

Try to analyze as many problem cases as possible. Find out what they have in common and avoid focusing on the wrong issues. For example, if you look at several cases at the beginning of the troubleshooting, you will find that the log pause occurs at any log point, so you can directly exclude HttpClient and Hystrix impact.
Reproduce the problem in a controlled environment. Reoccurrence of the problem in the test environment can help us adjust the troubleshooting strategy on the fly, greatly reducing the troubleshooting cycle. Of course, it is important to keep consistent with the online environment, which may lead to your thinking wrong. For example, when I repeated the problem in the pressure test, the heap memory was set too small, leading to frequent GC, which made me mistake the pause caused by GC.
Compare the changes and make guesses. When there is a problem with the service, we always ask what has been online recently. The program has inertia. If there is no change, the system will always work normally. Of course, the change is not only in time dimension, we can also compare differences between multiple services.
Problem location by elimination. Generally, we will put forward multiple possibilities leading to problems. During the investigation, keep a variable changing and then compare the occurrence of problems, so as to sum up the influence of variables on problems.
To solve. Once we locate the problem, the solution is usually as simple as changing a line of code.

Of course, there is also a very important point, throughout the beginning and the end, that is the data support, the investigation process must have data as support. Convince others of your theory by comparing total and percentage numbers before and after.

Three things to watch

If you find this article helpful, I’d like to invite you to do three small favors for me:

Like, forward, have your “like and comment”, is the motivation of my creation.
Follow the public account “Java rotten pigskin” and share original knowledge from time to time.
Also look forward to the follow-up article ing

Author: shen and books Reference: club.perfma.com/article/190…

A weird garbage collection time problem troubleshooting practice

GC

Safety point and bias lock

Security point log

Biased locking

Analyze problems using JMC

To solve

summary

Three things to watch

Related Posts

Shell programming – [01] string manipulation commonly used in the Shell

The whole network first, do the first person pure source code to explain RabbitMQ practice, collection bar

Analysis of ThreadLocal design ideas