Recommend a good article first

Recently, I have written several articles about GC, mainly because there are some problems about GC online, so I want to summarize some knowledge points and troubleshooting ideas about GC.

Before, some readers left a message asking if they could write an article about actual combat experience

The main combination used in our project is CMS + ParNew. So focus on this aspect of the data.

In the process of learning, I also read the article “Analysis and Solution of 9 Common CMS GC Problems in Java” written by Meituan Technical team. The quality of this article is very high, from theoretical knowledge, source code analysis, to common GC problem cases, including analysis path, root cause, tuning strategy and so on, very detailed and comprehensive, especially in the last part of the process SOP and root fishbone diagram, very nice. Wall crack recommended, worth reading!!

Knowing that you like ready-made ones, I manually copied these two images to help myself if necessary:

GC Problem Cases

I have encountered cases may not be as rich as the author of the above article, but it is also a few real cases encountered, so borrow this article to share, we can refer to the reference, to avoid stepping on similar pits.

Case 1 Young GC is frequent

There was a task that repeatedly called an interface frequently. So we use Guava Cache to do a simple memory cache. Results After going online, I often received frequent alarms of Young GC, which coincided with the start time of this task.

The GC diagram from the monitor looks something like this:

As you can see, the number of Young GC spikes at some point in time. This is accompanied by a rapid increase in Old area memory, and eventually a Full GC is triggered.

In this case, it is certain that this code change was caused. After Heap Dump analysis, it is found that a guava cache Map occupies a large amount of memory.

The code found that the guava cache was not set to the maximum cache and weak references, but to an expiration time of a few minutes. The amount of this task is quite large, and it soon caches a large number of objects online, leading to frequent triggering of Young GC. However, because some reference GC does not drop (as can be inferred from the image of the size of Survivor area), the objects with more algebra are gradually promoted to the old age. Later, when memory reaches a certain threshold, the Full GC is raised.

This problem is resolved later by setting the maximum number of caches. Accumulated another valuable experience, perfect!

Case 2: Both Young and Old GC are frequent

Frequent alarms of Young GC and Old GC are received in the gray environment online. Procedure The GC diagram that the monitor sees looks something like this:

As you can probably see from the GC diagram, both Young and Old GC are very frequent and can reclaim a large number of objects each time. That’s a simple guess: there are a lot of objects, and most likely a few big ones. Small objects cause frequent Young GC, while large objects cause frequent Old GC.

A piece of code was also responsible for the sweep. For a query and sort paging SQL, at the same time this SQL needs to join multiple tables, under the sub-database sub-table, direct call SQL performance is poor, or even timeout. So I thought of a low way: look up the list, find out all the data, in memory sort pages. A List is used to store data, and some data is large, which causes this phenomenon. Heap Dump analysis also confirms this suspicion. Objects of type List take up a lot of space.

Case 3 interface thread pool Full and Full GC

This is a problem reporting that the interface thread pool is full. But the Full GC will be done at the same time each time. The surveillance picture looks something like this:

In terms of time, the number of Java threads spikes and then the Full GC is triggered. After a subsequent restart, the Number of Java threads returns to its normal level.

Here’s a bit of trivia: How much memory does a Java thread take by default?

This parameter can be controlled: -xx :ThreadStackSize. On 64-bit Linux, the default is 1 MB (not entirely true, again depending on the stack depth). Java 11 has an optimization for this to reduce memory footprint. See this article for details: dzone.com/articles/ho…

Log4j 1 has a performance problem and can cause a large number of threads to block in high concurrency.

void callAppenders(LoggingEvent event) {
    int writes = 0;

    for(Category c = this; c ! =null; c=c.parent) {
        // Protected against simultaneous call to addAppender, removeAppender,...
        synchronized(c) {
            if(c.aai ! =null) {
                writes += c.aai.appendLoopOnAppenders(event);
            }
            if(! c.additive) {break; }}}if(writes == 0) {
        repository.emitNoAppenderWarning(this); }}Copy the code

The solution is to reduce logging printing and upgrade the logging framework to Log4j 2 or Logback.

Case 4 Full GC was frequent when the application started

This is an earlier case where GC diagrams are not available.

But there are only a few possible causes of Full GC frequency:

  • Call System. The gc ()
  • The Old area is insufficient
  • Permanent generation/meta space full

The first two possibilities are ruled out based on the code and GC diagram, that is, the metaclass is full. In Java 8, XX:MaxMetaspaceSize has no upper limit, the maximum capacity is related to the machine’s memory; But XX:MetaspaceSize has a default value of 21M. If the application has a lot of objects that need to be in the meta space (such as a lot of code), the Full GC will be triggered frequently. The workaround is that you can specify the MetaspaceSize size via JVM arguments: -xx :MetaspaceSize=128M.

conclusion

As you can see, most of the above cases are caused by code changes or application frameworks. There is usually a default set of JVM parameters in the company, and there are very few cases where you actually need to tune JVM parameters to solve a problem.

Sometimes the GC problem may not be the root cause of the problem, but may be caused by other problems. During actual troubleshooting, you need to determine the GC problem according to the log and timeline.

How to judge whether it is “frequent” or “time-consuming”? There are some formulas, but I think that’s acceptable as long as it doesn’t affect the existing business. If the GC performance of one application is significantly different from the average, or other similar applications, it is reasonable to suspect that there is a GC problem.

Most of the time the GC problem is not sufficient pressure test out, and the cost of pressure test is relatively large. It is often observed in online grayscale release, so it is necessary to closely observe the system monitoring and alarm during grayscale release. If possible, you can deploy red and blue to reduce risks.

GC problems are quite complex and require a lot of experience and theoretical knowledge. After encountering, you can summarize and analyze the root cause. At ordinary times, you can also read other people’s blogs and draw experience, so that you can constantly improve your knowledge system.

And a support

My name is Yasin, a blogger who insists on original technology. My wechat official account is: Made a Program

All see here, if feel my article write also ok, might as well support once.

The first article will be sent to the public account, the best reading experience, welcome your attention.

Your every retweet, attention, like, comment is the biggest support for me!

There are learning resources, and frontline Internet companies within the push oh

And a support

My name is Yasin, a blogger who insists on original technology. My wechat official account is: Made a Program

All see here, if feel my article write also ok, might as well support once.

The first article will be sent to the public account, the best reading experience, welcome your attention.

Your every retweet, attention, like, comment is the biggest support for me!

There are learning resources, and frontline Internet companies within the push oh