An application that uses the default JFR configuration was recently found to be experiencing an avalanche, which is documented here as a cautionary tale.
Key takeaways
- JFR’s default thread stack dump collection cycle is
everyChunk
That is, one at a timeChunk
It will be collected once. The default for eachChunk
With a size of 12M, stack dump can be larger than 12M on a large number of threads, causing JFR to keep switching new chunks and then firing againevenryChunk
Causing an endless cycle and an avalanche. - For the JFR default event collection configuration (located at
The JDK directory/lib/JFR/default. JFC
), each collection cycle andChunk
Related, all should be handled with care, it is best to cycle through a fixed time to write dead, such as every minute, etc., do not useChunk
As an indicator, prevent the above problem. - CPU context switches occur only for Running threads. Many threads are created, but there are few threads with tasks at one time (Running threads are relatively few), so thread switches are not frequent. Threads waiting for tasks consume little CPU resources.
background
For an online application with a large number of threads, the thread is idle and a task is submitted. However, the task is executed after 10s, and there are many similar situations during this period. For example:
The 2020-12-01 13:07:34. 515 INFO [name of micro service, 298 ee1f359f67ff7, 298 ee1f359f67ff7] [19] [globally unique name: thread scheduling threads] tasks submitted to the thread pool... The 2020-12-01 13:07:43. 623 INFO [micro service name, 298 ee1f359f67ff7, e881ba0d86fbc4fc] [19] [thread globally unique name: 24684780-0] start dealing with the 2020-12-01 13:07:44. 159 INFO [name of micro service, 298 ee1f359f67ff7, e881ba0d86fbc4fc] [19] [thread globally unique name: 24684780-0] processing is completedCopy the code
The queue is empty and all threads are waiting for a task. Rule out waiting cases where other tasks fill the core thread pool.
A thread that remains idle but cannot accept a task may be in a situation where it needs to enter SafePoint. Check the GC, the GC is normal at that time, proving that it is not the GC that causes the need to enter the safe point:
View the SafePoint log at that time (our online process started safepoint log, parameter: – Xlog: safepoint = debug: file = safepoint. Log: utctime, level, tags: filecount = 50, filesize = 100 m), found in this period, There are many, many processes Entering SafePoint (we just looked at Entering safepoint Region) :
[2020-12-01T13:07:32.290+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.335+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:32.635+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:32.655+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:32.853+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:32.923+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:32.968+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.269+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.286+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.480+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:33.524+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:33.806+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:33.822+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:33.949+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:34.019+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.064+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.346+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.361+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:34.510+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:34.555+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:34.839+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:34.861+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.023+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:35.097+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.140+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.423+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.439+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:35.614+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:35.654+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:35.943+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:35.959+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.089+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:36.172+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.216+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:36.498+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:36.515+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:36.706+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:36.749+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.036+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.052+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.184+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:37.258+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.301+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:37.624+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:37.648+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:37.906+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:37.955+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.248+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:38.265+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:38.431+0000][info ][safepoint] Entering safepoint region: G1CollectForAllocation
[2020-12-01T13:07:38.551+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:38.620+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:38.686+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:38.978+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.001+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.214+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.269+0000][info ][safepoint] Entering safepoint region: PrintThreads
[2020-12-01T13:07:39.552+0000][info ][safepoint] Entering safepoint region: PrintJNI
[2020-12-01T13:07:39.568+0000][info ][safepoint] Entering safepoint region: FindDeadlocks
[2020-12-01T13:07:39.733+0000][info ][safepoint] Entering safepoint region: JFRCheckpoint
[2020-12-01T13:07:39.813+0000][info ][safepoint] Entering safepoint region: ClassLoaderStatsOperation
[2020-12-01T13:07:39.860+0000][info ][safepoint] Entering safepoint region: PrintThreads
Copy the code
Looks a lot like JFR’s timed collection, looking at thread CPU usage via top-HP:
Top-01:31:46 up 10:01, 1 User, Load Average: 2.16, 2.55, 2.62 Threads: 25390 total, 2 running, 25388 sleeping, 0 stopped, 0 zombie %Cpu(s): 17.0US, 4.7SY, 0.0Ni, 74.0 ID, 4.1wa, 0.0hi, 0.2Si, 0.0st KiB Mem: 32120452 total, 287936 free, 20438864 used, 11393652 buff/cache KiB Swap: 0 total, 0 free, 0 Used.11485760 Avail Mem PID USER PR NI VIRT RES SHR S %CPU % Mem TIME+ COMMAND 63120 root 20 0 26.8g 9.4g 22016 R 73.2 30.8178:50.29 VM Thread 63130 root 20 0 26.8g 9.4g 22017s 15.7 30.8 45:19.81 JFR Periodic Ta 63114 root 20 0 26.8g 9.4g 22016s 6.6 30.8 17:31.55 GC Thread#0 63185 root 20 0 26.8g 9.4g 22016s 6.6 30.8 44:59.01 log4j2-tf-10-as 63346 Root 20 0 26.8g 9.4g 22016s 6.3 30.8 17:31.89 GC Thread#2 63345 root 20 0 26.8g 9.4g 22016s 6.0 30.8 17:31.70 GC GC Thread#3 63128 root 20 0 26.8g 9.4g 2201s 5.4 30.8 GC Thread#3 63128 root 20 0 26.8g 9.4g 2201s 5.4 30.8 13:35.61 JFR Recorder Th 63134 root 20 0 26.8g 9.4g 22016 S 4.2 30.8 13:21.16 Java 63569 root 20 0 26.8g 9.4g 22016 S 1.8 30.8 rest. 94 lettuce - 64742 root epollEv 20 0. 26.8 g 9.4 g 22016 S 1.2 30.8 1:03 26 XNIO task - 2-6 2777 root 20 0 26.8 g 9.4g 22016s 1.2 30.8 0:46.15 xniO-2 task-24......Copy the code
It was found that the most cpu-using thread was the VM thread, followed by the JFR thread. It is suspected that all threads enter the safety point all the time because of JFR collection. Our JFR collection event configuration is the default configuration, which is default. JFC in the LIB directory of the JDK directory.
<event name="jdk.ThreadDump">
<setting name="enabled" control="thread-dump-enabled">true</setting>
<setting name="period" control="thread-dump-interval">everyChunk</setting>
</event>
Copy the code
The default is everyChunk, which means that it prints every time a new Chunk is entered. So what is Chunk? When will the new Chunk be entered?
In JFR, all events (including those generated through the JFR API and those generated by the JVM) are stored in each Thread’s own Thread Buffer. When the Buffer is full, the contents of the Buffer are flushed into the Global Buffer. The Global Buffer is a circular Buffer that holds the contents of the Thread buffers sent by all threads. When the size of the ring Buffer reaches its limit, depending on the configuration, it will either discard or flush the file, which is the default. This file, we can think of as Chunk.
The size of Chunk is limited, and the default is 12M. However, ** this process has many threads, resulting in threads that dump more than 12m. ** causes JFR to keep switching new chunks, which then triggers evenryChunk causing an infinite loop into the safe point.
The solution
The first thing that comes to mind when you find a problem is to reduce the number of threads. Due to service requirements, the threads of this service are mainly used for service isolation and orderly execution, and the number of Running threads at the same time is not very large. CPU context switches occur only for Running threads. Many threads are created, but there are few threads with tasks at one time (Running threads are relatively few), so thread switches are not frequent. Threads waiting for tasks consume little CPU resources. So don’t worry about reducing the number of threads for now.
The final solution is that, for the JFR default event collection configuration, each collection cycle related to Chunk should be handled carefully. It is better to write the cycle at a fixed time, such as every minute, etc., and do not use Chunk as an indicator to prevent the problem of infinite collection caused by an avalanche of collection that constantly switches between chunks.