The profile

Java performance analysis is an art and a science. Scientific refers to the fact that performance analysis generally involves a large number of numbers, measurements, and analyses; Art refers to the use of knowledge, experience and intuition. The tools or methods of performance analysis are different, but the process of performance analysis is very different. This article on the known applicable Java performance analysis tips to share some, to help users better understand and use.

Tip 1: thread stack analysis

Thread stack analysis is a snapshot analysis of running Java threads. It is a lightweight analysis that users can first try when they are not sure what the application’s performance problems are. While there is no universal standard for determining whether a Java thread is an exception, there are several metrics that users can use to make a quantitative assessment. The following four detection indicators are shared:

Thread deadlock checking Thread deadlock checking is a very valuable metric to detect. If the thread deadlock, there is generally a waste of system resources or service capacity decline and other problems, once found, need to be dealt with in time. Thread deadlock detection will show the thread deadlock relationship and the corresponding stack information, through the analysis can be located to trigger the deadlock code. The deadlock model shown in Figure 1 shows a complex 4-thread deadlock scenario.

2) Thread statistics check state statistics is the running thread according to the running state of the statistics and summary. Users who do not fully understand their business pressures often configure a very generous range for the number of threads available, which can lead to performance degradation or system exhaustion due to too many threads. As shown in Figure 2, it can be found that more than 90% of the threads are in the blocking and waiting state, so appropriately optimizing the number of threads can reduce the overhead and unnecessary resource waste caused by thread scheduling.



As shown in Fig. 3, the number of threads in the running phase has exceeded 90%. Further analysis may result in thread leakage. At the same time, there are too many threads running and the overhead of thread switching is very large.

3) Checking the CPU usage of Java threads. The CPU usage of Java threads is counted and sorted, and the thread stack with very high CPU usage is analyzed to quickly locate the program hot spots. As shown in Figure 4, the CPU utilization of the first task thread has reached 100%, and the developer can decide whether to optimize the code based on the business logic.



4) Check the number of GC threads

The number of GC threads is a metric that is often overlooked by users. It is easy for users to ignore the resource situation of the system when setting the number of parallel GC threads, or arbitrarily deploy the application to the physical machine with more CPU cores. As shown in figure 5, we found that in a 4 core 8 gb containers G1 collection of concurrent threads for 9 (usually a concurrent GC thread number is the number of GC task thread 1/4), which is in GC occurs may appear nine parallel GC thread, this case CPU resources directly drain the system will be a short time and business. So when using a GC collector (such as CMS, G1), set or focus on the number of threads in the GC as much as possible.

Tip two: GC log analysis

Log analysis is the analysis of the recorded data collected by the Java program GC, and this part of the data collection needs to be enabled for certain options. JDK8: -XLoggc :logs/gc-% t.log-xx :+ PrintGCDetails-xx :+ PrintGCDatestamps; JDK11: -xlog :gc*:logs/gc-%t.log:time,uptime). The results of the GC log analysis describe the state of the Java program in relation to memory collection in the past period. By analyzing this state information, users can easily obtain GC parameters and even Java code optimization metrics. The following three analysis indicators are carried out:

1)GC throughput The throughput rate describes the percentage of time available for business processing during the JVM’s running time, that is, non-GC time. A higher value indicates less time spent by the user GC and better JVM performance. The value specified internally by the JVM cannot fall below 90%, otherwise the performance penalty imposed by the JVM itself can seriously affect business performance. Figure 6 shows the log analysis results of JDK8 CMS running for about 3 hours. The analysis results show that the throughput rate is more than 99.2%, and the performance loss caused by JVM (GC) is relatively low.

The GC pause time refers to the time that the business thread needs to be stopped during the GC process, and the time needs to be within a reasonable range. If most of the pause times are longer than expected (acceptable to the user), it may be necessary to adjust GC parameters and heap sizes, or even set the number of parallel GC threads. As shown in Figure 7, over 95% of GC pauses are within 40ms; Pause of more than 100ms can be a major contributor to business request time burrs. To eliminate the problem of fluctuating pause times, you can choose to do things like G1 GC or ZGC, or adjust the number of parallel threads or GC parameters.

3) Scatter Diagram of GC Stage Scatter Diagram reflects the distribution of the memory freed by each GC operation. As shown in Figure 8, the memory freed by each GC is basically the same size, indicating that the memory freed process is relatively stable. However, if there are relatively large fluctuations or more Full GCs, it may be due to the lack of reactor space in the Cenozoic region that leads to the large promotion amount. If the release amount of each GC is relatively small, it may be that the G1 GC adaptive algorithm causes the smaller new generation space, etc. Because the scatter plot presents limited data, it is generally necessary to combine other metrics with the user’s JVM parameters for analysis.

Tip three: JFR event profiling

JFR, short for Java Flight Record, is the event-based JDK monitoring recording framework built into the JVM. In the community, JFR was released prior to OpenJDK11 and later migrated to the higher version 260 of OpenJDK8, using the same unified interface and operation command JCMD. At the same time, since JFR recording generally has little impact on the application (the performance impact of default is less than 1%), it is suitable for long time; In addition, JFR can collect rich information such as Runtime, GC, Thread Stack, Heap, IO, etc., which makes it very convenient for users to know the running status of Java programs. There are more than 100 events recorded by JFR, and if the program is complex and often less than 10 minutes, the recorded JFR file size can exceed 500MB, so users tend not to pay attention to all the information when analyzing. The following are some common business performance to share:

The default interval of CPU sampling is 1s, which can basically reflect the average CPU usage of the current process in time. Detection and analysis can be performed when the CPU is persistently high, or when the CPU is occasionally high, as shown in Figure 9. Through further localization, the CPU spike here was consistent with the GC trigger time, and it was preliminarily confirmed that the CPU change was caused by GC.

The GC configuration can help us understand the current process’s GC collector and its main configuration parameters. Because different collectors have different characteristics, it is also important to analyze the heap space, trigger control, and other parameters. The control parameters help us understand the GC collection process, such as the G1 collector shown in Figure 10, with a maximum heap of 8GB and a GC pause time of around 40ms (the default expectation is 200ms), which is much lower than expected. Further analysis of the parameters found that setting a NewRatio value of 2 (which is easy for users who are not familiar with G1 GC) resulted in frequent GC triggering in the new generation and no mixed GC triggering from the data. To increase the use of heap space, we can increase the use of the whole heap by removing the NewRatio parameter, increasing the maximum proportion of the Cenozoic region (since no mixed GC is triggered, the amount of heap collection is very low), lowering the threshold for collection blocks, and so on. Through optimization, the heap space usage was increased from 4GB to 7GB, the YGC frequency was increased from 20S/time to the average 40S/time, and the GC pause time did not change significantly.

3) Method Sampling Flame Diagram Method Flame Diagram is the statistics of the sampling times of the called method. The larger the proportion is, the more times the method is called. Because there is complete information of the stack in the sampling process, it is very intuitive for the user, and the performance optimization is greatly helpful. As shown in Figure 11, it is clear that groupheap.match is executed nearly 30% of the time and can be used as a performance optimization point.

IO performance is most likely to show a sudden change in the program’s processing performance, such as a drop or surge. For example, the amount of data read from the socket soars, resulting in the skyrocketing CPU processing business; Or because more data needs to be written out, the business thread is blocked, processing capacity is reduced, etc. As shown in Figure 11, I/O capability during the monitoring period can be determined by reading/writing the trend chart.

Tip four: heap content analysis

Heap content analysis is a common method to analyze the causes of Java heap OOM MemoryError. In OOM, there are heap space overflow, meta-space overflow, stack space overflow, and direct memory overflow, but not all overflow cases can be detected by heap content analysis. In the case of a heap dump file, the possibility of a memory leak is uncertain, but can be determined by some quantitative metrics or convention conditions, and finally confirmed by the developer or tester. Here are three valuable metrics to share:

1) Checking the distribution of large objects statistics can help us to understand the proportion of memory consumption in this part of the object, and whether the existence of large objects is reasonable. Too many large objects that cannot be freed will run out of memory faster, resulting in OOM. Checks of large objects are typically compared to full analysis of all objects, as shown in Figure 13.

2) Class loading check Class loading statistics is the main statistics of the program is currently loaded all the class information, is an important data to calculate the meta space occupancy. Excessive load class information can also lead to a large footprint in the meta space. In a scenario like RPC, caching load class information is easy to trigger OOM.

3) Object leak checking first introduces three concepts; Shallow heap: The amount of memory occupied by an object, independent of the object’s content, depends only on its structure. Deep heap: The actual amount of memory that can be freed after an object is collected by GC, that is, the sum of the shallow heap of all objects accessed through the object (the dominant tree). Domination tree: In the reference graph of an object, all paths to object B pass through object A, then object A is considered to dominate object B; If object A is the dominant object closest to object B, object A is considered to be the direct dominant of object B. According to the GC policy, objects in the heap can only have two states: objects reachable through the GC root; The other is objects that are not reachable through GC root. Unreachable objects are collected by the GC collector and the corresponding memory is returned to the system. However, reachable objects are objects that are directly or indirectly referenced by users. Therefore, object leakage targets objects that are indirectly referenced by users but never used. These objects cannot be released because they are referenced. Object leakage is not absolute, but relative, and there is generally no exact standard, but it can be assessed by the deep heap size of an object. For example, if you detect that the HashMap stores 4844 objects (as shown in Figure 14) and calculate the shallow heap of the HashMap to be about 115KB, you may feel that there is no problem. But by calculating the object’s deep heap, it turns out to be more than 500MB. In this case, the failure to free the HashMap and keep adding new keys could cause the heap to run out of memory and an OOM to appear.

About the author:

Nianwu, Senior Backend Engineer

Mainly responsible for Java performance platform and JDK support, also have in-depth research on defect checking and compiler.

For more exciting content, please pay attention to [OPPO Internet Technology] public number