directory
- Tuning to prepare
- Performance analysis
- Performance tuning
- Other Optimization Suggestions
- JVM parameters advanced
For tuning, there are generally three processes:
- Performance monitoring: Problems don’t occur and you don’t know what you need to tune? In this case, some system and application monitoring tools are required to detect problems.
- Performance analysis: The problem has occurred, but you don’t know exactly what the problem is. In this case, you need to use tools and experience to analyze the bottleneck of the system and application in order to locate the cause of the problem.
- Performance tuning: After the analysis in the previous step, the problem is located, and the problem needs to be solved and optimized by means of code and configuration.
Java tuning is all about these three steps.
In addition, the performance analysis and tuning mentioned in this paper are independent of the following factors:
- Underlying system environment: hardware and operating system
- Use of data structures and algorithms
- Use of external systems such as databases and caches
- Use of some Api in Java, such as Random, StringBuilder, etc.
Tuning to prepare
Tuning requires preparation, because every application’s business goals are different, and performance bottlenecks are not always at the same point. At the business application level, we need to:
- Need to understand the overall architecture of the system, clear direction of pressure. For example, which interface and module of the system is the most used and faces the challenge of high concurrency.
- You need to build a test environment to test the performance of your application, using AB, LoadRunner, or JMeter.
- Analyze the volume of key business data, which mainly refers to quantitative analysis of some data, such as the daily data volume of the database; How much data is cached, etc
- Understand the system response speed, throughput, TPS, QPS and other indicators requirements, such as the corresponding speed and QPS of the second kill system is very demanding.
- Understand the version, mode, and parameters of the software related to the system. Sometimes, the version and mode of the service on which the application depends may affect the performance to some extent.
In addition, we need to know a few things about Java:
Performance analysis
At the system level, there are three factors that affect application performance: CPU, memory, and IO. You can analyze the performance bottleneck of a program from these three aspects.
CPU analysis
When the program responds slowly, run the top, vmstat, and ps commands to check whether the CPU usage is abnormal. In this way, you can determine whether the CPU is busy. Among them, the abnormal process information is mainly viewed by us (% of user processes). When the US approaches 100% or higher, you can be sure that the slow response is due to busy CPU. Generally speaking, the CPU is busy for the following reasons:
- Threads have infinite empty loops, no blocking, regular matching, or pure computation
- Frequent GC occurs
- Multithreaded context switching
Once you have identified the process with the highest CPU usage, you can use jStack to print the stack information for the abnormal process:
jstack [pid]
The next thing to note is that all threads in Linux end up in the system as lightweight processes, and using JStack can only print out information about the process, which contains the stack information for all threads under the process (lightweight process -LWP). Therefore, it is further necessary to determine which thread consumes a large amount of CPU. In this case, top -P [processId] can be used to check, or pS-LE can directly display the resource consumption information of all processes, including LWP. Finally, the corresponding stack information can be located by searching the CORRESPONDING LWP ID in the output file of JStack. Note the thread states: RUNNABLE, WAITING, etc. For Runnable processes you need to be aware of cpu-consuming calculations. A thread that is Waiting is typically a lock wait operation.
You can also use jstat to view the GC information for the process to determine if the GC is causing the CPU to be busy.
jstat -gcutil [pid]
You can also use vmstat to determine if the CPU is busy due to context switches by looking at the number of context switches (CS) in the kernel state.
vmstat 1 5
Memory analysis
For Java applications, memory is mainly composed of off-heap memory and in-heap memory.
-
Out of memory
Off-heap memory is mainly used by JNI, Deflater/Inflater, and DirectByteBuffer (niO). For the analysis of out-of-heap memory, you need to check the swap and physical memory consumption using VMstat, SAR, TOP, and pidstat before making a judgment. In addition, calls to JNI and Deflater can be made through Google-prefTools to track resource usage.
-
Within the heap memory
This part of memory is the main memory area for Java applications. Commonly associated with this part of memory performance are:
- Create objects: this is stored in the heap, need to control the number and size of objects, especially large objects can easily enter the old age
- Global collections: Global collections tend to be long-lived, so they need to be used with special care
- Cache: The data structure chosen by the cache can greatly affect memory size and GC
- ClassLoader: Dynamic loading of classes is easy to cause permanent generation memory shortage
- Multithreading: Thread allocation can consume local memory, and too many threads can cause insufficient memory
Improper use of the above is easy to cause:
- Frequent GC -> Stop the world, making your application slow down
- OOM, directly cause memory overflow error makes the program exit. OOM can be divided into the following categories:
- Heap space: Insufficient Heap memory
- PermGen space: the permanent generation memory is insufficient
- Native thread: does not have enough memory to allocate
A common tool for troubleshooting heap memory problems is Jmap, which comes with the JDK. Some common uses are as follows:
- Check the JVM memory usage: jmap-heap
- View the JVM memory alive objects: jmap-histo :live
- Jmap-dump :format=b,file=xxx.hprof
- Jmap-dump :format=b,live,file=xxx.hprof jmap-dump :format=b,live,file=xxx.hprof
In addition, Eclipse’s MAT(MEMORY ANALYZER TOOL) can be used to analyze dump files, whether using JMap or generated in OOM, to see specific stack and in-memory object information. Of course, jhat, which comes with the JDK, can also view dump files and launch web ports for browsers to browse.
IO analysis
Typically, application performance is related to file IO and network IO.
-
File IO
You can use the pidstat, iostat, and vmstat system tools to check the I/O status. Here is a diagram of the result using VMstat.
Note bi and BO, which indicate the number of blocks received and sent by the block device per second respectively, to determine the I/O busy status. Further, system calls to file IO can be located using the Strace tool. Generally, poor file IO performance can be caused by:
-
Network IO
The netstat tool is generally used to check the NETWORK I/O status. You can view the status, number, and port information of all connections. For example, if time_WAIT or close_WAIT connections are too many, the application speed will be affected.
netstat -anp Copy the code
Tcpdump can also be used to analyze network I/O data. Tcpdump files are binary data. You can use the Wireshark to read connections and data contents.
Tcpdump -i eth0 -w tmp.cap -tnn DST port 8080 # Listen for network requests on port 8080 and print logs to tmp.capCopy the code
You can also get information about interrupts currently used by your system by looking at /proc/interrupts.
The columns are, in order:
Irq sequence number, number of interrupts on respective CPUS, programmable interrupt controller, device name (dev_name field in request_IRQ)Copy the code
You can determine the NETWORK I/O status by viewing the terminal status of the network adapter.
Other analysis tools
The system /JDK analysis tool is used to analyze CPU, memory and IO. In addition, there are some comprehensive analysis tools or frameworks that can facilitate the investigation, analysis, and positioning of Java application performance.
-
VisualVM
This tool is a Java application monitoring tool that Java developers are familiar with. The principle is to connect to the JVM process through a JMX interface to see threads, memory, classes, and so on on the JVM.If you want to take a closer look at GC, you can install the Visual GC plug-in. In addition, VisualVM also has a bTrace plug-in that allows you to visually and intuitively write BTrace code and view output logs. Similar to VisualVm, JConsole is a tool for viewing remote JVM information through JMX. It also shows specific thread stack information and memory usage by generation, as well as direct remote execution of MBeans. Of course, VisualVM can also have these capabilities by installing the JConsole plug-in.However, since both tools require a UI interface, they typically connect locally and remotely to the server JVM process. In a server environment, this is generally not the case.
-
Java Mission Control(jmc)
This tool is jdK7 U40 start with, originally is JRockit on the tool, is a sampling type set of diagnosis, analysis and monitoring and one very powerful tool. Docs.oracle.com/javacompone…
-
It uses the Java Attach API + Java Agent + Instrument API to implement dynamic tracking of the JVM. Interception class methods can be added to print logs and so on without restarting the application. Refer to the Btrace Beginner to Journeyman Complete Guide for detailed usage.
-
Jwebap
Jwebap is a JavaEE performance detection framework based on asm enhanced bytecode implementation. Support: HTTP request, JDBC connection, method call trace tracking, times, time statistics. This allows you to obtain the most time-consuming requests, methods, and view the number of JDBC connections, whether they were closed, and so on. But the project was created in 2006 and has not been updated for nearly 10 years. According to this author, applications compiled by JDK7 are no longer supported. If you want to use it, it is suggested to develop it again based on the original project. At the same time, the track tracking of redis connection can also be added. Of course, you can also implement your own JavaEE performance monitoring framework based on the principles of bytecode enhancement.
The above image is from jWEBAP, which has been redeveloped by our company. It already supports JDK8 and Redis connection tracing.
-
useful-scripts
Here is an open source project I am involved in: github.com/superhj1987… , encapsulates a number of commonly used performance analysis commands, such as printing the stack of busy Java threads described above, with only one script to execute.
Performance tuning
Corresponding to performance analysis, performance tuning is also divided into three parts.
CPU tuning
- Don’t have an always-running thread (an infinite while loop) and use sleep to sleep for a while. This situation is common in some pull consumption scenarios. If a pull fails to get data, it is recommended to sleep for the next pull.
- The wait/notify mechanism can be used for polling
- Avoid loops, regular expression matching, and overcalculation, including using String format, split, and replace methods (you can use apache’s commons-lang StringUtils equivalents). Use regex to determine mailbox formatting (sometimes causing endless loops), sequence/deserialization, and so on.
- Combine the JVM and code to avoid frequent GCS, especially full GCS.
In addition, the following points need to be noted when using multithreading:
- Use thread pools to reduce the number of threads and thread switching
- Multi-threaded lock competition can be considered to reduce the lock granularity (using ReetrantLock), split lock (similar to ConcurrentHashMap bucket locking), or use CAS, ThreadLocal, immutable objects and other lock-free technologies. The best way to write multithreaded code is to use the JDK’s package sending/closers framework, ForkJoin, and so on. Discuptor and Actor can also be used in appropriate scenarios.
Memory tuning
Tuning memory is primarily tuning the JVM.
- Set the size of each generation appropriately. Avoid Cenozoic Settings that are too small (not enough, often minor gc and into old) and too large (fragmentation), as well as Survivor Settings that are too large and too small.
- Choose the appropriate GC strategy. You need to choose the appropriate GC strategy for different scenarios. It should be said here that CMS is not omnipotent. Unless you need to set it up again, after all, THE CMS’s new generation recycling strategy parNew is not the fastest, and CMS will generate fragments. In addition, G1 was not widely used until the advent of JDK8 and is not recommended.
- JVM startup parameters -xx :+PrintGCDetails -xx :+PrintGCDateStamps -xLOGGc :[log_path] to record GC logs for troubleshooting.
In regard to the first point, there are specific suggestions:
- Young generation size selection: Response time first applications, as large as possible, until close to the system’s minimum response time limit (selected based on the actual situation). In this case, the frequency of GC occurring in young generation collections is minimal. At the same time, it can reduce the number of objects reaching the old generation. For throughput-first applications, set it as large as possible because there is no requirement on response time and garbage collection can be performed in parallel. It is recommended for applications with more than 8 cpus.
- Aged generation size selection: For response-time-first applications, teners generally use concurrent collectors, so their size needs to be carefully set, generally taking into account some parameters such as concurrent session rate and session duration. If the heap is set too small, it can cause memory fragmentation, high recycle frequency, and application pauses to use traditional token cleanup. If the heap is large, it takes longer to collect. The optimal scheme generally needs to refer to the following data:
- Concurrent garbage collection information
- Number of concurrent collections for persistent generation
- Traditional GC information
- The proportion of time spent on recycling between the younger and older generations
Generally, throughput-first applications should have a large young generation and a small old generation. In this way, most short-term objects are reclaimed as much as possible, while intermediate objects are reduced, while the old generation stores long-lived objects.
In addition, the fragmentation problem caused by the smaller heap: because the concurrent collectors of the older generation use mark-and-sweep algorithms, the heap is not compressed. When the collector collects, adjacent Spaces are merged so that they can be allocated to larger objects. However, when heap space is low, “fragmentation” can occur after running for a while, and if the concurrent collector cannot find enough space, the concurrent collector will stop and recycle using the traditional mark-and-sweep method. If there is a “fragments”, may need to have the following configuration: – XX: + UseCMSCompactAtFullCollection, when using the concurrent collector, open to the older generation of compression. At the same time use – XX: CMSFullGCsBeforeCompaction = XX set how much time after Full GC, to compress old generation.
The remaining optimization issues for the JVM can be seen in the JVM Parameter Progression section below.
In code, note also:
- Avoid saving duplicate strings, and be careful with string.substring () and string.intern ()
- Try not to use Finalizer
- Release unnecessary references: Release ThreadLocal to prevent memory leaks, and close streams when they run out.
- Use object pooling to avoid excessive object creation and frequent GC. But don’t use object pools, except for initialization/creation scenarios like connection pools, thread pools, etc.
- For cache invalid algorithm, you can use SoftReference and WeakReference to save cache objects
- Use caution with hot deployment/loading, especially with dynamically loaded classes
-
Do not use Log4j to output file names, line numbers, because Log4j does this by printing a thread stack, generating a large number of strings. In addition, when using Log4j, it is recommended to use this classical method to check whether logs of the corresponding level are opened before performing operations. Otherwise, a large number of strings will be generated.
if (logger.isInfoEnabled()) { logger.info(msg); } Copy the code
IO tuning
Note on file IO:
- Consider using asynchronous writes instead of synchronous writes, using redis’s AOF mechanism as a reference.
- Use caching to reduce random reads
- Try to batch write, reduce I/O times and addressing
- Use a database instead of file storage
Note the following on network I/O:
- Similar to file IO, asynchronous and multiplexed IO/ event-driven IO are used instead of synchronous blocking IO
- Batch performing network I/OS to reduce I/O count
- Use cache to reduce the read of network data
- Use coroutines: Quasar
Other Optimization Suggestions
- Algorithms and logic are the primary performance of the program. When encountering performance problems, the logical processing of the program should be optimized first
- Use return values rather than exceptions to indicate errors is preferred
- See if your code is inline friendly: Is your Java code JIT-compile-friendly?
In addition, JDK7 and 8 have some performance enhancements for the JVM:
- Enable tiered compilation support for JDK7 with -xx :+TieredCompilation. Multi-layer compilation combines the advantages of the client-side C1 compiler and the server-side C2 compiler (client-side compilation can start quickly and optimize in time, while server-side compilation can provide more advanced optimizations), making it a very efficient use of resources. Low-level compilation is performed at the beginning, while information is gathered, and high-level compilation is performed at a later stage for high-level optimization.
- Compressed Oops: indicates that the Compressed pointer is enabled by default in server mode in jdk7.
- Zero-based Compressed Ordinary Object pointer: On a 64-bit JVM, the operating system is required to reserve memory starting at a virtual address 0 when using the above Compressed pointer. If the operating system supports this request, a zero-based Compressed Oops is enabled. This makes it possible to decode the offset of a 32-bit object into a 64-bit pointer without adding any address additions to the base address of the Java heap.
- Escape Analysis: The Server mode compiler determines the Escape type of the relevant object based on the code to determine whether to allocate space in the heap and whether to perform scalar substitutions (allocating atomic-type local variables on the stack). Alternatively, synchronization control can be automatically eliminated depending on the call, such as StringBuffer. This feature is enabled by default starting with Java SE 6U23.
- NUMA Collector Granularity: This is important for The Parallel Scavenger garbage Collector. Enables faster GC by taking advantage of machines with non-Uniform Memory Access (NUMA) architecture. You can enable this function by running -xx :+UseNUMA.
In addition, there is a lot of outdated advice on the Internet, so stop blindly following it:
JVM parameters advanced
Setting JVM parameters has always been a bit of a muddle, with a lot of confusion about what parameters can be configured, what they mean, and why. Here are some common sense explanations of these and some of the parameters that can easily lead to pitfalls.
All of this is for Oracle/Sun JDK 6
-
Startup parameter default value
Java has many startup parameters, and many versions are different. But now the web is full of information, many of which are ineffective or default if used undiscriminately. In general, we can see all the parameters that can be set and their default values by using Java -xx :+PrintFlagsInitial. You can also add -xx :+PrintCommandLineFlags at program startup to view startup parameters that are different from the default values. If you want to see all the startup parameters (including the default values), use -xx :+PrintFlagsFinal.
The “=” in the output indicates that the default value is used, while the “:=” indicates that the default value is not used. It may be a parameter passed in from the command line, a parameter in the configuration file, or ergonomics automatically selects another value.
In addition, you can use the jinfo command to display startup parameters.
- Jinfo-flags [pid] # check the current valid startup parameters
- Jinfo-flag [flagName] [pid] #
It is important to note that when you are configuring JVM parameters, it is best to use the above commands to check the default values of the parameters before deciding whether to set them. It’s also best not to configure parameters that you don’t know what they’re for, because there’s a reason for default Settings.
-
Dynamic setting parameters
When the Java application starts up, the performance problem is identified as GC, but when you start up, you do not add the parameters to print GC, many times you just add the parameters again and restart the application. However, the service will be unavailable for a certain period of time. The best practice is to be able to set parameters dynamically without restarting the application. You can do this using JInfo (again, essentially JMX-based).
Jinfo-flag [+/-][flagName] [pid] # Enable/disable a parameter jinfo-flag [flagName=value] [pid] # Set a parameterCopy the code
In the above GC case, you can use the following command to open heap dump and set the dump path.
jinfo -flag +HeapDumpBeforeFullGC [pid] jinfo -flag +HeapDumpAfterFullGC [pid] jinfo -flag HeapDumpPath=/home/dump/dir [pid] Copy the code
It can also be turned off dynamically.
jinfo -flag -HeapDumpBeforeFullGC [pid] jinfo -flag -HeapDumpAfterFullGC [pid] Copy the code
Other parameter Settings are similar.
-
Verbose: gc and – XX: + PrintGCDetails
Many GC recommendations use both of these parameters. In fact, once -xx :+PrintGCDetails is turned on, the previous options will be turned on at the same time.
-
-XX:+DisableExplicitGC
The purpose of this parameter is to make system.gc available for air conditioning, which is recommended in many Settings. However, using this option will result in oom if you are using NIO or other out-of-heap memory. Can use XX: + ExplicitGCInvokesConcurrent or XX: + ExplicitGCInvokesConcurrentAndUnloadsClasses (use with CMS makes system. Gc triggered a concurrent gc) instead.
There is another interesting point. If you do not set this option, full GC will be performed periodically when you use RMI. This phenomenon is due to distributed GC, which serves RMI. Specific visible in the the link content related to the DGC: docs.oracle.com/javase/6/do…
-
MaxDirectMemorySize
This parameter is the upper limit of the set out-of-heap memory. When not set to -1, the value is -xmx minus the reserved size of a survivor space.
-
For legacy reasons, the same parameters are used
- Xss and – XX: ThreadStackSize
- -xmn :NewSize -xx :NewSize -xmn :NewSize -xx :NewSize -xmn :NewSize
-
-XX:MaxTenuringThreshold
The default value is 15, but when CMS is selected, the value will change to 4. When this value is set to 0, all live objects in Eden are promoted to Old Gen on their first minor GC, and survivor Spaces are not.
-
-XX:HeapDumpPath
Use this parameter to specify the – XX: + HeapDumpBeforeFullGC, XX: + HeapDumpAfterFullGC, XX: + HeapDumpOnOutOfMemoryError trigger heap dump file storage location.
The resources
- Java HotSpot™ Virtual Machine Performance Enhancements
- Java HotSpot Virtual Machine Garbage Collection Tuning Guide
- [HotSpot VM] Various pitfalls of “standard parameters” for JVM tuning