1. Code correlation
When you encounter a performance problem, the first thing you should do is check to see if it is related to business code — not by reading the code to solve the problem, but by using logs or code to troubleshoot low-level errors related to business code. The best place for performance optimization is inside the application.
For example, check whether a large number of errors are reported in service logs. Most performance problems of the application layer and framework layer can be found in logs (improper log level setting leads to frantic online log calls). Furthermore, checking the main logic of the code, such as improper use of the for loop, NPE, regular expressions, mathematical calculations, and other common problems, can be fixed by simply modifying the code.
Don’t try to optimize performance with caching, asynchrony, and JVMS
Tuning noun hooks, complex problems may have simple solutions, and the 80/20 rule still holds true in the field of performance optimization. Of course, understanding some basic “common code pothole” can speed up the process of problem analysis, from CPU, memory, JVM analysis to some bottleneck optimization ideas, may also be reflected in the code.
Here are some high-frequency coding points that can cause performance problems.
1) Regular expressions are CPU intensive (greedy mode may cause backtracking), so be careful with string split(), replaceAll(), etc. Regular expression expressions must be precompiled.
2) String.intern() used on older JDKS (Java 1.6 and earlier) may cause method area (permanent generation) memory to overflow. In older JDKS, if the string pool setting is too small and too many strings are cached, there is also a significant performance overhead.
3) When exporting the exception log, if the stack information is clear, you can cancel the output of the detailed stack, the construction of the exception stack is costly. Note: when a large number of duplicate stack messages are thrown in the same location, the JIT optimizes it to throw a pre-compiled, type-matched exception, and the exception stack will not be seen.
4) Avoid unnecessary unpacking operations between reference types and base types, and try to keep the same. Automatic packing occurs too often, which will seriously consume performance.
5) Stream API selection. For complex and parallel operations, the Stream API is recommended to simplify code while taking advantage of multiple CPU cores. For simple operations or single-cpu operations, explicit iteration is recommended.
6) Manually create a thread pool using ThreadPoolExecutor based on service scenarios, and specify the number of threads and queue size based on different tasks to avoid resource exhaustion risks. The unified thread name can also facilitate subsequent troubleshooting.
7) Select concurrent containers based on business scenarios. For example, to ensure data consistency when selecting a Map container, you can use Hashtable or Map + lock. Read much more than write, use CopyOnWriteArrayList; ConcurrentHashMap is used when the amount of data accessed and accessed is small, data consistency is not required, and changes are not frequent. ConcurrentSkipListMap is used to access a large amount of data, frequently read and write data, and does not have strong data consistency requirements.
8) Lock optimization ideas include: reduce lock granularity, use lock coarsening in the loop, reduce lock holding time (read and write lock choice), etc. At the same time, some JDK optimized concurrency classes are also considered, such as LongAdder instead of AtomicLong for counting in statistical scenarios with low consistency requirements, ThreadLocalRandom instead of Random class, etc.
In addition to these code layer optimizations, there are many more not to list. We can observe that some common optimization ideas can be extracted from these points, such as:
Space for time: the use of memory or disk in exchange for more valuable CPU or network, such as cache use;
Time for space: to save memory or network resources by sacrificing part of CPU, for example, to change a large network transmission into several times; Other techniques such as parallelization, asynchrony, pooling, etc.
2. The CPU
As mentioned earlier, we should pay more attention to CPU load. High CPU utilization is generally not a problem. CPU load is a key basis for judging the health of system computing resources.
2.1 High CPU usage && High load average
This is common in CPU-intensive applications, where a large number of threads are running and I/ OS are small. Common APPLICATION scenarios that consume CPU resources are as follows:
The regular operation
Mathematical operations
Serialize/deserialize
Reflex action
An infinite or unreasonably large number of cycles
Base/third party component defects
The common way to check high CPU usage is as follows: Print the thread stack multiple times (> five times) by jStack to locate the thread stack that consumes a large number of cpus. Or, through Profiling (based on event sampling or burial points), an on-CPU flame map can be applied over a period of time to quickly locate problems.
It is also possible that the application has frequent GC (including Young GC, Old GC, and Full GC), which can lead to increased CPU utilization and load. The jstat -gcutil command is used to continuously display the number and time of GC statistics collected by the current application. Frequent GC increases the load and is usually accompanied by a lack of available memory. Use commands such as free or top to check the amount of available memory on the current machine.
Is it possible that CPU performance bottleneck is the cause of high CPU utilization? It’s possible. You can further view detailed CPU utilization through vmstat. If the user-mode CPU usage (US) is high, it indicates that user-mode processes occupy a large number of cpus. If the value is greater than 50% for a long time, check the application performance. Kernel-mode CPU utilization (SY) is high, indicating that kernel-mode occupies a large amount of CPU. Therefore, it is important to check the performance of kernel threads or system calls. If the value of US + SY is greater than 80%, the CPU may be insufficient.
2.2 Low CPU usage & High load average
If CPU utilization is not high, our application is not busy computing, but doing something else. Low CPU utilization and high load average are common for I/O intensive processes. This is easy to understand, since load average is the sum of R and D processes. If you remove the first one, only D processes are left. Such as disk I/O, network I/O, etc.).
Investigation && verification ideas: Check the % WA (ioWAIT) column. This column indicates the percentage of disk I/O wait time in the CPU time slice. If the value exceeds 30%, it indicates that disk I/O wait is serious. This may be caused by a large number of random disk accesses or direct disk accesses (without the use of the system cache), or the disk itself may have a bottleneck. You can verify this by combining the output of IOstat or Dstat. For example, if the %wa(ioWAIT) increases and the disk read request is large, the disk read may be a problem.
In addition, network requests that take a long time (namely, network I/O) also increase the AVERAGE CPU load, such as slow MySQL query and interface data acquisition using RPC interfaces. The troubleshooting of this situation generally requires comprehensive analysis based on the upstream and downstream dependencies of the application itself and trace logs of middleware buried points.
2.3 CPU context Switchover Times Becomes high
Vmstat was used to check the number of context switches on the system, and pidstat was used to check the voluntary and involuntary context switches on the process (CSWCH). Voluntary context switching is caused by intra-application thread state transitions, such as invoking sleep(), join(), wait(), or using Lock or synchronized Lock structures; An involuntary context switch is caused by a thread running out of allocated time slices or by execution priorities being scheduled by the scheduler.
If the number of voluntary context switches is high, it means that the CPU is waiting to acquire resources, such as insufficient system resources such as I/O and memory. If the number of involuntary context switches is high, the possible cause is that there are too many threads in the application, leading to fierce COMPETITION for CPU time slices and frequent forced scheduling by the system. In this case, the number of threads and thread status distribution can be used as evidence.
Memory related
As mentioned above, memory is divided into system memory and process memory (including Java application process). Generally, most of the memory problems we encounter will fall on the process memory, and the bottleneck caused by system resources is relatively small. For Java processes, their built-in memory management automatically solves two problems: how to allocate memory to objects and how to reclaim the memory allocated to objects, with the garbage collection mechanism at its core.
Although garbage collection can effectively prevent memory leaks and ensure the effective use of memory, it is not a panacea. Improper parameter configuration and code logic will still bring a series of memory problems. In addition, the early garbage collectors were not very functional or efficient at collecting, and the excessive GC parameter Settings depended heavily on the tuning experience of the developer. For example, improper setting of maximum heap memory can cause problems such as heap overflow or heap flapping.
Let’s look at some common memory problem analysis ideas.
3.1 The System memory is Insufficient
Java applications usually monitor the memory water level of a single machine or a cluster. If the memory usage of a single machine is greater than 95% or the memory usage of a cluster is greater than 80%, a potential memory problem may exist.
Except in some extreme cases, the system is out of memory, most likely caused by Java applications. When running the top command, you can view the actual memory usage of a Java application process. RES indicates the resident memory usage of a process, and VIRT indicates the virtual memory usage of a process. The relationship between memory sizes is as follows: VIRT > RES > actual heap size of a Java application. In addition to heap memory, Java process overall memory footprint, method area/meta space, JIT cache, etc., mainly composed as follows:
Java application memory usage = Heap + Code Cache + Metaspace + Symbol tables + Thread Stacks + Direct Buffers + JVM structures + Mapped files + Native Libraries + Buffers + Mapped files
You can run the jstat -gc command to view the memory usage of Java processes. In the output indicator, you can view the usage of each partition and meta-space of the current heap memory. Statistics and usage of out-of-heap Memory can be obtained using NMT (Native Memory Tracking, introduced by HotSpot VM Java8). It is easy to overlook the amount of memory used by the thread stack. Although thread stack memory is lazy-loaded and does not use +Xss to allocate memory directly, too many threads can lead to unnecessary memory usage. You can use the script JStackMem to count the overall thread usage.
System memory is insufficient.
1. Run the free command to check the current available memory space, and run the vmstat command to check the memory usage and memory growth trend. In this phase, you can locate the processes that occupy the most memory.
2. Analyze cache/buffer memory usage. If this value does not change much over time, it can be ignored. If you observe that the cache/buffer size continues to increase, you can use tools such as pcstat, cacheTOP, and Slabtop to analyze cache/buffer usage.
3. If the memory continues to grow after the cache/buffer is excluded, a memory leak may exist.
3.2 Java Memory Overflow
Memory overflow is when an application creates an object instance that requires more memory than is available in the heap. There are many types of memory overflow, and the keyword OutOfMemoryError is usually seen in the error log. Common types of memory overflow and analysis methods are as follows:
1) Java. Lang. OutOfMemoryError: Java heapspace.
Causes: objects in the heap (new generation and old generation) are no longer allocated, references to some objects are held for a long time without being released, garbage collector is unable to collect, a large number of Finalizer objects are used that are not in the GC collection cycle, etc. A heap overflow is usually caused by a memory leak. If you are sure there is no memory leak, you can increase the heap memory appropriately.
2) Java. Lang. OutOfMemoryError: GC overhead limit
Exceeded. Cause: The garbage collector spends more than 98% of its time garbage collecting, but collects less than 2% of the heap memory, usually because of a memory leak or a small heap.
3) Java. Lang. OutOfMemoryError: Metaspace or Java. Lang. OutOfMemoryError: PermGen
Space. Check whether there is dynamic class loading but not unloaded in time, whether there is a large number of string constant pooling, and whether the permanent generation/meta-space is set too small.
4) Java. Lang. OutOfMemoryError: unable to create new native
The Thread. Cause: The VM cannot obtain sufficient memory space when expanding the stack space. The size of each thread stack and the overall number of threads in the application can be appropriately reduced. In addition, the total number of processes/threads created on the system is also limited by free memory on the system and the operating system, so check carefully.
Note: This stack overflow is different from StackOverflowError, which is caused by a method call that is too deep and does not allocate enough stack memory to create a new stack frame. In addition, there are OutOfMemoryError types such as Swap partition overflow, local method stack overflow, and array allocation overflow, which are not very common and will not be covered in detail.
3.3 Java Memory Leaks
Java memory leaks are a developer’s nightmare, unlike memory leaks, which are simple and easy to spot. A memory leak is a symptom of an application running for a period of time, memory utilization becomes more and more high, and the response becomes slower and slower, until the process “suspends”.
Java memory leak may cause insufficient available memory, process death, OOM, etc., troubleshooting methods are as follows:
Jmap periodically outputs statistics of objects in the heap to locate objects whose number and size continue to grow. Profiling applications using Profiler tools to look for memory allocation hot spots. In addition, if the heap memory continues to grow, you are advised to dump a snapshot of the heap memory. Then you can do some analysis based on the snapshot. Although snapshots are instantaneous values, they are also meaningful.
3.4 Related to garbage collection
The GC (garbage collection, same below) metrics are an important measure of healthy Java process memory usage. Core garbage collection metrics: the frequency and number of GC pauses (including MinorGC and MajorGC), which can be obtained directly from the jstat tool, and the memory details for each collection, which requires GC logs to be analyzed. Note that FGC/FGCT in the jstat output column represents the number of GC pauses (stop-the-world) in an old garbage collection. For example, for the CMS garbage collector, This value is increased by 2 each time the old-age garbage collection (the initial marking and re-marking of the two stop-the-world phases would make the statistic 2.
When is GC tuning needed? This depends on the application, such as response time requirements, throughput requirements, system resource constraints, and so on. Some lessons learned: GC frequency and time increases significantly, average GC Pause time exceeds 500ms, Full GC frequency is less than 1 minute, etc. If GC meets any of the above characteristics, it is time for GC tuning.
Due to the variety of garbage collectors, tuning strategies vary for different applications, so here are some general GC tuning strategies.
1) Select the appropriate GC collector. According to the application of delay, throughput requirements, combined with the characteristics of each garbage collector, reasonable selection. It is recommended to replace the CMS garbage collector with G1, whose performance is gradually being optimized and is catching up or even surpassing on machines with 8GB of ram or less. G1 is more convenient to tune, while CMS garbage collector parameters are too complex, easy to cause space fragmentation, high CPU consumption, so it is currently in the abandoned state. The new ZGC garbage collector introduced in Java 11, which can basically do full-phase concurrent marking and collection, is worth looking forward to.
2) Reasonable heap memory size Settings. Do not set the heap size to too large. It is recommended that the heap size do not exceed 75% of the system memory to avoid system memory exhaustion. The maximum heap size is consistent with the size of the initial heap to avoid heap flapping. The setting of the size of the new generation is quite critical. We adjust the frequency and time of GC, and most of the time we adjust the size of the new generation, including the proportion of the new generation and the old generation, the proportion of Eden area and Survivor area in the new generation, etc. The setting of these ratios also needs to consider the promotion age of the objects in each generation. There are a lot of things to consider in the whole process. With the G1 garbage collector, the size of the new generation is much less of a concern, and the adaptive strategy determines the collection set (CSet) for each collection. The adjustment of the new generation is the core of GC tuning and depends very much on experience. However, in general, the high frequency of Young GC means that the new generation is too small (or the Eden area and Survivor configuration is not reasonable), and the long time of Young GC means that the new generation is too large. The two directions are approximately the same.
3) Reduce the frequency of Full GC. If frequent Full GC or old GC occurs, it is likely that there is a memory leak, which causes objects to be held for a long time. Analyzing memory snapshots through dump can quickly locate the problem. In addition, the ratio between the new generation and the old generation is not appropriate, which leads to the frequent allocation of objects to the old generation, or to the Full GC, which needs to be combined with business code and memory snapshot analysis. In addition, by configuring THE GC parameters, we can obtain a lot of key information for GC tuning, Such as configuration – XX: + PrintGCApplicationStoppedTime – XX: XX: + PrintSafepointStatistics – + PrintTenuringDistribution, We can get the GC Pause distribution, the safe point elapsed time statistics, the age distribution of the object promotion, plus -xx :+PrintFlagsFinal lets us know what GC parameters are finally in effect, etc.
Disk I/O and network I/O
4.1 Troubleshooting Disk I/O Faults
1. Use a tool to output disk related output indicators, such as %wa (iowait) and %util, to check whether disk I/O is abnormal. For example, a high %util indicator indicates heavy I/O behavior.
2. Use pidstat to locate the process and check the data size and rate of the read or write.
3. Use lsof + process ID to view the list of files (including directories, block devices, dynamic libraries, and network sockets) opened by the abnormal process. In combination with service codes, you can generally locate the I/O source.
Note that an increase in % WA (ioWAIT) does not necessarily mean that there is a disk I/O bottleneck. This is the number representing the percentage of TIME spent on CPU I/O operations, which is normal if I/O is the main activity of the application process during this time.
4.2 Network I/O Bottlenecks Possible causes are as follows:
1. If the number of objects transferred at one time is too large, the request response may be slow and GC is frequent.
2. Unreasonable selection of network I/O model leads to low overall QPS and long response time;
3. The thread pool for RPC calls is incorrectly set. You can use jStack to count the distribution of threads. If there are many threads in TIMED_WAITING or WAITING state, you need to pay special attention to them. Example: the database connection pool is not enough, reflected in the thread stack is many threads competing for a connection pool lock;
4. The timeout period for RPC calls is incorrectly set, resulting in many request failures.
Thread stack snapshots for Java applications are very useful. In addition to the above mentioned thread pool configuration problems, other scenarios such as high CPU, slow application response, etc., can be started from thread stack.
5. A useful one-line command
This section describes several commands used to locate performance problems.
1) Check the number of network connections in the system
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
Copy the code
2) View the Top 50 distribution of objects in the heap (locate memory leaks)
Jmap -- histo: live$pid | sort-n -r -k2 | head-n 50
Copy the code
3) List the top 10 processes by CPU/ memory usage
# memory
ps axo %mem,pid,euser,cmd | sort -nr | head -10
#CPU
ps -aeo pcpu,user,pid,cmd | sort -nr | head -10
Copy the code
4) Display the overall CPU utilization and idle rate of the system
grep "cpu " /proc/stat | awk -F ' ' '{total = $2 + $3 + $4 + $5} END {print "idle \t used\n" $5*100/total "% " $2*100/total "%"}'
Copy the code
5) Count the number of threads by thread state (enhanced version)
jstack $pid | grep java.lang.Thread.State:|sort|uniq -c | awk '{sum+=$1; split($0,a,":"); gsub(/^[ \t]+|[ \t]+$/, "", a[2]); printf "%s: %s\n", a[2], $1}; END {printf "TOTAL: %s",sum}';
Copy the code
6) view the stack information of the Top10 threads that consume the most CPU
Show-busy-java-threads: show-busy-Threads: show-busy-Java-Threads: show-busy-Java-Threads: show-busy-Java-Threads: show-busy-Java-Threads This script has been used in ali online operation and maintenance environment. Link: github.com/oldratlee/u…
7) FlameGraph generation (perf, perf-map-agent and FlameGraph need to be installed) :
# 1. Collect application runtime stack and symbol table information (sample time: 30 seconds, 99 events per second);
sudo perf record -F 99 -p $pid -g -- sleep 30; ./jmaps
# 2. Using perf script to generate analysis results, the resulting flamegraph.svg file is the flamegraph.
sudo perf script | ./pkgsplit-perf.pl | grep java | ./flamegraph.pl > flamegraph.svg
Copy the code
8) List the top 10 processes according to Swap partition usage
for file in /proc/*/status ; do awk '/VmSwap|Name|^Pid/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 3 -n -r | head -10
Copy the code
9) STATISTICS of JVM memory usage and garbage collection status
Display the cause of the last or currently occurring garbage collection
jstat -gccause $pid
Display the capacity and usage of each generation
jstat -gccapacity $pid
Display new generation capacity and usage
jstat -gcnewcapacity $pid
# Display old age capacity
jstat -gcoldcapacity $pid
# display garbage collection information (continuous output at 1-second intervals)
jstat -gcutil $pid 1000
Copy the code
10) Other daily commands
Kill all Java processes quickly
ps aux | grep java | awk '{ print $2 }' | xargs kill9 -# find the top10 files that occupy the largest amount of disk space under /
find / -type f -print0 | xargs -0 du -h | sort -rh | head -n 10
Copy the code
Summarizing performance optimization is a large field, and each of these small points can be expanded into dozens of articles. Performance optimization for the application, in addition to the mentioned above, and the front optimization, architecture optimization (distributed, cache usage, etc.), data storage, code optimization (such as design pattern optimization), limited to the limited space, here is not one by one, the content of this article, only ACTS as a topic. At the same time, this article is some of my experience and knowledge, not necessarily correct, I hope you correct and supplement. Performance optimization is a comprehensive work, need to continue to practice, tool learning, experience learning into the actual combat to constantly improve, to form a set of tuning methodology belongs to its own.
Also, while performance tuning is important, don’t invest too much in tuning too early (of course good architectural design and coding are necessary), because tuning too early is the root of all evil. On the one hand, the optimization work done in advance may not be suitable for the rapidly changing business requirements, but instead hinder the new requirements and features. On the other hand, premature optimization increases application complexity and reduces application maintainability. When to optimize, optimize to what extent, is a need to weigh the proposition.
To develop practical skills, performance, screening, daily command Please see the original: developer.aliyun.com/article/741…