ThreadDump Analysis in action (Performance Bottleneck analysis)

A review,

Now that we’ve seen how ThreadDump can be viewed and what it can do, let’s continue to explore this question

ThreadDump analysis note (1) Unscramble stack

ThreadDump analysis note (2) Analysis stack

Second, where is the bottleneck

Improving resources is often referred to as performance optimization, and improving resources means doing more with limited resources. When a thread is blocked by a particular resource, we say it’s limited by that resource, by the database, by the processing power of the other end, etc.

Using concurrency to improve system performance means keeping the CPU as busy as possible. If the program is limited by the computing power of the current CPU, we can solve the problem by adding more processors or clusters. But if you can’t use the CPU to keep it busy, adding more processors doesn’t help. So make full use of multithreading to allow idle processors to do unfinished work.

In general, performance improvements address limited resources, which may be:

CPU If the current CPU usage is close to 100% and the code and service logic cannot be simplified, the system performance has reached the maximum. To improve performance, you can only add processors or clusters.

Other resources, such as the number of database connections, can also improve overall performance if CPU utilization is not close to 100% by modifying the code to maximize CPU utilization.

Let’s take a look at a graph where CPU usage does not approach 100% as the system pressure increases

If you can’t get CPU utilization close to 100% under any amount of stress on a single-CPU machine, then there’s room for optimization. The analysis process of a system performance bottleneck is as follows:

Firstly, analyze the performance bottleneck of a single process and optimize the performance of a single process. (There’s no trick in hard-coding additional timestamps to figure out where it takes the most time, but it’s a case-by-case approach.)
Perform overall performance bottleneck analysis (focus this time)

So what is high performance? In fact, high performance has different concepts in different scenarios:

There are situations where high performance means fast user experience, like interface manipulation, clicking on a menu, quick response and we say high performance
In situations where high throughput means high performance, such as SMS, the system values throughput more than it is sensitive to the processing time of each message
In some cases, it is a combination of the two that not only requires high throughput, but also requires that each message be processed within a specified period of time, without delay

The ultimate goal of performance tuning is to get your system’s CPU utilization close to 100%. If your CPU is underutilized, there are several possibilities:

Not enough pressure was applied

It may be that the application is not under enough load, and you can increase the stress by observing system response times, service failure rates, and CPU utilization. If the pressure is increased and some services fail, the system response time becomes slow, or the CPU usage cannot increase, the pressure should be the saturation pressure of the system. That is, the current capacity is the maximum capacity of the system.

System Bottlenecks exist

When the system is under saturation pressure, if the CPU utilization is not close to 100%, then there is room for improvement.

There are several symptoms of system bottlenecks:

Continuously running slowly. (Often the application is slow, and changing the load, number of database connections, etc., does not improve the overall response time.)
The system performance decreases over time. Under stable load conditions, the longer the system runs, the slower it slows down. It may be that the system runs erratically frequently beyond a certain threshold, resulting in system deadlocks or crashes.)
System performance deteriorates as the load increases (as the number of users increases, the performance becomes slower. After several users exit the system, the program can return to normal state.)

Here are some common centralized performance bottlenecks

Several common performance bottlenecks

Contention for resources caused by improper synchronization

Improper use of sychronized results in unrelated methods using the same lock or different shared variables using the same lock, resulting in unnecessary resource competition.

class MyClass {

Object sharedObj; 
synchronized void fun1() {... } // access the sharedObj synchronized voidfun2() {... } // access the sharedObj synchronized voidfun3() {... } // do not access the sharedObj synchronized voidfun4() {... } // do not access the sharedObj synchronized voidfun5() {... // do not access the shared variable sharedObj}Copy the code

Java provides this lock by default. Many people prefer to use synchronized locks directly on methods. In many cases, it is not appropriate to do so. Synchronized is a class lock that locks all created objects. So be sure to control the boundaries when you use it.

The code above applies sychronized to every method of the class, violating the principle of protecting whatever locks. For both methods with no shared resource, the same lock is used, artificially creating unnecessary lock waits.

The granularity of the lock is too large, and subsequent code is not placed outside the synchronized code block after access to the shared resource is complete. As a result, the resource takes too long, and other threads competing for the lock have to wait. As follows:

void fun1() {synchronized(lock){//1 access the shared resource... . //2 Perform other time-consuming operations that have nothing to do with the shared resource... . }}Copy the code

The code above causes a thread to hold the lock for too long, causing other threads to wait. However, this writing method is optimized in different occasions in different ways, need to pay attention to:

A single CPU takes time-consuming operations out of sync blocks, improving performance in some cases and not in others.

The time consuming code in synchronous blocks is CPU-intensive code (such as pure CPU computing). There is no low CPU consumption code such as disk I/O or network I/O. In this case, because the CPU executes this code with 100% utilization, reducing synchronous blocks will not provide any performance improvement. However, shrinking synchronized blocks at the same time does not result in performance degradation.

The time-consuming code in the synchronous block is the code with low CPU consumption, such as disk/network I/O. When the current thread is executing the code that does not consume CPU, the CPU is idle. If the CPU is busy at this time, the overall performance can be improved. You can definitely improve the overall performance.

Multiple cpus always improve performance by taking time-consuming operations out of sync blocks

The time consuming code in the synchronized block is pure CPU operations, and there is no code such as disk IO/ network IO that might not consume CPU. In this case, because there are multiple cpus, other cpus may be idle, so reducing the synchronized block allows other threads to execute the code immediately, which can improve performance.

Time consuming in the synchronized block code such as disk/network IO not consume CPU code, when the current thread is holding the line does not consume CPU code, by this time there is always the CPU is idle, if the CPU is busy at this time, can be to bring the overall performance of ascension, under this scenario, would take action in the synchronized block of code, You can definitely improve the overall performance.

So, in any case, narrowing the scope of synchronization can only be beneficial, so our code above is optimized as follows:

void fun1() {synchronized(lock){//1 access the shared resource... . } //2 Perform other time-consuming operations that have nothing to do with the shared resource... . }Copy the code

The abuse of sleep.

Sleep is only good when waiting for a fixed length of time, and if the polling code has sleep() calls in it, it’s necessarily bad design.

This design can lead to serious performance bottlenecks in some cases, and in the case of a user interacting system, the user will inevitably feel the system slowing down directly.

If it is a background message processing system, then inevitably the message processing is slow. This design can certainly do the same with notify() and wait()

Misuse of String +

String c = new String(“abc”) + new String(“efg”) + new String(“12345”);

Each + operation produces a temporary object, along with a copy of the data, which is a huge drain on performance. This notation has often been a system bottleneck and, if this happens to be a performance bottleneck o ﬀer, there would be a significant improvement in performance by modifying it to StringBu o ﬀer.

Inappropriate threading model

In multi-threaded situations, if the thread model is not appropriate, performance will also be poor. For example, in the case of network IO, we must use message send queue and message receive queue for asynchronous IO. After this modification, performance can be improved tens of times.

Insufficient threads

Where thread pools are used, poor performance can also result if the thread pool has too few threads configured

Frequent GC due to memory leaks

Memory leaks can cause GC to become more frequent, and as GC operations are CPU-intensive, frequent GC can lead to a serious degradation of overall system performance, which we often encounter.

Iv. Analytical means and tools

All of the above mentioned performance bottlenecks can be traced through thread stack analysis. ThreadDump is suitable for multi-threaded scenarios.

How to simulate bottleneck discovery

Several characteristics of performance bottlenecks:

There is only one current performance bottleneck, and only when that bottleneck is solved will the next one be known. Without addressing the current performance bottle neck, the next performance bottleneck will not occur. On a highway, the narrowest part of the road determines its capacity for traffic. Only when the narrowest part is widened, the traffic capacity of the whole road section can be improved. However, if the sub-narrowest section (i.e. the second narrowest section) is directly widened, the traffic capacity of the whole road section will not be improved, as shown in the figure:

Performance bottlenecks are dynamic, and what is not a bottleneck at low loads can become a bottleneck at high loads. Bottlenecks occur under high pressure, but due to the overhead of JProﬁler and other performance profiling tools attached to the JVM, there is no way to achieve the performance required for this bottleneck. As a result, this type of performance bottleneck cannot be found with performance profiling tools such as JProﬁler or OptimizeIt. In this case, thread heap analysis is a really effective approach.

In view of the above characteristics of the performance bottleneck, it is necessary to simulate the performance bottleneck at a pressure slightly higher than the current system. Otherwise, the performance bottleneck will not be visible. The specific steps are as follows:

How do I find performance bottlenecks through the thread stack?

In general, once a system has a performance bottleneck, there are three typical stack characteristics from the stack analysis:

The stack of most threads appears to be on the same calling context, and there are very few idle threads left. Possible reasons are as follows:

(a) Too few threads

(b) Lock competition caused by excessive lock granularity

(c) Resource competition (e.g. insufficient connections in the database connection pool, causing some threads to acquire connections to be blocked)

(d) A large number of time-consuming operations (such as a large amount of disk I/O) within the lock range, resulting in lock contention.

(e) Slow processing of the other side of the remote communication (slower dubbo provider), such as poor performance of SQL code on the database side.
The vast majority of threads are in the wait state, only a few working threads, the overall performance is not good. The possible reason is that the system has a critical path, and there is not enough capacity on the critical path to deliver a large number of tasks to the next stage, resulting in idle other places. If message distribution is the bottleneck in a message distribution system, where message distribution is typically one thread and message processing is multiple threads, then the above phenomenon is observed from the thread stack: that is, the critical path does not have enough capacity to transport a large number of tasks to the next stage, resulting in idle elsewhere.
The total number of threads is small. The causes of performance bottlenecks are similar to those above. There is so little threads, it is because some thread pool implementations use another kind of design thought, when the task come only after new threads to, this way, the number of threads, means that somewhere on the critical path does not have enough ability to the next stage of conveying a lot of tasks, which does not require more threads to handle.

Here is an example of a stack with a performance bottleneck:

"Thread-243"prio=1 tid=0xa58f2048 nid=0x7ac2 runnable [0xaeedb000..0xaeedc480] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at oracle.net.ns.Packet.receive(Unknown Source) . . at oracle.jdbc.driver.LongRawAccessor.getBytes() at oracle.jdbc.driver.OracleResultSetImpl.getBytes() - locked <0x9350b0d8> (a oracle.jdbc.driver.OracleResultSetImpl) at oracle.jdbc.driver.OracleResultSet.getBytes(O) ... . at org.hibernate.loader.hql.QueryLoader.list() at org.hibernate.hql.ast.QueryTranslatorImpl.list() ... . at com.wes.NodeTimerOut.execute(NodeTimerOut.java:175) at com.wes.timer.TimerTaskImpl.executeAll(TimerTaskImpl.java:707)  at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627) - locked <0x80df8ce8> (a com.wes.timer.TimerTaskImpl) at  com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209) at com.wes.threadpool.PooledExecutorEx$Worker.run() at java.lang.Thread.run(Thread.java:595)

  
"Thread-248"prio=1 tid=0xa58f2048 nid=0x7ac2 runnable [0xaeedb000..0xaeedc480] at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at oracle.net.ns.Packet.receive(Unknown Source) . . at oracle.jdbc.driver.LongRawAccessor.getBytes() at oracle.jdbc.driver.OracleResultSetImpl.getBytes() - locked <0x9350b0d8> (a oracle.jdbc.driver.OracleResultSetImpl) 2 at oracle.jdbc.driver.OracleResultSet.getBytes(O) ... . at org.hibernate.loader.hql.QueryLoader.list() at org.hibernate.hql.ast.QueryTranslatorImpl.list() ... . a com.wes.NodeTimerOut.execute(NodeTimerOut.java:175) at com.wes.timer.TimerTaskImpl.executeAll(TimerTaskImpl.java:707) at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627) - locked <0x80df8ce8> (a com.wes.timer.TimerTaskImpl) at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209) at com.wes.threadpool.PooledExecutorEx$Worker.run() at java.lang.Thread.run(Thread.java:595) ... ."Thread-238" prio=1 tid=0xa4a84a58 nid=0x7abd inObject.wait() [0xaec56000..0xaec57700] at java.lang.Object.wait(Native Method) at com.wes.collection.SimpleLinkedList.poll(SimpleLinkedList.java:104) - locked <0x6ae67be0> (a com.wes.collection.SimpleLinkedList) at com.wes.XADataSourceImpl.getConnection_internal(XADataSourceImpl.java:1642) ... . at org.hibernate.impl.SessionImpl.list() at org.hibernate.impl.SessionImpl.find() at com.wes.DBSessionMediatorImpl.find() at com.wes.ResourceDBInteractorImpl.getCallBackObj() at com.wes.NodeTimerOut.execute(NodeTimerOut.java:152) at com.wes.timer.TimerTaskImpl.executeAll() at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627) - locked <0x80e08c00> (a com.facilities.timer.TimerTaskImpl)  at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209) at com.wes.threadpool.PooledExecutorEx$Worker.run() at java.lang.Thread.run(Thread.java:595)

  
  
"Thread-233" prio=1 tid=0xa4a84a58 nid=0x7abd inObject.wait() [0xaec56000..0xaec57700] at java.lang.Object.wait(Native Method) at com.wes.collection.SimpleLinkedList.poll(SimpleLinkedList.java:104) - locked <0x6ae67be0> (a com.wes.collection.SimpleLinkedList) at com.wes.XADataSourceImpl.getConnection_internal(XADataSourceImpl.java:1642) ... . at org.hibernate.impl.SessionImpl.list() at org.hibernate.impl.SessionImpl.find() at com.wes.DBSessionMediatorImpl.find() 48 at com.wes.ResourceDBInteractorImpl.getCallBackObj() at com.wes.NodeTimerOut.execute(NodeTimerOut.java:152) at com.wes.timer.TimerTaskImpl.executeAll() at com.wes.timer.TimerTaskImpl.execute(TimerTaskImpl.java:627) - locked <0x80e08c00> (a com.facilities.timer.TimerTaskImpl)  at com.wes.threadpool.RunnableWrapper.run(RunnableWrapper.java:209) at com.wes.threadpool.PooledExecutorEx$Worker.run() at java.lang.Thread.run(Thread.java:595) ... .Copy the code

From the stack, more than N of these are occupied by JDBC database access. This means that it is possible to run out of links and block all other HTTP requests on the java.lang.object.wait () method because they cannot get links. From this stack, the performance bottleneck is in database access, which consumes all connections. Once the bottle neck is found, the next step is to analyze the source code. What is the reason why the database access takes too long? Did you not create an index, or did you use an inefficient SQL statement?

End conditions for performance tuning

The process of performance tuning always has a dead end, so what condition is met that there is no room for optimization? To sum up, there are two as follows:

The algorithm is optimized enough, the code is optimized to the extreme
Threads make full use of the CPU

If the above conditions are met, the performance still cannot meet the requirements of the application, and the only way to achieve greater capacity support is to consider purchasing a better machine or cluster

Performance tuning tools

Several familiar JProﬁler VisualVM and some tools that come with the JDK. However, once these analysis tools are attached to the system, they will lead to a significant decline in overall performance. In multi-threaded situations, performance bottlenecks will not appear at all because the overall pressure can not go up. Therefore, performance analysis in this situation, these tools are basically not helpful. These performance profiling tools are more suitable for single-thread code analysis to find time-consuming algorithms and code, but they are often powerless to analyze lock misuse in multi-threaded situations.