background
One day at about 4 o ‘clock in the afternoon, I received a message from my partner. In the online environment, an HTTP service I maintained suddenly timed out a lot (the timeout time was set as 300ms). I quickly went to the EagleEye platform to start sampling, and found that the average QPS of the service was about 120, and the average RT was more than 2 seconds to 3 seconds. Some burrs are as high as 5 to 6 seconds (normally around 60ms).
QPS situation:
Rt situation
Problem solving
This service is an internal operation platform service (only two Dockers have been deployed). It is expected to have single-digit QPS and has not done any online release recently. The core operation is to integrate query database, and a request involves about 40 DB queries at most. Before, I verbally agreed with the caller to do caching, but now I see that the QPS is around 120 (QPS proves that no caching is done), so I ask the caller to do caching and reduce the QPS. Then QPS fell below 80, RT returned to normal (60ms on average), and finally QPS fell all the way down to 40 (it was necessary to push the caller to cache later to ensure that QPS was in single digits).
Problem orientation
Since the core operation of the service is to query the database, and there are 40 DB queries in one request, we first check whether slow SQL is caused by slow SQL, check the DB performance monitoring, and find that the average RT of DB is less than 0.3ms, we can calculate that the overall TIME of DB is about 12ms, excluding slow SQL caused by high RT.
Starting to wonder if the DB connection pool is queued under high concurrency, TDDL default connection pool size is 10. After checking the monitoring, the number of connections occupied in the whole system never exceeds 7. Therefore, the problem of insufficient connection pool is eliminated.
So far, the reason for the high RT has been excluded at the database level.
It then looks at each execution point in the service invocation chain sampled to see which part of the invocation chain takes the most time. It is found that many of the execution points have a feature that local calls take a very long time (hundreds of milliseconds), but real service calls (such as DB query actions) take a very short time (0ms means that the execution time is less than 1ms, which indirectly confirms that the average RT of db before is less than 0.3ms).
Local invocation time: 267ms the client sends a request: 0ms the server processes a request: 0ms the client receives a response: 1ms total time: 1ms
At this point, the problem becomes clear. The problem is that the local method takes too long to execute. However, after checking all the code of the service again, there is no local execution logic that requires long time.
Load hovers around 4 for a long time. Our Docker is deployed on a 4c8G host, but we cannot monopolize this 4C host. It is abnormal to keep such a high load. Continue to trace the cause of the CPU load spike, and then look at the GC log to find a large number of Allocation failures, and then ParNew times more than 100 times per minute, which is clearly abnormal, as shown in the following GC log example
2020-03-25T16:16:18.390+ 0800:1294233.934: [GC (Allocation Failure) 2020-03-25T16:16:18.391+ 0800:1294233.935: [ParNew: 1770060K->25950K(1922432K), secs] 175763K ->361653K(4019584K), secs] [Times: User sys = = 0.12 0.00, real = 0.04 secs]Copy the code
The CPU usage is around 0.04s per session, but because ParNew GC is so frequent, up to 100 times per minute, the overall CPU usage is still significant.
Take a look at the JVM memory parameters
Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 4294967296 (4096.0MB) NewSize = 2147483648 (2048.0MB) MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 4294967296 (4096.0MB) NewSize = 2147483648 (2048.0MB) MaxNewSize = 2147483648 (2048.0MB) OldSize = 2147483648 (2048.0MB) NewRatio = 2 SurvivorRatio = 10 MetaspaceSize = 268435456 (256.0MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 536870912 (512.0MB) G1HeapRegionSize = 0 (0.0 MB)Copy the code
Young generation are assigned 2 gb of memory, including Eden area is about 1.7 G using jmap checking their young generation objects take up the space of the top there are multiple org.. Apache tomcat. Util. Buf objects under the package, such as ByteChunk, CharChunk, MessageBytes etc., And a list of temporary objects involved in the response. ByteChunk and other classes are related to Tomcat response output
So far, the problem is clear. In the process of sending a large response packet (50K) to the nic, it needs to be copied from the user space to the kernel space, and then copied to the NIC for sending (zero-copy, such as Netty, is aimed at this kind of copy). In addition, in the process of querying the response body, The large number of temporary object lists involved, in high concurrency scenarios, can cause the young generation to run out of memory and then gc frequently (the interface will be later squelch at appropriate times). Here’s another point: Many people think ParNewGC won’t stop the world, but it will. Frequent ParNewGC causes the user thread to block, freeing up CPU time slices and eventually causing the connection processing to wait and the INTERFACE’S RT to rise. The whole investigation process, hawk-eye, IDB performance monitoring and other visual monitoring platform to help really great, otherwise everywhere to check logs have to check the dizzy.
experience
- Interface design, need to avoid the emergence of large response body, divide and conquer, split a large interface into several small interfaces.
- Caching design, such as this service, where a single request results in approximately 40 DB queries, needs to be considered on the server side (lazy at the time, requiring the caller to do the caching).
- Performance design, to be responsible for the performance of their own system, you can get the ceiling of their own system through pressure measurement and other means, otherwise, a certain interface hang, will lead to the availability of the whole application problems.
- Traffic isolation: Traffic isolation is required between internal applications and external traffic. Even if traffic is cached, cache breakdown may occur.
- What is said in words is unreliable, and what is written in documents needs to be checked for implementation.
Three things to watch ❤️
If you find this article helpful, I’d like to invite you to do three small favors for me:
-
Like, forward, have your “like and comment”, is the motivation of my creation.
-
Follow the public account “Java rotten pigskin” and share original knowledge from time to time.
-
Also look forward to the follow-up article ing🚀
Author: edenbaby
Reference: club.perfma.com/article/184…