Daily Bug check – system unresponsive – Improper use of Redis
preface
The daily Bug check series is a series of simple Bug check, here I will introduce some simple Bug check tips, and also accumulate material ^_^.
At the scene of the Bug
The system on the development response line loses response and receives frequent MarkAndSweep(Full GC) alarms. Then find the author to investigate.
Look at basic monitoring
The first thing, of course, is to look at our monitoring, find the IP address of the system that’s not responding, and look at our base monitoring.
Machine memory continues to rise. Since we are a Java system, the heap size is initially set to a maximum.
--XX:Xms2g -Xmx2g
Copy the code
So it looks like an out-of-heap memory leak. FullGC alarms are triggered only by objects in the heap associated with out-of-heap memory.
Look at application monitoring
The second step, of course, is to observe our application monitoring, here the author used CAT. By observing the corresponding application in Cat, it is easy to find that its ActiveThread presents an abnormal phenomenon, reaching more than 5000+, which is consistent with the memory rising curve.
jstack
Java application encountered too many threads phenomenon, first we consider jstack, Jstack out of the corresponding file. We less and find many threads stuck on the code stack below.
"Thread-1234
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park
......
at org.apache.commons.pool2.impl.LinkedBlockingQueue.takeFirst
......
at redis.clients.util.Pool.getResource
Copy the code
It is obvious that this code stack is not getting connections and thus jamming. As for why the card is not released for such a long time, it must be because the timeout period is not set. So are most threads stuck here? Let’s do some statistics here.
cat jstack.txt | grep 'prio=' | wc -l
======> 5648
cat jstack.txt | grep 'redis.clients.util.Pool.getResource'
======> 5242
Copy the code
As you can see, out of 5648 threads, 5242, or 92%, are stuck in Redis getResource.
Let’s look at the Redis case
Netstat anp | grep TCP 6379 0 0. 2:111, 6379:3.4.5.6 ESTABLISHED...Copy the code
There are five in total, and the connection status is ESTABLISHED and normal. So the maximum number of connections they configured is 5(because other threads are getting Redis resources).
Redis connection leaks
It is natural to assume that the Redis connection leaks, i.e. the application takes the Redis connection and does not return it. There are several possibilities for such a leak:
Case 1:
Situation 2:
Case 3:
Calling Redis is stuck. Exclude this case because other machines are fine.
How to distinguish between
Let’s make a simple inference: in case 1, the RedisConn must be associated with Thread through memory accessibility analysis, and the association must be associated with some business operation entity (such as code Stack or business bean). So we just need to observe whether its association path in the heap is relevant to the business. If there is no association, it is basically determined to be case 2.
Accessibility analysis
We can dump application Memory using JMap, and then perform reachability Analysis using MAT(Memory Analysis Tool).
Find RedisConn first
Open dump file in MAT and run OQL:
Select * from redis. Clients. Jedis. Jedis (RedisConn entity class)Copy the code
Search for a bunch of Jedis classes, and then we execute
Path To GCRoots->with all references
Copy the code
You can see the following results:
redis.clients.jedis.Jedis
|->object
|->item
|->first
|->...
|->java.util.TimerThread
|->internalPool
Copy the code
Thus, our connections are held only by TimerThread and internalPool(the connection pool of Jedis itself). So we can tell that the high probability is case 2, which is forgetting to return the connection. Look at the business code:
Void lock(){conn = jedis.getResource() conn.setnx () // finally{returnResource()} or RedisTemplate}Copy the code
Finally, it is very simple, the business development after performing the setNx operation, forgot to return the connection. The connection leaks.
How to locate stuck code in case 1
So far, the problem has been solved. But in case one, how do we analyze it? Simple, if we find out which business thread owns jedis, we can find the thread number from the heap dump and then search the Jstack to find the code stack it is stuck on.
jmap:
redis.clients.jedis.Jedis
|->Thread-123
jstack:
Thread-123 prio=...
at xxx.xxx.xxx.blocked
Copy the code
conclusion
This is a very simple problem, once you know the pattern, it doesn’t take much trouble to check. Although the final check out is a very low-level code, but this analysis method is worth learning.