Daily Bug check – system unresponsive – Improper use of Redis

preface

The daily Bug check series is a series of simple Bug check, here I will introduce some simple Bug check tips, and also accumulate material ^_^.

At the scene of the Bug

The system on the development response line loses response and receives frequent MarkAndSweep(Full GC) alarms. Then find the author to investigate.

Look at basic monitoring

The first thing, of course, is to look at our monitoring, find the IP address of the system that’s not responding, and look at our base monitoring.

Machine memory continues to rise. Since we are a Java system, the heap size is initially set to a maximum.

--XX:Xms2g -Xmx2g
Copy the code

So it looks like an out-of-heap memory leak. FullGC alarms are triggered only by objects in the heap associated with out-of-heap memory.

Look at application monitoring

The second step, of course, is to observe our application monitoring, here the author used CAT. By observing the corresponding application in Cat, it is easy to find that its ActiveThread presents an abnormal phenomenon, reaching more than 5000+, which is consistent with the memory rising curve.

jstack

Java application encountered too many threads phenomenon, first we consider jstack, Jstack out of the corresponding file. We less and find many threads stuck on the code stack below.

"Thread-1234
	java.lang.Thread.State: WAITING (parking)
		at sun.misc.Unsafe.park
		......
		at org.apache.commons.pool2.impl.LinkedBlockingQueue.takeFirst
		......
		at redis.clients.util.Pool.getResource
Copy the code

It is obvious that this code stack is not getting connections and thus jamming. As for why the card is not released for such a long time, it must be because the timeout period is not set. So are most threads stuck here? Let’s do some statistics here.

cat jstack.txt | grep 'prio=' | wc -l 
======> 5648
cat jstack.txt | grep 'redis.clients.util.Pool.getResource' 
======> 5242
Copy the code

As you can see, out of 5648 threads, 5242, or 92%, are stuck in Redis getResource.

Let’s look at the Redis case

Netstat anp | grep TCP 6379 0 0. 2:111, 6379:3.4.5.6 ESTABLISHED...Copy the code

There are five in total, and the connection status is ESTABLISHED and normal. So the maximum number of connections they configured is 5(because other threads are getting Redis resources).

Redis connection leaks

It is natural to assume that the Redis connection leaks, i.e. the application takes the Redis connection and does not return it. There are several possibilities for such a leak:

Case 1:

Situation 2:

Case 3:

Calling Redis is stuck. Exclude this case because other machines are fine.

How to distinguish between

Let’s make a simple inference: in case 1, the RedisConn must be associated with Thread through memory accessibility analysis, and the association must be associated with some business operation entity (such as code Stack or business bean). So we just need to observe whether its association path in the heap is relevant to the business. If there is no association, it is basically determined to be case 2.

Accessibility analysis

We can dump application Memory using JMap, and then perform reachability Analysis using MAT(Memory Analysis Tool).

Find RedisConn first

Open dump file in MAT and run OQL:

Select * from redis. Clients. Jedis. Jedis (RedisConn entity class)Copy the code

Search for a bunch of Jedis classes, and then we execute

Path To GCRoots->with all references
Copy the code

You can see the following results:

redis.clients.jedis.Jedis
	|->object 
		|->item
			|->first
				|->...
					|->java.util.TimerThread
				|->internalPool
Copy the code

Thus, our connections are held only by TimerThread and internalPool(the connection pool of Jedis itself). So we can tell that the high probability is case 2, which is forgetting to return the connection. Look at the business code:

Void lock(){conn = jedis.getResource() conn.setnx () // finally{returnResource()} or RedisTemplate}Copy the code

Finally, it is very simple, the business development after performing the setNx operation, forgot to return the connection. The connection leaks.

How to locate stuck code in case 1

So far, the problem has been solved. But in case one, how do we analyze it? Simple, if we find out which business thread owns jedis, we can find the thread number from the heap dump and then search the Jstack to find the code stack it is stuck on.

jmap:
redis.clients.jedis.Jedis
	|->Thread-123

jstack:

Thread-123 prio=...
	at xxx.xxx.xxx.blocked

Copy the code

conclusion

This is a very simple problem, once you know the pattern, it doesn’t take much trouble to check. Although the final check out is a very low-level code, but this analysis method is worth learning.

Daily Bug check – system unresponsive – Improper use of Redis

Daily Bug check – system unresponsive – Improper use of Redis

preface

At the scene of the Bug

Look at basic monitoring

Look at application monitoring

jstack

Let’s look at the Redis case

Redis connection leaks

How to distinguish between

Accessibility analysis

Find RedisConn first

How to locate stuck code in case 1

conclusion

Related Posts

Fast Failure mechanism & Failsafe mechanism

In-depth understanding of the JVM—— garbage collector

Logistics System design based on zhongtai idea (3) : Building logistics address capability