First, let’s talk about the problem phenomenon: the API of the sandbox environment on the Intranet is stuck for one week, and all apis are unresponsive
At first, when the test complained about the slow response of the environment, we restarted the application, and the application returned to normal, so no processing was done. But then the problems appeared more and more frequently, and more and more colleagues began to complain, so they felt that there might be something wrong with the code and began to troubleshoot.
First of all, I found that the development of the local IDE did not find the problem, the application freezes when the database, Redis is normal, and no special error log. I began to suspect that the sandbox environment machine was a problem (the test environment is inherently brittle! _!)
SSH runs the following command on the server
top
The machine was fine, so I decided to take a look at the JVM stack information
Let’s start with the problem of using resource-consuming threads
Run top-H-P 12798
Find the first three relatively expensive threads
Jstack looks at the heap memory
Jstack 12798 | grep hexadecimal ff 31 12799
I don’t see anything wrong, so I look at the next 10 lines and I do it, okay
See that some threads are locked. But no business-related code appears and is ignored. There was no clue at this point. Think about it. Decided to abandon the stuck machine
To protect the accident site, dump all the heap memory of the faulty process. Then restart the test environment application in debug mode, and plan to remotely debug the faulty machine when the problem appears again
The next day the problem reappeared, so I informed o&M nGINx forward to remove the problem application and remotely debug Tomcat myself.
They randomly find an interface, breakpoint in the interface entry place, the tragedy began, nothing happened! The API waited for the service response and did not enter the breakpoint. At this time a little confused, calm down for a while, before the entrance of the AOP place under a breakpoint, debug again, this time into the breakpoint, f8 N times later found that the execution of redis command when the main block. Keep following and finally find a problem at a place in Jedis:
/** * Returns a Jedis instance to be used as a Redis connection. The instance can be newly created or retrieved from a * pool. * * @return Jedis instance ready for wrapping into a {@link RedisConnection}.
*/
protected Jedis fetchJedisConnector() {
try {
if(usePool && pool ! = null) {return pool.getResource();
}
Jedis jedis = new Jedis(getShardInfo());
// force initialization (see Jedis issue # 82)
jedis.connect();
return jedis;
} catch (Exception ex) {
throw new RedisConnectionFailureException("Cannot get Jedis connection", ex); }}Copy the code
The thread starts wait after pool.getResource()
public T getResource() {
try {
return internalPool.borrowObject();
} catch (Exception e) {
throw new JedisConnectionException("Could not get a resource from the pool", e); }}Copy the code
return internalPool.borrowObject(); This code should be a rental code, and follow
public T borrowObject(long borrowMaxWaitMillis) throws Exception {
this.assertOpen();
AbandonedConfig ac = this.abandonedConfig;
if(ac ! = null && ac.getRemoveAbandonedOnBorrow() && this.getNumIdle() < 2 && this.getNumActive() > this.getMaxTotal() - 3) { this.removeAbandoned(ac); } PooledObject<T> p = null; boolean blockWhenExhausted = this.getBlockWhenExhausted(); longwaitTime = 0L;
while(p == null) {
boolean create = false;
if (blockWhenExhausted) {
p = (PooledObject)this.idleObjects.pollFirst();
if (p == null) {
create = true;
p = this.create();
}
if (p == null) {
if (borrowMaxWaitMillis < 0L) {
p = (PooledObject)this.idleObjects.takeFirst();
} else {
waitTime = System.currentTimeMillis();
p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
waitTime = System.currentTimeMillis() - waitTime; }}if (p == null) {
throw new NoSuchElementException("Timeout waiting for idle object");
}
Copy the code
There’s a code in there
if (p == null) {
if (borrowMaxWaitMillis < 0L) {
p = (PooledObject)this.idleObjects.takeFirst();
} else {
waitTime = System.currentTimeMillis();
p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS);
waitTime = System.currentTimeMillis() - waitTime; }}Copy the code
BorrowMaxWaitMillis <0 would always execute and then it would loop and start wondering if this value wasn’t configured
The redis pool configuration is not configured for MaxWaitMillis, and the else code is an Exception
Continue to F8
public E takeFirst() throws InterruptedException {
this.lock.lock();
Object var2;
try {
Object x;
while((x = this.unlinkFirst()) == null) {
this.notEmpty.await();
}
var2 = x;
} finally {
this.lock.unlock();
}
return var2;
}
Copy the code
When I find the lock word here, I begin to suspect that all request apis are blocked
Arthas is an open source Java diagnostic tool for Alibaba.
Executing the thread command
Http-nio-8083-exec – this thread is actually the Tomcat thread that produces the HTTP request
Pick a random thread to look at the heap
thread -428
This is a confirmation that the API is going around in circles, and this redis code is getting connections,
Read this memory code and all the threads are waiting for @53e5504e to release the lock. So JStack does a global search for 53E5504E and can’t find the thread.
Since then. The cause of the problem can be identified as a redis connection acquisition problem. But what causes the failure to obtain the connection can not be determined
Arthas thread-b again (thread-b, find threads currently blocking other threads)
Nothing. Instead of finding a blocking thread, I looked at the documentation for this command and found the following sentence
Well, we happen to be the latter…
Let me think again. This time modify the Redis Pool configuration to set the connection timeout to 2s, and then wait for the problem to recur to see what the application does when it finally works.
Add some configuration
JedisConnectionFactory jedisConnectionFactory = new JedisConnectionFactory(); . JedisPoolConfig config = new JedisPoolConfig(); config.setMaxWaitMillis(2000); . jedisConnectionFactory.afterPropertiesSet();Copy the code
Restart the service and wait…
Another day, and again
SSH server, check tomcat Accesslog, found a large number of API requests 500,
org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource fr om the pool at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:1 40) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:229) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:57) at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:128) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:91) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:78) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:177) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:152) at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:85) at org.springframework.data.redis.core.DefaultHashOperations.get(DefaultHashOperations.java:48)Copy the code
Find out where the source first appeared at 500,
Find the following code
. Cursor c = stringRedisTemplate.getConnectionFactory().getConnection().scan(options);while (c.hasNext()) {
.....,,
}
Copy the code
Analysis of this code, stringRedisTemplate. GetConnectionFactory () getConnection () to obtain the pool after the redisConnection, no follow-up action, In other words, the redisConnection in the Redis connection pool is not released or returned to the connection pool after the connection is leased. The redisConnection in the redis connection pool is idle, but the redisConnection in the pool has not returned to idle
Normal should be
Since then the problem has been found.
Summary: Spring stringRedisTemplate encapsulates redis general operations, but does not yet support Commands such as Scan SetNx, so you need to get jedis Connection to do special Commands
use
stringRedisTemplate.getConnectionFactory().getConnection()
Copy the code
Is not recommended
We can use
stringRedisTemplate.execute(new RedisCallback<Cursor>() {
@Override
public Cursor doInRedis(RedisConnection connection) throws DataAccessException {
returnconnection.scan(options); }});Copy the code
To perform,
Or after using connection, use the
RedisConnectionUtils.releaseConnection(conn, factory);
Copy the code
To release the connection.
In addition, it is not recommended to use the keys command in redis, and the redis pool configuration should be properly configured. Otherwise, there will be no error log and no error report when problems occur, which is quite difficult to locate.