First of all, let’s talk about the problem symptoms: ** The application stuck in the sandbox API on the Intranet for one week, and all the apis did not respond. ** At first, when the test complained about the slow response of the environment, we restarted the application. The application returned to normal, so no processing was done. But then the problems appeared more and more frequently, and more and more colleagues began to complain, so they felt that there might be something wrong with the code and began to troubleshoot. First of all, I found that the development of the local IDE did not find the problem, the application freezes when the database, Redis is normal, and no special error log. I began to suspect the Sandbox environment machine was a problem, the test environment itself is very brittle! _! Then SSH on the server to execute the following command top

At this point, I found that the machine was fine, so I decided to look at the JVM stack information first to look at the problem of the application of the resource-consuming thread top – H-P 12798

Find the first three relatively expensive threads jStack and look at the heap memory

Jstack 12798 | grep hexadecimal ff 31 12799

I don’t see anything wrong, so I look at the next 10 lines and I do it, okay

See that some threads are locked. But no business-related code appears and is ignored. There was no clue at this point. After some thinking, decided to give up the stuck state of the machine. In order to protect the accident site, all the heap memory of the affected process is dumped. Then, the test environment application is restarted in debug mode, with the intention of directly remotely debugging the affected machine when the problem recur. The next day the problem reappeared, so I informed o&M nGINx forward to remove the problem application and remotely debug Tomcat myself. They randomly find an interface, breakpoint in the interface entry place, the tragedy began, nothing happened! The API waited for the service response and did not enter the breakpoint. Mentally at this time, to calm for a while, before entrance of aop place a breakpoint, the debug again, this time into the breakpoints, f8 N times when stuck when redis command. Keep following and finally find a problem at a place in Jedis:

/** * Returns a Jedis instance to be used as a Redis connection. The instance can be newly created or retrieved from a *  pool. * * @return Jedis instance ready for wrapping into a {@link RedisConnection}. */ protected Jedis fetchJedisConnector() { try { if (usePool && pool ! = null) { return pool.getResource(); } Jedis jedis = new Jedis(getShardInfo()); // force initialization (see Jedis issue #82) jedis.connect(); return jedis; } catch (Exception ex) { throw new RedisConnectionFailureException("Cannot get Jedis connection", ex); }}Copy the code

The thread starts wait after pool.getResource()

public T getResource() { try { return internalPool.borrowObject(); } catch (Exception e) { throw new JedisConnectionException("Could not get a resource from the pool", e); }}Copy the code

return internalPool.borrowObject(); This code should be a rental code, and follow

public T borrowObject(long borrowMaxWaitMillis) throws Exception { this.assertOpen(); AbandonedConfig ac = this.abandonedConfig; if (ac ! = null && ac.getRemoveAbandonedOnBorrow() && this.getNumIdle() < 2 && this.getNumActive() > this.getMaxTotal() - 3) { this.removeAbandoned(ac); } PooledObject<T> p = null; boolean blockWhenExhausted = this.getBlockWhenExhausted(); long waitTime = 0L; while(p == null) { boolean create = false; if (blockWhenExhausted) { p = (PooledObject)this.idleObjects.pollFirst(); if (p == null) { create = true; p = this.create(); } if (p == null) { if (borrowMaxWaitMillis < 0L) { p = (PooledObject)this.idleObjects.takeFirst(); } else { waitTime = System.currentTimeMillis(); p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); waitTime = System.currentTimeMillis() - waitTime; } } if (p == null) { throw new NoSuchElementException("Timeout waiting for idle object"); }Copy the code

There’s a code in there

if (p == null) { if (borrowMaxWaitMillis < 0L) { p = (PooledObject)this.idleObjects.takeFirst(); } else { waitTime = System.currentTimeMillis(); p = (PooledObject)this.idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); waitTime = System.currentTimeMillis() - waitTime; }}Copy the code

BorrowMaxWaitMillis <0 would always execute and then it would loop and start wondering if this value wasn’t configured

The redis pool configuration is not configured for MaxWaitMillis, and the else code is an Exception

Continue to F8

public E takeFirst() throws InterruptedException {
    this.lock.lock();

    Object var2;
    try {
        Object x;
        while((x = this.unlinkFirst()) == null) {
            this.notEmpty.await();
        }

        var2 = x;
    } finally {
        this.lock.unlock();
    }

    return var2;
}
Copy the code

When I find the lock word here, I begin to suspect that all request apis are blocked

Arthas is an open source Java diagnostic tool for Alibaba.

Executing the thread command

Http-nio-8083-exec: thread 428: thread 428: thread 428: thread 428: thread 428: thread 428: thread 428: thread 428: thread 428: thread 428

This is a problem that confirms that it’s the API that keeps going around and around, and it’s this redis code that gets the connection, reading this memory code that all the threads are waiting for @53e5504e to release the lock. So JStack does a global search for 53E5504E and can’t find the thread. Since then. The cause of the problem can be identified as a redis connection acquisition problem. Arthas thread-b (arthas thread-b, arthas thread-b, arthas thread-b)

Nothing. Instead, it should be able to find a blocking thread, so I looked at the documentation for this command and found the following sentence

Well, we happen to be the latter…

Let me think again. This time modify the Redis Pool configuration to set the connection timeout to 2s, and then wait for the problem to recur to see what the application does when it finally works.

Add some configuration

JedisConnectionFactory jedisConnectionFactory = new JedisConnectionFactory(); . JedisPoolConfig config = new JedisPoolConfig(); config.setMaxWaitMillis(2000); . jedisConnectionFactory.afterPropertiesSet();Copy the code

Restart the service and wait… The next day, the SSH server again, check the Tomcat Accesslog, found a large number of API requests 500,

org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection; nested exception is redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource fr om the pool at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:1 40) at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:229) at  org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:57) at  org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:128) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:91) at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:78) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:177) at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:152) at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:85) at org.springframework.data.redis.core.DefaultHashOperations.get(DefaultHashOperations.java:48)Copy the code

Find out where the source first appeared at 500,

Find the following code

. Cursor c = stringRedisTemplate.getConnectionFactory().getConnection().scan(options); while (c.hasNext()) { ..... ,,}Copy the code

Analysis of this code, stringRedisTemplate. GetConnectionFactory () getConnection () to obtain the pool after the redisConnection, no follow-up action, In other words, the redisConnection in the Redis connection pool is not released or returned to the connection pool after the connection is leased. The redisConnection in the redis connection pool is idle, but the redisConnection in the pool has not returned to idle.

Normal should be

Since then the problem has been found.

** Spring stringRedisTemplate encapsulates redis general operations, but does not yet support Commands like Scan SetNx, requiring jedis Connection to do special Commands.

StringRedisTemplate. GetConnectionFactory (.) getConnection () is not recommended, we can use the following:

stringRedisTemplate.execute(new RedisCallback<Cursor>() { @Override public Cursor doInRedis(RedisConnection connection) throws DataAccessException { return connection.scan(options); }});Copy the code

Or, after using connection, use the

RedisConnectionUtils.releaseConnection(conn, factory);
Copy the code

To release the connection.

In addition, it is not recommended to use the keys command in redis, and the redis pool configuration should be properly configured. Otherwise, there will be no error log and no error report when problems occur, which is quite difficult to locate.