The background,

On a certain day, the upstream caller received a feedback that a Dubbo interface provided by the provider was fused for a short time at a fixed time point every day, and the exception message was thrown that the provider Dubbo thread pool was exhausted. At present, dubbo interface receives 1.8 billion requests per day, with 94W error requests per day. At this point, optimization journey has begun.

Two, rapid emergency response

2.1 Quick Location

First of all, routine system information monitoring (machine, JVM memory, GC, thread) was carried out. It was found that although there were slight spikes, they were all within a reasonable range and did not match the time point of error reporting, so it was temporarily ignored.

Secondly, traffic analysis was conducted, and it was found that there would be sudden increase of traffic at fixed time points every day. The point of sudden increase of traffic also coincided with the time point of error reporting, which was preliminatively judged to be caused by short-time heavy traffic.

Traffic trends

Amount to be demoted

99 lines of interface,

Find performance bottlenecks

3.1 Interface Process Analysis

3.1.1 flowchart

3.1.2 Process analysis

After receiving the request, the downstream interface is called, and the Hystrix fuse is used. The fuse duration is 500MS.

According to the data returned by the downstream interface, the detailed data is encapsulated. The first step is to obtain the data from the local cache. If the local cache does not have the data, the source is returned from Redis.

If the first step to call the downstream interface is abnormal, the data will be backtracked, and the backtracked process will first get the data from the local cache. If the local cache does not have the data, the data will be returned from Redis, and the asynchronous thread will directly return the data from the database.

3.2 Troubleshooting performance bottlenecks

3.2.1 Downstream Interface Service Takes a Long time

The call chain shows that although the P99 line of the downstream interface has a spike when the peak flow exceeds 1S, it is judged that the downstream interface is not the key point of the problem because of the setting of fusing timeout (fusing time is 500MS, coreSize&masSize=50, and the average time of the downstream interface is less than 10MS). In order to further eliminate interference, When there is a spike in the downstream service, it can quickly fail. Adjust the fusing time to 100MS and dubbo timeout time to 100MS.

3.2.2 Obtaining Details The local cache has no data, and Redis returns the data to the source

With the help of the call chain platform, the first step is to analyze the Redis request traffic, so as to judge the hit ratio of the local cache. It is found that the traffic of Redis is twice that of the interface, which should not occur in terms of design. I started the code Review and found that there was a logic problem.

It did not read from the local cache, but directly obtained data from Redis, and the maximum Redis response time did find unreasonable spikes. Further analysis found that the response time of Redis was basically the same as that of Dubbo99 line spikes. I felt that the cause of the problem had been found at this time, and I was secretly pleased.

Redis requests traffic

Service interface request traffic

Dubbo99 line

Redis maximum response time

3.2.3 Obtaining Bottom-pocket Data The local cache has no data and Redis returns to the source

normal

3.2.4 Record the request result into Redis

Because currently Redis does resource isolation, and no slow logs are queried in DB background, there are many reasons for the slow Redis at this time, but other reasons are subjectively ignored, and attention is focused on the problem of requesting Redis traffic to double. Therefore, the problem in 3.2.2 should be solved first.

Iv. Solutions

4.1 The fault located in 3.3.2 goes online

Number of Redis requests before going online

Number of Redis requests after going online

After the launch, the problem of Redis traffic doubling was solved, and the spike of Redis maximum response time was alleviated to some extent, but it was still not completely solved, indicating that the large traffic query is not the most fundamental reason.

Maximum Redis response time (before going live)

Maximum redis response time (after going live)

4.2 Redis capacity

After solving the problem of abnormal traffic in Redis, the problem has not been completely solved. At this time, what we can do is to calm down and carefully comb out the causes of slow Redis. The ideas are mainly from the following three aspects:

  • A slow query occurred

  • The Redis service has a performance bottleneck

  • The client configuration is incorrect

Based on the above ideas, one by one for investigation; No slow query was found when the Redis slow query log was queried.

Using the call chain platform to analyze the slow Redis command in detail, without the interference of slow query caused by heavy traffic, the problem location process is very fast, a large number of time-consuming requests are in the Setex method, and the occasional slow query request is also after the Setex method. Setex is responsible for the Redis99 line spike according to the single-threaded nature of Redis. After finding specific statements and locating specific services, first apply for expanding Redis from 6 masters to 8 masters.

Before the expansion and Redis

Redis expansion after

According to the results, capacity expansion basically has no effect, indicating that Redis service itself is not the performance bottleneck point, and the remaining one is the client-side configuration.

4.3 Optimization of Client Parameters

4.3.1 Connection Pool Optimization

Redis capacity expansion has no effect. There are two directions for the possible problems on the client.

The first is the client in Redis cluster mode, there is a BUG on the connection management, the second is the connection pool parameter setting is not reasonable, at this time source analysis and connection pool parameter adjustment synchronization.

4.3.1.1 Determine whether there are bugs in client connection management

After the analysis, the client processing connection pool source, no problem, as expected, according to the slot cache connection pool, the first assumption is excluded, the source code is as follows.

1,setEx
  public String setex(final byte[] key, final int seconds, final byte[] value) {
    return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
      @Override
      public String execute(Jedis connection) {
        return connection.setex(key, seconds, value);
      }
    }.runBinary(key);
  }
 
2,runBinary
  public T runBinary(byte[] key) {
    if (key == null) {
      throw new JedisClusterException("No way to dispatch this command to Redis Cluster.");
    }
 
    return runWithRetries(key, this.maxAttempts, false.false);
  }
3,runWithRetries
  private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    if (attempts <= 0) {
      throw new JedisClusterMaxRedirectionsException("Too many Cluster redirections?");
    }
 
    Jedis connection = null;
    try {
 
      if (asking) {
        // TODO: Pipeline asking with the original command to make it
        // faster....
        connection = askConnection.get();
        connection.asking();
 
        // if asking success, reset asking flag
        asking = false;
      } else {
        if (tryRandomNode) {
          connection = connectionHandler.getConnection();
        } else{ connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key)); }}return execute(connection);
 
    }
 
4,getConnectionFromSlot
  public Jedis getConnectionFromSlot(int slot) {
    JedisPool connectionPool = cache.getSlotPool(slot);
    if(connectionPool ! =null) {
      // It can't guaranteed to get valid connection because of node
      // assignment
      return connectionPool.getResource();
    } else {
      renewSlotCache(); //It's abnormal situation for cluster mode, that we have just nothing for slot, try to rediscover state
      connectionPool = cache.getSlotPool(slot);
      if(connectionPool ! =null) {
        return connectionPool.getResource();
      } else {
        //no choice, fallback to new connection to random node
        returngetConnection(); }}}Copy the code

4.3.1.2 Analyzing Connection Pool parameters

Through communication with the middleware team and reference to the commons-pool2 official documentation, the following changes are made;

After parameter adjustment, the number of requests for more than 1S is reduced, but still exists, and the upstream feedback is degraded from about 900,000 to 6W per day (why there will be more than 200MS requests after maxWaitMillis is set to 200MS is explained below).

Maximum Reds response time after parameter optimization

Number of interface errors after parameter optimization

4.3.2 Continuous optimization

Optimization can not stop, how to Redis all write requests to reduce to 200MS, at this time the optimization idea is to adjust the client configuration parameters, analysis Jedis connection related source code;

Jedis gets the connection source

final AbandonedConfig ac = this.abandonedConfig;
if(ac ! =null && ac.getRemoveAbandonedOnBorrow() &&
        (getNumIdle() < 2) &&
        (getNumActive() > getMaxTotal() - 3) ) {
    removeAbandoned(ac);
}

PooledObject<T> p = null;

// Get local copy of current config so it is consistent for entire
// method execution
final boolean blockWhenExhausted = getBlockWhenExhausted();

boolean create;
final long waitTime = System.currentTimeMillis();

while (p == null) {
    create = false;
    p = idleObjects.pollFirst();
    if (p == null) {
        p = create();
        if(p ! =null) {
            create = true; }}if (blockWhenExhausted) {
        if (p == null) {
            if (borrowMaxWaitMillis < 0) {
                p = idleObjects.takeFirst();
            } else{ p = idleObjects.pollFirst(borrowMaxWaitMillis, TimeUnit.MILLISECONDS); }}if (p == null) {
            throw new NoSuchElementException(
                    "Timeout waiting for idle object"); }}else {
        if (p == null) {
            throw new NoSuchElementException("Pool exhausted"); }}if(! p.allocate()) { p =null;
    }

    if(p ! =null) {
        try {
            factory.activateObject(p);
        } catch (final Exception e) {
            try {
                destroy(p);
            } catch (final Exception e1) {
                // Ignore - activation failure is more important
            }
            p = null;
            if (create) {
                final NoSuchElementException nsee = new NoSuchElementException(
                        "Unable to activate object");
                nsee.initCause(e);
                thrownsee; }}if(p ! =null && (getTestOnBorrow() || create && getTestOnCreate())) {
            boolean validate = false;
            Throwable validationThrowable = null;
            try {
                validate = factory.validateObject(p);
            } catch (final Throwable t) {
                PoolUtils.checkRethrow(t);
                validationThrowable = t;
            }
            if(! validate) {try {
                    destroy(p);
                    destroyedByBorrowValidationCount.incrementAndGet();
                } catch (final Exception e) {
                    // Ignore - validation failure is more important
                }
                p = null;
                if (create) {
                    final NoSuchElementException nsee = new NoSuchElementException(
                            "Unable to validate object");
                    nsee.initCause(validationThrowable);
                    throw nsee;
                }
            }
        }
    }
}

updateStatsBorrow(p, System.currentTimeMillis() - waitTime);

return p.getObject();

Copy the code

The general process for obtaining connections is as follows:

If there is a free connection, return it directly; if there is no free connection, create it.

If the maximum number of connections is exceeded during creation, it checks whether other threads are creating connections, if not, it returns directly, if yes, it waits for maxWaitMis time (other threads may fail to create connections), if not, it creates connections (the waiting time for obtaining connections may be longer than maxWaitMs).

If not, the connection pool is not enough. If yes, the connection pool is not enough. If yes, maxWaitMillis is less than 0.

The follow-up is to determine whether to do connection check and so on according to the parameters.

Based on the above process analysis, maxWaitMills is currently set at 200. The above processes add up to a maximum blocking time of 400MS, which is 200MS in most cases. There should be no spikes beyond 400MS.

At this time, the problem may occur in connection creation, because it is time-consuming to create a connection, and the creation time is variable, focus on analyzing whether this scenario exists, through the DB background monitoring of Redis connection.

DB background monitoring Redis service connection

Analysis of the above figure shows that there are indeed several time points (9:00,12:00, 19:00…) , redis connection number increased, which was basically consistent with the redis spike time. I feel (after previous attempts, I have not been able to confirm) that the problem is clearly located (in the case of sudden surge of traffic, the available connections in the connection pool cannot meet the demand, and connections will be created, resulting in request waiting).

Idea is in the service is started at this time to create the connection pool, try to reduce the number of the new connection to create, modify vivo.cache.depend.com connection pooling parameters mon. PoolConfig. MinIdle, result unexpectedly is invalid???

Poll2: Poll2: poll2: poll2: poll2: poll2: poll2: Poll2: Poll2: Poll2: Poll2

CommonPool2 source

public GenericObjectPool(final PooledObjectFactory<T> factory,
        final GenericObjectPoolConfig<T> config) {
 
    super(config, ONAME_BASE, config.getJmxNamePrefix());
 
    if (factory == null) {
        jmxUnregister(); // tidy up
        throw new IllegalArgumentException("factory may not be null");
    }
    this.factory = factory;
 
    idleObjects = new LinkedBlockingDeque<>(config.getFairness());
 
    setConfig(config);
}

Copy the code

Commons-pool2-2.4.jar middleware team provides the following source code (commons-pool2-2.4.jar) : The startEvictor method was called after the method was executed.

1Initializes the connection poolpublic GenericObjectPool(PooledObjectFactory
       
         factory, GenericObjectPoolConfig config)
        {
super(config, ONAME_BASE, config.getJmxNamePrefix());
if (factory == null) {
            jmxUnregister(); // tidy up
throw new IllegalArgumentException("factory may not be null");
        }
this.factory = factory;
        idleObjects = new LinkedBlockingDeque<PooledObject<T>>(config.getFairness());
        setConfig(config);
        startEvictor(getTimeBetweenEvictionRunsMillis());
    }

Copy the code

Why is it different?? Start to check the Jar package, the version is different, the middleware version is V2.4.2, the project is actually using V2.6.2, analysis startEvictor has a step of logic is to deal with the connection pool warm-up logic.

The Jedis connection pool is preheated

1,final void startEvictor(long delay) {
        synchronized (evictionLock) {
            if (null! = evictor) { EvictionTimer.cancel(evictor); evictor =null;
                evictionIterator = null;
            }
            if (delay > 0) {
                evictor = newEvictor(); EvictionTimer.schedule(evictor, delay, delay); }}}2,class Evictor extends TimerTask {
       /** * Run pool maintenance. Evict objects qualifying for eviction and then * ensure that the minimum number of idle instances are available. * Since the Timer that invokes Evictors is shared for all Pools but * pools may exist in different class loaders, the Evictor ensures that * any actions taken are under the class loader of the factory * associated with the pool. */
        @Override
        public void run(a) {
            ClassLoader savedClassLoader =
                    Thread.currentThread().getContextClassLoader();
            try {
                if(factoryClassLoader ! =null) {
                    // Set the class loader for the factory
                    ClassLoader cl = factoryClassLoader.get();
                    if (cl == null) {
                        // The pool has been dereferenced and the class loader
                        // GC'd. Cancel this timer so the pool can be GC'd as
                        // well.
                        cancel();
                        return;
                    }
                    Thread.currentThread().setContextClassLoader(cl);
                }
 
                // Evict from the pool
                try {
                    evict();
                } catch(Exception e) {
                    swallowException(e);
                } catch(OutOfMemoryError oome) {
                    // Log problem but give evictor thread a chance to continue
                    // in case error is recoverable
                    oome.printStackTrace(System.err);
                }
                // Re-create idle instances.
                try {
                    ensureMinIdle();
                } catch(Exception e) { swallowException(e); }}finally {
                // Restore the previous CCLThread.currentThread().setContextClassLoader(savedClassLoader); }}}3,void ensureMinIdle(a) throws Exception {
        ensureIdle(getMinIdle(), true);
    }
4,private void ensureIdle(int idleCount, boolean always) throws Exception {
        if (idleCount < 1|| isClosed() || (! always && ! idleObjects.hasTakeWaiters())) {return;
        }
 
        while (idleObjects.size() < idleCount) {
            PooledObject<T> p = create();
            if (p == null) {
                // Can't create objects, no reason to think another call to
                // create will work. Give up.
                break;
            }
            if (getLifo()) {
                idleObjects.addFirst(p);
            } else{ idleObjects.addLast(p); }}if (isClosed()) {
            // Pool closed while object was being added to idle objects.
            // Make sure the returned object is destroyed rather than left
            // in the idle object pool (which would effectively be a leak)clear(); }}Copy the code

Modified version of the Jar, add vivo.cache.depend.com mon configuration center. PoolConfig. TimeBetweenEvictionRunsMillis (check idle connections in a connection pool, The free time more than minEvictableIdleTimeMillis milliseconds disconnected, until all the number of connections in the connection pool until minIdle).

Vivo.cache.depend.com mon. PoolConfig. MinEvictableIdleTimeMillis (connections in the pool of free time, ms) two parameters, after restart the service, the connection pool normal preheating, finally solve the problem from the Angle of Redis level.

The optimization results are as follows, and the performance problem is basically solved.

Redis response time (before optimization)

Redis response time (after optimization)

Interface line 99 (before optimization)

Interface 99 line (after optimization)

Five, the summary

When an online fault occurs, the priority is to quickly recover online services to minimize the impact of the service. Therefore, policies such as traffic limiting, circuit breaker, and degradation must be prepared in advance for online services to quickly find a recovery solution when an online fault occurs. Proficiency in the use of the company’s monitoring platforms determines the speed of locating problems. Each development should take proficiency in the use of monitoring platforms (machines, services, interfaces, DB, etc.) as a basic ability.

When the Redis response is slow, you can give priority to the Redis cluster server (machine load, whether the service is slow query), business code (whether there are bugs), client (whether the connection pool configuration is reasonable) from three aspects to check, basically can check out most of the Redis slow response problems.

When Redis connection pooling system cold start, preheating of the connection pool, different Commons – pool2 version, cold start strategy is also different, but both need to configure the minEvictableIdleTimeMillis parameters will only take effect, we can see the common – pool2 official documents, Have a good understanding of common parameters, and can quickly locate problems when they occur.

The default connection pool parameters are weak for heavy traffic. Therefore, you need to adjust the connection pool parameters for heavy traffic scenarios. If the traffic is not heavy, use the default parameters.

Specific problems to specific analysis, can not solve the problem to flexible thinking, through a variety of methods to try to solve the problem.

Vivo Internet Server Team -Wang Shaodong