Anomalous site review

The video task function was pressure-tested on July 7. Some interface requests failed during the pressure test.

Our task push is mainly to B. The amount of concurrent is not large, 10 concurrent, 25 times, a total of 250 tasks. The amount of concurrency is not large, why the exception, how the exception?

First, let’s look at ELK log monitoring.

First, query the gateway logs to search for the internal gateway. The query condition is internal gateway + request path and view the concurrency.

We saw a total of 207 concurrent requests, all visually successful. But what if it was supposed to be 250, and it was 207? To query the exception. How to query? Search conditions:

  1. Project. key is the name of the project
  2. response is not 200

A problem has been found. There are 21 logs requesting 500.

Step 2: Access the gateway to query the exception cause.

Querying the gateway log based on the traceId of failed requests in elK, burning the goose does not yield anything. Take the traceId to the business server to look for it

Step 3: Find the service server logs.

The service server threw an exception and failed to obtain the lock.

Abnormal cause analysis and solution

Looking back at yesterday’s outlier data, we saw several outliers. Here we add a redisson lock with key: paperId+originId+originType. In this case, the originId of the pressure test should be different. How can there be a lock.

Online logs are not comprehensive. In order to simulate this problem, jMeter was started locally for pressure testing. Test range originId generated 20 randomly from 100,000-999999. The probability of the ID repeating is very low, so it’s very effective

20 requests. Two of them failed. The problems in the field of pressure measurement are reproduced. This is easy to do.

Step 1: View local logs and find abnormal logs

It’s the same anomaly as the online pressure survey.

Step 2: Determine whether the randomly generated originId is duplicate.

Because the key of a Redisson lock is not allowed to be repeated, if repeated, it will be waited, wait timeout, the lock will be thrown to fail. Troubleshooting result: There are no duplicate OriginidsCopy the code

The third step: check Redisson source code, in the exception point of interruption, step by step.

@Around("controllerAspect(lock)")
public Object aroundAdvice(ProceedingJoinPoint proceedingJoinPoint, Lock lock) throws Throwable {
    String[] keys = lock.keys();
    if (keys.length == 0) {
        throw new RuntimeException("Keys cannot be empty.");
    }
    String[] parameterNames = new LocalVariableTableParameterNameDiscoverer().getParameterNames(((MethodSignature) proceedingJoinPoint.getSignature()).getMethod());
    Object[] args = proceedingJoinPoint.getArgs();

    long attemptTimeout = lock.attemptTimeout();
    if (attemptTimeout == 0) {
        attemptTimeout = redissonProperties.getAttemptTimeout();
    }
    long lockWatchdogTimeout = lock.lockWatchdogTimeout();
    if (lockWatchdogTimeout == 0) {
        lockWatchdogTimeout = redissonProperties.getLockWatchdogTimeout();
    }
    LockModel lockModel = lock.lockModel();
    if (lockModel.equals(LockModel.AUTO)) {
        LockModel lockModel1 = redissonProperties.getLockModel();
        if(lockModel1 ! =null) {
            lockModel = lockModel1;
        } else if (keys.length > 1) {
            lockModel = LockModel.MULTIPLE;
        } else{ lockModel = LockModel.REENTRANT; }}if(! lockModel.equals(LockModel.MULTIPLE) && ! lockModel.equals(LockModel.REDLOCK) && keys.length >1) {
        throw new RuntimeException("Multiple parameters, lock mode: ->" + lockModel.name() + ".cannot lock");
    }
    log.info("Lock mode ->{}, wait for lock time ->{} seconds. Maximum lock time ->{} seconds",lockModel.name(),attemptTimeout/1000,lockWatchdogTimeout/1000);
    boolean res = false;
    RLock rLock = null;
    // Keep waiting for the lock.
    switch (lockModel) {
        case FAIR:
            rLock = redissonClient.getFairLock(getVauleBySpel(keys[0],parameterNames,args));
            break;
        case REDLOCK:
            RLock[] locks=new RLock[keys.length];
            int index=0;
            for (String key : keys) {
                locks[index++]=redissonClient.getLock(getVauleBySpel(key,parameterNames,args));
            }
            rLock = new RedissonRedLock(locks);
            break;
        case MULTIPLE:
            RLock[] locks1=new RLock[keys.length];
            int index1=0;
            for (String key : keys) {
                locks1[index1++]=redissonClient.getLock(getVauleBySpel(key,parameterNames,args));
            }
            rLock = new RedissonMultiLock(locks1);
            break;
        case REENTRANT:
            rLock= redissonClient.getLock(getVauleBySpel(keys[0],parameterNames,args));
            break;
        case READ:
            RReadWriteLock rwlock = redissonClient.getReadWriteLock(getVauleBySpel(keys[0],parameterNames,args));
            rLock = rwlock.readLock();
            break;
        case WRITE:
            RReadWriteLock rwlock1 = redissonClient.getReadWriteLock(getVauleBySpel(keys[0],parameterNames,args));
            rLock = rwlock1.writeLock();
            break;
    }

    / / implementation of aop
    if(rLock! =null) {
        try {
            if (attemptTimeout == -1) {
                res = true;
                // Keep waiting for the lock
                rLock.lock(lockWatchdogTimeout, TimeUnit.MILLISECONDS);
            } else {
                res = rLock.tryLock(attemptTimeout, lockWatchdogTimeout, TimeUnit.MILLISECONDS);
            }
            if (res) {
                Object obj = proceedingJoinPoint.proceed();
                return obj;
            }else{
                throw new LockException("Failed to acquire lock"); }}finally {
           if(res) { rLock.unlock(); }}}throw new LockException("Failed to acquire lock");
}
Copy the code

The exception is in this code, so read this code carefully. How to generate keys, how to lock them, how to retry. Suddenly, notice the code snippet that parses the key

public String getVauleBySpel(String key, String[] parameterNames, Object[] values) {

    if(! key.startsWith("#")) {return key;
    }
    / / spel parser
    ExpressionParser parser = new SpelExpressionParser();
    / / spel context
    EvaluationContext context = new StandardEvaluationContext();
    for (int i = 0; i < parameterNames.length; i++) {
        context.setVariable(parameterNames[i], values[i]);
    }
    Expression expression = parser.parseExpression(key);
    Object value = expression.getValue(context);
    log.info("key={},value={}",key,value);
    if(value! =null) {return value.toString();
    }
    return "";
}
Copy the code

If (! key.startsWith(“#”)){return key; } If redisson’s key does not start with #, then the argument to the key is not parsed. That’s the problem.

Normally, when we define a key, we put a constant in front of the variable. Can facilitate log investigation.

It is this problem that causes all concurrent requests to always be a lock. Finally, there is a waiting timeout problem.

Step 4: Modify the lock to put the constant after it

Pressure test again, the result is normal.

conclusion

When setting the key, if you add a constant, do not load before, to add after, if you modify the source code, that is another story.

In fact, constants are also mentioned in the Reddison-spring-boot-starter documentation. On the one hand, the hint is written in a comment; on the other hand, it is not explicitly stated that the constant should be there. For those of us who use Redis and are used to putting constants first, this is a bit of a hole.

At this point, the abnormal investigation is over. Record the investigation process. I hope that when we use Redisson again, we can avoid going to similar pits.