scenario

  • A writes A service and calls it to B

  • This service of A is for one key and cannot be called for the second time before the first call is completed. If it is called consecutively, the second call will not execute business, but directly return to the system busy, which is controlled by ConcurrentHashMap + Atomic CAS. (Don’t make fun of our small company with a small number of users who don’t use the distributed service of horizontal expansion for the time being, here we consider the safety of single machine thread)

    // {key -> whether to calculate}
    Map<Long, AtomicBoolean> map = new ConcurrentHashMap<>();
    
    public* * (@RequestBody ..)  {
        / /.. Omit checksum to get primary key
        // Can ensure the safety of single-machine thread
        AtomicBoolean atomicBoolean = map.computeIfAbsent(key, (key)->new AtomicBoolean(false));
        // If you are executing, just go back and come back later
        if(! atomicBoolean.compareAndSet(false.true)) {return "System busy";
        }
        try {
            / /.. Business operation and return
        }finally {
            atomicBoolean.set(false); }}Copy the code

One day, caller B sees the log, and a key calls this service and returns to the system busy all the time, and does not succeed again, so he starts the troubleshooting process.

process

Caller log

First, we look at caller B’s log and find the location of the first failed call with this key:

The 20-06-06 12:09:12 INFO * * - began to call a key 20-06-06 12:09:22 ERROR * * - call fail: java.net.http.HttpTimeoutException: request timed outCopy the code

It can be seen that the Http call was used, and there is 10 seconds between the start of the call and the failure log, which is the request timeout time.

The request did not return, and the subsequent call kept returning busy, indicating that the above cas lock release block was not executed:

try {
    / /.. Business operation and return
}finally {
    // Failed to execute
    atomicBoolean.set(false);
}
Copy the code

Basically guess the problem, the business operation code has not been completed;

Server log

Then I went to the log of server A:

20-06-06 12:09:13.476 INFO ** - End // key, 20-06-06 12:09:13.476 INFO ** - start // This start is other key 20-06-06 12:09:23.267 INFO ** - startCopy the code

Sure enough, a call at 12:09:13 only started, but did not end.

process

Let’s look at the server process thread health of service A:

top
top -Hp <pid>
Copy the code

It is found that A thread has occupied nearly 43 minutes of CPU time, and about an hour has passed since the last call, so it is likely that there is an infinite loop in the business code of A that causes this thread to be running all the time.

Because this line only uses JRE, there is no JDK, there is no jstack command to view detailed thread information, as to why there is no, you have to ask A.

Jstack <pid> jstack <pid>Copy the code

The problem

After looking at the business code, I found the following loop. In order to save A large amount of data in batches, A used A while loop, which did not reach the threshold of saving in batches during the offline test. However, when it came to the online test, it needed to save in batches, but FLG did not exit the loop as expected, resulting in continuous data insertion.

while (flg){
    flg = false;
    if(..) { testService.save(..) ;if(..)
	        flg = true; }}Copy the code

The while loop inserts a number of duplicate data into the database during the last hour, but when we refresh the database, there is no significant new data:

Guess, because the transaction has been cycle here, and have not been submitted, so I went to other mysql connection session invisible uncommitted transaction records, of course, but in order to more confirmation, decided to go to check the following database table file, found the only occupied by more than eight hundred of the record table should have the following data from 700 m or so, so, even though the transaction have not been submitted, But its table files keep growing.

Furthermore, after the refresh, it was found that the change time of the table file kept changing, thus proving the existence of the dead-loop problem code.

To solve

In fact, A’s idea of batch insertion can be reliably implemented by using A buffer method in reactor:

Flux.fromIterable(list)
    .buffer(savePerNum)
    .subscribe(testService::save);
Copy the code

Avoid the use of endless loop, and more readable.

After release, the problem is resolved.

conclusion

  • In the case of a single machine, resources that need to control thread safety can be usedConcurrentHashMap + Atomic~ + cascontrol
  • When a problem occurs, analyze the log to find the key location of the problem, observe and guess the possible cause of the problem
  • To viewjavaProcess and thread information proves conjecture
  • If the transaction is not committed, the table file will continue to change and grow, even though other sessions will not see the uncommitted data
  • Streaming batch inserts can be usedreactorThe buffer is elegantly completed

What suggestions and ideas do you have