scenario
-
A writes A service and calls it to B
-
This service of A is for one key and cannot be called for the second time before the first call is completed. If it is called consecutively, the second call will not execute business, but directly return to the system busy, which is controlled by ConcurrentHashMap + Atomic CAS. (Don’t make fun of our small company with a small number of users who don’t use the distributed service of horizontal expansion for the time being, here we consider the safety of single machine thread)
// {key -> whether to calculate} Map<Long, AtomicBoolean> map = new ConcurrentHashMap<>(); public* * (@RequestBody ..) { / /.. Omit checksum to get primary key // Can ensure the safety of single-machine thread AtomicBoolean atomicBoolean = map.computeIfAbsent(key, (key)->new AtomicBoolean(false)); // If you are executing, just go back and come back later if(! atomicBoolean.compareAndSet(false.true)) {return "System busy"; } try { / /.. Business operation and return }finally { atomicBoolean.set(false); }}Copy the code
One day, caller B sees the log, and a key calls this service and returns to the system busy all the time, and does not succeed again, so he starts the troubleshooting process.
process
Caller log
First, we look at caller B’s log and find the location of the first failed call with this key:
The 20-06-06 12:09:12 INFO * * - began to call a key 20-06-06 12:09:22 ERROR * * - call fail: java.net.http.HttpTimeoutException: request timed outCopy the code
It can be seen that the Http call was used, and there is 10 seconds between the start of the call and the failure log, which is the request timeout time.
The request did not return, and the subsequent call kept returning busy, indicating that the above cas lock release block was not executed:
try {
/ /.. Business operation and return
}finally {
// Failed to execute
atomicBoolean.set(false);
}
Copy the code
Basically guess the problem, the business operation code has not been completed;
Server log
Then I went to the log of server A:
20-06-06 12:09:13.476 INFO ** - End // key, 20-06-06 12:09:13.476 INFO ** - start // This start is other key 20-06-06 12:09:23.267 INFO ** - startCopy the code
Sure enough, a call at 12:09:13 only started, but did not end.
process
Let’s look at the server process thread health of service A:
top
top -Hp <pid>
Copy the code
It is found that A thread has occupied nearly 43 minutes of CPU time, and about an hour has passed since the last call, so it is likely that there is an infinite loop in the business code of A that causes this thread to be running all the time.
Because this line only uses JRE, there is no JDK, there is no jstack command to view detailed thread information, as to why there is no, you have to ask A.
Jstack <pid> jstack <pid>Copy the code
The problem
After looking at the business code, I found the following loop. In order to save A large amount of data in batches, A used A while loop, which did not reach the threshold of saving in batches during the offline test. However, when it came to the online test, it needed to save in batches, but FLG did not exit the loop as expected, resulting in continuous data insertion.
while (flg){
flg = false;
if(..) { testService.save(..) ;if(..)
flg = true; }}Copy the code
The while loop inserts a number of duplicate data into the database during the last hour, but when we refresh the database, there is no significant new data:
Guess, because the transaction has been cycle here, and have not been submitted, so I went to other mysql connection session invisible uncommitted transaction records, of course, but in order to more confirmation, decided to go to check the following database table file, found the only occupied by more than eight hundred of the record table should have the following data from 700 m or so, so, even though the transaction have not been submitted, But its table files keep growing.
Furthermore, after the refresh, it was found that the change time of the table file kept changing, thus proving the existence of the dead-loop problem code.
To solve
In fact, A’s idea of batch insertion can be reliably implemented by using A buffer method in reactor:
Flux.fromIterable(list)
.buffer(savePerNum)
.subscribe(testService::save);
Copy the code
Avoid the use of endless loop, and more readable.
After release, the problem is resolved.
conclusion
- In the case of a single machine, resources that need to control thread safety can be used
ConcurrentHashMap + Atomic~ + cas
control - When a problem occurs, analyze the log to find the key location of the problem, observe and guess the possible cause of the problem
- To view
java
Process and thread information proves conjecture - If the transaction is not committed, the table file will continue to change and grow, even though other sessions will not see the uncommitted data
- Streaming batch inserts can be used
reactor
The buffer is elegantly completed
What suggestions and ideas do you have