Recently online occasionally will report a deadlock problem, last week finally solved, tidy up over the weekend. Although the problem is resolved, the deadlock graph in the trace file is still not understood. It would be great if someone could tell me about it, but if no one could… Turn to the document again.

background

In the past month, there have been occasional deadlock reports on the line. The error is shown as follows:

The error “ORA-00060: DEADLOCK detected while waiting for resource” indicates that the Oracle database has detected a deadlock. At this point, Oracle rolls back one of the transactions that caused the deadlock, the other transaction executes normally (without being aware of the deadlock), and throws the above error message to the transaction that did the rollback.

The two pieces of code causing the deadlock were located with the help of the DBA. Since the project has a lot of pessimistic locks, using “SELECT… “FOR UPDATE” locks resources exclusively, so the first instinct is to see if the two pieces of code lock two or more resources in reverse order.

However, after analyzing the code, the possible cause of the deadlock was not immediately found. The operation of the two codes on the database resource is shown in the following table.

In Session1, only acTOR1 is locked and updated. In Session 2, actor2 and ACtor1 are locked in sequence. Session2 should also report “lock wait timeout” instead of deadlock.

To verify that these two pieces of code were indeed causing the deadlock, test code was written and two threads were opened to simulate the two pieces of code for the deadlock and remove the business logic unrelated to the database to see if it could be reproduced. After all, I still have a little doubt that the DBA made a mistake, not these two pieces of code. The code is as follows:

@SpringApplicationConfiguration(classes = DeadLockTest.class)@ImportAutoConfiguration({CommonConfig.class})public class DeadLockTest extends BaseUnitDbTest {  Long lenderId = 16642L;  Long borrowerId = 16643L;  @Autowired  private ActorService actorService;  @Autowired  @Qualifier(CommonConfig.ORACLE_TRANSACTION_MANAGER_NAME)  private PlatformTransactionManager platformTransactionManager;  private ExecutorService es = Executors.newFixedThreadPool(5, new ThreadFactoryBuilder().setNameFormat("Test-Thread-%d").build());  @Test  public void testRefreshAndLockActor() throws Exception {    es.invokeAll(Lists.newArrayList(this::lock1, this::lock2));  }  public Void lock1() {    TransactionTemplate t = new TransactionTemplate(platformTransactionManager);    t.execute((s) -> {      System.out.println("Before Lock " + Thread.currentThread().getName());      Actor lender = actorService.refreshAndLockActor(lenderId);      try {        Thread.sleep(6000);      } catch (InterruptedException e) {        e.printStackTrace();      }      lender.setLockedForInv(BigDecimal.ONE);      actorService.update(lender);      System.out.println("After Lock " + Thread.currentThread().getName());      return null;    });    return null;  }  public Void lock2() {    TransactionTemplate t = new TransactionTemplate(platformTransactionManager);    t.execute((s) -> {      try {        Thread.sleep(1000);      } catch (InterruptedException e) {        e.printStackTrace();      }      System.out.println("Before Lock " + Thread.currentThread().getName());      Actor borrower = actorService.refreshAndLockActor(borrowerId);      Actor lender = actorService.refreshAndLockActor(lenderId);      System.out.println("After Lock " + Thread.currentThread().getName());      return null;    });    return null;  }}

As a result… Actually called a deadlock…



The console error is shown below

Question why

Now that we’ve identified the two pieces of code that cause the deadlock, we just need to figure out why. SELECT … FOR UPDATE, add row level exclusive lock to resource. There must have been some unknown operation during UPDATE to Actor1 that caused Session1 to acquire Actor2’s lock, resulting in a deadlock.

The first suspect is triggers, although they are currently banned by the company, there are still some triggers left over from the main library for historical reasons. We ran all the triggers with Minka, and we pretty much ruled them out as triggers.

Although this is not a trigger problem, it will be mentioned, when it comes to triggers, if you are not very familiar with the system, it is really difficult to detect errors…

After excluding triggers, the DBA suggested that it could only be caused by foreign keys. After the DBA removed the foreign key from the DEV environment database, the test code was executed again and the deadlock was no longer reported.

Actor1’s refer_id is a self-referential Integrity constraint key. Actor1’s refer_id is a self-referential Integrity constraint key. Actor2’s refer_id is a self-referential Integrity constraint key. Therefore, when the full acTOR1 field is updated, the refer_id is also updated (actually the same value). Due to the constraint of foreign key, when the acTOR1 refer_id is updated to the ID of ACTOR2, we need to ensure that acTOR2 exists and cannot be deleted during the update process. So Session1 will lock actor2. The lock on actor2 is already held by Session2, and Session2 is waiting for the lock on ACtor1.

Here’s a picture:

deadlock

The solution

It took more than a day to find the problem and a few minutes to fix it. In fact, I just need to update two fields on the Actor. I don’t need to update all fields at all.

So instead of calling the full field update method, add a partial field update method, so that the foreign key field of ACTOR1 will not be updated, so that the lock of ACTOR2 will not be requested during the update of ACTOR1.

conclusion

  • Although this problem is not caused by triggers, it still makes sense to disable triggers, otherwise the problem will result in blood vomiting
  • Foreign key this east east, also can not use it, by the program control. Companies don’t allow foreign keys and triggers anymore, but due to historical reasons, some older systems will have to be reconfigured slowly. The foreign key column does not add an index, resulting in the update of the primary table field lock child table, the next blog can learn under the foreign key and deadlock have to say those things. Forget it, or not to set a flag, basically said what to write next, there is no next…
  • Do not update the full field. Regardless of the deadlock issue, updating full fields is inefficient. Let’s update only the changed fields.
  • I can’t be lazy, I saved 5 minutes, I spent more than a day looking for bugs… Are tears

legacy

The problem is solved, but there’s still a little doubt. The deadlock graph in the trace file looks like this:

trace file

So the question is, it is understandable that two sessions hold an X lock for two resources, but why are they waiting for an S lock? At least Session2 is waiting for acTOR1 to lock exclusive rows. Ask good people to answer.

The next articleHow does Reddit count the number of views per post