As those of you who follow my blog (www.hollischuang.com) probably know, I wrote an article about 2PC and 3PC (see: About distributed transactions, Two-phase commit protocols, and three-stage commit protocols). The previous article mainly introduced the concept, submission process, and advantages and disadvantages of the two distributed consistency protocols. This article in the previous article on the basis of an in-depth understanding of the two distributed consistency protocols. This paper mainly analyzes why 2PC has data consistency problems, how 3PC solves part of the problems existing in 2PC, and why 3PC still has possible data inconsistency.

If you are not familiar with the concept of distributed systems and 2PC and 3PC, you are advised to read the distributed series first.

The coordinator

In a distributed system, although each machine node can clearly know whether its transaction is successful or failed, it cannot know the transaction execution status of other distributed nodes. Therefore, when a transaction spans multiple distributed nodes (for example, the Taobao ordering process, the ordering system, and the inventory system might be deployed on different distributed nodes), a Cooradinator would be introduced to ensure that the transaction would meet ACID requirements. The other nodes are called participants. The coordinator is responsible for scheduling the behavior of the participants and ultimately deciding whether they want to commit the transaction.

Two-phase Commit Protocol (2PC)

The two-stage submission protocol is mainly divided into two stages: preparation stage and submission stage.

In daily life, there are many things that are submitted in this two stage. For example, western weddings often appear this scene:

Priest: will you marry this woman? To love and be true to her in need, in sickness and in disability, till death do us part. Doyou?”

Groom: “Ido!”

Priest: will you marry this man? To love and be true to him in need, in sickness and in disability, till death do us part. Doyou?”

Bride: “Ido!”

Pastor: Now face each other, take each other’s hands, and say this to each other as wife and husband.

Groom: I, XXX, take you to be my wife with all my heart. For better or for worse, for richer or for poorer, in sickness and in health, in happiness and in sorrow, I will love you with all my heart, I will try to understand you, and I will trust you completely. We will become a whole, a part of each other, we will face life together, to share our dreams, as equal and faithful partners, through the rest of life.

Bride: I marry you with all my heart as your wife, whether in prosperity or adversity, rich or poor, health or disease, happy or sad, I will love you without reservation, I will try to understand you, completely trust you, we will as a whole, each part of each other, we will face all of life together, to share our dreams, To spend the rest of our lives as equal and faithful partners.

The classic scenario above is a typical two-phase commit process.

First, the coordinator (pastor) will ask the two participants (the couple) if they can perform the transaction submission operation (willing to marry). If two participants are able to commit the transaction, the transaction is performed first and then returns YES, or NO if the transaction was not successfully performed.

After the coordinator receives feedback from all participants, the transaction commit phase begins. If all participants return YES, send a COMMIT request, and if one returns NO, send a Roolback request.

Note that the first phase of preparation for the two-phase commit protocol is not just a YES or NO response, but also a transaction, but a commit or roolback. Not quite the same as the marriage example above. If you have to give an example, it can be interpreted as the process of exchanging tokens of love between a man and a woman. Once the token has been given to the other party, it cannot be used for other purposes. That is, once a transaction is executed, the resource is locked until a COMMIT or roolback is performed. This can cause congestion.

2PC problems

Let’s analyze the problems of 2PC.

Instead of discussing the problems of synchronization blocking, single point of view, and brain splitting, we will only discuss data consistency. As a distributed consistency protocol, we mainly focus on its possible consistency problems.

2DURING the execution of the PC, the coordinator or participant may break down suddenly. The breakdown may occur at different times.

Situation one: The coordinator hangs up, but the participant does not

This situation is actually easier to solve by finding a replacement coordinator. When he becomes the new coordinator, he asks all participants about the execution of the last transaction so he can know what to do. Therefore, this situation does not lead to data inconsistency.

Case two: The participant hangs up, but the coordinator does not

This situation is also relatively easy to solve. If the coordinator dies. Then there are two scenarios:

  • The first one is hung up and then hung up, no recovery. So hang up. It won’t cause data consistency problems.

  • At this time, if he has unfinished transaction operations, he can directly cancel them and ask the coordinator what I should do now. The coordinator will compare his transaction execution record with that of the participant and tell him what he should do to maintain data consistency.

Situation three: The participant dies, and so does the coordinator

This is a complicated case, so let’s talk about it case by case.

  • The coordinator and the participant fail in the first phase.

    • Since the COMMIT operation has not yet been performed, the newly selected coordinator can ask about each participant and decide whether to commit or roolback. Because the commit has not been made, data consistency issues are not caused.
  • Phase 2: The coordinator and actor hang up, and the actor who hangs up does not receive the coordinator’s command before hanging up, or does not commit or roolback after receiving the command.

    • In this case, when the new coordinator is selected, he also asks about all the participants. Whenever a machine executes abort (Roolback) or the first phase returns No, the roolback operation is executed. If no one else executes abort, but a machine performs a COMMIT, execute the COMMIT directly. This way, when the suspended participant is recovered, it is simply a matter of following the coordinator’s instructions to commit or roolback the transaction. Since the failed machine did not commit or roolback, and the failed machine performed the same operation with the new coordinator, this situation does not result in data inconsistencies.
  • The phase 2 coordinator and actor hang, and the actor who hangs has performed the action before hanging. But since he died, no one knows what he did.

    • In this case, after the new coordinator is selected, he can only perform the commit or Roolback operation as before if he wants to assume the coordinator’s responsibility. This keeps the data consistent with the new coordinator and all participants who did not die, and we assume that they performed a COMMIT. However, what if the suspended participant recovers, because he has already completed the previous transaction. If he performed commit, it would be fine. If he performed roolback, it would be fine. Doesn’t that lead to data inconsistency? Although at this time, he can communicate with the coordinator again by means, and then try to make the data consistent, but during this period of time, his data state is already inconsistent!

Therefore, in the 2PC protocol, if both the coordinator and the participant are dead, data inconsistency may occur.

To solve this problem, 3PC was derived. Let’s look at how 3PC solves this problem.

Three-phase Commit Protocol (3PC)

The most important problem of 3PC is that both the coordinator and the participant fail at the same time, so 3PC splits the preparation phase of 2PC into two phases again, so there are three phases of CanCommit, PreCommit, and DoCommit. In the first phase, all participants are only asked if they can perform transactions, and transactions are not performed in this phase. When the coordinator receives that all participants have returned YES, the transaction is performed in phase 2, followed by commit or ROLLBACK in phase 3.

Here’s another example of a similar three-phase commit in life:

The monitor wants to organize the whole class to have dinner together. Since we have graduated for many years, we have to make a phone call one by one to finalize the time. The time is initially set on October 1. Then you start making calls one by one.

Monitor: A, we’d like to hold the meeting on October 1. Are you free? If you have time, you can say YES, if you don’t, you can say NO. Then I will ask others. I will let you know the specific time and place. (The coordinator asks if the transaction can be executed, this step does not lock the resource.)

Little A: Ok, I have time. (Feedback from participants)

Monitor: B, we’d like to meet on October 1st… Don’t keep waiting for me.

The monitor finished the collection of everyone’s time, a look at everyone has time, so we will inform you again. (Coordinator receives all YES instructions)

Monitor: Little A, we have decided to have A dinner party on October 1. You have to leave this day free. You can’t arrange anything else on this day. Then I will inform other students one by one, and I will come back to confirm with you after the notification. By the way, if I didn’t call you specifically, you just have to come to dinner on October 1st. By the way, are you sure you can make it? The coordinator sends the transaction execution order, which locks the resource. If the participant does not receive a later command from the coordinator due to network reasons, he will also commit.)

“A” circled 10.1 on his calendar and then told his monitor that I could go. (Participants perform transaction operations and report status)

Monitor: B, we think the dinner party on October 1… You can just come to dinner on The 1st.

After the monitor had made a round of announcements. All the students told him, “I’ve already put 10.1 out.” So he called each of them again on October 1 and said, Hey, now you can go out and pull… (The coordinator receives an ACK response from all participants, notifying all participants to commit the transaction)

Little A, little B: I’m on my way out. (Perform commit and report status)

Why is 3PC better than 2PC?

Analyze directly the situation where both the coordinator and the participant are dead.

  • The phase 2 coordinator and actor hang, and the actor who hangs has performed the action before hanging. But since he died, no one knows what he did.

    • In this case, when the new coordinator is selected, he also asks all participants about their status to decide whether to commit or roolback. Does this look the same as a two-phase commit? How does he solve the problem of consistency?

    • It appears to be the same phenomenon as in the case of data inconsistencies submitted in phase 2, but a careful analysis of the status of all participants reveals that it is not. Let’s assume that the actor that died performed the commit. So what should be the status of the other unhung operators? Their states are either prepare-commit or COMMIT. Because in phase 3 of 3PC, once a machine performs a COMMIT, it is inevitable that in phase 1 everyone agrees to commit. As a result, the newly elected coordinator performs a COMMIT if one of the unhanged participants is in the COMMIT state or prepared -commit state. Otherwise, perform rollback. This allows the lost participant to recover and maintain data consistency with other machines. (For simplicity, I have simplified the details of the newly elected coordinator’s operation. The real situation is more complicated than I described.)

In a nutshell, if the machine that died has performed a COMMIT, the coordinator can analyze the state of all participants that did not die and perform a COMMIT. If the lost participant performs a ROLLBACK, the coordinator and other participants must also perform a ROLLBACK.

So, with the introduction of one more phase, 3PC solves the data consistency problem that occurs in 2PC when both the coordinator and the participant fail.

3PC problems

During the doCommit phase, if participants cannot receive doCommit or Rebort requests from the coordinator in time, they will continue to commit the transaction after waiting for timeout.

Therefore, due to network reasons, the abort response sent by the coordinator is not received in time by the participant, and the participant performs the COMMIT operation after waiting for a timeout. This results in data inconsistencies with other participants who receive abort and perform rollback.

The resources

2PC and 3PC

Let’s talk about 2PC and 3PC

(Full text)



Welcome to HollisChuang wechat account

exceptional

Pay treasure sweep, reward the author ~

If there is no special explanation, this website articles are original, reproduced must indicate the source. HollisChuang’s Blog » In-depth understanding of 2PC and 3PC distributed systems