This is the 27th day of my participation in the More Text Challenge. For more details, see more text Challenge

A training class of advertising, your information do not believe I do not know, anyway, I intend to have a good study, began to start


Boss: Here, you do a distributed transaction

I:… What is a transaction?

Me: Let’s start with the theory

I don’t understand what is business

If I didn’t understand transactions, let alone distributed transactions, I immediately started learning.

A transaction is a rigorous series of operations in an application, all of which must be completed successfully or all the changes made in each operation will be undone.

Transactions should have four properties: atomicity, consistency, isolation, and persistence. These four properties are often referred to as the ACID properties.

If it’s easier to understand, it’s a set of operations like add, delete, change, and check that all four of them either succeed or fail, without inconsistent results.

I don’t understand what is a distributed transaction

Now that we know what a transaction is, here comes distributed transactions. Why do you need distributed transactions?

Transactions refer more to the single-server, single-database concept. A distributed transaction means that the participants of the transaction, the server that supports the transaction, the resource server, and the transaction manager are located on different nodes in different distributed systems.

To make it easier to understand, it is to maintain the characteristics of transactions between multiple transactions, that is, to guarantee the consistency of results between multiple transactions.

XA specification

With the distributed transaction scenario, there is a specification for how to solve this problem. The XA specification is a specification for solving distributed transactions. See wikipedia for details:

The XA specification provides an important idea:

1. Introduce the control node of global transaction and the coordinator of transaction

2. Multiple local transactions divided into multi-phase commit (i.e. 2PC, 3PC below)

I don’t understand distributed solutions

Where there is a specification, there is a landing scenario. Here are several implementation protocols based on the XA specification.

I’ll start with two-phase Commit and three-phase Commit.

2PC( Two-phase Commit )

Two-phase commit, as the name implies, is committed in two steps.

Here the first stage is called the preparation or voting stage. Introduce a transaction manager that coordinates the various local resource managers,

In the first phase of the local resource manager, which is typically implemented by the database, the transaction manager asks each resource manager if they are ready, performs everything except the commit transaction, and then returns the result to the transaction coordinator.

If the response from each resource is a success, the transaction is committed in phase 2, and if the response from any of the resources is a failure, the transaction is rolled back.

The implementation here is similar to the way we usually open black games, when we form a team, the captain will ask everyone to prepare, let the team members after the toilet to eat, if all the team members are ready, then start the game, if any one of the team members is not fed, not ready to confirm, will not start the game.

However, there are some problems with this agreement, as follows:

Synchronous blocking, which is the biggest problem of 2PC, strict 2PC execution process, all participating nodes are transactional blocking. When a participant owns a public resource, other third-party nodes have to be blocked to access the public resource

Solution: Introduce a timeout mechanism that performs a specific action if no response is received for a long time.

Coordinator single point of failure, the coordinator is the most important role in 2PC, which means that if he goes wrong, the whole process is ruined

Solution: The common solution for single point of failure is to introduce a replica and then re-select the master when the master node fails, just like in a team game, if the captain stays in the toilet for too long after the team is ready, the game program will usually kick him out and the rest of the team will switch to the captain.

Data inconsistency. Although the above problems have been solved, there are still data inconsistency due to many network jitter and call failure scenarios in the distributed system. The following are divided into coordinators, participants, network faults and other faults for detailed analysis:

1. The coordinator hangs up before sending the preparation command

This is equivalent to the transaction does not start directly, there is not much impact

2. The coordinator hung up after sending the preparation command

In this case, if the participant does not have a timeout mechanism, the resource is locked

3. The coordinator hung up before sending the submit command

This situation, similar to the previous one, can also result in resource locking

4. The coordinator hung up after sending the submit command

In this case, it is likely that the distributed transaction will be successful because the commit phase indicates that the other participants are ready and try again if they fail

5. The coordinator hung up before sending the rollback command

This situation is similar to that of 2 and 3, since the participant does not receive the command to perform the operation, it will continue to block and occupy the resource without a timeout

6. The coordinator hangs up after sending the rollback command

This is similar to the case of 4, and there is a high probability that the rollback transaction will be successful. If not, the decision has already been made, so you have to try again

7. After the coordinator sends the preparation command, some participants hang up

In this case, the coordinator has a timeout mechanism, determines a failure directly, and informs all participants to roll back

8. The coordinator hung up after sending the preparation command, and some participants hung up

In this case, after re-electing the coordinator, it is found that the coordinator is still in the first phase, and since there is no response from the failed participant, it determines failure and informs the other participants to perform a rollback

9. The coordinator hangs up after sending the commit or rollback command, and the participant who received the message hangs up

In this case, after re-electing the coordinator, the participant who did not receive the message does not execute the transaction, but the coordinator cannot determine whether the participant who received the message was successful in committing or rolling back the second phase, and the transaction inconsistency occurs

3PC( Three-phase Commit )

From the related content introduced above, we can also generally know the shortcomings and solutions of 2PC, so we have the following solution protocol, three-phase submission

From the encyclopedia can see the introduction of 3PC is mainly to solve the above we said the shortcomings of 2PC, how can it be solved?

1. 3PC is a non-blocking protocol

Ok, to solve the problem of resource consumption, the main thing is to introduce the participant timeout mechanism

2. A preparatory stage is inserted between the first and second stages

In two-phase commit, participants are in an “uncertain state” after voting due to a crash or error of the coordinator, which causes participants to be unable to know whether to commit or roll back, in order to ensure the state of all participating nodes is consistent before the final commit phase

3PC divides the first stage of 2PC into two stages again. The extra stage is to confirm whether the participants are normal before executing the transaction, so as to prevent the other participants from executing the transaction to lock resources if the individual participants are abnormal.

His general steps can actually be divided into four states of the participants

0. Initial state, in which the transaction initiator triggers the global transaction, the participant switches the local state to the start state, and registers himself with the coordinator.

1. Commit or state wait, in which the coordinator sends a command to each registered participant to change the state to commit.

2. Pre-commit state: In this phase, the coordinator receives the acknowledgement from the participants that they can commit and enters the state. Then the coordinator sends them a pre-commit message, and the participants lock resources and change the state to the pre-commit state. At the same time, the coordinator enters the pre-commit state.

3. Commit status. In this phase, the coordinator performs commit or rollback operations according to the pre-commit results of participants, and then releases resources.

In this way, some 2PC state inconsistency issues can be resolved. Summary of JBoss:

By introducing a pre-commit phase, the coordinator is able to determine the state of the participant before the commit, while the participant is able to infer the state of the other participant

Normally, the coordinator can decide whether to execute or roll back based on the results of the participant state switch. The extra pre-commit phase is to unify the state.

If a participant does not receive a coordinator message, the commit is performed by default, although this may result in data inconsistency.

After the coordinator fails to be re-elected, it determines whether to perform or roll back according to the status of the participants and the original master node.

When the new coordinator arrives, he finds that he is in the commit state and the participants are in the commit and rollback states, indicating that he has voted to roll back. At this time, the new coordinator executes the rollback command

When the new coordinator arrives, he/she finds that he/she is pre-committed and the participants are in pre-committed and commit state, which indicates that all the participants have been confirmed, so the commit command is executed at this time

It can be seen that 3PC has low performance due to the introduction of one more stage, and in fact, it does not solve the problem of data consistency. The effect of one more stage cannot guarantee that the effect must be better than 2PC, so it is rarely used.

TCC(Try-Confirm-Cancel)

The 2PC/3PC schema is based on a relational database that supports local ACID transactions:

  • One-phase PREPARE behavior: In the local transaction, commit the business data update along with the corresponding rollback log record.
  • Commit behavior in phase 2: The rollback logs are automatically and in batches.
  • Two-phase rollback: the system automatically generates compensation operations to rollback data by rolling back logs.

Correspondingly, THE TCC mode is handled from the business level and does not depend on the transaction support of the underlying data resources:

  • One-stage PREPARE behavior: Calls custom prepare logic.
  • Two-phase COMMIT behavior: Calls custom commit logic.
  • Two-phase ROLLBACK behavior: Invokes custom ROLLBACK logic.

The TCC schema supports the integration of custom branch transactions into the global transaction management, which can be independent of the local database, of course, implementation can be dependent on, more scenarios are a combination of the two.

TCC is more about letting the business implement the idea of two-phase delivery, which is very intrusive to the business

The Try phase is defined to perform the locking of the resource. This phase I think is more difficult to implement

Transfer scenarios may require you to try to see if the account balance is sufficient, then subtract the transfer amount and store the amount in the temporary field to lock the amount

Caching scenarios may require the use of distributed locks, either to lock the cache value to be operated on, or to pull one cache into another cache

Upload and download scenarios may require you to save files to a temporary directory on the server

The Confirm phase defines the execution of the resource locked by the try phase, which means that based on the success of the try, operations can continue, such as actual transfers, caching operations, upload downloads, and so on.

The Cancel phase is defined to release the resources reserved by the Try, that is, due to the failure of the Try, corresponding compensation operations or environment restoration should be made, such as deleting temporary fields during transfer, releasing locks, clearing temporary files, and so on.

TCC mode is difficult to implement, there are many abnormal scenarios to consider, and also consider how to lock and release resources, but because it does not block resources, it is widely used, it is said that many companies are still keen on this kind of compensation transaction implementation

In addition, THE TCC mentioned here is more of an idea, the actual implementation may still need to be adjusted according to the specific business, the method is dead, people are alive.

SAGA

Theoretical Basis (click to view the original paper) : Hector & Kenneth published the paper Sagas (1987)

The Saga pattern provides a long transaction solution. In the Saga pattern, each participant in the business process commits a local transaction, and when one of the participants fails, the previous successful participants are compensated. Both the one-phase forward service and the two-phase compensation service are implemented by business development.

Application scenario:

  • Business processes are long and numerous
  • Participants include other company or legacy system services that cannot provide the three interfaces required by the TCC pattern

The main idea behind Saga is to rely on state machine transitions, where a long transaction is broken up into multiple short transactions and the short transactions are executed in turn

If a short transaction fails, the compensation transaction is executed in the reverse order of the previous execution

This pattern is rarely used, and the implementation is complicated and the process is very long. When encountering similar scenarios, we still need to carefully consider whether it is necessary to implement distributed transactions.

Local message table

When the business is executed, the execution of the business and the operation of putting the message into the message table are placed in the same transaction, so that the business can ensure that the message into the local table is executed successfully.

The next service can then be invoked, and if it succeeds, the message status in the message table can be changed to succeeded.

If the invocation fails, a background task will periodically read the local message table, screen out unsuccessful messages and then invoke the corresponding service. After the service is successfully updated, the status of the message will be changed.

If the number of retries exceeds the limit, rollback is performed or manual intervention is notified.

It can be seen that data inconsistency may also occur in the local message table. Try to ensure the final consistency.

The message queue

The idea behind this scenario is to implement distributed transactions through message queues that support transactions.

Main process:

  1. The producer sends a semi-transactional message to MQ
  2. After the producer receives a successful MQ receipt, it executes the local transaction, but the transaction has not yet committed.
  3. The producer decides to send a commit or rollback message based on the execution result of the transaction
  4. The producer needs to provide an interface to query the status of the transaction. If no operation request is received for a period of time, MQ will obtain the result of the sender’s transaction execution through the query interface.
  5. If the message is the result of a failure, MQ simply discards it and the consumer is not affected
  6. If the message is a successful result, the consumer consumes the semi-transactional message and then goes on to consume the general message

This scheme differs from local messaging in that local message tables are removed and local transactions are bound to MQ transactions. The only solution currently in the market is Alibaba’s RocketMq

Best effort notice

In this way, please try your best to learn on your own

I don’t know how to do that

Learn so many schemes, their own implementation is still very difficult.

Common solution implementation frameworks include byteTCC, DTM implemented by Huawei ServiceCom (available on huawei Cloud official website), Ali Seata (paid version is GTS), and Tencent DTF

At present the most popular open source or seATA, support mode more, the official website documents detailed, here is not a description

Many articles have been written about SEATA, and the next article will try to practice distributed transactions using the SEATA framework.

Is Seata perfect? Of course not. It may be improved in the future

1, no console support, no visual interface, verification is all by printing and connecting to the database

Seata-server high availability does not support Raft protocol, transaction information is completely dependent on DB, Redis, etc

3. UndoLog occupies too much space, especially when a large JSON field is mirrored in front and back. If the amount of data is large, it may be slow to enter the library and may need to be compressed

4. The Rollback can Only be performed through exceptions. Rollback with Rollback-Only flags similar to Spring is not supported

5. Is the granularity of global lock a little too large? Is it necessary to report the status of branch transactions to THE TC

Find a shared video by Seata open source author Jimin Slievrly to learn together

If you need PPT learning in the video, just reply to Seata in the official account:

I’ve learned

This paper is written according to the learning process that has never been exposed to transactions. The brain map is as follows:

On the left are the basics, and on the right are the solutions, if you are also learning about distributed transactions.

This article is the first in a series, with one for seATA in action and one for SeATA principles and how to design a general distributed transaction framework. Thanks for reading and welcome.