The background,
In the traditional singleton system era, one system operated one database store. When the system is running, a single request or multiple operations on the same database data in a single thread can be implemented by enabling local transactions and relying on transaction features provided by the database to achieve transaction management involving data operations in a single request. However, with the rapid growth of system business, the stand-alone system cannot support the increased business volume, so database splitting, selecting multiple data stores according to storage characteristics, and system splitting into microservice architecture are common solutions. However, with the evolution of this architecture, in microservice architecture system, the traditional single system relying on local transactions can no longer meet the requirements in distributed environment, because the data operation involved in a request may involve multiple services, multiple remote calls or even multiple data stores. Therefore, in the microservice architecture system, if the core business process in a specific scenario needs to realize the transaction management of a single request during business processing, the distributed transaction technology that meets the requirements of architecture and business must be introduced. This paper is mainly based on the author’s own research and reflection on various distributed transaction solutions when the specific business in the project is implemented in distributed transaction management. Since there is no mature open source technology found in the three-stage submission, it is not involved in this paper. In addition, I suggest that you should have a basic understanding of distributed related theories, such as CAP and BASE, and then think about various solutions in depth, so that you will not feel like you have forgotten what you understand when you read it.
Second, common solutions
There are three common technical scenarios for generating distributed transactions:
- The database instance cannot be operated across microservices
- Singleton systems cross database instances
- Operate on the same database instance across services
2.1 2PC Two-stage submission
The two-phase commit protocol divides the whole transaction process into two stages: preparation stage (P) and commit stage (C). In the transaction process, it is divided into two roles: transaction participant and transaction manager. The transaction manager is responsible for making decisions about the commit and rollback of the entire distributed transaction, and transaction participants are responsible for committing and rollback of their own local transactions. 2PC stipulates that the transaction process is as follows:
- Prepare: The transaction manager sends a Prepare message to each participant. Each participant executes a local transaction, locks resources, and writes a local Undo/Redo log (note: local transactions are not committed).
- Commit phase: If the transaction manager receives failure or timeout feedback from any of the participants, it notifies each participant to roll back the local transaction; otherwise, it notifies all participants to commit the local transaction. The lock resources occupied in the preparation phase are released regardless of whether the participant performs rollback or commit.
Currently, Oracle and Mysql support the 2PC protocol. AT the same time, aiming AT 2PC protocol specification, the current popular concrete implementation schemes include XA scheme and AT mode provided by Ali open source distributed transaction framework Seata.
2.1.1 XA scheme
To provide a unified docking Model and standard, THE X/OPEN organization proposed a Distributed Transaction Processing Reference Model (DTP), which defines three roles in Distributed transactions:
- AP: applications
- TM: Transaction manager, responsible for coordinating and managing the entire distributed transaction
- RM: Resource manager, also known as transaction participant, is responsible for controlling branch transactions
The DTP model also defines the interface specification for communication between TM and RM. This specification is XA, which is also the 2PC interface protocol provided by the database. The transaction flow based on XA scheme is as follows:
- AP holds D1 and D2 data sources.
- The AP notifies THE RM of D1 through TM and the TM of D2 at the same time. In this case, the data of D1 and D2 are locked and RM does not commit the transaction.
- After receiving the RM of the two data sources for recovery, TM will initiate rollback instruction to the other RM as long as either of them fails, and the corresponding RM will roll back the branch transaction to release the resource lock.
- If they are all successful, TM initiates the submission instruction to all RMS, and all RMS receive the instruction to submit the transaction and release the resource lock.
XA scheme is essentially a distributed transaction at database level, which requires database to support transactions and achieve strong consistency. In the whole transaction process, from the preparation stage to the commit or rollback of the second stage, TM always holds the lock of the corresponding data resource. If other transactions want to modify the data in the database, they must wait for the release of the lock, resulting in the risk of a long transaction.
2.2.2 AT scheme of Seata
Seata is an open source distributed transaction solution dedicated to providing high performance and easy to use distributed transaction services, providing AT, TCC, SAGA and XA transaction modes. The main point here is the AT pattern. Seata’s AT pattern is essentially an optimized evolution of the 2PC protocol. It is middleware that works AT the Java application layer and drives the completion of global transactions by coordinating branch transactions of relational databases that support local ACID transactions. Seata divides the participating roles of distributed transactions and provides the following three components:
- TC: transaction coordinator. As an independent middleware, it needs to be deployed and run independently. It is responsible for maintaining the running state of global transactions, receiving TM instructions to initiate the submission and rollback of global transactions, and communicating with RM to coordinate the submission and rollback of branch transactions.
- TM: Transaction manager. Working as a JAR or dependency embedded in the application, responsible for starting a global transaction and ultimately issuing global transaction operations instructions to the TC.
- RM: Branch transaction controller. Embedded application program, responsible for branch registration, status reporting, receiving TC instructions and other work, control the submission and rollback of branch transactions.
In Seata mode, the overall transaction flow is as follows:
- TM applies to TC for starting a global transaction. The global transaction is successfully created and a globally unique XID is generated
In remote invocation, XID is propagated in the context of the invocation link, and each RM registers branch transactions with TC and brings them into the global transaction jurisdiction corresponding to XID
- After registration, RM controls branch transaction execution and inserts rollback log records and commits them to release resource locks
- TM initiates a global commit or rollback resolution against the XID to TC
- TC schedules all branch transactions under XID to commit or roll back (rollback is the reverse operation)
Seata’s AT mode has two main aspects to the evolution of 2PC protocol. On the one hand, transaction commit is controlled in the preparation stage to reduce the occupation time of resource lock, but at the same time, the business operation of branch transaction and rollback log are inserted in a local transaction to ensure atomicity; The other is to move RM to the application level. In order to solve the problem of write isolation under the mode of early commit and reverse rollback, Seata proposed the concept of global lock for the whole distributed transaction. Before committing a local transaction in the first phase, obtain the global lock of the corresponding resource. If the global lock is not obtained, the local transaction cannot be submitted. Of course, if the acquisition of the corresponding global lock fails within a certain range, it will not continue to block, but will give up, and the local transaction will be rolled back at the same time, indicating that the branch transaction fails to execute in the first phase. FOR one-phase read isolation problems, the default global isolation level of SeataAT mode is read uncommitted. If read committed is required, add FOR UPDATE to the SELCT statement. Seata proxies the SELCT FOR UPDATE statement to obtain the global lock FOR the resource before performing the query. When the global lock is acquired and the read operation is performed, it must read the data after another transaction has committed.
2.2 TCC compensation type
TCC compensated distributed transactions are a pure business layer solution. It requires each branch transaction to perform three operations: try preprocessing, Confirm confirmation, and Cancel cancel. In essence, the three operations are two-stage protocols. The try operation is one-stage protocol, and only one confirm or cancel is executed. The reason why TCC calls it a compensatory distributed transaction is also because it is a pure business layer solution, which does not depend on whether the underlying data store supports transactions or not, and realizes the atomicity of distributed transactions as a whole through the compensation content of cancel operation. The general transaction flow of TCC solution is as follows:
- The TM initiates a global transaction and all branch transactions perform a try operation
- When all try operations succeed, TM initiates confirm operations for all branch transactions
- In the event of a try failure, TM initiates cancels for all branch transactions
The TCC compensatory distributed transaction scheme does not fail confirm and Cancel operations by default. In practice, this cannot be guaranteed 100%, so a proper transaction recovery mechanism is generally required. In essence, a transaction that fails during confirmation or rollback is retried according to a certain policy until it succeeds. If not, manual intervention is required. In a real distributed scenario, the following three problems need to be considered in order to actually implement TCC compensating distributed transactions: Empty rollback: A two-phase cancel operation is called when the try operation is not called, resulting in inconsistent data. This typically occurs when the service on which the branch transaction is located fails, the try operation fails, and then the fault recovers when the entire transaction is rolled back and cancel executes normally. Idempotent: Because TCC needs to provide a retry mechanism for confirm and Cancel, there is a risk of inconsistent data due to idempotent problems. Suspension: In some abnormal scenarios, cancel is executed before try. As a result, the reserved resources for the try operation cannot handle the suspension. This typically occurs when an RPC call to try is called because network congestion causes the caller to timeout and throw an exception, which causes the transaction to roll back, and then the cancel of the branch transaction may execute before the try.
2.2.1 TCC ws-transaction
Tcc-transaction is an open source TCC compensatory distributed Transaction framework, TCC stands for Try, Confirm, Cancel. Tcc-transaction is based on THE idea of AOP, which strengthens the Transaction method, analyzes the validation and rollback methods configured in the Transaction annotation to generate the corresponding context information, and implements TCC compensatory distributed Transaction under the management of TM. In addition, the system provides a transaction recovery mechanism based on scheduled tasks and supports retry policies to implement transaction retry in abnormal scenarios. In this framework, the transaction initiator plays the role of TM, managing and coordinating all TCC transactions initiated by his service. For each transaction method, the business personnel need to implement the corresponding acknowledgement and rollback method logic, so it is recommended to consider the distributed transaction design from the whole business process, instead of using the idea based solely on local transaction.
TCC ws-transaction inside there are two core interceptor (call) in the corresponding section of class, CompensableTransactionInterceptor and ResourceCoordinatorInterceptor, Both have enhanced transaction methods around advice. The former is responsible for adding transaction processing to the transaction method, including transaction start, rollback, validation, and transaction resource cleanup. The latter is responsible for registering branch transactions. In microservice scenarios, it supports spring-Cloud and Dubbo microservice systems. The core is to transfer transaction context in the whole call chain by implicit parameter transfer, so as to realize the association between branch transactions and global transactions. The actual flow of a TCC-Transaction is as follows:
- The transaction initiator initiates the global transaction and performs the try operation of all branch transactions in sequence.
- Before the try operation is executed, each branch transaction is started, the global transaction ID is obtained from the transaction context, the branch transaction registration is completed, and each branch transaction is recorded in the corresponding branch transaction table.
- After the try stage is executed, if all are successful, the transaction initiator will trigger the submission of distributed transaction, and the transaction manager will obtain the current transaction, traverse all participants of the current transaction, and initiate the confirmation submission successively.
- If there is an execution failure, an exception is caught and the transaction initiator triggers the rollback of the distributed transaction, as well as the confirmation of the commit.
- After the entire transaction operation is completed, the transaction initiator clears and releases related transaction resources.
Tcc-transaction does not rely on local transactions in any of the three operations. However, in a microservice scenario, no matter the rollback is confirmed, the number of RPC or HTTP requests is at least twice as many as the original logic. In addition, because all the participating services maintain branch Transaction storage records, the performance impact is significant. Not suitable for large transaction scenarios with complex operations.
2.3 Final consistency based on reliable information
The final consistency scheme based on reliable messages seeks for the final consistency of branch transactions. The whole distributed transaction is divided into main or main transaction and branch transaction, which is decouple by message-oriented middleware. The transaction initiator sends a message after executing a local transaction, which the default transaction participant must be able to receive and successfully process. The core is to emphasize that as long as the message is sent to the participants of the transaction, the transaction consistency can be achieved eventually, and there is no scenario of the whole transaction being rolled back due to the failure of the sub-transaction. The scheme requires certain success of the participants, and only manual intervention can be used to compensate when necessary. The solution needs to address the following issues:
- Atomicity of local transactions and message sending: the transaction initiator must send the message after the local transaction is successfully executed, otherwise the message is discarded and the local transaction is rolled back.
- Reliability issues for transaction participants receiving messages: Transaction participants must be able to receive messages from the message queue and can repeatedly receive messages if they fail to receive messages.
- Message repeat consumption problem: method idempotency for implementing transaction participants.
2.3.1 Local message table scheme
The consistency of data business operation and message is guaranteed through local transaction, and then message is sent to message middleware through scheduled task, and the message is deleted after it is confirmed that the message is successfully sent to consumer. The whole transaction process is as follows:
- The local transaction of the transaction initiator includes its own business operation and adding transaction message log. The atomicity is guaranteed by the local transaction.
- The transaction initiator scans the transaction message table in the way of scheduled task, sends the unsent message, and deletes the log record of the corresponding transaction message after the message middleware feedback that the message is successfully sent to ensure the reliability of the transaction delivered by the initiator.
- Transaction participants monitor message queues. Based on message acknowledgement (ACK) mechanism of message queues, participants send ACK to message middleware after receiving messages and completing business processing. At this time, the message middleware clears the messages and does not send any more messages to ensure the reliability of participants receiving messages.
- The consumption method of transaction participants realizes idempotency to solve the problem of repeated consumption of messages.
2.3.2 RocketMQ transaction message scheme
RocketMQ began supporting transaction messages, known as prepare messages, in version 4.3. Such messages are not immediately delivered to the consumer after they are sent to MQ, but are double-checked by the sender and RocketMQ decides whether to deliver or discard them based on the confirmation instructions. The whole transaction process based on the scheme is as follows:
- The transaction initiator sends a prepare message to RocketMQ. RocketMQ stores the message and reports that the message is received successfully. The prepare message cannot be consumed by consumers.
- After receiving the feedback from RocketMQ, the initiator performs the local transaction. If the local transaction is successfully executed, the initiator sends the COMMIT command of the prepare message to RocketMQ. Otherwise, the initiator sends the Rolback message.
- After RocketMQ receives a second acknowledgement of the prepare message from the initiator, it performs the corresponding operation of the message. Commit controls the message to enter the actual consumption queue, and ROLLBACK deletes the corresponding message, thus ensuring atomicity of local transactions and message sending.
- RocketMQ provides a reverse timed task differential transaction status mechanism for reliability issues that are double-checked by the transaction initiator. The message is discarded after a maximum of 15 retries.
- The transaction participant listens to RocketMQ. Based on the message acknowledgement (ACK) mechanism, the participant sends an ACK to RocketMQ after receiving the message and completing the service processing to ensure the reliability of receiving the message.
- The consumption method of transaction participants realizes idempotency to solve the problem of repeated consumption of messages.
Three, comparison and selection
The above are some considerations made by the author based on my actual project. I personally feel that if there is no problem in the project, I will try my best to use Seata. Seata’s code is extremely non-intrusive, and read and write isolation is handled. RocketMQ transaction messages should be considered in the context of real business scenarios where only the mainline logic is concerned and everything else is asynchronous. In terms of performance, I personally feel RocketMQ is the best, after all it’s just one more step of validation. Tcc-transaction is similar to Seata in that both require twice the number of remote calls, although the actual number of remote calls depends on the implementation. In my own test of a small module integration based on the project environment, there is no significant difference. In terms of the high availability of the Transaction coordinator role, Seata can deploy multiple nodes independently and share database Transaction data, while TCC-Transaction is determined by the Transaction initiation service itself because Transaction coordination is managed within the process. Once the initiator dies after the try operation of all branch transactions is executed, Then all the data involved in this transaction need manual intervention. In general, distributed transactions can not be used, need to weigh whether the performance sacrifice is worth, if it is necessary to use, must be combined with the specific business scenario specific analysis, do not consider the overall design of distributed transactions based on the idea of local transactions.