Last week, Distributed transactions Fescar announced a brand upgrade:
Thanks, Fescar ❤ ️,
Hello, Seata 🚀.
Seata, which stands for Simple Extensible Autonomous Transaction Architecture, is a one-stop distributed Transaction solution.
Project address: github.com/seata/seata
Ant Financial added TCC mode in Seata 0.4.0 and will continue to do so.
In order to help you understand, Shao Hui, the head of distributed transaction open source, made an offline sharing, detailing the development of distributed transaction in Ant Financial, hoping to help you understand distributed transaction. The following is the text version of the sharing.
preface
Today’s share will start with the following three parts: the background of distributed transaction issues, Ant Financial distributed transaction Roadmap and Seata distributed transaction Roadmap.
Sharing guest: Shao Hui Ant Financial distributed transaction open source principal
1. The background of distributed transaction problems
1.1. Horizontal split of database
In the early days of Ant Financial, the business volume was relatively small, so a single database and single table could meet the business needs. But with the development of business, single database single table database gradually become the bottleneck. In order to solve the bottleneck problem of database, we split the database horizontally. One of the problems with splitting is that writes that used to be done on a single database now span multiple databases, creating cross-library transaction problems.
1.2. Service separation of business
Ant Financial had a single-system architecture in the early stage, with almost all business services in a few apps. With the development of services, services become more and more complex and the degree of coupling between services becomes higher and higher. Therefore, we reconstruct the system and decouple and split services vertically according to functions. The problem with this split is that a business activity that used to need to invoke only one service now needs to invoke multiple services, resulting in cross-service transaction problems.
1.3 Transfer cases illustrate data consistency problems
The problem with water splitting of databases and vertical splitting of services is that a business activity usually calls multiple services and accesses multiple databases.
For example, in the financial service scenario, the transfer service performs the following operations:
- Call the trading system service to create a trading order;
- Call payment system to record payment details;
- Call accounting system to execute A deduction;
- Call the accounting system to execute B plus money.
These four operations span three systems and access four databases. However, network, database and machine are all unreliable, so it is difficult for us to guarantee 100% success of the above four operations.
In the business with financial attributes, it is not allowed that the money of account A is deducted while the money of account B is not added. Therefore, we must find ways to ensure that the four operations from 1 to 4 are either all successful or all fail. Therefore, Ant Financial independently developed the distributed transaction middleware to solve the data consistency problem across services and databases.
2. Ant Financial Distributed transactions
2.1 theoretical basis of distributed transactions
Before introducing the distributed transaction middleware of Ant Financial, we first introduce some theoretical background of distributed transaction.
- 2PC
The Two Phase Commitment Protocol is the most basic Protocol for distributed transactions. In a two-phase commit protocol, there is one transaction manager and multiple resource managers, and the transaction manager coordinates the resource manager in two phases. In the first phase, the transaction manager asks all resource managers whether the preparation was successful. If all resources are prepared successfully, in the second phase the transaction manager asks all resource managers to commit. If any resource manager returns a preparation failure in the first phase, the transaction manager asks all resource managers to perform a rollback operation in the second phase. Through two-phase coordination of transaction managers, eventually all resource managers are either committed or rolled back, and the final state is consistent.
- TCC
Resource manager can be implemented in many ways, among which TCC is a servitization implementation of resource manager. TCC is a relatively mature distributed transaction solution, which can be used to solve the problem of data consistency across databases and services. TCC’s Try, Confirm, and Cancel methods are all implemented by the business code, so TCC can be called a service-oriented resource manager.
The Try operation of TCC as a phase is responsible for checking and reserving resources. The Confirm operation performs the real business as a two-phase commit operation. Cancel: a two-phase rollback operation is performed to Cancel a reserved resource and restore the resource to its initial state.
As shown in the figure below, after the user implements the TCC service, the TCC service will participate in the entire distributed transaction as one of the resources of the distributed transaction. The transaction manager coordinates TCC services in two phases, calling the Try methods of all TCC services in the first phase and executing the Confirm or Cancel methods of all TCC services in the second phase, and eventually all TCC services are either committed or rolled back.
2.2 Introduction to Ant Financial’s distributed products
Ant Financial has been doing distributed transactions for 12 years since 2007. The distributed transaction of Ant Financial was initially realized by USING TCC, which helped Ant business solve the data consistency problem in various financial core scenarios.
In 2007, we began to support Singles’ Day. In order to meet the performance requirements of singles’ Day, we made a series of performance optimizations for distributed transactions.
In 2013, Ant Financial began to do unitary transformation, and distributed transactions also began to support LDC, remote multi-activity and high availability DISASTER recovery, which solved the problem of fast service recovery in the case of machine room failure.
In 2014, Ant Financial distributed transaction middleware began to export through Ant Financial cloud, and we developed a large number of external users. In the process of developing external customers, external customers expressed a willingness to sacrifice some of their performance (ant-free business scale) in exchange for access convenience and non-intrusion.
So in 2015, we started working on non-intrusive transaction solutions: FMT mode and XA mode.
After long-term evolution, Ant Financial distributed transaction middleware has accumulated TCC, FMT and XA modes, with rich application scenarios. These three patterns are described below.
2.3 TCC mode
The TCC mode of Ant Financial is consistent with the TCC principle mentioned in the TCC theory introduced above. The difference is that transaction logs are logged throughout the execution of a distributed transaction. A distributed transaction generates one primary transaction record (for the initiator) and several branch transaction records (for the TCC participant). Record is the purpose of the transaction log, when abnormal interruption during the implementation of a distributed transaction, transaction recovery service by polling the transaction log, find out the transaction abort, compensation to execute the abnormal transaction remaining unfinished action, final state of the whole distributed transaction either all committed, or all rolled back.
TCC Design Specifications and notes:
When users access TCC, most of their work is focused on how to implement TCC services. After years of TCC application practice in Ant Financial, the following points for attention in TCC design and implementation are summarized:
1. Business operation is completed in two stages:
Before adding TCC, service operations can be completed in one step. However, after the TCC is connected, you need to consider how to divide it into two phases: check and reserve resources in the Try operation in the first phase, and execute the real business operations in the Confirm operation in the second phase.
The following is an example to illustrate how the business model is designed in two phases. The example scenario is “Account A has A balance of RMB 100 and RMB 30 needs to be deducted”.
Before accessing TCC, the user can write SQL: “Update account table SET balance = balance -20 where account = A” to complete the deduction operation in one step.
Once TCC is connected, you need to consider how to split the deduction operation into two steps:
- Try operation: Check and reserve resources.
In the deduction scenario, all the Try operation needs to do is check whether the balance of account A is sufficient, and then freeze the 30 yuan (reserved resources) to be deducted. No real deductions will occur at this stage.
- Confirm action: Performs the submission of the real business.
In the deduction scenario, what the Confirm stage does is to make the actual deduction and deduct the 30 yuan that has been frozen from A’s account.
- Cancel: Releases the reserved resource.
In the case of deduction, the deduction is cancelled, and the task of Cancel is to release the 30 yuan frozen by the Try operation and return account A to its original state.
2. Concurrency control
Users should consider concurrency when implementing TCC to minimize the granularity of locks to maximize the concurrency of distributed transactions.
As an example, “Account A has 100 yuan, 30 yuan will be deducted from transaction T1, and 30 yuan will be deducted from transaction T2, resulting in concurrency”.
In the first-stage Try operation, distributed transaction T1 and distributed transaction T2 freeze the portion of funds respectively without interference with each other. In this way, in the second phase of distributed transaction, whether T1 commits or rolls back has no impact on T2, so that T1 and T2 can execute in parallel on the same transaction data.
3. Allow empty rollback
As shown in the figure below, the transaction coordinator may experience a network timeout due to packet loss when calling the Try operation, a phase of the TCC service. At this point, the transaction manager triggers a two-phase rollback, calling the TCC service’s Cancel operation, and the Cancel operation invocation does not time out.
A scenario in which the TCC service receives a Cancel request without receiving a Try request is called a null rollback. Empty rollback is a common occurrence in production environments. When implementing TCC services, users should allow the execution of empty rollback, that is, return success upon receipt of empty rollback.
4. Anti-suspension control
As shown in the figure below, a timeout due to network congestion may occur when the transaction coordinator invokes the Try operation, a phase of the TCC service. At this point, the transaction manager triggers a two-phase rollback, calling the TCC service’s Cancel operation, and the Cancel call has not timed out. After that, the TCC service will receive the first-stage Try packet that is jammed on the network, and the second-stage Cancel request will be executed before the first-stage Try request. The TCC service will never receive the Confirm or Cancel of the second-stage after executing the late Try. The TCC service is suspended.
When implementing TCC services, users should allow empty rollback, but reject Try requests after empty rollback to avoid suspension.
5. Idempotent control
Try, Confirm or Cancel operations of THE TCC service will be repeated, whether network packet retransmission or compensation execution of abnormal transactions. When implementing TCC services, users need to take idempotent control into account. That is, the results of executing Try, Confirm, and Cancel once and for many times are the same.
2.4. FMT mode
FMT (Framework-Managed Transaction) Framework manages transactions and is a non-intrusive transaction solution. In this mode, the distributed transaction framework hosts all transaction operations. The first and second phase operations of the transaction are automatically generated by the framework. The user SQL is used as the first phase of the distributed transaction, and the second phase is automatically generated by the framework as “commit/rollback” operations.
1. FMT Stage 1
The first phase of FMT is automatically generated by the distributed transaction framework. In the first phase, the distributed transaction framework intercepts the SQL statement of the service and saves the data before the service SQL modification as the original undo log before the service execution. After the business SQL is executed, the updated business data is saved as a new snapshot (redo log). Finally, row locks are generated using table names and primary keys for concurrent control of distributed transactions.
The purpose of parsing SQL semantics is to make it easy to find the business data to be updated by the business; The purpose of extracting table metadata is to find the primary key and unique constraint key of the business table, so as to generate row lock.
2. FMT Stage 2
In FMT mode, intermediate data, such as undo log and redo log, is saved in phase 1 to generate phase 2 operations.
- FTM phase 2 submission
The two-phase commit operation is automatically generated. Since the business SQL has already been committed to the database in the first phase, the two-phase commit only needs to delete the intermediate data saved in the first phase (undo log, redo log, and row lock).
- FMT Phase 2 Rollback
The two-phase rollback operation is also automatically generated to roll back the service data updated by the one-phase SERVICE SQL using Undo log. The specific operation steps are as follows:
First, you need to verify dirty writes. Verify dirty write methods. Compare the redo log with the current database value. If the two data files are consistent, no dirty write occurs. If dirty write occurs, manual processing is required instead of using undo log to roll back service data.
Then, restore the business data. If no dirty write occurs, run the undo log command to roll back the service data and restore the service data to the initial value.
Finally, delete the intermediate data. After service data is restored, you can delete all intermediate data (undo log, redo log, and row lock) to complete rollback.
2.5 XA mode
XA mode is another non-invasive distributed transaction solution. Unlike FMT, all phases one and two are performed by the database. The ant distributed transaction framework invokes the first-phase XA interface of the data in one phase (XA Prepare) and the second-phase XA interface of the data in two phases (XA COMMIT/XA ROLLBACK).
The XA mode has the following features:
- Mainstream databases all support XA write with wide coverage.
- Deep customization with Oceanbase, a self-developed database of ants, to solve XA transaction performance problems;
- With the MVCC feature of database, distributed MVCC and global consistent read are implemented.
3. Ant Financial invests in Seata community of distributed transactions
In January 2019, based on technology accumulation, Alibaba middleware team launched the open source project Fescar (Fast & EaSy Commit And Rollback, Fescar) to jointly build distributed transaction solutions with the community. Fescar offers a unique solution to the problem of distributed transactions under microservices architecture. Fescar’s vision is to make the use of distributed transactions as simple and efficient as the use of local transactions. The ultimate goal is to make Fescar applicable to all distributed transaction scenarios.
In order to achieve the goal of being suitable for more distributed transaction business scenarios, Ant Financial joined the Fescar community and added TCC mode.
Ant Financial’s participation triggered a discussion among the core members of the community. In order to achieve the goal of applying to all distributed transaction business scenarios, and to make the community more neutral, open and ecological, the core members of the community decided to upgrade the brand and rename it Seata. Seata, which stands for Simple Extensible Autonomous Transaction Architecture, is a one-stop distributed Transaction solution.
Since open source, Seata has benefited from community contributions. So far, Seata has more than 7,000 STARS and over 55 Ficolin-3 developers contributing to a richer and more vibrant community.
In May 2019, Seata will be added to server-side HA cluster support, which will enable Seata to meet the standards used in production environments.
Developers who are passionate about distributed transactions are welcome to join the community and bring more imagination to Seata.
conclusion
- Live video and PPT to review the address in detail: tech.antfin.com/activities/…
- GitHub address: github.com/seata/seata
Financial Class Distributed Architecture (Antfin_SOFA)