Microservices advocate splitting complex single applications into several services with simple functions and loose coupling, which can reduce development difficulty, enhance scalability, and facilitate agile development. At present, more and more developers are advocating the system microservitization. After a seemingly simple function, it may need to call multiple services and operate multiple database implementations internally, and the problem of distributed transaction of service invocation becomes very prominent. Distributed transactions have become the biggest obstacle to the implementation of microservices, which is also the most challenging technical problem.
1. What are the distributed transaction issues brought about by microservitization?
First, imagine a traditional Monolithic App that implemented a single business by updating data on the same data source with three modules.
Naturally, data consistency across business processes is guaranteed by local transactions.
As business requirements and architecture change, individual applications are split into microservices: the original three modules are split into three independent services, using independent data sources (Pattern: Database per service). The business process will be completed by the invocation of three services.
At this point, data consistency within each service is still guaranteed by local transactions. How can global data consistency be guaranteed across the business? This is the typical distributed transaction requirement under microservices architecture: we need a distributed transaction solution to ensure the data consistency of business globally.
2. Development history of Fescar
Ali is one of the earliest enterprises to apply distributed (microservitization) transformation in China, so it has long encountered the problem of distributed transactions under microservice architecture.
In 2014, Alibaba middleware team released TXC (Taobao Transaction Constructor) to provide distributed Transaction services for applications within the group.
In 2016, TXC went through product transformation and was launched in Ali Cloud as Global Transaction Service (GTS). TXC became the only distributed Transaction product on cloud in the industry at that time. It began to serve many external customers in the public cloud and private cloud solutions of Ali Cloud.
Since 2019, based on TXC And GTS technology accumulation, Ali middleware team launched the open source project Fescar (Fast & EaSy Commit And Rollback, Fescar) to build this distributed transaction solution together with the community.
TXC/GTS/Fescar is a unique solution to the problem of distributed transactions under microservices architecture.
2.1 Design Intention
In the fast-growing Internet age, the ability to quickly trial-and-error is critical to business:
- On the one hand, the introduction of microsertization and distributed transaction support on the technical architecture should not impose additional r&d burdens on the business level.
- On the other hand, businesses that introduce distributed transaction support should maintain roughly the same level of performance and not be significantly slowed down by transactions.
Based on these two points, the most important considerations at the beginning of our design were:
- Business non-intrusive: By “intrusive” I mean that applications are designed and adapted at the business level because of the technical constraints of distributed transactions. This kind of design and modification often brings high development and maintenance costs to the application. We want to solve the distributed transaction problem at the middleware level without requiring the application to do extra work at the business level.
- High performance: The introduction of distributed transaction guarantees inevitably leads to additional overhead and performance degradation. We want to reduce the performance cost of the introduction of distributed transactions to a very low level, so that the application does not suffer from the availability of the business due to the introduction of distributed transactions.
2.2 Why not meet the existing solutions?
Existing distributed transaction solutions can be divided into two categories according to their intrusiveness to the business, namely, non-intrusive and intrusive.
Non-intrusive service solution
Among the existing mainstream distributed transaction solutions, only XA based solutions are non-invasive to services, but there are three problems in the application of XA solutions:
- Database support for XA is required. Do not use it if you encounter a database that does not support XA (or does not support XA well, such as MySQL prior to 5.7).
- Constrained by the protocol itself, the locking cycle of transaction resources is long. Long-term resource locking is often unnecessary at the business level, and because the transactional resource manager is the database itself, the application layer cannot intervene. As a result, XA-BASED applications tend to have poor performance and are difficult to optimize.
- The floor of the distributed solution based on XA, all depends on the application server of heavyweight (Tuxedo/WebLogic, WebSphere, etc.), it is not applicable to micro service architecture.
A scheme to invade services
In fact, initially there was only one solution for distributed transactions, XA. XA is complete, but in practice, for a variety of reasons (including but not limited to the three points mentioned above), it often has to be abandoned in favor of addressing distributed transactions at the business level. Such as:
- Final consistency scheme based on reliable message
- TCC
- Saga
All fall into this category. The specific mechanism of these schemes is not developed here, and there are many articles on the Internet. In summary, these solutions require that distributed transaction technology constraints be taken into account at the business level of the application, and typically each service needs to be designed to implement forward and reverse idempotent interfaces. Such design constraints often result in high development and maintenance costs.
2.3 What would the ideal solution look like?
It is undeniable that the distributed transaction scheme that intrudes into business has been proved by a lot of practices and can effectively solve problems. It plays an important role in the business application system of various industries. But back to the origin to think, the adoption of these programs are actually forced. Imagine an XA-BASED solution that was less heavy and could meet the performance requirements of the business. No one would want to take the distributed transaction problem to the business level.
An ideal distributed transaction solution should be: as simple as using local transactions, the business logic only focuses on the requirements of the business level and does not need to consider the constraints of the transaction mechanism.
3. Principle and design
We need to design a non-invasive solution to business, so we should consider from the non-invasive XA solution: can we evolve on the basis of XA and solve the problems faced by XA solution?
3.1 How to define a distributed transaction?
First, it is natural to think of a distributed transaction as a global transaction that contains several branch transactions. The responsibility of a global transaction is to coordinate the branch transactions under its jurisdiction to agree on either a successful commit together or a failed rollback together. In addition, often a branch transaction is itself a local transaction that satisfies ACID. This is our basic understanding of distributed transaction structures, consistent with XA.
Second, similar to the XA model, we define three components to protocol the processing of distributed transactions.
- Transaction Coordinator (TC) : A Transaction Coordinator that maintains the running status of global transactions and coordinates and drives the commit or rollback of global transactions.
- Transaction Manager (TM) : Controls the boundaries of a global Transaction, is responsible for starting a global Transaction, and ultimately initiating a global commit or rollback resolution.
- Resource Manager (RM) : Controls branch transactions, is responsible for branch registration, status reporting, receives transaction coordinator instructions, and drives the commit and rollback of branch (local) transactions.
A typical distributed transaction process:
- TM applies to TC for starting a global transaction. The global transaction is successfully created and a globally unique XID is generated.
- The XID is propagated in the context of the microservice invocation link.
- RM registers branch transactions with THE TC and brings them under the jurisdiction of the global transaction corresponding to the XID.
- TM initiates a global commit or rollback resolution against the XID to TC.
- TC schedules all branch transactions under XID to complete commit or rollback requests.
So far, Fescar’s protocol mechanism is generally consistent with XA.
3.2 What are the differences with XA?
Architectural layers
The RM of an XA solution is actually at the database layer, and RM is essentially the database itself (provided with xA-enabled drivers).
Fescar’s RM is deployed on the application side as a middleware layer in the form of a binary package, independent of the database’s own protocol support, and certainly does not require the database to support XA. This is important for microservitization architectures: the application layer does not need to accommodate two different sets of database drivers for different scenarios of local and distributed transactions.
This design removes the requirement of protocol support for database in distributed transaction scheme.
Two-phase commit
Let’s take a look at XA’s 2PC process.
Regardless of whether Phase2’s resolution is COMMIT or ROLLBACK, locks on transactional resources are held until Phase2 is complete.
Given a normal running business, with a high probability that more than 90% of transactions should eventually commit successfully, can we commit local transactions in Phase1? In more than 90% of cases, Phase2 lock time is saved and overall efficiency is improved.
This design reduces the transaction holding time in most scenarios, thus increasing the concurrency of the transaction.
Of course, you must ask: How does Phase1, in the case of commit, roll back Phase2?
3.3 How do branch transactions commit and roll back?
First, the application needs to use the JDBC data source agent of Fescar, which is the RM of Fescar.
Phase1:
The JDBC data source agent of Fescar parses the business SQL, organizes the data mirror of the business data before and after the update into the rollback log, and commits the update of the business data and the write of the rollback log in the same local transaction by taking advantage of the ACID property of the local transaction.
This ensures that any updates to committed business data will have a corresponding rollback log.
Based on this mechanism, the local transaction of the branch can commit Phase1 of the global transaction, immediately releasing the resources locked by the local transaction.
Phase2:
If the resolution is a global commit, and the branch transaction has already committed at this point, there is no need for synchronous coordination processing (just asynchronous cleanup of the rollback log), Phase2 can complete very quickly.
If the resolution is a global rollback, the RM receives the rollback request from the coordinator, finds the corresponding rollback log record by XID and Branch ID, generates the reverse update SQL from the rollback record and executes it to complete the rollback of the Branch.
3.4 Transaction propagation mechanism
XID is a unique identification of the global transaction, transaction propagation mechanism to do is to put the XID pass in the service call link, and bind to the service transaction context, in this way, the database update operations in the service link, there will be a registered branch, with the the XID on behalf of the global transaction into the jurisdiction of the same global transaction.
Based on this mechanism, Fescar can support any microservice RPC framework. Just find a mechanism in a particular framework that can transparently propagate xids, such as Dubbo’s Filter + RpcContext.
Corresponding to transaction propagation properties defined by the Java EE specification and Spring, Fescar supports the following:
- PROPAGATION_REQUIRED: Default support
- PROPAGATION_SUPPORTS: Default support
- PROPAGATION_MANDATORY: The application is implemented through the API
- PROPAGATION_REQUIRES_NEW: The application is implemented through the API
- PROPAGATION_NOT_SUPPORTED: The application is implemented through an API
- PROPAGATION_NEVER: The application is implemented through the API
- PROPAGATION_REQUIRED_NESTED: No support is supported
3.5 isolation,
The isolation of global transactions is based on the local isolation level of branch transactions.
Fescar designs a global write exclusive lock maintained by the transaction coordinator to ensure write isolation between transactions when a database local isolation level read is committed or above. By default, a global transaction is defined to read at the uncommitted isolation level.
Our consensus on isolation levels is that the vast majority of applications will have no problem reading committed isolation levels. In fact, the vast majority of these scenarios work just as well at uncommitted isolation levels.
In extreme scenarios, if an application needs to achieve global read commits, Fescar provides a mechanism to do so. By default, Fescar works at read-no-commit isolation levels, ensuring efficiency for most scenarios.
The application of ACID properties to transactions in Fescar is a complex topic that will be covered in a separate article.
4. Application scenario analysis
There is an important premise in the core principles of Fescar described above: the resource involved in a branch transaction must be a relational database that supports ACID transactions. The commit and rollback mechanisms of branches depend on the guarantee of local transactions. Therefore, if the application uses a database that does not support transactions, or is not a relational database at all, it does not apply.
In addition, the current implementation of Fescar has some limitations, such as transaction isolation level up to read committed level, SQL parsing does not cover the full syntax, etc.
To cover the application scenarios that the Fescar native mechanism does not support at this time, we have defined another working mode.
The Fescar native working mode described above is called Automatic Transaction mode (AT), which is non-intrusive. The corresponding working mode is called MT (Manual Transaction) mode, in which branch transactions need to apply their own logic to define the business itself and commit and rollback.
4.1 Basic behavior patterns of branches
A branch transaction that is part of a global transaction contains four behaviors that interact with the coordinator in addition to its own business logic:
- Branch registration: Before the data operation of a branch transaction can take place, it is necessary to register the data operation of the branch transaction with the coordinator and put it into the management of a global transaction that has been started. After the branch is registered, the data operation can take place.
- Status reporting: After the data operation of a branch transaction completes, the result of its execution needs to be reported to the transaction coordinator.
- Branch commit: Completes a branch commit in response to a request from the coordinator for a branch transaction commit.
- Branch rollback: Completes the branch rollback in response to a request from the coordinator to roll back a branch transaction.
4.2 AT Mode Behavior mode of branch
Business logic does not need to focus on transaction mechanism, the branch and global transaction interaction process automatically.
4.3 Behavior mode of MT mode branch
The business logic needs to be decomposed into Prepare/Commit/Rollback 3 parts to form an MT branch and join the global transaction.
MT mode is a complement to AT mode. In addition, the value of the MT pattern is that many non-transactional resources can be incorporated into the management of global transactions.
4.4 Mixed Mode
Because the branches of THE AT and MT modes fundamentally behave in the same way, they are fully compatible, that is, branches of both AT and MT can exist in a global transaction. In this way, comprehensive coverage of business scenarios can be achieved: AT mode is used for those supported by AT mode; If the AT mode is not supported, use MT mode instead. In addition, naturally, non-transactional resources managed by MT can also be managed in the same distributed transaction along with transactional relational database resources.
4.5 Perspective of Application Scenarios
Back to our original design: an ideal distributed transaction solution should not intrude on the business. MT mode is a natural complement to the situation where AT mode cannot cover all scenarios AT the moment. We hope that through the continuous evolution and enhancement of AT mode, the supported scenarios will be gradually expanded and MT mode will gradually converge. In the future, we will include native XA support as a non-invasive way to cover scenarios that are unreachable in AT mode.
5. The extension point
5.1 Support for microservices framework
The propagation of transaction context between microservices requires customized solutions that are optimal and transparent to the application layer according to the mechanism of the microservice framework itself. Developers interested in building in this area can refer to the built-in support for Dubbo to implement support for other microservices frameworks.
5.2 Supported Database types
Because AT involves parsing SQL, there are specific adaptations for working on different types of databases. Developers interested in building in this area can refer to the built-in support for MySQL to implement support for other databases.
5.3 Configuration and Service Registration Discovery
Support access to different configuration and service registry discovery solutions. Such as Nacos, Eureka, and ZooKeeper.
5.4 Scene expansion of MT mode
An important function of THE MT schema is that non-relational database resources can be wrapped in the MT schema branch into the jurisdiction of global transactions. For example, transaction messages for Redis, HBase, RocketMQ, etc. Developers interested in building in this area can contribute a range of ecological adaptations here.
5.5 Distributed high availability solution for transaction Coordinator
For different scenarios, different methods are supported as high availability solutions on the Server side of transaction coordinator. For example, persistence for transaction state can be either file-based or database-based. State synchronization between clusters can be based on RPC communication or high availability KV storage.
6. Roadmap
The blueprint
The green part has been released open source, the yellow part will be released by Ali in the future version, and the blue part is the ecological part jointly built by us and the community:
- For support of different databases, developers can refer to the implementation of MySQL.
- For support for different microservices frameworks, developers can refer to Dubbo’s implementation.
- For support of MQ and NoSQL, developers can refer to TCC implementation.
- Configuration and service registry discovery: With little effort developers can tap into any framework that provides such services.
- Of course, the non-blue sections also welcome the community to contribute to better solutions.
- In addition, as the standard of distributed transaction, XA is indispensable for a complete distributed transaction solution, and we must include XA support in the future planning.
Preliminary version planning
V0.1.0:
- Microservices framework support: Dubbo
- Database support: MySQL
- Annotation based on Spring AOP
- Transaction Coordinator: Standalone version
V0.5. X:
- Microservices framework support: Spring Cloud
- MT mode
- Support for TCC mode transaction adaptation
- Dynamic configuration and service discovery
- Transaction Coordinator: High availability cluster version
V0.8. X:
- Metrics
- Console: Monitor/deploy/upgrade/expand capacity
V1.0.0:
- General Availability: Applicable in production environment
V1.5. X:
- Database support: Oracle/PostgreSQL/OceanBase
- Annotations that do not rely on Spring AOP
- Optimized hotspot data processing mechanism
- RocketMQ transaction messages are incorporated into global transaction management
- NoSQL incorporates adaptation mechanisms for global transaction management
- Support HBase
- Support Redis
V2.0.0:
- Support the XA
Of course, in the process of the iterative evolution of the project, we attach the most importance to the voice of the community, and the roadmap will be fully communicated with the community to make timely adjustments.
share
I 13 years experience in Java development and product development experience, BAT background, worked as chief technology officer at many famous enterprises, enterprise scheme selection of chief adviser, successively engaged in kernel development, large Java system architecture design and iot system architecture design and development, is proficient in complex business difficulties, core research technical scheme selection, architecture, I have a very deep understanding of Java language and projects and rich practical experience. I love cutting-edge technology and am willing to share and discuss technology.