Seata long transaction solution Saga model | SOFAChannel# 10 review

SOFA:Channel/, interesting and practical distributed architecture Channel. The Saga model of distributed transaction Seata is a long transaction solution. Review the video and PPT to see the address at the end of the article. Welcome to join the live interactive nail group: 23372465, not to miss every live broadcast.

Hi, everyone, MY name is Chen Long, my name is Yi Yuan (long187@github), I am a core R&D and Seata Committer of Ant Financial Distributed Transactions. Today’s theme is “Saga Mode Details of Distributed Transaction Seata Long Transaction Solution”, which will explain how to use Seata Saga state machine for service orchestration and distributed transaction processing based on the pain points of financial distributed application development and the theory and application scenarios of Saga distributed transaction. To build more resilient financial applications, the implementation of Saga state machines will also be dissected in terms of architecture, principles, design, high availability, best practices, etc.

Seata：github.com/seata/seata

Pain points of financial distributed application development

An obvious problem with distributed systems is that a business process needs to compose a set of services. This is especially true with microservices, which require business consistency. That is, if a step fails, either roll back to the previous service call or retry to ensure that all steps succeed. — “Hearing the Wind in the Left Ear: Compensating Affairs of Elastic Design”

However, in the financial field, the business process under the micro-service architecture is often more complicated and the process is very long. For example, it is normal for an Internet micro-loan business process to transfer more than a dozen services, and the abnormal processing process becomes even more complicated. Students who have done financial business development will have a sense of body.

Therefore, we face some pain points in the development of financial distributed applications:

Service consistency cannot be guaranteed

We come into contact with most of the business (such as in the channels, product layer, system integration layer), in order to ensure business consistency, eventually tend to adopt the way of “compensation” to do, without a coordinator to support, the development difficulty is bigger, every step should be to deal with in the catch all previous “rollback” operations, This results in “arrow shaped” code that is poorly readable and maintainable. Or retry the abnormal operation. If the retry fails, the operation may be performed asynchronously or manually. All these bring great burden to the developer, the development efficiency is low, and easy to make mistakes.

Business status is difficult to manage

There are many business entities and entity states. After completing a business activity, the entity state is often updated to the database. There is no state machine to manage the whole state change process, which is not intuitive and error-prone, resulting in an incorrect state of the business.

Service monitoring is difficult to operate and maintain

Service execution is generally monitored by printing logs and viewing them on the log monitoring platform. In most cases, there is no problem. However, if a service error occurs, the monitoring does not have the current service context and is not friendly to troubleshooting. At the same time, log printing also depends on development, easy to miss.

Lack of unified error guard capabilities

Compensation transactions often need to have “error daemon trigger compensation” and “manual trigger compensation” operations, there is no unified error daemon and processing standards, these developers have to develop one by one, heavy burden.

Theoretical basis

For transactions we all know ACID and are familiar with CAP theory that only two of them can be satisfied at most, so a variant of ACID, BASE, is introduced to improve performance. ACID emphasizes consistency (C in CAP) while BASE emphasizes availability (A in CAP). In many cases, we cannot achieve strong ACID consistency. Especially when we need to span multiple systems that are not provided by one company. The system of BASE tends to design a more elastic system. In a short period of time, we should allow new transactions to occur even if there is a risk of data unsynchronization. In the future, we will deal with the transactions that may have problems in the business through compensation, so as to ensure the final consistency.

Therefore, we will make a choice in the actual development, and we can use compensation transaction for more business systems above the financial core. Saga theory was put forward in compensation transaction processing more than 30 years ago, and gradually attracted people’s attention with the development of micro-services in recent years. Saga is now widely accepted as a solution for long – term transactions.

Github.com/aphyr/dist-… Microservices. IO/patterns/da…

The Saga model deals with consistency in a very simple way: compensation. The normal transaction process is shown on the left side of the figure above. When an error occurs during the execution of T3, the transaction compensation process on the right is started and the compensation service of T3, T2 and T1 is returned, where C3 is the compensation service of T3, C2 is the compensation service of T2 and C1 is the compensation service of T1. The modified data of T3, T2, and T1 are compensated.

Usage scenarios

In some scenarios, when we have strong data consistency requirements, we will adopt a distributed transaction scheme that requires the use of “two-phase commit” at the business layer. In other scenarios, we don’t need that much consistency, and we just need to be consistent in the end.

For example, Ant Financial currently uses TCC mode in its financial core system, which is characterized by high consistency requirements (business isolation), short processes and high concurrency.

However, in many businesses above the financial core (such as systems at the channel layer, product layer and integration layer), these systems are characterized by consistency, multiple and long processes, and the services of other companies (such as financial network) may be called. This is if it is expensive to develop Try, Confirm, and Cancel methods for each service. If there are other companies’ services in the transaction, there is no way to require the other companies’ services to follow the TCC development pattern. At the same time, the process is long, the transaction boundary is too long, and the lock time is long, which will affect the concurrency performance.

So the Saga mode applies to the following scenarios:

Long and many business processes;
Participants include other corporate or legacy services that do not provide the three interfaces required by the TCC pattern;
Typical business systems: such as financial network (docking with external institutions), Internet micro-lending, channel integration, service integration under distributed architecture and other business systems;
Banking financial institutions are widely used;

The advantages:

One-stage commit local transaction, no lock, high performance;
Participants can execute asynchronously with high throughput;
Compensation services are easy to implement because the reverse of an update operation is relatively easy to understand;

The downside:

Isolation is not guaranteed, and we’ll talk about how to deal with the lack of isolation later.

Saga implementation based on state machine engine

Basic principles of Saga implementation based on state machine engine:

Define the flow of the service invocation through the state diagram and generate JSON definition files;
A node in the state diagram can be invoking a service, and a node can configure its compensation node (the node associated with the dotted line);
The state graph JSON is executed by the state machine engine. When an exception occurs, the state engine reverse-executes the transaction and the corresponding compensation node of the successful node rolls back the transaction.
Whether to compensate for exceptions can also be customized by users.
It can realize service choreography requirements, routing, asynchronous, retry, parameter conversion, parameter mapping, service execution state judgment, exception capture and other functions;

Seata’s current Saga mode adopts the state machine +DSL scheme for the following reasons:

The state machine +DSL scheme is more widely used in actual production.
It can be executed using asynchronous processing engines such as the Actor model or SEDA architecture to improve overall throughput;
Generally, the business system above the core system will be accompanied by the demand of “service orchestration”, and the service orchestration has the requirement of transaction consistency. It is difficult to separate the two. The state machine +DSL scheme can meet these two requirements at the same time.
Because of Saga model in theory is not guarantee isolation, in extreme cases may be due to dirty writing cannot complete A rollback operation, such as an extreme example, in A distributed transaction give users A prepaid phone, and then give the user B deducting the balance, if in give A prepaid phone users successfully, before the transaction is committed, A user consumption dropped the line, If a transaction rollback occurs when there is no way to compensate, some business scenarios can be allowed to business success finally, in the case of a rollback the can continue to try again to complete the back of the process, state machine + DSL scheme can realize “forward” restore context’s ability to continue, let business execution success finally, achieve the goal of eventual consistency.

Seata State Language

Define the flow of service invocation through state diagram and generate JSON state language definition file;
In the state diagram, a node can invoke a service, and a node can configure its compensation node.
State types support single selection, concurrent, asynchronous, sub-state machine, parameter conversion, parameter mapping, service execution state judgment, exception capture, etc.
Compared with XML (such as BPMN and BPEL), JSON definitions are more concise and easy to read, and cost less to learn.

Example of state machine JSON:

Description of “State machine” properties:

Name: indicates the unique Name of the state machine.
Comment: Description of the state machine;
Version: indicates the Version defined by the state machine.
StartState: the first “state” to run at startup;
States: The list of States is a map structure. The key is the name of the “state”, which must be unique in the state machine.

“Status” attribute description:

Type: indicates the Type of the state, for example:
- ServiceTask: Performs service invocation tasks;
- “Choice” : selects a route based on a single condition.
- CompensationTrigger: trigger the compensation process;
- Succeed: The status machine succeeds.
- Fail: Indicates that the state machine ends abnormally.
- SubStateMachine: invokes the SubStateMachine.
ServiceName: the ServiceName, usually the beanId of the service.
ServiceMethod: ServiceMethod name.
What CompensateState does to the compensation?
Input: a list of Input parameters to call the service;
Output: Assign the parameters returned by the service to the context of the state machine;
Status: mapping of service execution Status. The framework defines three states: SU successful, FA failed and UN unknown. We need to map the Status of service execution into these three states to help the framework judge the consistency of the whole transaction.
Catch: Indicates the route after an exception is caught.
Retry: service invocation Retry policy.
Nex: “state” of the next execution after the service execution is complete.

For a more detailed explanation of the state language, see the Seata Saga documentation.

State machine designer

Seata Saga provides a visual state machine designer for users to use. Please refer to the code and operation guide: github.com/seata/seata…

State machine Designer screenshot:

State machine designer demo address: seata.io/saga_design…

State machine engine principles

The state diagram is stateA, then stataB, then stateC;
The execution of “state” is based on event-driven model. After stataA is executed, routing messages will be generated and put into EventQueue. The event consumer will fetch messages from EventQueue and execute stateB.
When the entire state machine is started, Seata Server is called to start distributed transactions, xids are produced, and “state machine instance” startup events are logged to the local database.
When a “state” is executed, Seata Server is called to register branch transactions, produce Branchids, and log the “state instance” to start executing events to the local database.
When a “state” execution is completed, the “state instance” execution end event is recorded to the local database, and then the Seata Server is called to report the status of the branch transaction.
When the entire state machine execution is complete, the “state machine instance” execution completion event is logged to the local database, and the Seata Server is called to commit or roll back the distributed transaction.

State machine engine design

The design of state machine engine is mainly divided into three layers, the upper layer depends on the lower layer, from bottom to top:

Eventing layer:
- Implementation of event-driven architecture, events can be pushed in and consumed by the consumer side, this layer does not care what the event is what the consumer side performs, implemented by the upper layer;
ProcessController layer:
- Since upper level Eventing drives the execution of an “empty” process execution, the “state” behavior and routing are not implemented by upper level implementation;

You can theoretically extend any “process” engine by custom based on these two layers. The design of these two layers refers to the design of the internal financial network platform.

StateMachineEngine layer:
- Implement the behavior and routing logic of each state in the state machine engine;
- Provide API, state machine language warehouse;

State machine engine is highly available

The state machine engine is stateless; it is built into the application.

When the application is running properly:

The state machine engine reports the state to Seata Server.
State machine execution logs are stored in the business database.

When an application instance is down:

Seata Server will sense this and send transaction recovery requests to the surviving application instances.
After receiving the transaction recovery request, the state machine engine loads the log from the database and restores the state machine to continue the execution.

Practical experience of service design in Saga mode

The following are some experiences of micro-service design in Saga mode.

The Seata Saga pattern does not require any interface parameters for microservices, making the Saga pattern useful for integrating services from legacy systems or external agencies.

Allowable void compensation

Void compensation: the original service is not executed, the compensation service is executed;
Causes:
- The original service times out (packet loss);
- [Fixed] Rollback triggered by Saga transaction
- If the original service request is not received, the compensation request is received first;

Therefore, the service design needs to allow null compensation, that is, if no business primary key to be compensated is found, compensation success is returned and the original business primary key is recorded.

Anti-suspension control

Suspension: compensation service is executed before the original service;
Causes:
- The original service time out (congestion);
- The Saga transaction rollback triggers the rollback.
- The arrival of jammed original services;

Therefore, the current business primary key is checked to see if it already exists in the business primary key recorded by null compensation, and if so, the execution of the service is denied.

Idempotent control

The original service and compensation service need to ensure idempotency. Because the network may time out, you can set a retry policy. When the retry occurs, idempotency control should be used to avoid repeated updates of service data.

For services on older systems that may not be idempotent, there are “bypass” options: you can leave the retry policy unset, ask the state machine not to retry the service call, then “reverse check” or “manually correct” the execution state of the service, and then resume the execution of the state machine.

Lack of isolation response

Because Saga transaction does not guarantee that isolation, in extreme cases may be due to dirty writing cannot complete A rollback operation, such as an extreme example, distributed within A transaction to give users A prepaid phone first, and then to the user B deducting the balance, if in give A prepaid phone users successfully, before the transaction is committed, A user consumption dropped their balance, if the transaction rolled back, There is no way to compensate.

This is a typical problem caused by the lack of isolation, and the general approach in practice is:

Business process design follow the principle of “rather long money, not short money”, long money means that the customer less money institutions more money, to the credibility of the agency can give customers a refund, on the contrary is short money, less money may not be back, so in the business process design must be deducted first;
Some business scenarios can be allowed to business success finally, in the case of a rollback the can continue to try again to complete the back of the process, so the state machine engine also need to provide in addition to providing “rollback” ability “forward” to restore the context’s ability to continue, let business execution success finally, achieve the goal of eventual consistency;

Seata Saga advantage

In practice, we find that long process business scenarios often require service orchestration, while ensuring data consistency between services.

There are also some Saga transaction frameworks in the open source community, such as Apache Camel Saga, Eventuate Tram Saga, Apache ServiceComb Saga, and so on. There are also some service orchestration frameworks, such as Uber Cadence, Netflix Conductor, Zeebe-IO Zeebe, ING-Bank Baker, AWS Step Functions, etc.

However, they have either Saga transaction processing capabilities or service choreography capabilities, and Seata Saga is a very elegant combination of the two capabilities to provide users with a simple development, easy exception handling, high-performance event-driven product.

Saga implementation based on annotation interceptor (planned)

There is also a Saga implementation that is based on annotations and interceptors. Seata does not currently implement this. For example, the one method defines the @sagacompensable annotation. The compensation method used to define the One method is the compensateOne method. Then define the @sagatransactional annotation on the business process code processA method, start the Saga distributed transaction, use interceptors to intercept each forward method when an exception occurs, trigger rollback operations, and call the forward method’s compensation methods.

Compare the pros and cons of the two Saga implementations

The biggest advantage of the state machine engine is that it can improve the system throughput through the event-driven method of asynchronous execution, and can meet the needs of service orchestration. In the case of the lack of isolation of Saga mode, it can provide a “forward retry” event recovery strategy, so as to improve the fault tolerance of the system, but the disadvantage is high business intrusion.

The biggest advantage of annotations plus interceptors is that they are easy to develop and cheap to learn. The disadvantage is that there is no way to “retry forward after the fact” because thread context cannot be restored. In the absence of isolation, the absence of a means of transaction processing can increase the operation and maintenance costs.

conclusion

Most of the time, we do not need to emphasize the uniformity. We design more resilient systems based on the BASE and Saga theories to achieve better performance and fault tolerance under distributed architecture. There is no silver bullet in distributed architecture, only solutions suitable for specific scenarios. In fact, Seata Saga is a product with the capabilities of “service choreography” and “Saga distributed transaction”, and its applicable scenarios are summarized as follows:

Suitable for “long transaction” processing under microservice architecture;
Suitable for “service choreography” requirements under microservices architecture;
Applicable to the financial core system above the business system with a large number of combined services (such as in the channel layer, product layer, integration layer system);
Suitable for scenarios in which business processes need to integrate legacy systems or services provided by external organizations (these services are immutable and cannot be modified);

That’s all for this post, but if you want to learn more about Seata, check out the articles on the Seata website, or check out the code in the project.

This video review and PPT view address

Tech.antfin.com/community/l…

Financial Class Distributed Architecture (Antfin_SOFA)