Document a recent business scenario in your work and a small Saga framework that you learned and understood based on that scenario and implemented simply.
The business scenario
The service scenario is described briefly and abstractly: There are four modules A, B, C, and D, where B, C, and D are downstream services and A is the upstream service caller. A request enters A from the front end. A needs to record this request in the database, and then calls B and C services serially and D services asynchronously. After the call, the request is updated in the database.
The most familiar e-commerce scenario is the following steps:
- Users place orders on the order platform
- Accounting for inventory
- Account for the money
- Generate an order
As shown below:
Distributed system
Before getting into more detail about the business scenario, let’s take a look at the diagram above. The image above shows a simple representation of a distributed application that splits an e-commerce business into multiple services. In distributed application invocation scenarios, network communication is abnormal, server node is abnormal, or even the server is down. There is a saying that in distributed applications, exceptions are more common and more frequent than you might think.
So why deploy services in a distributed way when there are so many exceptions to consider in distributed applications? In the past a website, traffic is not so big when the stand-alone is completely able to hold. However, with the rapid development of Internet business, you said that in today’s Spring Festival travel rush, if a 12306 cannot bear the visits of millions of people, the online ridicule will already be one. So we began to pursue a highly available, high-performance, easily scalable and secure system for the site. This is why distributed systems exist.
Distributed systems have the following “characteristics” :
- High availability: When one or more server nodes are abnormal or even down, the system is still available. Multiple nodes are deployed in distributed redundancy to ensure high availability.
- High performance: In the case of a single server, factors such as cpus and memory of a single server become performance bottlenecks. However, in the case of distributed multi-server deployment, high performance can be achieved by splitting services and deploying multiple servers.
- “Scalable” : When traffic increases and storage requirements increase, one or several servers become too small to handle. Add servers to the server cluster.
- “Scalable” : It is important to distinguish between scalability, which is based on the underlying hardware, and scalability, which is business-specific, that is, facilitating the expansion of new services or the separation of services. And generally adopt modularization, so it is easier to expand and facilitate module reuse. As the business is split, development efficiency increases.
In addition, the distributed needs to be addressed in a service call “success and failure, timeout exception” three scenarios, how to guarantee the data consistency problem, this is also our most concern, who also don’t want to buy Zhang Chunyun ticket goes out the tickets to the somebody else’s hands, this year’s distributed system must pay attention to make money hand delivery.
Data consistency
The C of the distributed system is consistent. The C of the distributed system is consistent. The C of the distributed system is consistent. What do we mean by consistency here? Let’s talk about it.
ACID
ACID refers to four features that must be met for transactions to be correct and reliable when database writes or updates, they are “A (Atomicity), C (Consistency), I (Isolation), and D (Persistence)”. Leaving the other three aside, consistency refers to transactional consistency, and the other three I understand ultimately serve this consistency. Consistency here means that the system moves from one correct state to another correct state before and after a transaction occurs.
What does “right state” mean? Take A simple example, A and B each have 100 yuan, A transfers 50 yuan to B, after the transfer, A’s account becomes 50 yuan, B’s account becomes 150 yuan, which is the correct state. If A transfers 100 yuan to B at this time, then A’s account becomes -50 yuan and B’s account becomes 250 yuan, then this is the wrong state. In this case, the transaction needs to be rolled back if it goes to the wrong state, which is consistency on the transaction. Generally speaking, this consistency can be guaranteed by the database constraint, but more often the business logic is more complex, and the code logic level is required to ensure the constraint.
CAP
CAP theory is usually applied in distributed systems. CAP refers to C (Consistency), A (Availability), and P (Partition tolerance) respectively. CAP theory holds that only two of these three characteristics can be met simultaneously in distributed systems. By data consistency we mean this consistency, and this consistency is not the same thing as transactional consistency in ACID, so don’t confuse it.
- “Consistency” : All copies of data in a distributed system have the same value at the same time.
- Availability: When one or more nodes in a distributed cluster fail, clients still respond to read and write requests.
- Partition tolerance: Partitions occur when nodes in the cluster cannot communicate with each other, or when the system cannot achieve data consistency within a certain period of time. Partition tolerance is when a partition is still available after it is created.
That why can simultaneously meet the two, first consider to meet partition tolerate sex, will make multiple copies of copies but once more, the consistency will be affected, the data synchronization need to spend more time and data synchronization time is much, can affect the usability, may have a period of time can’t use, At the same time, partition tolerance can not be well satisfied because of the longer time to achieve consistency. To sum it up:
- “Meet C\A does not meet P”, that is, partition is not required, then the node is limited, that is to say, give up the scalability of the system, which is contrary to the distributed characteristics, usually using this scheme is the traditional database such as Oracle, Mysql.
- “C\P but not A” does not require availability but consistency, that is, data in all copies must be consistent. Partitioning may lead to A long time of data synchronization. In this scenario, user experience is sacrificed, and users can access data only when all data is consistent. This is typically the case for distributed databases.
- “Satisfying A\P but not C” does not require consistency, and once partitioning occurs and remains available, each node is serviced with locally stored data, so that users may have access to expired data. In this scenario, the user experience is slightly affected, but high availability is ensured without process blocking.
Therefore, we generally adopt some compromise schemes in the implementation. At this time, it is necessary to mention several different consistencies generated according to the actual needs:
- “Strong consistency” : If a certain data is updated in the system, subsequent reads of the data will get the updated value. That is, data must be strictly consistent across all replicas. Replicas are updated synchronously.
- Weak consistency: Compared with strong consistency, the updated data cannot always be read, that is, the data in the copy cannot be consistent, and the copy is asynchronously updated.
- “Final consistency” : a special type of weak consistency in which all copies of a system are synchronized after a brief period of data inconsistency.
In our business scenario, upstream service A calls services B, C, and D to ensure final data consistency. For services B and C, the transaction is rolled back directly if it fails during the service invocation. Asynchronous calls to service D can tolerate long periods of data inconsistencies, introducing a “soft state (intermediate state)” as a temporary state until final consistency, and setting an “order generation” state when the asynchronous call result has not returned.
Fault-tolerant mechanism
In our business, we’re always in the business of money, so the first thing we need to think about is not making mistakes (data consistency), and the next thing we need to think about is how to deal with errors or how to automate the system (fault tolerance).
There are many scenarios to consider for fault tolerance, including successful processes, automatic rollback systems for failure, and exceptional scenarios such as network or server node timeouts that require special consideration.
The success scenario
The success scenario indicates that the synchronous invocation and asynchronous request are successful. In this case, the request is returned to the user synchronously. However, it is important to note that an asynchronous request does not mean that the order is successfully generated. In this case, only a response is returned to the user without blocking the process. Therefore, we will introduce soft state, that is, intermediate state similar to “order generation” state. The order generation may or may not succeed. If the order generation fails, a rollback is also required.
Failure scenario
Three failure scenarios are shown here. For the first two, a rollback is required:
- Terminates and rolls back the database when calling service B to freeze the inventory fails.
- When service C fails to freeze the money, terminate and roll back service B to unfreeze the inventory and roll back the database.
For the third failure scenario, there are several more options:
- You can do the same with the previous two types of rollback directly
- You can choose to provide the user with the opportunity of “manual retry”, because the order is highly likely to be placed after the inventory and money are frozen successfully, and the asynchronous request failure is extremely unlikely. By providing a retry mechanism, you can improve the user experience.
- An asynchronous request can be written to a database or message queue, and a separate thread can be set up to handle the asynchronous request and subsequent operations, including “automatic retry” and successful and failed processing after receiving the asynchronous result. The benefit is that the code is well decoupled and the business process is clear.
Timeout/exception scenario
The causes of timeouts vary from network jitter to server node failure to server node crash. When timeouts occur, programmers are always looking for a reason. And there is also A difficulty is especially prone to overtime status inconsistency, when the service invoke service B overtime, don’t know at this time whether there is A B receives A request, in this case there are two solutions, one is required to achieve B “idempotent” “status query interface”, or provides the service through A retry to deal with the timeout, Or you can use “forced rollback” to handle timeout. However, when both methods time out again, manual intervention is required. In this case, manual processing methods provided for O&M personnel should be reserved.
Down the airport scene
An application can’t afford a sudden power outage, a plug pulled, or a malicious process kill. Once that happens, all the state and data stored in memory are gone. If at this time, service A has successfully called service B to freeze the inventory, but before calling service C, it will suddenly power off, then it is bound to cause inconsistent data on each service, the inventory has been deducted, but the money has not been deducted, the order has not been generated, this is not inexplicably deducted inventory? Therefore, you must save onsite data before invoking the service to facilitate the rollback after a restart.
After downtime to restart the system, will start a thread to read the data, based on the field data to be rolled back, so the data should contain several information: “state of the called module, passing parameters,” (whether the data is already used to rollback processing, if have to rollback is handled, or is to be processed).
Plan 1: TCC
Initially we intended to use the TCC framework to implement calls between microservices to address the business scenario described above. So what is a TCC framework?
TCC is the first letter of three words: try, confirm and cancel. A) confirm B) cancel C) try D) confirm Try something, try something, try something, try something, try something, try something Confirm when a friend makes a reservation and leaves work at 5:30 on the dot to have dinner. If you have to work late to fix a bug, or your friends are working late, or you didn’t get a reservation, then you don’t get crayfish. Cancel.
Without further ado, we combine the actual business scenarios with plain words + figure to let you understand.
Normal logic
Phase 1: Try
During the try phase, the caller calls multiple calles, and confirm is performed only when all calles return success. Otherwise, Cancel is performed. In other words, the try phase is a preparatory operation that locks a resource and freezes part of the data before the actual deduction.
Stage 2: Confirm
In this phase, confirm is a follow-up to a successful try, and the frozen or preoccupied data is actually deducted (deducted in this scenario, but may also be added in other scenarios). Two scenarios need to be considered.
If confirm is successful, the calls to services B and C are complete and the data is consistent.
In the second case, some or all confirmations fail or the call times out. Unlike in the try phase, the cancel rollback is not performed in this case, but all confirmations are retried. It is important to note that the called party must implement an idempotent mechanism, otherwise it cannot retry. In addition, you need to set the retry times. If the retry still fails or times out and the retry times has exceeded, you must notify manual handling.
Phase three: Cancel
Cancel is performed only if the try phase fails or an abnormal timeout occurs, and cancel is performed not for all callers, but for those who have already tried successfully. In addition, the same as the Confirm phase, the cancel phase also needs to provide a retry mechanism and manually handle the alarm notification if the retry exceeds a certain number of times.
Think further
At this point, the TCC framework is basically what is going on. In the implementation, open source TCC frameworks can be used: ByteTCC, Himly, TCC-Transaction, Seata, etc. Or write by yourself, but the implementation details of such a framework is still very complex, how to perceive the execution of each stage, how to push to the next stage, how to implement retry mechanism, alarm system, manual intervention, downtime rollback and so on should be considered clearly.
Interface modification
In addition, if the TCC framework is to be adopted, then all services should adopt the TCC framework uniformly. For the called party, the original interface needs to be transformed into three interfaces: try-confirm-cancel.
- The “try phase” needs to realize freezing and precapturing of data, save the call record for data retention and support idempotent mechanism.
- In the “Confirm phase”, the frozen and pre-occupied data should be deducted or increased, and the idempotent mechanism should be implemented to return the same result in multiple calls. (Generally speaking, the confirm result is successful, but the call usually times out and needs to be retried. If it is clear that a failure result is returned, manual intervention is most likely required or rollback is allowed as the case may be).
- The “Cancel phase” requires the cancellation of frozen and preoccupied data, that is, a rollback operation. The called party rolls back to the point before the call to ensure data consistency. Similarly, cancel requires an idempotent mechanism, allowing retries in the case of timeout, and manual intervention in the case of an explicit return of failure.
The power mechanisms such as
The reason why “idempotent mechanics” are so important is because it is possible for the system to have to be retransmitted because of a timeout. It is important that the called party ensure idempotent mechanisms in order not to consume system resources due to repeated calls and to ensure that confirm and Cancel are properly executed.
Idempotent means that a single request and repeated multiple requests have the same impact on system resources and return the same results. Therefore, in the realization of idempotent, “transaction control table” is usually introduced to record the request data. Each request will have a transaction ID, and the retransmitted request will have the same transaction ID, and the idempotent can be realized by comparing the transaction ID.
Other Exception Handling
In addition, in the above stages, we only consider various scenarios in which the result is returned by the called party. However, if “the caller system has an exception”, it also needs to be handled separately:
- “During the try“An error/exception/timeout requires a rollback of Cancel. Two special cases need to be considered here:
- “The caller received cancel, but no try” : At this point, the caller does not need to do any processing, i.e. “empty rollback”;
- “The called party received a cancel and then received a try” : At this point, the called party needs to nullify the previous cancel and recognize the subsequent try. It recognizes that the cancel has been performed, so it will not process the call again. This is the so-called “anti-suspension control”.
- Empty rollback and anti-suspension control can also be implemented by introducing transaction control table.
- Exceptions in confirm or Cancel phases need to be retried or handled manually.
- If the confirm phase is over exception occurs, you can consider whether to roll back based on the situation. For example, if the asynchronous invocation is abnormal, retry mechanism can be considered. If the database operation is abnormal, rollback is most likely required.
Also, we have considered in the airport outage scenario that it is important to save the field data in the event of an outage, and to save the state of the various phases of the TCC transaction framework.
PS: Think about how we can make systems highly available in a production environment through the TCC framework. As opposed to frequent manual intervention? The main approach adopted by the author in practice is based on a highly available MQ middleware implementation such as Kafka or rabbitMq.
Plan 2: Saga
In my opinion, the Saga framework is a TCC framework without confirm, but it is not accurate to say that, because Saga’s try is no longer a try, without preempting and freezing data, but a direct deduction/addition of data.
Saga theory comes from Hector & Kenneth’s 1987 paper Sagas, in which the most important idea is compensation agreement:
“Compensation protocol” : In Saga mode, there are multiple participants in a distributed transaction, each of which is a forward compensation service. In the figure above, T1~Tn are “forward calls” and C1~Cn are “compensation calls”. There is a one-to-one correspondence between forward calls and compensation calls. Suppose there are n called services, T1 is the call to service one, T2 is the call to service two, and T3 is the call to service three. If a failure is returned, a rollback is required, and the corresponding compensation C2 of T2 is called, and the corresponding compensation C1 of T1 is called, making the distributed transaction return to its initial state.
Saga Application Scenario
- Saga is a “long transaction solution”, which is more suitable for the scenario of “long and multiple business processes”.
- If the service participants include other companies or legacy system services and the three interfaces in TCC mode cannot be provided, Saga is required.
- Typical business systems: financial institutions docking system (need to connect with external systems), channel integration (long process), distributed architecture services;
- Banks and financial institutions are more widely used;
Similarities and differences between Saga and TCC
From the above statement, you must think Saga is very similar to TCC. If TCC is born, why Saga? In fact, they still have differences, mainly reflected in the application of the different scenarios.
Data isolation
Returning to our business scenario, service A calls services B and C if the following happens:
- 1) Service B is tried successfully, but service C fails to be tried.
- 2) A user reads the data of service B and service C;
- 3) ROLL back A, B, and C;
If TCC or Saga fails a try, it will enter the cancel phase. Unfortunately, a user will read data from B and C before the cancel rollback process is performed. In this case TCC and Saga will return different results.
To TCC, a try is only a frozen operation, so it only operates on a **” frozen field “**. This field does not affect the actual data queried by the user, so it should return the correct result.
In the case of Saga, a try is a direct operation on the data that changes the data the user wants to access, and then the user is returned dirty data. Dirty reads indicate that data under Saga framework is “isolated”.
For Saga, the isolation issue is essentially about “controlling concurrency”, so it’s back to the business to implement concurrency control from the business logic layer. It can be handled by locking mechanisms at the application layer, or serialization at the Session level, but these will have more or less impact on performance and throughput.
In addition, since Saga does not guarantee isolation, we need to follow the principle of “long money rather than short money” in business design, that is to say, it is better to have more money in the account than less money, which can still be retrieved, while less money will disappear. Therefore, when doing business design, it must be booked before deducting, to ensure that even if data isolation is not less money.
Two phases and one phase
From the above description, Saga is mainly lacking in isolation compared with TCC, because TCC is a two-phase commit transaction while Saga is a one-phase commit transaction.
For TCC, Phase 1 preoccupies and locks resources, and Phase 2 uses or releases resources.
For Saga, it is “phase one” to commit transactions directly. Compared to TCC’s two-phase commit transactions, one-phase commit transactions are “lock-free” and can be “event-driven asynchronous execution”, which is suitable for “long process” businesses, and asynchronous execution also means higher “throughput”.
However, if Saga has isolation problems due to phase commit, TCC can use retry mechanism in case cancel (compensation) fails, and Saga needs to provide additional human intervention.
conclusion
“TCC framework“
- It is more intrusive and requires the try, confirm, and Cancel interfaces.
- No data isolation issues;
- Two-phase transaction commit, although there is a lock, but can improve the concurrency through the way of business lock;
“Saga framework“
- This method applies to scenarios where interfaces cannot be modified.
- Data isolation problems may occur, such as dirty data, update loss, and fuzzy reading.
- One-stage transaction commit, no lock, suitable for long process business logic;
- Suitable for event-driven asynchronous execution with higher throughput;
- In the case of compensation failure, additional exception handling, such as manual intervention, is required.
There are also some similarities between them. They can learn from each other in idempotent mechanism, empty rollback, anti-suspension design, service lock and exception handling mechanism.
Simple Saga framework implementation
In the end, we chose Saga in the selection, mainly because it does not need to transform the interface and is simple to implement. However, since there are not many process nodes in our business, we did not adopt the event-driven asynchronous execution design, but adopted the forward service realization mode of “annotation + interceptor intercepting business”.
-
The basic approach is to annotate business process code with @sagatransactional annotations, which implement rolling back exceptions caught in the process code inside the annotations.
-
Define a generic message structure SagaAction, structure contains need library Saga transaction control table of information (the “transaction ID, ID, type of business, business information system downstream of the downstream system ID, send a positive message, send the downstream system information compensation message, transaction state”), and can be “customized logo” operation, For example, whether to manually roll back (applicable to scenarios that do not throw exceptions and need to be rolled back), timeout handling mechanism (whether to resend or roll back can be customized after a timeout), and exception handling mechanism (whether to resend or roll back can be customized after an exception).
-
Each request is recorded in the Database persistence record to ensure that the system can be rolled back or retry after a system breakdown by reading the database information.
-
Provide a “bottom-saving mechanism”. Register the bottom-saving mechanism at the beginning of the business process. After the exception is thrown, the Saga framework will automatically roll back, and the bottom-saving program will be executed after the rollback, which is mainly used to modify the intermediate state or ensure data consistency.
-
Provides the Saga Transaction Handling Exception mechanism. When compensation fails or abnormal timeout occurs, you can customize subsequent operations, such as Alarm manual intervention and automatic retransmission processing.
review
To sum up, the study and practice of TCC and Saga framework in the work process are simply recorded. Firstly, the relevant theories of distributed system are introduced, and then the business scenarios and distributed transaction realization of TCC and Saga are mainly explained. In the actual process, people can choose independently according to their needs and business conditions. Due to the confidentiality problem can not provide the source code, the follow-up will write a demo for reference, if found errors, welcome correction!
Reference:
https://www.sofastack.tech/blog/sofa-meetup-3-seata-retrospect/