Distributed transactions: Based on reliable messaging services

Microservices advocate the separation of complex monolithic applications into several simple, loosely coupled services, while introducing the problem of distributed transactions between multiple services.

As we all know, databases can implement local transactions, that is, in the same database, can guarantee the atomicity of transactions, that is, all success or failure, the last article also wrote about a simple multi-source transaction solution (like 2PC).

However, today’s systems tend to use microservices architecture, and business systems have independent databases, so there is a need for transactions across multiple databases, which is called “distributed transactions”.

The commonly used solutions to this problem are:

2PC/3PC (Two-phase Commit protocol/three-phase Commit protocol)
TCC (Compensation type)
Distributed Transactions based on Reliable Message Services (Asynchronous assurance)

The whole process

Based on distributed transaction of reliable message service, I have implemented a distributed transaction middleware based on RabbitMQ, Shine – MQ

Originally intended to encapsulate mq operations for ease of use, subsequent iterations added distributed transaction capabilities. Here’s a look at the middleware:

Before service A processes task A, A prepare record with A check ID is sent to the Coordinator to start the distributed task
Coordinator Response service A after preparing records are persisted
After service A receives an acknowledgement, service A processes task A and sends A Ready record after task A succeeds. The Coordinator deletes the prepare record and persists the ready record and complete message
Upon receiving the ready record and message persistence response, service A can commit messages to the messaging middleware, which can be set for RabbitMQsetPublisherConfirms(true)As well as the implementationsetConfirmCallbackTo implement the messaging middleware persistent reply service A. Service A can then delete the previous ready record and work on other tasks.
The messaging middleware (which RabbitMQ can make highly available via mirrored queues) can deliver messages to service B once it has decided to drop them
Service B consumes the message and successfully processes task B, and service B returns an acknowledgement reply to the messaging middleware telling it that the message has been successfully consumed, at which point the distributed transaction is complete.

The above is the whole process. After service A completes task A, there will be A certain time difference between the completion of task B. In this time difference, the whole system is in a state of data inconsistency, but this temporary inconsistency is acceptable, because after a short time, the system can maintain data consistency, to meet the BASE theory.

The BASE theory:

BA: Basic Available
S: Soft State: Soft State Indicates the State of different copies of the same data, which need not be consistent in real time.
E: Eventual Consisstency: Final consistency State of different copies of the same data. It does not need to be consistent in real time, but it must be consistent after a certain period of time.

Abnormal situation

The above is an ideal process, but the real environment will have many emergencies, such as task A processing failure, then need to enter the rollback process

The exception of task A can be directly captured by service A. After the exception is rolled back, the prepare record is deleted. After the exception is deleted, service A considers the rollback complete and can perform other tasks.

If the message is not delivered to the messaging middleware, service B has no impact. The system is in A consistent state again because task A and task B are not executed.

Coordinator provides an interface that can be implemented by itself. By default, I use Redis to implement it. You can implement the interface yourself if you want to use another method.

The image above shows a failure to send the ready record. In this case, service A receives an exception or no response. In this case, you can roll back task A directly. During the rollback, task A triggers the deletion of ready. If an exception occurs after prepare is sent, it does not affect service A before the task is executed.

After analyzing some situations between service A, Coordinator, and messaging middleware, now analyze some special situations between messaging middleware and service B.

After the message is successfully published to the messaging middleware, service A can do its job and the messaging middleware ensures that the message is successfully delivered to service B. This is the reliability guarantee of the message-oriented middleware in the case of message delivery. The specific process is that the message-oriented middleware enters the blocking waiting state after delivering the message to the downstream system, and the downstream system immediately processes the task and returns the reply to the message-oriented middleware after the task is processed. The message-oriented middleware receives an acknowledgement and considers the transaction complete! If a message is lost during delivery, or an acknowledgement reply for a message is lost on the way back, the messaging middleware redelivers after waiting for the acknowledgement reply to time out until the downstream consumer returns a successful consumption response.

The number and interval of message retries can be set, and dead letter queues can be used if the message continues to fail. See the following figure for details:

When messages remain unconsumed and exceed the set retry threshold, they are delivered to the dead-letter queue, whose Exchange and routeKey default to the values set in @distributedTrans.

This exception can be handled by consuming dead-letter queue messages (SMS or email alerts can be set up, manual intervention). The rollback of service A will not be implemented for the time being, because the rollback interface provided by service A in advance will undoubtedly increase the development cost and the complexity of the business system. The design goal of a service system is to minimize the complexity of the system while ensuring performance, thus reducing the operation and maintenance costs of the system.

Design ideas

Finally sort out the whole middleware design ideas

Some exceptions have been analyzed above, and the atomicity of downstream services and message-oriented middleware can be guaranteed by the reliability of message-oriented delivery (i.e., ACK mode, retry on failure or no reply). So to implement distributed transactions, all that remains is to ensure atomicity between the tasks performed by the upstream service and the delivery of messages to the messaging middleware.

In this case, there are generally two schemes, synchronous and asynchronous communication. Through the sequence diagram before, obviously between the upstream system and message middleware USES is asynchronous communication, that is to say, when the upstream service after submitting the message will go to do other things, commit, rollback was completely next to the message middleware to complete, and fully trust message middleware, think it will be able to correctly complete the transaction committed or rolled back. This is mainly to improve the system concurrency. In addition, the business system directly deals with users, and user experience is particularly important. Therefore, this asynchronous communication mode can greatly reduce the waiting time of users.

Rabbitmq provides a transaction mechanism using txSelect(), txCommit() and txRollback(). A simple test can run up to 300ms, which is time consuming. So instead of doing that, I brought ina Coordinator to do it.

Another key daemon (daemon thread) is to handle errors timed out in coordinators (similar to Rocketmq’s timeout inquiry mechanism). Therefore, in addition to realizing normal service processes, service A needs to provide A transaction query interface that can be invoked by A Coordinator to prevent service A from being interrupted when performing tasks. A prepare timeout triggers an access to the callback interface, which returns three results:

Commit delivers the message
The rollback simply discards the message
Continue to wait while processing and reset the time.

A ready message timed out is picked up and sent directly to the messaging middleware, because as long as the ready message is persisted to the coordinator, service A has completed its task.

This ensures atomicity between upstream services and message-oriented middleware (see Distributed transactions: reliable delivery of messages), and then through reliable delivery of message-oriented middleware combined with downstream services, distributed transactions are completed.

If it helps you, then help me point a star to the ^.^

Github address: github.com/7le/shine-m…

Github don’t be stingy with your star ^.^ more highlights

Distributed transactions: Based on reliable messaging services

The whole process

Abnormal situation

Design ideas

Related Posts

SpringCould micro service book management system

Docker storage-driver

NIO