introduce
With distributed system and microservice architecture prevailing today, it has become normal for services to fail to invoke each other. How to deal with exceptions and how to ensure data consistency has become a difficult problem in the process of microservice design.
Solutions vary according to service scenarios. Common solutions are as follows:
-
Blocking retry;
-
2PC, 3PC traditional transactions;
-
Use queue, background asynchronous processing;
-
TCC compensation transaction;
-
Local message tables (asynchronous assurance);
-
MQ transactions.
This article focuses on a few other items, about 2PC, 3PC traditional affairs, online information has been very much, here do not repeat.
Blocking retry
Blocking retry is a common approach in microservice architectures.
Examples of pseudocode:
m := db.Insert(sql)
err := request(B-Service,m)
func request(url string,body interface{}){
for i:=0; i<3; i ++ {
result, err = request.POST(url,body)
if err == nil {
break
} else {
log.Print()
}
}
}
Copy the code
As above, when a request to service B’s API fails, initiate a maximum of three retries. If three attempts fail, the log is printed and execution continues or an error is thrown to the upper layer.
- Originally published in Code Ape Technology
This approach brings the following problems:
-
The current service considers that service B failed due to network timeout. In this case, service B generates two identical pieces of data.
-
Service B fails to be called. The service B is unavailable, and the service still fails after three attempts. A record inserted into the DB by the current service in the previous code becomes dirty data.
-
Retries increase the upstream delay for the call, and if the downstream load is heavy, retries magnify the downstream service.
The first problem is solved by having the API of the B service support idempotency.
Second problem: It is possible to correct data with timed footsteps in the background, but this is not a good idea.
Third problem: This is a necessary sacrifice to improve consistency and availability through blocking retry.
Blocking retry applies to scenarios where services are not sensitive to consistency requirements. If data consistency is required, additional mechanisms must be introduced to address it. Also pay attention to the public number ape technology column, reply keywords “9527” free access to spring Cloud Alibaba latest video tutorial.
Asynchronous queue
Introducing queues is a common and good way to evolve a solution. The following is an example:
m := db.Insert(sql)
err := mq.Publish("B-Service-topic",m)
Copy the code
After the current service writes data to DB, it pushes a message to MQ, which is consumed by a separate service to process the business logic. Although MQ is far more stable than a normal business service when compared to blocking retry, calls to push messages to MQ still have the possibility of failure, such as network problems or current service outages. This still causes the same problem of blocking retry, where the DB write succeeds but the push fails.
In theory, in distributed systems, code involving multiple service invocations will have such a situation, and in the long run, invocation failure will definitely occur. This is also one of the difficulties in distributed system design.
TCC compensation transactions
TCC compensated transactions are a better choice when transactions are required and are not easily decoupled.
TCC breaks each service invocation into two phases and three operations:
-
Phase 1. Try operation: Check and reserve service resources, such as checking and withholding inventory.
-
Phase 2. Confirm operation: Confirm the resource reservation for the Try operation. For example, update inventory withholding to deduction.
-
Phase 2. Cancel operation: After the Try operation fails, the resource withheld by the Try operation is released. For example, add back the inventory withholding.
TCC requires each service to implement the APIS of the above three operations. The operations that were done in one call before the service was connected to the TCC transaction now need to be done in three operations in two phases.
For example, A shopping mall application needs to call A inventory service, B amount service and C points service, with the following pseudocode:
m := db.Insert(sql)
aResult, aErr := A.Try(m)
bResult, bErr := B.Try(m)
cResult, cErr := C.Try(m)
ifcErr ! = nil { A.Cancel() B.Cancel() C.Cancel() }else {
A.Confirm()
B.Confirm()
C.Confirm()
}
Copy the code
A, B, and C service APIS are called in the code to check and reserve resources, and Confirm operations are returned. If the Try operation of the C service fails, the Cancel APIS of A, B, and C are respectively called to release the reserved resources.
TCC solves the problem of data consistency across multiple services and databases in distributed system. However, there are still some problems with TCC, which need to be paid attention to in practice, including the call failure mentioned in the above section.
Empty release
If c.terry () in the above code is a true call failure, then the following extra c.canel () calls will release the resource without locking it. This is because the current service is unable to determine whether the failed call really locks the C resource. If not called, it actually succeeds, but returns failed due to network reasons, which results in C’s resources being locked and never released.
Null-release occurs frequently in production environments, and services should support null-release execution when implementing TCC transaction apis.
The sequential
If c.terry () fails in the above code, the c.canel () operation is then called. Due to network problems, c. Canel () request may be sent to C. service first, and c. Terry () request may be sent to C. Service later. As a result, empty release is caused, and C resources are locked.
So the C service should reject the Try() operation after the resource is freed. Implementationally, a unique transaction ID can be used to distinguish between a first Try() and a post-release Try().
Call fails
Cancel and Confirm May fail during the invocation, for example, due to common network reasons.
If Cancel() or Confirm() fails, the resource is locked and cannot be released. Common solutions to this situation include:
-
Blocking retry. But they have the same problems, like downtime, failure all the time.
-
Write to the log, queue, and then have a separate asynchronous service intervene automatically or manually. But there are also problems, and when writing to a log or queue, there are failures.
Theoretically speaking, non-atomic and transactional two pieces of code, there will be intermediate state, there will be the possibility of failure.
Local message table
The local message table was originally proposed by ebay to put the local message table in the same database as the business data table so that local transactions could be leveraged to meet transaction features.
This is done by inserting a message data as well as business data in a local transaction. Then perform subsequent operations. If other operations succeed, delete the message. If it fails, do not delete it, asynchronously listen for the message and retry.
Local message tables are a good idea and can be used in several ways:
Cooperate with MQ
Sample pseudocode:
messageTx := tc.NewTransaction("order")
messageTxSql := tx.TryPlan("content")
m,err := db.InsertTx(sql,messageTxSql)
iferr! =nil {return err
}
aErr := mq.Publish("B-Service-topic",m)
ifaErr! =nil {// Failed to push to MQ
messageTx.Confirm() // Update the status of the message to confirm
} else {
messageTx.Cancel() // Delete the message
}
// Asynchronously process the confirm message and continue pushing
func OnMessage(task *Task){
err := mq.Publish("B-Service-topic", task.Value())
if err==nil {
messageTx.Cancel()
}
}
Copy the code
Insert messageTxSql into the local message table:
insert into `tcc_async_task` (`uid`,`name`,`value`,`status`)
values ('? '.'? '.'? '.'? ')
Copy the code
It is executed in the same transaction as the business SQL and either succeeds or fails.
If the message is successfully pushed to the queue, the local message is deleted by calling messagetx.cancel (). If push fails, mark the message as confirm. There are two status states in the local message table: try and confirm. Either status can be monitored in OnMessage to initiate retry.
Local transaction guarantees that messages and business will be written to the database, and asynchronous listening can follow up on subsequent execution, whether it is down or a network push fails, ensuring that messages will be pushed to MQ.
MQ guarantees that the consumer service will be able to process, or continue to post, to the next business queue using MQ’s QOS policies, thus guaranteeing the integrity of the transaction.
Cooperate with service invocation
Sample pseudocode:
messageTx := tc.NewTransaction("order")
messageTxSql := tx.TryPlan("content")
body,err := db.InsertTx(sql,messageTxSql)
iferr! =nil {return err
}
aErr := request.POST("B-Service",body)
ifaErr! =nil {// Failed to invoke b-service
messageTx.Confirm() // Update the status of the message to confirm
} else {
messageTx.Cancel() // Delete the message
}
// Asynchronously process confirm or try messages and continue calling b-service
func OnMessage(task *Task){
// request.POST("B-Service",body)
}
Copy the code
This is an example of local message table + calling other services, without the introduction of MQ. This kind of asynchronous retry and local message table are used to guarantee the reliability of messages. It solves the problem of blocking retry and is common in daily development.
If there is no local operation to write to DB, you can just write to the local message table, also handled in OnMessage:
messageTx := tc.NewTransaction("order")
messageTx := tx.Try("content")
aErr := request.POST("B-Service",body)
/ /...
Copy the code
Message expiration
Configure handlers for Try and Confirm messages in the local message table:
TCC.SetTryHandler(OnTryMessage())
TCC.SetConfirmHandler(OnConfirmMessage())
Copy the code
In the message processing function, you need to determine whether the current message task exists for a long time. For example, if the task has been tried for an hour or fails, you need to send emails, SHORT messages, and logs and alarms to allow manual intervention.
func OnConfirmMessage(task *tcc.Task) {
if time.Now().Sub(task.CreatedAt) > time.Hour {
err := task.Cancel() // Delete the message and stop the retry.
// doSomeThing() alarm, manual intervention
return}}Copy the code
In the Try handler, it is also necessary to separately determine whether the current message task is too short, because messages in the Try state may have just been created and have not yet been committed or deleted. This is repeated with normal business logic execution, meaning that successful calls are also retried; To avoid this situation, you can detect if the message creation time is too short, or skip it.
The retry mechanism necessarily relies on the idempotent nature of the downstream API’s business logic, and while it is possible to do without processing, it is designed to avoid interfering with normal requests.
Independent message service
The independent message service is an updated version of the local message table, which is separated into a separate service. Before all operations, add a message to the message service. If the subsequent operations succeed, delete the message. If the subsequent operations fail, submit the confirmation message.
Then use asynchronous logic to listen to the message, do the corresponding processing, and the local message table processing logic is basically consistent. However, since adding messages to the message service cannot be put into a transaction with local operations, there will be successful adding messages, subsequent failure, then the message is a useless message.
The following example scenario:
err := request.POST("Message-Service",body)
iferr! =nil {return err
}
aErr := request.POST("B-Service",body)
ifaErr! =nil {return aErr
}
Copy the code
The MQ transaction
Some implementations of MQ support transactions, such as RocketMQ. MQ transactions can be viewed as a concrete implementation of a stand-alone messaging service, logically consistent.
Before any operation, send a message to MQ. If a subsequent operation succeeds, Confirm confirms the commit message. If a subsequent operation fails, Cancel deletes the message. The MQ transaction also has a prepare state and requires MQ’s consumption processing logic to confirm the success of the service.
conclusion
From the practice of distributed system, it is necessary to introduce additional mechanism to ensure data consistency.
The advantages of TCC are that it acts on the business service layer, does not depend on a specific database, does not couple with a specific framework, and has flexible granularity of resource lock, which is very suitable for micro-service scenarios. The disadvantage is that each service has to implement three apis, which deal with various failed exceptions for business intrusion and change. It’s hard for developers to deal with all sorts of situations, and finding a mature framework, such as Alibaba’s Fescar, can greatly reduce costs.
The advantage of local message tables is that they are simple, do not rely on modification of other services, work well with service invocation and MQ, and are practical in most business scenarios. The disadvantage is that the local database has multiple message tables coupled with the business tables.
Example of local message table, from a library written by the author, interested students pay attention to the public number “code ape technology column” reply keywords “120”
The advantage of MQ transactions and stand-alone messaging services is to separate out a common service to solve the transaction problem, avoiding message tables coupled to each service and increasing the processing complexity of the service itself. The disadvantage is that there is very little MQ to support transactions; In addition, the API is called before each operation to add a message, which will increase the overall call delay, and is an unnecessary overhead in most normal response business scenarios.