I was asked about distributed things in an interview with an Internet company in early April. Once again I have suffered a loss on the problem of not organizing properly, for record, or!!
background
At the beginning of April, I had an interview with a company in this city that had been engaged in unmanned shelves in the office. Although they are now facing transformation, they are still attractive to children like me who want to go from traditional enterprises to the Internet industry.
The issue of distributed things came up during the interview. I once again did not tidy up the problem on the loss, record, or long memory!!
Look at the interview process first
The interviewer first drew a picture on a piece of paper:
Let me take a look at this and follow the process. Any questions? The interviewer didn’t directly address the issue of distribution, but asked me to tell him, “That’s the interview formula.”
I answered the question that there may be distributed things in the process. When system B is called in Step 2, it may time out in response after the processing is completed, so that system A mistakenly thinks that the processing of B failed, resulting in the rollback of system A and data inconsistency between system B and system A.
Ok, at this point, I should have answered the interviewer’s first meaning, at least I was aware of it, he nodded.
Then he asked, “Do you have a good solution?”
At this time, I only had the impression of the general flowchart of the two-stage submission in my mind, and Then Barkabala told him something about a coordinator in the middle, pre-submission, and if there is a failure, rollback, if ok, and then the real submission, which is the theory of these online gods.
The interviewer then asks: “What does your code do when the line A calls B breaks?” How do you do rollback? Tell me about your code.
At this point, I was confused.
The final result, you can certainly guess, cool cool.
What is a transaction
Here we say the transaction generally refers to the database transaction, referred to as the transaction, is a logical unit in the database management system execution process, composed of a limited sequence of database operations. Wikipedia says so.
For example, account A needs to transfer RMB 100 to account B, which involves at least two operations:
1. 100 yuan is deducted from account A
2. Add 100 yuan to account B
In A transaction-enabled DBMS, you need to make sure that both of the above operations (the entire “transaction”) are completed and that you can’t have A situation where A’s $100 is deducted and B’s account is not added.
Database transactions contain four features, which are:
• Atomicity: The transaction is executed as a whole and all or none of the operations on the database contained within it are executed. In the case of transfers, account A takes money and account B adds money, and either it succeeds or fails at the same time.
• Consistency: Transactions should ensure that the state of the database changes from one consistent state to another. Consistent state means that the data in the database should meet integrity constraints
• Isolation: When multiple transactions are executed concurrently, the execution of one transaction should not affect the execution of other transactions. When transferring money from other accounts, the previous transactions of A and B above cannot be affected.
• Durability: Modifications to the database by committed transactions should be permanently saved in the database.
What are distributed things
As we know, the above transfer is a transaction operation in a database. We can use frameworks such as Spring’s transaction manager to do this for us.
However, if a crash occurs in our system, such as an operation where I need to operate multiple libraries, or an operation that crashes previous calls to the application, then Spring’s transaction management mechanism has no protection against such a scenario.
Just like the problem in the interview question above, when the step 2 of system A called B remotely, B did not respond normally due to network timeout, but A failed to call B and rolled back, and B submitted the thing again. At this point, it may lead to data inconsistency, contributing to the problem of distributed things.
The existence of distributed things is to solve data inconsistency.
Why do we want consistency
Theory of CAP
There is a popular theory in distributed systems: the CAP theorem
The theorem stems from a conjecture made by Eric Brewer, a computer scientist at the University of California, Berkeley, at the Principles of Distributed Computing conference (PODC) in 2000. Then, in 2002, Seth Gilbert and Nancy Lynch of the Massachusetts Institute of Technology (MIT) published a proof of Brewer’s conjecture, making it a theorem. [From Wikipedia]
He said it was impossible for a distributed computing system to satisfy all three requirements:
• Consistency
• Availability
• Partition tolerance
A distributed system can satisfy at most two of these.
So, what are these three things? Why can we only have both?
Let’s start with a scenario where we have two systems deployed (two nodes, web1 and web2), with the same business code, but maintaining the data generated by our own node. But when users come in, they may visit different nodes.
However, whether to access Web1 or Web2, after the user parameter data, this data must be synchronized to another node, so that no matter which node the user visits, it is the data he needs. The diagram below:
Partition fault tolerance
Let’s start with partition fault-tolerance: Even if there is a network failure between the two nodes, there is no synchronization problem, but users access to each node, that node has to provide a separate service, which is a must for Internet companies.
If the network between Web1 and Web2 is faulty, data cannot be synchronized. The user writes data on Web1, then accesses it to read the data and requests Web2, but web2 has no data at this point. So are we returning null to the user? Or do you give some hint that the system is unavailable and try again later?
All wrong, man.
consistency
If you want to ensure availability, and nodes with data return data and nodes with no data return NULL, then the user will see some data and some data, and there will be a consistency problem.
availability
If it’s consistent, then when the user accesses, whether it’s Web1 or Web2, we might return some message saying the system is not available, try again later, etc., to make sure it’s consistent every time. We have data, but our system is responding to a prompt, which is a usability issue.
Since partition fault tolerance (P) must be guaranteed, our distributed system is more of a balance between consistency (CP) and availability (AP), which can only meet both conditions.
In fact, if you think about it, ZK strictly implements CP, while Eureka guarantees AP.
In fact, distributed things emphasize consistency.
Several distributed transaction solutions
2PC
Before we talk about 2PC, what is the XA specification?
The XA specification describes the interface between a global transaction manager and a local resource manager. The purpose of the XA specification is to allow multiple resources (such as databases, application servers, message queues, and so on) to be accessed in the same transaction so that ACID properties remain valid across applications.
XA uses two-phase commit (2PC) to guarantee that all resources commit or roll back any particular transaction at the same time.
Think of a scenario, when making a single application, some of you have connected two libraries? In one transaction, data is inserted into two systems simultaneously. But for ordinary things, it doesn’t matter.
Take a look at the following figure (just to illustrate the pattern of this operation, not limited to the following business) :
There are two libraries to operate in one service, so how do you guarantee success?
Here we introduce a framework, Atomikos, that implements this XA routine. Look at the code:
Github AtomikosJTATest[1]: https://github.com/heyxyw/learn/blob/master/distributed-transaction/src/main/java/com/zhouq/jta/AtomikosJTATest.java[2]
See the diagram above, Atomikos implements a transaction manager itself. All the connections we get are taken from it.
• The first step is to start the transaction and then do the pre-commit. Db1 and DB2 do the pre-execute first. Note: there is no commit here. • The second step is the actual submission, which is initiated by Atomikos and rolled back if an exception occurs.
Does this process have two roles, a transaction manager and a resource manager (in this case the database, but also other components, message queues and so on)?
The entire execution process is as follows:
The picture above is normal, and the picture below is a failure on one side.
Pictures from: XA transaction processing [3] : https://www.infoq.cn/article/xa-transactions-handle [4], the specific detailed interpretation on XA, can have a good look at. The whole 2PC process:
Phase 1 (Submit request phase) :
-
The coordinator node asks all the participant nodes if they can commit and waits for the response from each participant node.
-
The participant node performs all transactions until the query is initiated and writes Undo and Redo information to the log.
-
Each participant node responds to the query initiated by the coordinator node. If the transaction of the participant node actually succeeds, it returns a “agree” message; If the transaction operation of the participant node actually fails, it returns an abort message.
Phase ii (Submission for implementation) :
Success, when the coordinator node gets the corresponding message “agree” from all the participant nodes:
1. The coordinator node issues a “formally commit” request to all participant nodes.
2. The participant node completes the operation and releases the resources occupied during the transaction.
3. The actor node sends a “done” message to the coordinator node.
4. The coordinator node completes the transaction after receiving the “complete” message from all the participant nodes.
Failure, if any participant node returns a “terminated” response message during phase 1, or if the coordinator node is unable to obtain the response message from all participant nodes before phase 1 query times out:
1. The coordinator node sends a rollback request to all the participant nodes.
2. The participant node uses the previously written Undo information to roll back and release the resources occupied during the entire transaction.
3. The participant node sends a rollback complete message to the coordinator node.
4. The coordinator node cancels the transaction after receiving the “rollback complete” message from all the participant nodes.
The second phase is sometimes referred to as the completion phase because this is where the coordinator must end the current transaction, regardless of the outcome.
The reliable message Final consistency scheme is based on common message queue middleware. We talked about the two-phase commit scheme above, then we will talk about how to solve the distributed transaction problem based on the reliable message Final consistency scheme.
In this scenario, the message service middleware role is involved. Let’s start with a flow chart:
Let’s illustrate the above figure using the process of creating an order placing and the subsequent process of shipping out.
In the ordering logic (Producer end), the data of an order, such as the order number, quantity and other key information, are first packaged into a message, and the state of the message is set as init, and then sent to the independent message service and stored in the database.
Next we move on to the rest of the local logic for the order.
After the processing is completed, the step of confirming the sending of the message indicates that my order can be placed successfully. Then we send a confirm message to the message service, which changes the status of the order to SEND and sends it to the message queue.
Next, the consumer consumes the message. Process its own logic, and then feed back the message processing results to the independent message service, independent message service set the message state to end, indicating the end. However, attention should be paid to ensure the idempotency of the interface to avoid the problems caused by repeated consumption.
The possible problems in this and how to solve each step:
1. For example, if an exception occurs in the prepare stage, the order will not be placed successfully. But we said, we’re based on reliable information here, and we need to make sure our messaging service is up and running.
2. There is something abnormal in comfirm, and the confirmation fails to be sent at this time, but our order has been placed successfully. This kind of situation, we can play a timed task in independent news service and timing to query message status to init data, to reverse the query order number in the system exists, if it exists, then we will put the message to the send status, and then sent to the message queue, if the query to there is no order, Then just discard the message. So here our order system has to provide a batch query order interface, and the downstream consumption system has to ensure idempotent. Ensure the consistency of repeated consumption.
3. Messages are discarded from the message queue or the downstream system fails to process the messages. As a result, no message is sent. In this case, we also need a scheduled task to process the messages in the SEND state. We can resend the messages until the system consumes them successfully.
4. At the end of the consumer side, when we consume, if there is any abnormal consumption, or the system bug causes the abnormal situation. So here we can also go to log, if it is not the system code problem, network jitter caused, then in the third case above, the message system will send a message again, we will deal with it. If it keeps failing, you need to think about whether your code is really buggy.
5. As the final guarantee scheme, record logs and process data when problems occur. Now our system error, with the current technical means is not able to do all rely on machines to solve, we have to rely on people. As far as I know, now many large factories will have such a person, specializing in dealing with this type of problem, to manually modify the way of the database. The small factory we stayed in before, basically rely on our own to write SQL to modify the data, think about it, right?
Post the key isSS core logic code framework:
Scheduled task:
Implemented based on RocketMQ
This solution is the same as the standalone messaging service mentioned above, where the standalone service is removed and only the message queue is implemented, namely Alibaba’s RocketMQ.
The flow chart is as follows:
The entire process here is the same as the message-based service above. Here, but more specific code refer to: https://www.jianshu.com/p/453c6e7ff81c [5], write very well.
In terms of the reliable message ultimate consistency scheme, when we say reliable, we mean that the message is guaranteed to be sent to the message middleware.
For the downstream system, if the consumption is not successful, it is generally taken as failure retry. If the retry fails for several times, logs will be recorded and subsequent manual intervention will be performed. So it’s important to emphasize that the next system has to deal with idempotent, retry, and logging.
If it is for the business of capital class, after the subsequent system rollback, we have to find a way to notify the system in front of the rollback, or send an alarm by manual rollback and compensation.
TCC scheme
The whole PROCESS of TCC is divided into three stages, namely Try, Confirm and Cancel:
1.Try phase: This phase checks the resources of each service and locks or reserves the resources
2.Confirm phase: This phase is about performing the actual operations in each service
3.Cancel phase: If the business method execution of any of the services fails, there is a need to compensate by performing a rollback of the business logic that has been successfully executed
Again, take the transfer example as an example. When the transfer is carried out across banks, it needs to involve the distributed things of two banks, transferring one piece from Bank A to bank B. If the TCC scheme is used to achieve:
Here’s the idea:
1.Try stage: Freeze 1 yuan of bank account A and pre-add 1 yuan of funds to bank account B.
2.Confirm stage: Perform the actual transfer operation, deduct RMB 1 from bank ACCOUNT A and increase RMB 1 from bank account B.
3.Cancel stage: If any bank fails to perform the operation, it needs to be rolled back for compensation. For example, if the account of Bank A has been deducted, but the fund of bank B fails to increase, the fund of bank A must be added back.
This scheme is more complex, one step operation to do multiple interfaces to cooperate to complete.
With ByteTCC framework implementation examples to describe about the above process, the sample address https://gitee.com/bytesoft/ByteTCC-sample/tree/master/dubbo-sample [6]
At the beginning, both bank accounts of A and B were set as follows: Amount =1000, frozen = 0
1 RMB from bank A account to bank B account:
In the try stage, the amount of the bank account of A is reduced by 1, and the frozen amount of the bank account of B is increased by 1.
At this time:
• Bank account A: Amount = 1000-1 = 999, frozen = 0 + 1 = 1
•B bank account: Amount = 1000, frozen = 0 + 1 = 1
Confirm stage: The frozen amount of bank account A shall be reduced by 1; the frozen amount of bank account B shall be increased by 1 and the frozen amount shall be reduced by 1
At this time:
• Bank account A: Amount = 999, frozen = 1-1 = 0
•B 银行账户:amount(数量)= 1000 + 1 = 1001,frozen(冻结金额)= 1 – 1 = 0
Cancel stage: Bank A account amount + 1, frozen amount -1, frozen amount -1 in bank B
At this time:
•A 银行账户:amount(数量)= 999 + 1 = 1000,frozen(冻结金额)= 1 – 1 = 0
•B 银行账户:amount(数量)= 1000,frozen(冻结金额)= 1 – 1 = 0
Now that I’ve done the whole thing, you should run through the code. In fact, it is quite complicated, there are many interfaces to complete the whole business, imagine if we use TCC to write a lot of projects, you can bear?
Again, BASE theory
BASE theory is an acronym for Basically Available, Soft State, and Eventually Consistent.
1. Basically Available: A distributed system is allowed to lose some of its availability in the event of an unpredictable failure.
2. Soft State: Data in the system is allowed to exist in an intermediate State, and the existence of the intermediate State is considered to have no impact on the overall availability of the system. That is, the system is allowed to delay data synchronization between data copies on different nodes.
Eventual Consistency: It emphasizes that all data update operations can finally reach a consistent state after a period of synchronization. Therefore, the essence of final consistency is that the system needs to ensure the consistency of the final data, rather than ensuring the strong consistency of the system data in real time.
Its core idea is:
Even if Strong consistency cannot be achieved, each application can adopt appropriate methods to achieve Eventual consistency according to its own service characteristics.
The account in the ABOVE TCC scheme designs a frozen field. Is this the soft state in the middle of BASE theory?
The last
For companies with a lot of microservices, the calls between services are extremely complex, so when introducing distributed things, you need to consider the complexity and development cost of implementing the system, or where distributed things are not needed at all.
In fact, there is no need to do distributed things everywhere, for most businesses, in fact, we do not need to do distributed things, directly do logging, do monitoring. Then there are problems, manual to deal with, a month will not have so many problems. If you’re having these problems every day, you might want to check your code for bugs.
For capital scenarios, distributed transaction schemes are basically used to ensure that other services, memberships, points, product information, etc., may not need to do so.
Java Geek technology public account, is set up by a group of technical people who love Java development, focus on sharing original, high quality Java articles. If you think our article is good, please help to appreciate, read, forward support, encourage us to share better articles.
Pay attention to the public account, you can reply “nuggets” in the background of the public account, to obtain the author Java knowledge system/interview must see information.