preface
Okay, back to today’s topic, I’m going to share with you some ways to handle exceptions in the payment system.
In fact, these processing methods are not only limited to the payment system, but also applicable to other systems. We can learn from them and apply them to our own systems to improve the robustness of our own systems.
Exceptions are an inevitable part of system operation, and if everything works, our system design will be fairly simple.
Unfortunately, no one can do this, so in order to handle the problems that exceptions can cause, we have to add a lot of extra design to handle them.
It can be said that in system design, exception handling needs to be considered and will occupy most of our energy.
Let’s take a look at the most common exception in payment systems: dropped orders
Welcome to pay attention to my public account: procedures to get daily dry goods push. If you are interested in my topics, you can also follow my blog: studyidea.cn
Off a single exception
One of the most common payment platform architecture relationships is as follows:
In the figure above, we are from the perspective of the third-party payment company. If it is the internal payment system of the company, then the external merchant is actually some internal systems of the company, such as the order system, while the external payment channel is actually the third-party payment company
Taking Ctrip as an example, an order payment initiated on ctrip will go through three systems:
- Ctrip creates orders and initiates payment requests to third-party payment companies
- The third-party payment company creates the order and initiates the payment request to ICBC
- Icbc completed the deduction operation and returned to the third-party payment company
- The third-party payment completes the order update and returns to Carrier
- Ctrip changes order status
The above process is as simple as the picture below:
In this process may encounter, the user ICBC card has been deducted, but Ctrip order is still to be paid, we usually call this situation dropped.
Most of the above-mentioned order drop scenarios are caused by the loss of information in links ③ and ⑤. Such order drop is called external order drop.
There is also a rare case where the information returned by links ③ and ⑤ is received, but the internal system fails to update the order status in links ④ and ⑥, resulting in the loss of the information of successful payment. This kind of order drop is usually referred to as internal order drop because it is an internal problem.
External off single
The external order drop is because we did not receive the return information from the peer end. In this case, it is probably a network problem, or the processing logic of the peer end is too slow, which leads to our request timeout and directly disconnects the network request.
Increasing the timeout
In this case, the first and easiest solution is to increase the timeout appropriately.
However, it should be noted that after we increase the network timeout time, we may also need to adjust the timeout time of the entire link, otherwise it may cause the whole link to run errands and thus cause the internal order drop.
Voice-over: When connecting to external channels, you must set the network connection timeout period and read timeout period.
Receiving asynchronous Notifications
The second method is to receive notification information asynchronously.
Generally speaking, the payment channel interface can send an asynchronous callback address, when the channel side processing success, the success message will be notified to this callback address.
In this case, we just need to receive the notification, parse it, and update the internal order status.
In this case, we need to pay attention to the following:
- For asynchronous request information, it is necessary to verify the notification content with signature, and verify whether the returned order amount is consistent with the order amount on the merchant side, so as to prevent “false notification” caused by data leakage and capital loss.
- Asynchronous notifications will be sent multiple times, so asynchronous notification processing needs to be idempotent.
Off a single query
Some channels may not provide asynchronous notification, but only provide an interface for order query. In this case, we have to use the third solution, timed off order query.
We can save such orders with unknown timeout to the single drop table, and then periodically query the status of the order to the channel side.
If the query succeeds or fails explicitly (for example, the order does not exist), the order status can be updated and the single table record can be deleted.
If the query is still unknown, then we need to wait for the result of the next query.
Note that in some cases, it may not be possible to query the status of the order, so we need to set the maximum number of times the order can be queried to prevent infinite queries from wasting performance.
reconciliation
Finally, there are rare cases where neither order query nor asynchronous notification can obtain payment results, which leaves one last last-ditch solution, reconciliation.
If there is this payment result in the reconciliation file sent by the channel the next day, we can directly update our internal payment record according to this record update.
Before the small black elder brother wrote a reconciliation article, interested in can look again: talk about the reconciliation system design scheme
Voice-over: Play it safe. You can initiate a query and then update the order record based on the query results.
In extreme cases, however, when the query fails to get results, you can simply update the internal record.
If there is no result of this record the next day, in this case, we can consider it as a failure. If the user is deducted, the internal channel side will initiate a refund and return the payment amount to the user. So this is a case that doesn’t need to be handled.
Internal single drop exception
Pay for internal order relationships
Next, we will talk about the internal order drop exception. First, we will look at why the internal order drop exception occurs, which is actually related to our system architecture.
As shown in the figure above, the internal table of third-party payment companies usually has a 1-to-N relationship between payment orders and channel orders.
The payment order stores the order number of the external merchant system and represents the relationship between the internal order of the third-party payment company and the order of the external merchant.
The channel order represents the relationship between the third-party payment company and external channels. In fact, for the external channel system, the third-party payment company is an external merchant.
Why do we need to design this relationship? As opposed to using the following 1-to-1 relationship, right?
If we use the one-to-one order relationship in the figure above, if the first payment fails, the external merchant may initiate payment to the third-party payment company using the same order number again.
At this time, if the third-party payment company requests the external channel system with the same internal order, the external channel system may not support the same order number to request again.
In fact, there are other ways to generate a new internal order number, update the internal record of the original payment order, and then request the external channel system. But in this case, the record of the last payment failure will be lost, which is not conducive for us to do some post-hoc statistics.
In fact, the third-party payment company can not support the same order number to initiate a request again, but in this case, external merchants need to generate a new order number.
In this case, the system of third-party payment companies is simple, and all the complexity is handed over to external merchants.
However, in reality, many external merchants are not so easy to replace and generate new order numbers, so generally third-party payment companies need to support the same external merchant order number in the case of failure to support repeated payment.
In this case, we need our 1:N order diagram above.
Cause of internal single drop exception
When we received a successful return message from the external channel system, the channel order table was successfully updated. However, because the channel order table and the payment order table may not be in the same database, or they may not be in the same application, the update of the payment order table may fail.
Since the payment order is a table that stores the relationship between the external merchant order and the internal order, the payment order failed, so the external merchant cannot query and get the successful payment result.
At this point, the channel order table has been successful, so the method of external order drop above does not apply to internal order drop.
Internal single exception solution
The first solution is distributed transactions.
The internal order drop exception is basically because the payment order table and the channel order table cannot use a database transaction to guarantee the success or failure of both updates.
In this case, we actually need to use distributed transactions.
However, we did not adopt this kind of distributed transaction. Firstly, there was no mature open source distributed transaction framework on the market when we developed it, and secondly, it was very difficult to develop by ourselves.
So I don’t have much experience with distributed transactions. If you use distributed transactions to solve this kind of problem, please leave a comment.
The second solution is asynchronous compensation update.
In the event of an internal drop order, i.e., failure to update the payment order, the payment order can be saved to an internal drop order table.
However, there may be a problem here. There is no guarantee that saving to internal drop tables will be successful.
So, we also need to periodically query the records of payment orders that have not been paid for a period of time, but have been paid for in the channel order table, and then insert that into the internal drop order table as well.
Another system application, just need to periodically scan the internal drop table, will pay the order successfully, and then delete the internal drop record.
It is important to note that timed queries can be slow when the payment order table data is large, so such queries can be performed in the standby database to avoid affecting the primary database.
conclusion
Today, I mainly introduced the abnormal order cancellation in the payment system, which often leads to the situation that the user has actually been deducted money, but the merchant’s order is still waiting for payment.
If this exception is not handled properly, it will lead to poor user experience for customers and may lead to complaints from customers.
Drop order exceptions can usually be external and internal systems. Most of the dropped orders are caused by external systems. We can increase the timeout, drop orders query, and accept asynchronous notifications to solve 99% of the problems, and the remaining 1% can only be solved by checking accounts the next day.
It is a typical data consistency problem in distributed environment that the internal system causes single exception. For such problems, we do not need to pursue strong consistency, as long as we ensure the final consistency. We can use distributed transactions to solve these problems, or we can periodically scan for orders with inconsistent status and then do batch updates.
Finally, this is just to introduce a class of single exception in the payment system. In the next article, we will introduce other exceptions in the payment system. Please look forward to it!
The resources
- Zhihu @ Tianshun Talk about abnormality (1)