What is capital loss
Capital loss usually refers to the capital loss in the payment scenario, which can be seen from two dimensions:
- From the point of view of the user: More money is deducted from the user, which leads to the loss of the user’s capital. This problem generally needs to be fed back through customer service and other channels. The extra money can be returned to the user, but the user experience is largely lost.
- The company’s point of view: mainly more money, more shipments, more recharge and other situations, generally this kind of loss is difficult to recover, this is the real loss of assets.
For example an electrical business may involve the above all kinds of business, business are distributed between, callback, messaging and other logic or state synchronization, if one lost due to some interactions lead to abnormal stock, fund settlement conversion, an abnormal process documents without end, logic triggered repeated requests, concurrent control improper handling, etc. Resulting in the loss of assets or funds.
For the prevention of the situation, in addition to a series of strict tests in advance, the optimization of post-analysis remedy, can also be monitored through the event, in this aspect is through the research of DCheck for prevention and control.
I met DCheck
This system is led by the “transaction & Stability” team, which mainly hopes to timely discover data problems and ensure stable operation of data, especially in the case of capital loss. In order to achieve real-time and effective monitoring, a quasi-real-time check system DCheck is built under this background. This platform is based on Mysql binglog monitoring. As well as MQ subscription information flow technology means, through the configuration of trigger conditions, rules and task operation and alarm, to ensure the state consistency between the upstream and downstream businesses of each business, the accuracy of the amount of in-out deductions after calculation.
In-depth DCheck
Functional level
Architectural logic
concept
- Topic: Logical database or MQ subscriptions
- Event: Update/ Insert type operation or MQ custom elimination
- Child events: The first layer of filtering of the data returned by the event
- Script pool: There are two inheritance methods: Filter and Check. Filter processes layer-2 data filtering and checks the upstream and downstream logic of check services
- Rules: Trigger and detect the core execution
DCheck
The functions and use of this system are briefly demonstrated as follows:
Check the configuration
Theme manager
- TOPIC: the library
- Topic encoding: table name
- Topic name: Chinese table name
- MQ instance address: * for binlog
Event management
For binlogs, there are two types of INSERT and UPDATE, which are automatically generated based on the previous topic configuration.
Sub-event management
Filter layer of the cleaned data. In the rule configuration, only the one that returns’ TRUE ‘goes to the next layer, for example: if( obj.status.toInteger() == 10000 && (obj.type.toInteger() ==101 || obj.type.toInteger() ==301) ) return ‘TRUE’; If you want to return all, just return ‘TRUE’.
Script Pool Management
All executed scripts are written in Groovy, mainly using BaseScript filter and check methods. Internally, you can refer to the script library:
- Filter The operations that require the script to perform secondary filtering after the event filtering are mainly those that cannot be simply judged by donCleanData and need to obtain other data forward and backward, or some complicated logic.
- Check verification logic processing method, mainly is to carry out upstream and downstream synchronization state verification, complex results calculation (especially in terms of debit and debit) comparison.
Code examples:
import com.alibaba.fastjson.JSON; import com.alibaba.fastjson.JSONObject; import groovy.util.logging.Slf4j; import org.springframework.stereotype.Service; import javax.annotation.Resource; / * * * DCheck: calm period - platform customer cancel the order - refund buyers pay amount * / @ Slf4j @ Service class CheckRefundPayForLess30min implements BaseScript {@ the Resource private OrderDevOpsApi orderDevOpsApi; @Resource DCheckApi dCheckApi; @Resource private PayServiceFeignClient payServiceFeignClient String logTag = "TAG_CheckCrossAndOverSeaRefundPayForLess30min:{}" // 1. Order Closing time - Payment time <30 minutes @Override Boolean filter(JSONObject doneCleanData) {// Query payment time String unionId = doneCleanData.getString("order_no"); String payTime = getOrderData(unionId,"payTime", DevOpsSceneEnum.FORWARD_PAY); long modifyTime = doneCleanData.getDate("modify_time").getTime(); long diffTime = modifyTime - Long.valueOf(payTime) if (diffTime < 30 * 60 * 1000){ return true; } log.info(logTag,"===> Check") return false; } @Override String check(JSONObject doneCleanData) { String subOrderNo = doneCleanData.getString("sub_order_no"); Result<List<String>> listResult = dCheckApi.queryPayNoBySubOrderNo(subOrderNo); If (listResult = = null | | listResult. GetData () = = null) {return "according to the positive query interface through the son order number query pay serial number data is empty". } if(listresult.getData ().size() > 1){return "if(listresult.getData ().size() > 1){return "; } String outPayNo = listResult.getData().get(0); RefundQueryRequest refundQueryRequest = new RefundQueryRequest(); refundQueryRequest.setPayLogNum(outPayNo); Result<List<RefundBillDTO>> resp = payServiceFeignClient.queryRefundsByPayLogNum(refundQueryRequest); / / determine whether payment query data is empty, if is empty directly quote data errors, and whether the query to multiple data if (resp = = null | | resp. The getData () = = null) {return "upstream data is empty: pay a refund query (according to the serial number)". } else if (resp.getData().size() ! = 1) {return "upstream data is multiple please confirm logic: payment refund query (according to payment serial number)"; If (resp.getData().get(0).getStatus()! =2){return "check the payment status is not 2"; } // Logical check point 2: the transaction refund amount is the same as the RPC query amount. Otherwise, if (resp.getData().get(0).getAmount()! Donecleandata.getlong ("amount")) {return "return "; } return "SUCCESS"; } // Database query corresponding field value String getOrderData(String unionNo,String key,DevOpsSceneEnum DevOpsSceneEnum){// Internal method omitted.... return value; }}Copy the code
Rule configuration
The rule configuration is a combination of all the basic configurations above and the real operating core, which consists of two main blocks, basic information and degrade policy:Basic information: sub-events (support search and multiple selection) + script classes (search selection) = trigger and execute logic, other auxiliary configuration information according to their domain and requirementsDemotion strategy:
- Sampling percentage: sampling percentage of online traffic. It cannot be 100% for early testing or traffic that needs to be controlled if it has a great impact on services.
- First delay: indicates the delay for triggering the execution. In service processes, data synchronization may be delayed. To avoid status synchronization caused by the delay, you are advised to set a delay ratio of about 10 seconds.
- Maximum timeout and validity time: specifies the validity time of the rule.
Tool use
Check the abnormal
To check abnormal data, the general will first fly to configure alarm book group of individuals and the configuration, click to jump to this page, basically see the concrete data of error, after confirming the data problems is the script or in part, the optimized “retransmission can be marked as” processing, if it is to determine the problem is, the positioning problem of “information loss”.
Mock
Because local scripts call some RPC interfaces, there is no good way to debug locally, so you need to configure rules first and then use mock to debug. The main purpose is to debug rules, select the specified rules, search or create request parameters in JSON format that conform to the DCheck scenario. Submit the request and view the response.
There is a problem here, because the internal logic of Dcheck uniformly handles some system exception scripts. In many cases, it is impossible to see the specific cause, such as failure or external logic pass. In this case, you need to add more printing logs in the script, and then check the specific logic problem through the log platform.
Some tips
Rule Configuration Techniques
- The events in a rule can be selected multiple times. Therefore, the script processing logic with different events but the same or similar check can be merged to reduce the amount of rule maintenance.
- Trigger data can be configured in the event, minimize the use of script filter processing, code processing generally requires non-trigger data logic.
Groovy closure
Script data processing will have a lot of list, key-value processing, can take full advantage of Groovy closure features, greatly simplify the Java language complex processing logic.
For example, if objectList is returned as a result of the following data format:
[{id=10086, refundNo=RE10086, orderNo=100888, userId=15206, bizType=110, payTool=0, payStatus=404,amount=100, feature=, IsDel =0, createTime=2021-05-11 21:39:34.000, modifyTime=2021-05-11 21:39:34.000, currency from =1, countryCode=},{id=10087, refundNo=RE10087, orderNo=100999, userId=15206, bizType=202, payTool=0, payStatus=404, Amount =400, feature=, isDel=0, createTime=2021-05-11 21:39:34.000, modifyTime=2021-05-11 21:39:34.000, moneyFrom=1, countryCode=}]Copy the code
- Filter To check conditions, you can use any
Def filterResult = objectList.any{it.bizType in [110,119] &&it.payStatus == 404}; return filterResult
- Check gets a value that matches a condition, you can use find
def mount = objectList.find{ it.type=5}.amount
More groovy features can refer to the article: www.jianshu.com/p/5d30f1443…
Some platforms are inadequate
-
There is no convenient debugging method
Local script libraries at present can run debugging environment, despite the Mock tool, but also need to configure the event first, upload script configuration to debug, at the same time script logic problems also needs to log constantly change the script to add log way to check, if in addition the last in the test environment debugging through, going to repeat the configuration process line environment.
Suggestion: Add a Debug button to the page where sub-events, scripts and rules are added online. You can directly give debugging results by setting parameters or capturing a data that meets the conditions. It is better to provide this part of the log.
-
The script pool Filter and check methods together have a lot of redundancy
In the development script and configuration, it is found that many filter logic and check logic are the same, but because the scripts are put together, overlapping logic writing will occur, and effective public stripping cannot be achieved.
Suggestion: Split Filter and check and use them as a common pool. Configure rules separately. You can also set the import mode to reference or import.
-
There are no circuit breakers when real problems occur
Though the practices of the whole platform is still to be seen, but the subsequent if platform effect is very good, there is a large quantities of data or loss of funds problems, problems is only a warning, the mechanism of platform and technology involved, nature or the rear of the means of processing, such as start the processing of the rear, and can not achieve the goal of stop-loss in time.
Suggestion: In the future, when the platform is more accurate, the linkage with key domains should be considered to prevent losses in time.
-
The platform should have an up and down wire switch
At present, the execution of the rule is not executed, only by editing control flow 0, or execution time to close, there is no batch, some inconvenience in operation.
Suggestion: Add uplink and downlink switches to perform batch operations such as traffic volume, switch, and warning person
-
The platform can consider adding dynamic judgment mechanism of traffic
Because a lot of check points are operated through the interfaces of each domain, when the flow is large, especially the key business, it may have a great impact on the business products, or when a certain interface is not compatible, and the system abnormality leads to a large number of abnormal errors, all parties will explode.
Suggestion: The rules are classified into levels. Advanced configurations can be performed for core services. When the fault occurs, traffic is automatically reduced.
Article | daqi
Pay attention to the object technology, hand in hand to the cloud of technology