preface

From electrical business finance more than two years, because the two business models, traffic, stepped on a lot during pit, especially in monitoring the block, we suffered a lot, early due to the lack of monitoring, online accidents caused, after a fumble, we achieved some relative feasible monitoring method, effectively guarantee the stability of the market and business, Here I summarize and share with you, hoping to provide you with some monitoring ideas in financial scenarios. If you have better ideas, you are welcome to discuss them together.

This paper mainly expounds from the following aspects:

  1. The common monitoring mode under the scene of electric shopping mall
  2. Difficulties in financial monitoring
  3. Several reliable means of monitoring in financial scenarios

The common monitoring mode under the scene of electric shopping mall

There are two main types of monitoring in the e-commerce landscape, one is traffic monitoring (interface request), and the other is the monitoring of key nodes (such as registration and ordering)

For these two monitoring, we commonly use the method of dot, every request of the interface or every generation of key nodes dot, so that we can compare the dot data of today and yesterday to monitor, the following is our dot data for a key event

As shown in the chart: green represents the data of today, yellow represents yesterday

With the data of yesterday and today, it is very simple for us to monitor. We can compare the data of both in the same period. If the data of today drops by more than 50% compared with yesterday, the path of this key node may be problematic and an alarm can be triggered, as shown in the following figure

There are two main reasons why these two types of monitoring work in an emporium environment:

Because of electric market under the view of flow is large, large flow, then dot every minute of the data is large, so that through the percentage decline to trigger the alarm error is relatively small, so the feasible, another large flow so means that there is any trouble, such as because the page is not available will cause complaints in a short period of time, or some data to the sharp decline in a short period of time, They give us early warning in a shorter period of time, allowing us to spot problems in time.

Because the key nodes under the electricity is relatively less, mainly is “add shopping cart” and “order” key nodes, such as the key node means less out of the question, as long as the key node percentage of fall through mentioned above the alarm screen path whether there is a problem of key nodes can, due to low key nodes, the need to focus on the core link is relatively short, So the screening is relatively easy.

Difficulties in financial monitoring

Section on electricity market under the view of two kinds of schemes on the financial scene is not applicable, mainly because of financial reason is a low frequency operation, and under the electricity market view, live normally can reach tens of millions, but the financial situations of daily living is less a few times, this means that each dot of key nodes may be only a few tens of less than an hour, There may be dozens of minutes in which the key node corresponding to the DOT is 0, so it is not possible to use the form of the drop percentage alarm. In order to better introduce you to the pain points of financial business monitoring, or a brief introduction to the financial business.

Introduction to Financial Business

At present, we are mainly engaged in the cash loan business, which belongs to the loan assistance business. The so-called loan assistance business means that the platform does not directly issue loans, but only uses its own advantages such as customer acquisition and post-loan management to match the borrower with the capital to realize the financing, and the platform charges a considerable handling fee. The main business process is as follows

Platform we will choose the matching for each user to credit the funds, these funds risk control strategy is different, so the credit pass rate, borrowing rate is different on the core index of natural, like some head money party such as gold or elimination borrow 360 immediately pass rate is higher, we will give more traffic, And some lesser-known capital side of these key indicators performance is not so satisfactory, to allocate less flow naturally.

After matching the applicant for each user, the following cycle usually occurs:

  1. Before the loan: credit link, capital side to give you the amount, you always want to provide id card and relevant education background and other personal information, so capital side can evaluate your credit through these information, decide whether to give you the amount
  2. Loan: namely loan link, after granting credit, the user can borrow money
  3. After loan: namely reimbursement link

Can see that for every step, especially before the loan and credit, the core process of key nodes are very much, the key node means big funnel, more users into the lower, the key nodes of dot (such as submit credit, submit borrowing) may be a day, only a few thousand per minute on average to it a few times did not even, and the financial itself is a very low frequency operation, The behavior of users is very uncertain. There may be 50 people who submit credit at 8~9 o ‘clock today, but the number of people who submit credit at 8~9 o ‘clock the next day drops to single digits. These phenomena that will definitely trigger alarms in e-commerce are normal in finance, so there is often such a phenomenon in the early stage: We found on the monitoring chart that there was a huge difference in the number of key nodes (such as credit submission and successful loan number) of some capital parties in the same period of two days. However, after investigation, we found that there was no problem with the link, which caused us a lot of trouble.

Through the above introduction, I believe it is not difficult to understand that the monitoring of the electric shopping mall scene can not be copied to the financial scene, we must combine the characteristics of the low frequency of the financial scene to design a corresponding monitoring system.

Several reliable means of monitoring in financial scenarios

1. Monitor the number or success rate of each fund party in each process (before and after the loan)

In view of the characteristics of the financial low frequency mentioned above, we designed a relatively effective monitoring system with the following ideas: Although loans, loans, credit after a lot of key nodes in each process, but actually we don’t need to monitor all key nodes, we only need funds for each critical process of successful outcome (credit succeed, borrowing, credit pass rate, borrowing pass) monitored rbis, because if the credit or loan is successful, It shows that there is no problem with the process before and during the loan

Note that we need to monitor the success of credit granting and borrowing of each fund party separately, because it is meaningless to count the total success, and the risk control strategy and flow allocation of each fund are different. Judging whether the process is normal by the total number of successes is likely to result in some funders adjusting their risk control strategy (or other bugs) one day, leading to the failure of all credit granting or borrowing without being detected.

Of course, as mentioned above, the total number of successful credit granting or borrowing of each capital party is likely to be 0 in dozens of minutes, so we can use the total number of successful hours to alarm. We record the total number of successes in every hour of every day, and compare the number of successes in nearly X hours between today and the same period (average) of the past week every half hour. If the number of successes is lower than half of the average number of successes in the past week, there may be a problem with the link and an alarm will be generated. How to select this X? If the total number of successes in the last hour is less than 20 (the threshold needs to be selected based on actual conditions), we compare the total number of successes in the last 2 hours between today and the same period in the past week. If the total number of successes in the last 3 hours is still less than 20. , until the last X hours of the total number of successes reached 20, so the error is relatively small, through this way of alarm efficiency is 100% so far! There are also many problems on the line. The nail alarm is displayed as follows:

High-quality capital because of the high pass rate, distribution of flow is big, so the corresponding number of successful is relatively more per hour, compared with the average over the past week the same time in such way to make the alarm is really practical, but for those who pass less money party, these funds may one day a total of only a few number of success, with the above error alarm way, Let’s extend the time line and count the success number of this fund in the last 8 hours. If it is 0, there may be a problem:

In this way, we also found many problems caused by the decrease in the number of successful credit granting/lending due to the risk control adjustment of the capital party, and timely informed the capital party to solve the problems.

2, skillfully use the section to find and solve the abnormality of the fund side in time

So far, we have access to more than 20 capital parties, and each capital party has its own set of interface specifications, and the interfaces of each capital party are different. There may be hundreds or thousands of interfaces in total, which brings some hidden dangers. If the interface request fails due to bugs in our code or internal problems of the funder (usually the status code returned by the interface is the status code of failure), we can hardly find it. Some people say it is not simple. If the interface returns the status code of failure, the request error alarm is not enough at this time.

There are two questions:

One is where the alarm code should be written in the book, some people say that is written in the book of each capital party requests the bottom, if that’s the case, monitor code with business tight coupling, and we have access to a big money, every request of the bottom of the corresponding file should be a written warning code, workload is huge, And if the new recipient of funds is easy to forget to add the alarm code.

Secondly, not all interfaces that return failure status codes should be alerted. Some are normal request failures, such as “account is permanently frozen” and “repayment is not allowed on the day of loan”. These failed requests are not caused by bugs, so we do not care about them. We are only interested in “cannot be empty name” requests that are obviously buggy, and we need to filter the normal failure alarms.

Looking at the first problem, the underlying interface request pseudocode for each funder is as follows

// Interface call
String result = httpPost();
Response response = JSON.parseObject(result, Response.class);
// If the request status code is failure status code
if(! response.getCode().equals(SUCCESS_CODE)) {// Throw an exception with failure information returned from the fund side
    throw new Exception(ErrorCodeEnum.ERROR_REQUEST_EXCEPTION, response.getMessage());
}
Copy the code

Obviously, if we could block all of these exceptions in sections, we would be able to write the alarms for these exceptions in sections without any intrusion into the existing code, and manage all of the alarms in a unified way, which would be a perfect solution to problem one.

The pseudocode of the section implementation is as follows:


@Aspect
public class LoggingAspect {
    // Capture only the packages where all funds request files are located
    @AfterThrowing ("execution(* com.howtodoinjava.app.service.impl.*(..) )", throwing = "ex")
    public void logAfterThrowingAllMethods(CustomException ex) throws Throwable {
           // Send the thrown error message to the nail alarmsendDingWarning(ex.getMessage()); }}Copy the code

The second problem is how to filter exceptions thrown by normal failed requests that we don’t care about. Mentality is also very simple, set a white list mechanism, if we find failure request is normal, the failure information to join in the white list, so that if a money request failed throw an exception, we just take a look at this the failure information is in the white list, if in, that is normal failure, do not need to trigger the alarm, If it is not in the whitelist, an alarm is triggered. After transformation, our alarm process has become as follows:

There is a small problem need to pay attention to, this white list how to configure things, at first we don’t know what a failure is the failure of the normal request, so the beginning just put all the request of the failure alarm, requests each found a normal failure, just add the failure information to the white list, so the white list is continuously dynamic change, It also needs to take effect in real time. We choose 360 open source QConf to configure our whitelist.

QConf is a distributed management service based on Zookeeper. It is dedicated to separating configuration content from code and providing configuration access and update services in a timely, reliable and efficient manner. After QConf is used for configuration, whitelist management is resolved.

3. Achieve circuit breaker downgrade for the capital party

In our business scenario, a user request is likely to request more money to the interface, if a fund service problems, rely on the capital side of the interface user requests will hang up and, in fact, almost every request in our business scenario to request money to interface, this also means that as long as a capital of service is not available, Our business will be unavailable, which is obviously unacceptable.

As shown in the figure, a user requests multiple funders. If the interface service of funder B is suspended, the user request may also be suspended this time. Even make the whole business unusable!

Therefore, we must introduce a circuit breaker downgrade mechanism. When there is an abnormality in a fund’s service (such as call timeout or other abnormal proportion increase), all calls to this resource will be automatically disconnected and fail quickly within the next downgrade time window, thus avoiding the huge hidden danger of unavailability of services.

We used Ali’s Sentinel to achieve the circuit breaker degradation. Sentinel provides several ways to achieve the circuit breaker. We used the method of fusing after the number of outliers exceeds the threshold within 1 minute. Requesting the funder interface throws a “DegradeException”, which is also caught in the section, and an alarm is generated.

conclusion

This article summarizes this low-frequency business to share in the financial scenario design monitoring several ways, through the above several ways of basic guarantee the stability of the market, but in fact the first monitoring (for success, success rate) if the behavior prediction for machine learning efficiency should be higher, also will be some more timely, but the team has no experience in this respect, Therefore, the number of successes/success rate is temporarily replaced by monitoring. If there is an opportunity to introduce machine learning in the future, I believe it will have a good effect. If you have a better monitoring method to share, feel free to share.

Finally: welcome everyone to pay attention to my public number, exchange together, common progress!