Risk control in any company is a mysterious existence, not only rarely shared online, but also rarely disclosed its architecture and design from the perspective of security. I will participate in the construction of risk control to talk about the technology of risk control. (This article is from the PPT that I share internally. It only discusses the construction of risk control system from the technical perspective, and does not involve the internal secrets of the company. Due to the limited space, some details cannot be explained completely.)
Risk control architecture evolution
After more than 1 year of risk control system construction, will have the company’s internal risk control system from the business code is given priority to risk control architecture transformation to platform 2 generation architecture, and then transformed into dynamic and offline data model of 2.5 generation of architecture, to the deep learning, online data model of 3 generation of architecture evolution.
It has to be said that the risk control system of the company is relatively weak compared with big companies such as Alibaba, Tencent and JINGdong, but the company has been satisfied with the promotion of the whole system construction to the present level with limited resources.
The technical architecture
First of all, the current technical architecture of risk control is divided into five systems from the business and architecture levels: storage system, identification system, support system, operation system and data computing system.
Storage systems including hbase, mysql, Redis, ES and Hive actually use existing frameworks or open source projects.
Identification system includes control platform (control system, batch system, decision system, bus system), punishment platform (punishment system), analysis platform (rule system, model system), data platform (data system, operational data system).
The support system mainly refers to the background configuration system.
Operation system mainly refers to the risk control operation system, KIBANA report system.
Data computing system mainly refers to big data and off-line computing platform and data analysis business based on it.
The call relationship is shown as follows:
Business architecture
Second, look at the business architecture of the whole system. It has been preliminarily available at present
Business ability, marketing cheating, transaction fraud, login registration prevention and control, content prevention and control
Data model capabilities include user portrait and risk rating, associated backcheck, risk market, various statements, etc
Operation ability, user early warning, merchant early warning, case review, comprehensive information query
Based on the existing data, it classifies and forms its own data assets, including list class, user class, device class, environment class and location class.
Performance of risk control system
The following figure shows the pressure test results in the production environment. About 8W TPS was obtained by using 12000 concurrent users, with an average response time of 141ms and an error rate of five in 10,000.
Among them, 170 million valid requests were accumulated, and the data volume reached 8TB
Difficulties in risk control system construction
Flexible and efficient access: usually only one week or less, complex and diverse services; How to reduce release errors and accidents
Extremely short response time: Business typically only gives 100ms and up to 200ms timeouts
High requirements for concurrent throughput: A large number of access services and a large number of requests. Some businesses use risk control to fend off attacks
Large amount of data processing: the amount of data is relatively large, how to effectively use; Data query backtracking requirements are high
Escalation of confrontation: attackers constantly guessing internal rules; How does data serve adversarial purposes
Greatly promote stability: how to ensure that the amount of adjustment does not break down after the increase; How to service even if something goes wrong
The next chapter will describe in detail how to solve these difficulties.
What else?
Multi-tenant and open platform services
Because there are multiple internal subsidiaries all wanted to go through the risk control system to control, thus to multi-tenant data isolation becomes more and more important, for different tenants, using the same platform the same set of system, but all of the interface, data calculation, report will only see the data under the tenant, and cannot be exceeded his powers to other tenants, Multi-tenancy is essentially a form of permission control, but the isolation is deeper and more thorough than permission.
In addition, due to the accumulated data and services of risk control, many external systems want to share the data and services of risk control, and providing services for part of the business of risk control as an open platform is also an important step to deepen the reform of risk control.
Rule efficiency analysis
Rules can only be set by experience, and data is needed to measure whether rules are good or not. Rule effectiveness analysis is used to measure the effectiveness of rules.
Anti-fraud based on neural network
Based on this paper, the click sequence of each session is input into RNN to provide appropriate risk samples and let it identify the risk session “Session Based Fraud Detection”. More could be done to counter fraud, but neural networks are so poor at interpretability that there is no way to refute claims made on such occasions. But it is a criterion.
Intelligence operations
For the online system, from intelligent operation and maintenance to intelligent decision, from fusing to self-healing, sent to the system can make automatic decision, some problems can be automatically repaired.
TimeWheel algorithm
For applications with high timeliness requirements such as risk control, time-outs need to be controlled at request time. Usually, this way to control the timeout is the new up a single thread, by controlling the thread of execution time to control the timeout, in a big pressure, a large number of threads can bring burden to the CPU switches of the whole system, can only use a thread to control all the calculation of the timeout, in this way, will greatly reduce the number of threads that will reduce the CPU load.
The following is an algorithm called Timewheel, which mimics the algorithm used to calculate timeouts after TCP packets are sent. Imagine if the underlying operating system created a new thread for each TCP packet to time, the operating system would have already died.
Feel small preparation of good welcome to pay attention to forwarding, thank you