Author’s brief introduction
Jiang Yixin, technical expert of Ctrip Risk Control Department, is responsible for risk control operation platform, data distribution, relationship map, list service and other risk control subsystems. Risk control engine performance optimization main implementation personnel.
* This article was first published on GitChat, please quote * for reprint
Ctrip’s online risk control system was officially launched in 2011 to deal with the growing problem of payment fraud. At present, the online risk control system supports ctrip’s daily real-time risk event processing of 100 million + and quasi-real-time data pretreatment of 10 billion +; The total number of rules and models running in the system reached 10,000 + and 20+ respectively. The scope of risk control has expanded from simple payment risk control to various types of business risk control (such as: malicious preemption of resources, scalpers buying, merchants brushing orders).
Below is the overall technical architecture of the current online risk control system:
The current system structure is a relatively mainstream risk control system structure, including decision engine, Counter, list library, user portrait, offline processing, offline analysis and monitoring of the main modules. Ctrip’s online risk control system developed to this stage through 3 major revisions.
First, the initial establishment
It was launched in 2011 and uses the services developed by.NET. The database uses SQL Server. The brief architecture diagram is as follows:
At this time, the risk control service integrates all online decision-making functions into one system, including rule judgment, list library and flow calculation. All of this logic is implemented in a database.
Traffic calculation: Execute SQL from a list (for example, SELECT COUNT(DISTINCT orderId) FROMt1 WHERE… [2])
Rule judgment: The database records the rules of greater than, less than, and equal. After receiving risk events, the database obtains the traffic value and compares the rules to obtain the final risk judgment.
List base judgment: The database maintains blacklist and whitelist information (attribute type, attribute value, judgment basis, etc.), and the program determines whether the value in risk events matches the list.
Based on ctrip’s demand for risk control at that time, the system mainly satisfies functions. After the online operation for a period of time, with the growth of Ctrip’s business, the traffic of the risk control system kept increasing. The time consuming of SQL-based traffic statistics seriously restricted the response time of the system, so the first performance optimization revision was made.
Second, traffic query performance optimization revision
Since the main performance bottleneck at this time lies in the traffic query implemented by the database, the main direction of this optimization is to optimize the implementation of the traffic query: on the basis of the original single database, the method of distributing the pressure evenly by dividing the database and tables, in order to achieve faster response time and higher throughput. The architecture diagram is as follows:
After optimization, the traffic database is separated from the service database. The traffic database uses multiple database instances to split traffic details using hash, and SQL is used for statistics. Because the pressure is shared by multiple database instances, the performance of the system traffic query is improved.
After the new version online, Ctrip’s business and put forward more requirements for risk control system:
1. More convenient access: In addition to payment risks, business risks also need risk control support;
2. More external data access: user information, location information, UBT information;
3. Richer rule logic: support arbitrary variable rule judgment, support more judgment logic;
4. Higher performance: 10x increase in flow, response time less than 1 second;
5. Programming language update: Ctrip promoted the conversion of.NET to Java in the company.
Then came the major overhaul that laid the foundation for today’s online risk control system
Risk control3.0(Aegis)
From this version, the risk control system is fully switched to Java development, and the core module is independent into services, which defines the boundary of each subsystem as well as the positioning and role in the whole system. Compared with previous functional applications, Aegis is a platform-based risk control system. Here is a simplified system architecture diagram:
Aegis began to use Drools scripts for rule writing, which greatly improved the response time of the rule team to emergencies. Emergency rules can be put online in 10 minutes.
The risk control engine is divided into synchronous engine and asynchronous engine in structure. The synchronous engine runs rules and models used to judge the risk results in real time. The asynchronous engine is responsible for validating rules/models, data distribution, critical data landing, and so on. Synchronous/asynchronous engines are designed to be stateless for quick expansion.
In terms of traffic statistics, Aegis has independently developed Counter Server, which is a customized TSDB-like service. The query of any accuracy and any time window is controlled within 5ms. Meanwhile, it supports high concurrent query, which improves the performance by thousands times compared with the IMPLEMENTATION of SQL. It supports tens of billions of queries per day for each service in the current risk control system. Below is a brief description of its structure:
At the data preprocessing level, risk portrait and DataProxy services are independently developed.
Risk portrait service provides real-time portraits of users and orders (user level, user behavior label, order resource label, etc.) as input variables of rules and models for the engine. Its data sources are real-time engine data, quasi-real-time MQ data cleaning service and offline data import.
DataProxy service packages all external interfaces and database access, and configures different caching strategies for different data characteristics, to ensure that 99.9% of requests get the required data within 10ms.
In addition, there are several main services:
List library service, which supports multiple independent list libraries and optimizes the list judgment logic so that the response of a single query (10 dimensions) is less than 10ms.
The configuration service centralizes the functions required by each application in the system and provides centralized configuration services for each application to obtain the response configuration.
An event processing platform for handling events that the engine cannot determine or require human intervention.
Performance monitoring service: Monitors the health status of services in the system and provides early warning and alarm functions.
Service monitoring service monitors the running status of the rule model, returned risk results, event time and other business-level data, and provides early warning and alarm functions.
After the launch of Aegis system, the access time of new risk control services was shortened to one week, and the average time was controlled at 300+ms under the condition of the increase of 10x traffic and the execution of more complex rules, which was more than double that of the previous version.
Then Aegis family added two more important subsystems: Sessionizer and DeviceID. These two services are quasi-real-time processing applications, but both provide real-time data to the engine in a warm-up manner.
Sessionizer reduces the user’s page access session to RiskSession, which reflects the user’s operations. The flow is shown as follows:
Chloro, a self-developed big data processing system, is used for data processing. The architecture diagram is as follows:
DeviceID is used for fingerprint data collection and fingerprint identification generation to determine device uniqueness. Chloro is also used for data processing, as shown in the following diagram:
These two services provide critical judgment data for rules and models as well as manual processing, further improving the accuracy of risk judgments.
However, during the operation of Aegis system for more than a year, it was found that the response time of real-time risk control showed a trend of slow decline with the continuous increase of traffic, rules and model quantity, and the amount of timeout (1 second) was about 1/1000, which led to continuous performance optimization.
4. Performance optimization for real-time risk control
Performance optimization started a year ago and can be divided into the following major optimizations:
1. Rules are distributed and executed in parallel
Split the rules and models that need to be executed by a single event into multiple servers within the same logical group, and eventually merge the data. This optimization reduced the average time to 200+ms
2. Script execution engine switches Drools to Java
Use the template to mask drools’ special syntax when writing rules, and then compile the script into A Java class for execution. The rule execution performance is doubled, and the average real-time time of the entire engine is reduced to 100ms.
3,Develop a Java model execution engine
Java versions of random forest and logistic regression algorithms have been completed with an order of magnitude performance improvement over Python.
After the above optimization, the entire system can meet the capacity expansion in a short period of time by simply adding servers.
The above is the architecture and evolution process of Aegis system. Of course, the evolution process is still going on. At this stage, the goal is to continue to develop the platform system and achieve service transition.
Recommended reading:
How to construct a real-time computing platform based on Spark Streaming
A case study of MySQL5.7 partition table performance degradation
Construction, inference and application of large-scale knowledge graph
Evolution of Ctrip operation and maintenance workflow platform
Ctrip ticket big data architecture best practices