Abstract: This paper is based on a speech delivered by Guo Yubo, a senior expert in the development of zhongan insurance big data platform, at Flink Forward Asia 2021 industry Practice session. The main contents include:
- The overall situation
- Intelligent Marketing application
- Real-time feature application
- Anti-fraud application
- In the late planning
Click to view live replay & Presentation PDF
First, the overall situation
The figure above is the overall architecture diagram of real-time computing. The lowest layer is the data source layer, including business data from the application system, message data of the application system, user behavior buried data and application log data. These data will enter the real-time data warehouse through Flink.
Real-time data warehouse is divided into three layers:
- The first layer is THE ODS layer. After the data is Flink to the ODS layer, an original table will be associated, which is one-to-one corresponding to the data source, and then there will be a visual chart for simple cleaning and processing of the original data.
- After that, the data is delivered to the DWD layer through Flink. The DWD layer is divided based on the subject domain, which is now divided into user data domain, marketing data domain, credit data domain, insurance data domain, etc.
- In addition, the DIM layer contains dimension table data related to users, products, and channels. DIM data is stored in HBase.
After data cleaning at the DWD layer, the data will be delivered to the DWS layer, which will integrate and summarize the data. Generally, there will be indicator wide table and multidimensional detailed wide table. The data then goes into the ADS layer, which contains the various OLAP data storage engines. We currently use ClickHouse as a storage engine for large-scale real-time reports, HBase and Alicloud’s TableStore as data storage services for user labels and feature projects, and ES for real-time monitoring scenarios.
The figure above is the architecture diagram of our real-time computing platform, which can be divided into three parts. The first part is the task management background, editing and submitting tasks in the task management module. The task editor supports Flink SQL and Flink JAR tasks at the same time, providing convenient Flink SQL editing and debugging functions, and also supports a variety of task startup strategies. For example, checkpoint generation based on checkpoint, offset, point in time, and earliest location is supported. Once the task is submitted, it will be submitted to our home-built CDH cluster via the Flink client. The task management service also periodically obtains the real-time task status from Yarn.
For monitoring, Flink will push the metric log data to PushGateway, and Prometheus will obtain PushGateway and visually display the data in Grafana. In addition to monitoring the status of abnormal tasks, we also provide real-time alarms for various situations, such as resource usage and message backlog. In addition, Flink also supports many connectors, such as ODPS of Ali Cloud, TableStore and Hologres. It also has rich built-in UDFs and supports user-defined UDFs.
Above is the task editor of our real-time computing platform. You can see that it supports Flink SQL and Flink JAR tasks. SQL tasks support DML and DDL, and they can work together in the editor for the overall task submission. In addition, it also supports many advanced task configuration functions, such as checkpoint configuration, message Kafka parallelism and state management.
Second, intelligent marketing applications
Next, the use of Flink in intelligent marketing application scenarios will be introduced.
The lowest level of the architecture diagram of the marketing platform is also the data source layer, including financial business data, insurance business data, user behavior data, third-party platform data and operation results data. The off-line data enter the off-line data warehouse through ETL, and the real-time data enter the real-time data warehouse through Flink.
Label is real-time offline for several positions on service layer, platform of offline/real-time labels management functions, at the same time we also to governance control of these labels, such as data access control, in addition, there are label data monitoring, can timely find the tag data exception, accurately grasping analysis on the label usage statistics.
Above the label layer is the label application layer. We have marketing AB laboratory and traffic AB laboratory. The difference between them is that marketing AB mainly conducts marketing on the customer group, whether it is static customer group selection based on rules or real-time customer group access through Flink. Will carry out process marketing and intelligent reach to these customer groups. The traffic AB lab is also a tag-based data service capability, which is used for personalized recommendation of thousands of faces on the APP end. The platform also provides the analysis function of customer group portrait, which can quickly find the data effect of similar customer groups and historical marketing of customer groups, so as to better assist the operation of customer group selection and marketing.
After marketing AB and traffic AB experiments, there will be an effect analysis service to carry out real-time effect recovery, through which the operation team can timely assist in rapid strategy adjustment.
At present, the total number of labels has reached more than 500, the number of marketing tasks will be about 2 million every day, and the traffic AB will be more than 20 million every day, mainly to provide the front end with personalized display of resource bits and thousands of business scenarios.
The figure above is the data flow diagram of the intelligent marketing platform. On the left is the data source, including the business data from the business system, as well as the buried point and event data. These data reach the real-time data warehouse through Kafka. Some of these will be processed into real-time customers, which will also be sent to Kafka, and these real-time customers will be intelligently reached by marketing AB.
Another part of the label data processed by the offline warehouse, we use DataX as the ETL tool to synchronize them to Hologres. Hologres can seamlessly connect to ODPS and achieve data synchronization in millions of levels per second by taking advantage of the acceleration capability of its associated surface with ODPS. Operators can select customer groups by themselves on the marketing platform, and Hologres interaction analysis ability can support the generation of complex customer groups at the second level.
The characteristics of the whole marketing platform can be summarized as follows:
- First, real-time portrait. By customizing standardized real-time events and data structures and utilizing Flink’s real-time computing capabilities, automatic real-time tag access is realized;
- Second, support smart marketing strategies. Can let users directly on the marketing platform for component marketing process configuration, provide rich time strategy, as well as a variety of intelligent marketing channels, but also support flexible, multi-branch business flow, the use of consistent hash shunting algorithm for user AB experiment;
- Third, real-time analysis. For real-time analysis of marketing effectiveness, we also use Flink to achieve real-time effectiveness recovery. Through funnel analysis and business indicator effectiveness analysis ability, can better empower the marketing business.
3. Real-time feature application
Feature engineering mainly serves financial risk control scenarios, such as decision engine, anti-fraud, risk control model services, etc. The main purpose of feature engineering is to transform raw data into a better representation of the nature of the problem. Using these features can improve the accuracy of our prediction of some invisible things, and financial business scenarios use this feature to improve the ability to identify user risks.
Feature engineering is the most time-consuming and important step in the whole data mining model. It provides the core data support for the risk control of the whole process of financial business, which is mainly divided into three parts:
- The first is feature mining, which is mainly completed by the risk control strategy and model development team. They will analyze and process data according to business indicators, and then extract effective compliance features.
- The feature development team will connect with different data sources according to the source of the feature. Some are from three parties, some are processed offline, and some are processed in real time. Of course, there are also some features calculated by machine learning model after processing again.
- Developed features are provided to online services through the feature center, and the stability of the entire feature link is ensured.
The Feature project currently uses more than 100 Flink real-time tasks, generating more than 10,000 feature numbers and more than 30 million feature calls per day.
The core indicators of financial risk control characteristics, the most important is compliance. All features are above compliance. Besides, it is necessary to ensure the accuracy of feature processing, the real-time performance of feature numbers, the fast response of feature calculation, and the high availability and stability of the whole platform operation.
Based on such indicators, we adopted Flink as real-time computing engine, HBase and TableStore of Ali Cloud as high-performance storage engine, and then realized overall servitization and platformization through micro-servitization architecture.
The architecture diagram of feature platform can be divided into five parts. Upstream system includes foreground system, decision system and protection system. All requests of the business side will pass through the feature gateway. The feature gateway will arrange links according to the source data of the features. Some of them will call third-party data, people’s credit data, and some data from the data mart. After data access, it will enter the feature data processing layer, which includes feature processing services for tripartite data and real-time financial feature data calculation. There are also anti-fraud feature computing services, including relationship mapping and some list feature services.
Some of the underlying features are processed by this layer and then made available to upstream business systems, while others need to be processed again by feature composition services. We implement feature composition services and risk control model services through a low code editor, and feature rework through a machine learning platform.
The basic service layer is mainly to do the background management and real-time monitoring of features. Real-time features depend on the real-time computing platform, and offline features depend on the offline scheduling platform.
To sum up, the feature platform is a feature service system constructed by microservitization. It is a set of feature computing risk control data products combined by accessing tripartite data, credit investigation data, internal data, real-time data and offline data for feature processing and service.
The chart above clearly shows the flow of real-time financial characteristic data. The data mainly comes from the business database, and there are various business systems such as the front desk and the middle desk, which are sent to Kafka in the way of binlog. The data middleware BLCS can convert binlog to Kafka. The data of user behavior is directly sent to Kafka and entered into the real-time data warehouse through Flink. After the calculation of the data in the real-time data warehouse, the detailed multidimensional data will be written into the TableStore.
At first, we used HBase. Later, for the sake of stability, we used TableStore to upgrade the technology. Finally, considering that feature services have high requirements for stability, we still reserve two storage devices, and HBase is used as degraded storage.
Since financial features require data services that can describe the user’s whole life cycle, they require not only real-time data, but also full offline data. Offline data is recycled to HDFS through DataX, and then the offline computing capacity of Spark is recycled to TableStore, an online storage engine.
Now, risk control is more and more demanding for the processing of features. For example, a simple feature calculation such as expenditure amount may require various business Windows including half an hour, nearly 3 hours, nearly 6 hours, nearly one day, 7 days, 15 days, 30 days and so on. If the real-time calculation is used, there will be too many Windows, and the calculation of full data will also cause the decline of Flink throughput. Therefore, our real-time task is mainly to do data cleaning and simple integration, and then we will reflux these detailed data to the storage engine, and then carry out configuration feature processing through the feature computing engine of the application system.
The scenario of risk control features is still relatively fixed, which is basically calculated from the user ID card, user ID or mobile phone number. Therefore, we abstract a set of user entity relationship association tables, including the mapping table of user IDS, such as ID card, mobile phone number and device fingerprint. Service data is stored in the dimension table with the userID. Detailed user data can be queried from two dimensions: entity relationship and service data. Thanks to the high performance lookup capability provided by TableStore, we can handle high concurrency feature computations. Some features not only use real-time data, but also call the interface of the business system to obtain data. Real-time data and interface data need to be aggregated to complete the calculation, so it is impossible to complete all the feature calculation in Flink. Therefore, Flink only processes and aggregates detailed data, and then realizes the calculation of feature results by feature computing engine.
At present, our real-time feature calculation is mainly realized by combining the data of DWD living in the real-time data warehouse with the feature calculation engine. The data of DWD will be returned to the Tablestore of Ali Cloud, and then feature processing and calculation will be realized through configuration. In order to save query cost, our calculation granularity is in the dimension of feature group. A feature group will query data source only once, and feature group and feature are one-to-many relationship.
Here is a brief description of the calculation process of features: Will first according to the characteristics of the query conditions, scanning related detail data, then according to the characteristics of the group under the specific characteristics of configuration such as the time granularity, dimension characteristics of use of statistics function of custom, and if it is multiple data sources need to join to calculate, depend on the characteristics of the first factor calculation is finished, then complete the next characteristics calculation. In addition, if our custom functions don’t satisfy our computational needs, the system also provides a way to manipulate features by residing in Groovy scripts. In addition, some features come from the interface of the business system, so only need to switch the first step of data acquisition from query Tablestore to call interface, if there are other feature data sources can also be completed by implementing the standard data interface, the feature computing engine does not need to do any adjustment.
Four, anti-fraud applications
The figure above shows the data flow diagram of real-time anti-fraud feature application, which is somewhat similar to the data flow diagram of financial real-time feature service, but there are also some differences. In addition to using business data, data sources here pay more attention to user behavior data and user device data. Of course, these device data and behavioral data are collected under the premise of user permission. These data are processed in Flink after Kafka. Anti-fraud data mainly uses a graph database to store user relational data. For complex feature calculation requiring historical data, we will use Bitmap in Flink as state storage, combine timerService for data cleaning, and use Redis to store feature calculation results.
The anti-fraud feature of GPS is the feature calculation of position recognition using TableStore’s multivariate index and LBS function. Anti-fraud relationship maps and communities are provided to anti-fraud officers through data visualization capabilities for case investigation.
We classify anti-fraud features into four categories:
- The first type is the location recognition type, which is mainly based on the user’s location information and the algorithm of GeoHash to realize the data calculation of location clustering characteristics. For example, we found some suspicious users through location aggregation features, and then checked the facial recognition photos of these users through anti-fraud investigation, and found that they all had similar backgrounds and applied for business in the same company. So we can combine the features of location class with the AI capability of image recognition to locate similar fraud more accurately;
- The second category is device association, which is mainly realized by relational graph. By obtaining the situation of the associated users of the same device, we can locate some fleece party and simple fraud relatively quickly.
- The third category is atlas relationship, such as user login, registration, self-use, credit and other scenarios. We will capture some device fingerprint, mobile phone number, contact and other information of users in these scenarios in real time to construct the adjacent relationship of the atlas. Then, through such adjacent relationship and the node degree associated with users, it is judged whether some users are associated with the black and grey list to identify the risk.
- The fourth category is the statistical community features based on the community discovery algorithm, by judging the size of the community, the performance of the user behavior in the community, to extract the rules of the statistical class characteristics.
NebulaGraph is the service that takes care of most of the relational atlas features mentioned above. We have tested the commonly used Janusgraph and OrientDB, but when the amount of data reaches a certain order of magnitude or more, there will be some unstable factors and situations, so we tried to use graph computing engine, and found that its stability is relatively high. Because it uses the distributed engine storage of Shard -nothing, it can support the calculation of large-scale graphs at the trillion level. It is mainly divided into three parts to compose services:
- Graph service, mainly responsible for real-time graph calculation;
- Meta services, mainly responsible for data management, schema operations and user permissions, etc.
- Storage service, mainly responsible for data storage.
Nubula also uses a computation-storage separation architecture so that both computing and storage layers can be cloned independently. It also supports pass-through computing, reducing data migration. Both the Meta layer and the storage layer use raft protocol to achieve the final consistency of data.
NebulaGraph also provides a variety of client access methods, supporting Java\Go\Python and other clients, as well as Flink connector and Spark connector, which can be easily integrated with the mainstream big computing engines.
The realization path of relational graph is divided into four parts: the first is the data source of graph. In order to build a valuable relational graph, it is necessary to find accurate and rich data for graph modeling. Our data source mainly comes from user data, such as mobile phone number, ID card, device information, contacts and other related data are synchronized to the relationship graph. In addition to real-time data, historical data is also cleaned by offline Spark task. NebulaGraph provides a search language that supports rich graph functions like adjacent edges, maximum paths, shortest paths, and so on. The community found that we realized this through Spark Graph-X, and finally provided data services through API for the application of Graph database. Now we have services directly provided to the decision engine for Graph data features, as well as some data services for anti-fraud. You can even consider community-based recommendation algorithms for marketing later.
5. Later planning
In the future, we will first consolidate our real-time computing platform, realize the management of blood relationship of real-time data, and try Flink + K8s to realize the dynamic expansion and contraction of resources.
Secondly, we hope to build an atlas platform based on Flink + NubelaGraph. Currently, real-time computing and offline computing are implemented by Lambda architecture, so we want to try to solve this problem by using Flink + Hologres to realize streaming batch integration.
Finally, we will try to use Flink ML to realize online machine learning in anti-fraud business scenarios of risk control, improve model development efficiency, quickly realize model iteration, and enable intelligent real-time risk control.
Click to view live replay & Presentation PDF