Build a real-time platform based on Flink

“This is the 24th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

One, foreword

In the era of big data, fintech companies often rely on consumption data to assess a user’s creditworthiness and ability to repay. In this process, some intermediary agencies will collect a large number of numbers and carry out the work of “raising numbers”, that is, in a one-year cycle, let these numbers form normal consumption and communication records, the purpose is to “cultivate” these numbers to be very healthy, and then sell to users with fraudulent intentions. This kind of user submits through the online information audit, cheated to the loan after “disappeared”.

So how can you more quickly prevent or identify potential fraud? How to realize online real-time anti-fraud from super-large scale, high concurrency and multi-dimensional data? These are the main challenges facing fintech companies today.

To address these issues, InfoQ spoke with Nine Fu group to reveal how flink-based ultra-large scale online real-time anti-fraud technology can quickly process massive amounts of data and create a good user experience.

Two, online real-time anti – fraud difficulties and pain points

There are three common scenarios of financial fraud:

One is material forgery. This was a common fraud in the early days when paper submissions were required;
The second is “Yang number”, which is common in intermediary agencies. They maintain the health status of a large number of numbers by charging service fees and sell them to fraudulent users for loan applications.
The third is the threat from professional hackers, who attack account security by looking for loopholes in systems and processes.

Due to the virtual characteristics of fintech, the main risks are concentrated in two aspects: fraud risk and credit risk, so the core risk assessment process is anti-fraud and credit assessment. For anti-fraud, information verification, high-risk group interception and real-time calculation, identification and decision-making are its core risk control methods. And the assessment of credit risk needs both internal and external repair.

Jiufu group’s credit rating of users is mainly evaluated dynamically by the fireeye scoring – Rainbow rating system independently developed by Jiufu Group, which covers the whole c-end lending service of Jiufu Group. Since its launch, the performance has been stable and the distinguishing effect is obvious. External also referred to Tencent, Ali and other scores as a reference.

At present, online real-time anti-fraud will face various pain points. In the business scenario of Jiufu Group, the main pain points are concentrated in the following three aspects:

Low latency requirements. The more data you need to compute, the longer it takes. In the era of popular online lending, a slogan often circulated is “three minutes credit, one minute lending”, and even some companies hit “one minute credit, half a minute lending”. However, in the big data scenario, data analysis and processing have an increasingly high demand for low latency.
Very large scale real-time computing requirements. In the big data scenario, large-scale data need to be calculated in real time. The Flink computing platform code-named “Fuxi” in Jiufu Group needs to perform rapid retrieval and calculation on the data set of close to 510TB every day. Changes in users’ behaviors will lead to changes in data, thus affecting decision-making. As a result, there is a growing need for real-time computing of very large amounts of data to ensure that users can abort transactions in the event of fraud.
Multi-dimensional, high concurrency requirements. As the scale of users in the same business scenario expands, the data generated by users also explodes. In the financial scenario, it is urgent to have a complete system that can obtain risk assessment reports according to the analysis of all dimensions of data and mine potential needs of users according to their characteristics. The simplest and most effective way for the system to obtain user-generated data is flowing data. A single data packet contains all the information of each dimension at the time point of occurrence. One of the characteristics of this scenario is high data concurrency, so it is a great challenge for data analysis with high time-sensitive requirements.

Aiming at the current pain point of online real-time anti-fraud, Jiufu Group adopts the large-scale online real-time anti-fraud system based on Flink, which not only improves user experience, but also reduces business losses.

Three, based on Flink large scale online real-time anti-fraud system

1. Why Flink?

Flink open source project is a rising star in the field of big data processing in the past two years. Although it is a rising star, it has been applied in the engineering practice of many large Domestic Internet enterprises, such as Ali, Meituan, JINGdong and so on. Then, in the iteration of Jiufu’s big data technology system, why did it choose Flink, the stream data processing engine?

In terms of technical languages, Spark uses JAVA and Scala, which have certain requirements on Scala. Flink, on the other hand, is mainly Based on JAVA, which makes the programming language more mature and versatile, making it easier to modify code. Therefore, from the comprehensive perspective of language level, Flink is relatively good. The comparison of Spark, Storm and Flink technologies is as follows:

From the perspective of delay and throughput: Flink is a pure streaming design. The calculation of streaming big data technology is logical first, that is, the calculation logic is defined first. When data flows through, the calculation results are calculated in real time and retained. When data needs to be used, the calculation results can be directly called, without re-calculation.

Streaming big data technology can be widely applied to scenarios requiring high timeliness of data processing, such as real-time transaction anti-fraud. The performance of Flink in terms of delay and throughput is good, which can meet the requirements of Jiufu Group for online real-time computing of super-large scale data stream.

In contrast, Spark is mainly in small-batch processing mode, which cannot meet the requirements of the anti-fraud system to process large-scale, multi-dimensional, and high-concurrency data flows in real time. Although Storm is based on stream processing, Flink’s throughput is about 3-5 times that of Storm, and Flink’s latency at full throughput is about half that of Storm. Overall, the Flink framework itself performs better than Storm.

In terms of integration with the existing ecosystem: Flink’s integration with very Large computing and storage (HBase) is much better than Spark and Storm, and has more friendly interfaces. HBase is the caching basis for the system advance check function, which is the most important technical optimization to reduce p99 latency.

Overall, Flink is a well-designed framework that is powerful and performs well. In addition, it has some good design, such as memory management and flow control. However, due to the low maturity of Flink, there are still many problems. For example, SQL support is relatively rudimentary, and it cannot dynamically adjust resources without stopping tasks like Storm. Does not provide good Streaming and Static Data interaction like Spark does.

2. Super large-scale online real-time anti-fraud system architecture

The basic process of online credit is as follows: the user initiates the demand through the App, and the App asks the user to fill in the information related to authorization. The main purpose is to evaluate the user’s credit limit. After that, the user data will enter the background data system for anti-fraud and credit evaluation. If the audit passes, the user will receive information and the account quota will be opened. The architecture of the large-scale online real-time anti-fraud system based on Flink is as follows:

As for the future planning of the online real-time anti-fraud system, the first step of Jiufu will be to build the large-scale online real-time anti-fraud system based on Flink technology into a data product combining jiufu’s accumulation in technology and scenarios, so that it can export data assets and process data.

Secondly, The Jiufu technical team will continue to invest in the function optimization of the system, and promote it to the community as an open source product, so that more developers can directly use the system.

Finally, to further improve the performance of the whole system through technical optimization, the current P99 delay of the system is 100ms, and the next goal of Jiufu in the future is to achieve P99 delay of 50ms.)

The architecture of Jiufu’s large-scale online real-time anti-fraud system based on Flink is divided into two parts: the data part and the decision-making part. The operation of the whole system is equivalent to a workflow, the user’s data and information in the form of flow from one node to the next node, will produce a lot of decision-making in the process of circulation of information, make filtering and judgment according to the conditions, and the judgment result quickly to the next node, which can real-time user’s data, and then decide whether to lend to the user.

The data part needs the fastest processing, and the whole data processing is completed by four parts.

The first part is the fastest transfer of data from the front end to the back end. The first step of flink-based large-scale online real-time anti-fraud system is to widen the data path, allowing more information to flow into the data processing at the same time.

The second part is a large column storage cluster, mainly implemented by HBase. HBase is a NoSQL database running on Hadoop. It is a distributed and scalable big data warehouse that utilizes the distributed processing mode of HDFS and benefits from the MapReduce program model of Hadoop. Most importantly, it supports high concurrent read and write operations. HBase is the basic guarantee for the entire architecture. When a large amount of data is flooded, HBase can quickly store data, reducing the over-dependence of data writing and data reading on the system architecture.

HBase has a large number of indexes, such as primary and secondary indexes. Customize the read/write cache of HBase to ensure the pre-check function. Through App or other channels, users’ behavioral data information can be obtained, and users’ intentions can be inferred. Then the system starts to do pre-query and puts relevant information of users into the cache. In this way, when users trigger operations in the front end, the back end directly calls data from the cache to carry out calculation, greatly improving the data processing speed. 99% of data can be matched in HBase cache, which depends on the strong user awareness capability of the system.

The third part is the computing engine, mainly completed by Flink. The computing engine is divided into two parts. One is the filtering engine, which mainly performs customized filtering of user information in different dimensions in large-scale and high-concurrency data streams, in order to reduce the magnitude of the entire data calculation. The other is a function engine, which, through a highly abstract approach, customizes some very good functions and loads them into the engine, so that developers don’t have to modify the code themselves. The combination of the filter engine and the function engine dramatically reduces the data magnitude for the entire user, and with some efficient code, further reduces latency.

The core of Flink is based on the flow execution engine. Flink provides many apis of higher abstraction layer to facilitate users to write distributed tasks. The three commonly used apis are as follows:

DataSet API performs batch processing operations on static data and abstracts static data into distributed data sets. Users can conveniently operate distributed data sets with various operators provided by Flink.

DataStream USES the Flink API to process data streams and abstract them into distributed data streams. Users can easily use the operators provided by Flink to perform various operations on distributed data streams.

Table API, query operations on structured data, the structured data is abstracted into relational tables, and through Flink to provide a SQL-like DSL on relational tables for various query operations.

Based on its service characteristics, Jiufu needs to rapidly process large-scale online real-time data streams. Therefore, DataStream API is adopted to reduce latency.

The fourth part is calculation force. The computing power depends on the Hadoop cluster, and YARN is used to manage the entire resource, which has good scalability. The basic idea of YARN is to decompose the resource management and job scheduling/monitoring functions into separate daemons, consisting of global resource scheduling (RM) and application-specific scheduling (AM). YARN enables Hadoop to integrate into multiple computing frameworks, manage and schedule them in a unified manner, rather than supporting only one computing model, MapReduce. YARN architecture is as follows:

3. System architecture iteration

The large-scale online real-time anti-fraud system based on Flink has experienced a significant architectural iteration in Jiufu Group. The initial goal of Jiufu group was to quickly obtain the risk control results within 1s, but the user experience was not fast enough, so the whole system underwent a technical upgrade, adding the pre-check technology. The pre-check technology includes two parts: retrieval and calculation, and its core relies on the powerful concurrent ability of Flink.

Do a quick pre-check in a large amount of data, use Flink concurrent ability to cover the data, and finally hit the result in the cache, so as to avoid the process of network I/O query and waiting for the return. Finally, the optimization of P99 delay was reduced from 1s to 100ms by upgrading part of the computing framework.

4. Application of AI technology

In the era of big data, the quality of data directly affects the effect of big data analysis and processing methods, as well as the decision-making process. By analyzing massive amounts of data, you can discover patterns and rules implicit in data sets. However, abnormal data can cause significant interference to the analysis process. Machine learning is used to detect outliers in a large-scale online real-time anti-fraud system based on Flink. Outlier detection (also called outlier detection) is a detection process to find out the behavior of the object is different from the expected. These objects are called outliers or outliers.

Abnormal data in big data has the following characteristics: it is obviously different from normal data; Its generating mechanism is different from normal data and may be unknown. The data dimension is high. Outlier detection is widely used in credit card fraud detection. When the number of users is very large, some users with low credit values need to be identified, and machine learning is used to detect outlier, and then the users with low credit values are screened out for manual confirmation.

AI knowledge graph technology is also applied in the large-scale online real-time anti-fraud system based on Flink. The society is composed of large and small groups, and users also have such group characteristics. The relationship between these groups is constructed with data, and the value of data is deeply mined through the algorithms of graph segmentation and retrieval. In practical applications, if a user’s credit is very poor and has been blacklisted, then the users related to him need to be investigated. Classifying users according to their behavior is called clustering. There are many kinds of clustering algorithms, and then divide the graph according to the user’s information to determine the risk coefficient of each person. Some means can also be used to open the access of high-quality circles and guide high-quality circles to carry out information interaction.

Iv. Future planning of flink-based ultra-large scale online real-time anti-fraud system

Finally, the performance of the whole system is further improved through technical optimization. At present, the P99 delay of the system is 100ms, and the next goal in the future is to achieve the P99 delay of 50ms.

Build a real-time platform based on Flink

One, foreword

Two, online real-time anti – fraud difficulties and pain points

Three, based on Flink large scale online real-time anti-fraud system

1. Why Flink?

2. Super large-scale online real-time anti-fraud system architecture

3. System architecture iteration

4. Application of AI technology

Iv. Future planning of flink-based ultra-large scale online real-time anti-fraud system

Related Posts

Detailed SpringBoot Tutorial Web Development (part 1)

Arrays and slices in Go

Run CMD in Python