Author | Bai Xueyu
This paper is shared by Bai Xueyu, r&d engineer of Big data platform of Central Plains Bank, and mainly introduces the application of real-time financial data lake in Central Plains Bank. The main contents include:
1. Background overview
2. Real-time financial data lake architecture
3. Scene practice
I. Background overview
First of all, a brief introduction to Central Plains Bank, located in Zhengzhou, Henan Province, is the only provincial legal person bank in Henan Province, is the largest urban commercial bank in Henan Province. It was successfully listed in Hong Kong on July 19, 2017. At the very beginning of its establishment, Zhongyuan Bank has taken benefiting from science and technology and developing from science and technology as its strategy, and our bank is determined to become a science and technology bank and data bank. We have been engaged in technology and advocate technology, hoping to solve current problems by means of technology.
This paper will share the construction background, architecture and scene practice of real-time financial data lake.
1. Business background of the birth of data Lake
■ Change in decision-making methods
To put this in context, we think banks are facing a big shift in the way they make decisions.
- First of all, traditional bank data analysis mainly focuses on the distribution of income, cost and profit of banks and the supervision of regulatory authorities. These data analysis is very complex, but there are certain rules, it belongs to financial data analysis. With the continuous development of Internet finance, the business of banks is constantly being squeezed. If data analysis is still focused on revenue, cost, distribution and supervision, it can no longer meet the needs of business. Getting to know your customers better, collecting lots of data, and doing more targeted marketing and decision analysis are priorities these days. Therefore, the business analysis and decision-making of banks are gradually shifting from traditional financial analysis to KYC-oriented analysis.
- Secondly, traditional banking business mainly relies on business personnel to make decisions to meet the needs of business development. However, with the continuous development of banking, a variety of applications produce a large number of multiple types of data. Relying solely on business people to make decisions no longer meets the needs of the business. At present, the problems are more complex and the influencing factors are increasing day by day, which requires a more comprehensive and intelligent technical way to solve. As a result, banks need to move from traditional, purely human decision-making to one that increasingly relies on machine intelligence.
■ Problem analysis
The biggest characteristic of the era of big data is the large amount of data and many types of data. A variety of techniques are involved in the use of large-scale data, including:
- Traditional offline data analysis for financial analysis
- Non-financial oriented data analysis
- Frequent changes such as event or log oriented
- High real-time data analysis
We need diversified digital marketing tools to paint a more comprehensive, accurate and scientific picture of our customers. At the same time, real-time risk decision technology is needed to monitor the risks faced by businesses in real time, and multi-module data processing technology is needed to effectively support different types of data, including structured data, semi-structured data, and unstructured data. Of course, machine learning and artificial intelligence technologies are also needed to support intelligent analysis and decision making of problems.
So many technologies, coupled with the scene of data-driven decision-making, determine that the current bank data analysis is facing a huge change, from the traditional finance-oriented, off-line data analysis, gradually to customer-oriented, real-time data analysis. The above is the first point of view of the construction of real-time financial data lake.
2. Technical background of the birth of data Lake
The second point of view of the construction of real-time financial data lake is that, in the banking system, the traditional data warehouse system oriented to standardization and precision processing can better solve the financial analysis and other scenarios, and will remain the mainstream scheme for a long time.
■ Traditional warehouse architecture
The following figure shows the traditional warehouse architecture. From bottom to top, it is the basic paste source layer, the integration layer of public data, the business market layer and the application processing layer. Different layers perform a large number of operations daily in batches to get the results the business wants. For a long time, banks relied heavily on the traditional data warehouse system because it solved the problem of financial analysis very well. Its characteristics are also obvious:
- Precision and standard
- Multilayer data processing
- Caliber unified
- T+1 data processing
- With high performance
- After a long time of accumulation and precipitation
- Suitable for financial analysis
These are the advantages of traditional data warehouses. Of course, its disadvantages are also obvious:
- Change difficult
- The unit storage cost is high
- It is not suitable for massive logs, frequent behavior changes and real-time data
- Semi-structured data is not compatible with unstructured data
The above is the second view of real-time financial data lake construction, that is, traditional data warehouse has its advantages and disadvantages, and will exist for a long time.
■ Change of number warehouse
The third viewpoint of the construction of real-time financial data lake is that kyC-oriented and machine intelligence analysis needs to support multi-type data, multi-time data and more agile use. Therefore, a new architecture complementary to data warehouse is needed.
3. Characteristics of real-time financial data lake
Through the above three points of view, the topic of today’s introduction is real-time financial data lake. There are three main characteristics:
- First, openness. Supports multiple types of scenarios, such as AI, unstructured data, and historical data.
- Second, timeliness. Have an effective architecture to support real-time analysis and decision-making.
- Third, integration. Integration with bank data warehouse technology architecture, unified data view.
The overall real-time financial data lake is a fusion data lake, and its fusion concept is mainly reflected in the following six aspects:
- First, the convergence of data convergence, where a variety of massive and diverse data converge, including structured, semi-structured and unstructured data.
- Second, the convergence of technology implementation, including the convergence of cloud computing, big data, data warehouse, stream computing and batch processing technology.
- Third, the integration of specification design, flexible design of data model theme, support schema-on-read and schema-on-write Schema, support multidimensional, relational data model.
- Fourthly, the integration of data management, the unification of metadata management of data lake and data warehouse and the unification of user development experience.
- Fifth, the fusion of physical locations can be a single large cluster in physical concentration, or a logical cluster in physical dispersion and logical concentration.
- Sixth, the integration of data storage, the analysis of data unified storage technology platform, in accordance with the lake warehouse standards of data in accordance with the requirements, reduce storage and operation and maintenance costs.
1
Second, architecture
1. Real-time financial data lake architecture
■ Functional architecture
Let’s first look at the functional architecture of the real-time financial data lake. In terms of function, it includes data source, unified data access, data storage, data development, data service and data application.
First, data sources. Support not only for structured data, but also for semi-structured and unstructured data. Second, unified data access. Data is intelligently accessed based on different data types on a unified data access platform. Third, data storage. Including data warehouse and data lake, realize cold and hot temperature intelligent data distribution. Fourth, data development. Including task development, task scheduling, monitoring operation and maintenance, visual programming. Fifth, data services. Including interactive queries, data apis, SQL quality assessment, metadata management, lineage management. Sixth, data application. Including digital marketing, digital risk control, data-based operations, customer portrait.
■ Logical Architecture
The logical architecture of real-time financial data lake mainly consists of four layers, including storage layer, computing layer, service layer and product layer.
- At the storage layer, MPP data warehouse and data lake based on OSS/HDFS can realize intelligent storage management.
- At the computing layer, unified metadata services are implemented.
- At the service layer, there are federated data computations and data service apis. Among them, federated Data computing service is a federated query engine, which can realize data cross-database query. It relies on unified metadata service to query data in data warehouse and data lake.
- At the product level, we provide intelligent services including RPA, certificate recognition, language analysis, customer portrait and intelligent recommendation. Business analytics services: self-service analytics, customer insight, visualization. Data development services: including data development platforms, automated governance.
2. Real-time financial data lake engineering practice
Let’s talk about the engineering practice of real-time financial data lake, mainly for real-time structured data analysis. The overall structure is based on open source architecture, as shown in the figure below. There are 4 layers, including storage layer, table structure layer, query engine layer and federated computing layer.
- Storage layer and Table structure layer are the components of intelligent data distribution, which support semantic guarantee of Upsert/Delete, Table Schema and ACID, and can store semi-structured data and unstructured data.
- The query engine layer and federated computing layer are an integral part of the unified data development platform. Unified data development platform provides one-stop data development, which can realize real-time data task development and offline data task development.
This sharing focuses on the development of real-time data tasks. The following part mainly introduces the one-stop flow computing development platform, which can realize the development, management, operation and maintenance of real-time tasks and ensure the stable operation of real-time tasks.
1
3. Stream computing development platform
Why do banks need streaming computing development platforms and what are the advantages of streaming computing development platforms?
S advantage
The advantage of streaming computing development platform is that it can effectively reduce the access threshold of real-time data development and facilitate the rapid development of real-time services. Through the stream computing development platform, it provides a one-stop real-time data development platform, including visual data development, task management, multi-tenant and multi-project management, unified operation and maintenance management, authority management, and can complete the development of real-time data tasks on this platform. The stream computing development platform is based on Flink SQL, which is itself a productivity tool.
Through the continuous application of Flink SQL, the ability of the streaming computing development platform can be pushed down to the sub-branches. Sub-branches can independently develop real-time data according to the business needs through the platform, so as to promote the development of banking business.
S architecture
The architecture of the stream computing development platform is shown below. It mainly includes data storage, resource management, computing engine, data development, Web visualization, etc.
It can realize multi-tenant management, multi-project management, and users can realize a real-time task on the operation and maintenance monitoring. The resource management mode of the streaming computing development platform supports physical machines and virtual machines, and supports the unified cloud base K8s. The platform computing engine is based on Flink and provides data integration, real-time task development, operations and maintenance center, data management, and IDE for visual data development.
■ “straight-through” real-time scene
The architecture and advantages of the streaming computing development platform are introduced above. The following is a further introduction for specific scenarios. The first is the “straight-through” real-time scene architecture.
Different data sources are connected to Kafka in real time, Flink reads Kafka data in real time for processing, and sends the processing results to the business end. The service end can be Kafka or HBase. Business dimension table data is stored using Elastic. The “straight-through” architecture can realize the timeliness of T+0 data, which is mainly used in real-time decision-making scenarios.
- Real-time decision analysis
Here is a simple example of the immediate post-loan collection business. The loan is about to expire and needs to be collected. Business depends on account balance, transaction amount and amount due for current period. Using three pieces of data to make decisions for different businesses, is it SMS collection, smart voice collection, or phone collection?
If it is based on the original offline data warehouse architecture, the data obtained are T+1. With outdated data decisions, it is possible that the customer has already paid, but there is still a phone collection problem. Through the application of “straight-through” scenario architecture, real-time decisions can be made on T+0 account balance, transaction amount and amount due for repayment in the current period to improve user experience.
- Real-time BI analysis
Let’s take another example. To obtain the sales information of financial products in real time from the past period to the present, there are some keywords in this demand, which requires “real-time acquisition”, that is, T+0 data. “Some time to now”, which involves querying historical data. The sales information of financial products involves banking business, which is generally complicated and requires multi-stream JOIN.
The whole requirement is a real-time BI requirement, which cannot be effectively solved by using the “straight-through” architecture. The “straight-through” architecture uses Flink SQL, but Flink SQL cannot effectively deal with the query of historical data. In addition, the business of banks is generally complicated, so dual-stream JOIN is mainly used now. To solve this problem, we need to explore new architectures other than “straight-through” real-time scenario architectures.
■ “floor” real-time scene
After data sources are connected to Kafka in real time, Flink can process Kafka’s data in real time and write the results to the data lake. The whole data lake is built based on the open source scheme. HDFS and S3 are used for data storage, and Iceberg is used for table form. After reading Kafka’s data, Flink performs real-time processing. At this time, the intermediate results of the processing can be written into the data lake, and then gradually processed, finally obtaining the desired results of the business. The processing results can be connected to applications using query engines, such as Flink, Spark, and Presto.
4. Real-time financial data lake
S architecture
The following is the real-time financial product architecture of Central Plains Bank. Including “straight-through” real-time application scenarios and “floor” real-time financial scenarios. Data is fed into Kafka in real time, and Flink reads data from Kafka in real time for processing. If dimension table data is involved, it is stored in Elastic. There are two cases:
- The service logic is simple. Flink reads event data in Kafka and dimension table data in Elastic in real time and processes the data. The processing result is directly sent to the service.
- The business logic is complex and will be processed step by step. The intermediate results are first written to the data lake and then processed step by step to get the final result. The final result is then interfaced with different applications through a query engine.
■ Data flow
This is the data flow diagram of the real-time financial data lake. The data source for real-time data comes from Kafka, and Flink SQL reads the data in Kafka in real-time via ETL. Through the ETL of real-time data and data lake platform, real-time and quasi-real-time output results are provided. Among them, real-time data ETL corresponds to “straight-through” real-time scenario architecture, while data lake platform corresponds to “floor” real-time application scenario architecture.
■ Features of real-time financial data lake
The real-time financial data lake features three things.
- First, openness. Data lake compatibility supports complex SQL and supports a large number of financial scenarios.
- Second, timeliness. Supports real-time and quasi-real-time data analysis and processing, and supports landing and non-landing application interconnection.
- Third, integration. The data lake provides the architecture of a financial data lake, supporting the analysis and processing of unified structured data in stream batch. Semi – structured and unstructured are also supported, as data lakes use distributed storage.
■ Construction achievements
Through the continuous construction of the data lake, a series of achievements have been made on the whole. We are now T+0 data timeliness, already support 20+ financial products, storage costs can be reduced by 5 times.
Iii. Scene practice
1. Intelligent real-time anti-fraud
Real-time financial data lake is mainly applied in two major aspects, one is real-time BI, the other is real-time decision-making. Among them, the typical application of real-time decision-making is intelligent real-time business frauds, it depends on the real-time computing platform, the platform of knowledge map, machine learning, real-time data model, provide a series of data services, including the relationship between fraud services, equipment, fingerprint, behavior monitoring service, location, analytical services and matching service, To support transaction anti-fraud scenarios, application anti-fraud scenarios and marketing anti-fraud scenarios.
At present, 1.4 million risk data have been processed in real time on an average day, 110 times of real time blocking and 108 times of real time warning on an average day.
2. Real-time BI
Let’s take another look at a real-time BI scenario. It is mainly a real-time customer insight platform, internally called Zhiqiu Platform, which relies on a real-time computing platform, a knowledge graph platform, a customer portrait platform and an intelligent analysis platform. Different platforms combine together to provide interactive query services, unified metadata management services, SQL quality assessment services, configuration development services, unified visual data display and so on. Support for trend analysis, circle analysis, retention analysis, customer base analysis and other scenarios. Now it has been able to get through common requirements and services of real-time analysis scenarios, realize closed-loop visualization of real-time BI analysis and independent digital real-time BI analysis of branches. 26800 real-time BI analysis cases have been implemented, and the average monthly life of real-time BI analysis platform is more than 10,000. Assist in the analysis of various real-time BI demands of over 30000 per day.