I. Problems and challenges existing in traditional data lakes
In the traditional data lake solution, Hive is used to build a T+1 level data warehouse. HDFS is used to store massive data and perform horizontal expansion. Hive is used to manage metadata and SQL data operations. Although it can achieve good results in mass batch scenarios, there are still the following status issues:
Problem one: Transactions are not supported
Because traditional big data schemes do not support transactions, unwritten data may be read, resulting in data statistics errors. To avoid this problem, the sequence of read/write tasks is controlled and read tasks can be started only after the write tasks are complete. However, not all read tasks can be constrained by the scheduling system, and this problem still exists in reading.
Problem 2: Data update efficiency is low
Business system library data, in addition to the flow table class data are new data, there are many state class data tables need to update operations (for example: Account balance table, customer status table, device status table, etc.), while traditional big data schemes cannot meet incremental update, usually adopt zipper mode. Join operation is performed first and then Insert Overwrite operation is performed to complete update operation by overwrite, which usually requires T+1 batch processing mode. Thus, the end-to-end data delay T+1, low efficiency, high cost.
Problem three: Unable to respond to business table changes in a timely manner
If the upstream service system changes the data schema, the data cannot be imported to the lake. Therefore, the data schema of the data lake needs to be synchronized. In terms of technical implementation, the method of data table reconstruction is used to meet this scenario, which leads to the complex management and maintenance scheme of data table of data lake and high implementation cost. In addition, this scenario usually requires the coordination of business and data teams to achieve synchronization of table structures through management processes.
Fault 4: Historical snapshot table data redundancy
In the traditional data lake solution, historical snapshot tables are stored in full history storage mode. For example, a day-level historical snapshot table stores full data every day. As a result, a large number of redundant data stores occupy a large number of storage resources.
Problem 5: High cost of processing small batch incremental data
To achieve incremental ETL, traditional data lakes usually store incremental data in the form of partitions. To achieve T+0 data processing, incremental data needs to be partitioned at the level of hours or minutes. This implementation can cause problems with small files and a large number of partitions can cause metadata service stress.
Huawei FunsionInsight MRS integrates Apache Hudi to solve the problems existing in traditional data lakes.
Key features of MRS Cloud native data lake Hudi
Apache Hudi is the file organization layer of the data lake. It manages Parquet and other format files, provides data lake capabilities, supports a variety of computing engines, provides IUD interface, and provides stream primitives for inserting updates and incremental pulling on HDFS/OBS data sets. It has the following characteristics:
Support ACID
- Supports SnapShot data isolation to ensure data read integrity and concurrent read and write capabilities
- Data commit, data into the lake second level visible
Fast Upsert capability
- Pluggable index base system is supported to quickly add and update data to the lake
- Expand Merge to add, update, and delete mixed data into the lake at the same time
- Supports the write synchronization small file merging capability. The written data is automatically merged according to the preset file size
Schema Evolution
- Supports synchronous evolution of data schemas in the lake
- Supports multiple common schema change operations
Multiple view reading interfaces
- Real-time snapshot data can be read
- Historical snapshot data can be read
- Supports current and historical incremental data reading modes
- Support fast data exploration and analysis
Many versions
- Data is stored according to the submitted version, and historical operation records are kept to facilitate data tracing
- Data rollback is simple and fast.
3. Typical application scenarios of MRS-HUDi
Real-time data into the lake based on MRS-CDL component
Scenario Description:
- Data can be extracted directly from the business database
- Data into the lake requires high real-time, second-level delay
- Table changes need to be synchronized with the data lake table structure in real time
Program introduction:
The scheme is constructed based on THE MRS-CDL component, which realizes the MRS-HUDi based data lake storage which is captured and written by the operation events of the business library.
Mrs-cdl is a real-time data synchronization service introduced by FusionInsight MRS. It aims to capture event information in traditional OLTP databases and push it to the data lake in real time. The solution has the following features:
- Mrs-cdl supports capturing DDL and DML events for business system libraries.
- Support mrS-HUDi as the data target end.
- Visual operation, collection task, lake entry task and task management are all visual operations.
- Multi-tenant tasks are supported to ensure that data permissions are the same as those in the lake.
- The whole task development zero code, save the development cost.
Program benefits:
- Simple operation into the lake, zero code development in the whole process.
- The lake-entry aging is fast, and the adjustment from business system data to lake-entry can be completed within minutes.
Based on Flink SQL into lake
** Scenario description: **
- Without direct connection to the database, data is sent to Kafka by existing collection tools or directly to Kafka by business systems.
- There is no need to synchronize DDL operation events in real time.
** Program description: **
MRS FlinkSQL into the lake link is based on Flink+Hudi. Mrs-flink supports the solution with the following features:
- Added Flink engine docking capability to Hudi. Supports read and write operations on COW and MOR tables in Hudi.
- FlinkServer (Flink development platform) added flow meter support for Hudi.
- Visual operation of job development and job maintenance.
Program benefits:
- Easy to develop into the lake code.
FlinkSQL implements the following statement into the lake:
Insert into table_hudi select * from table_kafka;
- The aging is fast, and the fastest data can be entered into the lake in seconds.
Rapid ETL for lake data
Scenario Description:
Data in the lake is usually stored in hierarchical storehouses, such as source layer (SDI), summary layer (DWS) and market layer (DW). Different enterprises also have different hierarchical standards. There are also specifications for the direct flow of data across the layers. The traditional data lake usually uses ETL processing to realize data flow among different layers.
Hudi now supports ACID features, Upsert features, and incremental data query features, enabling incremental ETL and fast flow between layers.
Incremental ETL jobs have the same service logic as traditional ETL jobs. Commit_time is used to read incremental tables to obtain incremental data. Multiple table association in the job logic can associate Hudi tables with Hudi tables, or associate Hudi tables with existing Hive tables. ETL job development can be based on SparkSQL, FlinkSQL development. The ETL statement based on the incremental view is as follows: Upsert table_dWS select * from table_SDI where commit_time > “2021-07-07 12:12:12”.
With incremental ETL, the amount of data processed at a time also decreases, depending on the actual traffic of the business and the granularity of the increments. For example: The business data of the Internet of Things, the traffic is stable 24 hours a day, and the incremental ETL of the level of 10 minutes is adopted, then the amount of data processed will be 1/(24*60/10) of that of the whole day. Therefore, when the amount of data to be processed decreases significantly, the computing resources required also decrease correspondingly.
Program benefits:
- The processing latency of individual ETL jobs is reduced and the end-to-end time is shortened.
- As the consumption of resources decreases, the amount of data processed per ETL job decreases significantly, and so does the computing resources required.
- The original model stored in the lake does not need to be adjusted.
Supports interactive analysis scenarios
Scenario Description:
The data stored in the data lake has the characteristics of full data types, multiple dimensions, and long historical period. The data required by services basically exists in the data lake. Therefore, the direct interactive analysis engine can directly connect to the data lake to meet various business requirements.
In data exploration, BI analysis, report presentation and other business scenarios, it is required to be able to return massive data in seconds, and the analysis interface is required to be simple and SQL.
Program Description:
In this scenario, MRS-Hetuengine can be used to implement the solution. Mrs-hetuengine is a distributed high-performance interactive analysis engine, which is mainly used for fast real-time data query scenarios. Mrs-hetuengine has the following features to support this scenario:
- The MRS-Hetuengine has been connected to MRS-HUDi and can quickly read the data stored by Hudi.
- You can query the latest snapshot data and historical snapshot data.
- Incremental query is supported. Incremental data is queried within any time range according to commit_time.
- For MOR storage model, especially in data exploration scenario, it can quickly analyze Hudi table data of MOR model through read optimization query interface.
- It supports multi-source heterogeneous collaboration and has the ability of joint analysis across Hudi and other DB. For example, when dimension data is stored in TP library, the association analysis between fact table of data lake and TP dimension table can be realized.
- SQL query statement, support JDBC interface.
Program benefits:
- Combined with MRS-CDL data into the lake, business system database data changes can be visible in the data lake within minutes.
- Interactive queries for terabytes to petabytes of data can return results in seconds.
- The data of each layer in the lake can be analyzed.
Build batch integration based on Hudi
Scenario Description:
Lambda or Kappa architectures are used in traditional processing architectures. Lambda is flexible to use and can be used to solve business scenarios, but it requires two systems in this architecture and is complex to maintain. After data diversion, it is also difficult to associate application. For example, the stream processing scenario uses the results of batch processing. The Kappa architecture is for real-time processing, which lacks batch processing capabilities.
Program Description:
In many real-time scenarios, the delay requirement can be minute-level, so that rapid data processing can be realized by INCREMENTAL computation of MRS-Hudi and real-time computing engines Flink and Spark-Streaming, and the end-to-end delay can be minute-level. In addition, MRS-HUDi is a lake storage system that can store massive data. Therefore, MRS-HUDi can also support batch computing. Hive and Spark are common batch processing engines.
Program value:
- Unified data storage: Real-time data and batch data share the same storage.
- It also supports real-time computing and batch computing. Reuse of processing results of the same business logic.
- Real-time processing capability and massive batch processing to meet minute delay.
Five, the summary
Traditional big data does not support pain points such as transactions, resulting in T+1 delay. Although it can realize the second-level data processing capability of a small amount of data in simple scenes based on Flink streaming computing, it still lacks real-time update and transaction support capability of massive complex scenes. Now Hudi based on Huawei cloud FusionInsight MRS can build minute-level data processing solutions to achieve real-time processing capability of complex calculations with large amounts of data, greatly improving data timeliness and making data value close at hand!
This document is published by Huawei Cloud.