Abstract: There are a variety of scenarios, tools and aging requirements for the data entering the lake in the relational database. This paper summarizes the relevant scenarios and the corresponding suggestions.

This article is shared by HisonHuang from huawei cloud community “summary of scenarios and solutions for relational database data entering the lake”.

There are a variety of scenarios, tools, and time-dependent requirements for data entry into the lake in the relational database. This paper summarizes related scenarios and corresponding suggestions.

First, two types of access tools are introduced: bulk data migration tools (such as CDM) and real-time data access tools (such as CDL).

Batch data migration tool, which can be a full amount, a full amount + batch incrementally to data from a relational database data migration into the lake, often into the lake timeliness (from the data to the data into the data layer source lake) in more than 10 minutes or more, such as 15 minutes or so, depends on the time interval batch incremental migration task. The following is a functional architecture diagram of the Batch Data Migration Tool (CDM) :

Real-time data access tools (such as CDL), can real-time capture the binlog of the relational database, stored in the message queue such as Kafka, and support real-time parsing of binlog log to generate add, delete and change commands to operate the data records of the data lake, the realization of the real-time data of the relational database into the lake, the timingefficiency of the lake in seconds or minutes. Generally less than 15 minutes. The following is a functional architecture diagram of the Real-time Data Access Tool (CDL) :

Scenario 1: The relational database has stock historical data but no new data generated in real time

Solution:

1. Lake entry tools: Batch data migration tools (such as CDM)

2. Lake entry mode: migration of the whole table, as shown below:

3. Lake entry process:

3.1. Use the lake entry tool to configure the relational database as the source end and the HDFS/Hive directory as the destination end.

3.2, use the lake tool to start the lake job, relational database – “HDFS directory (data lake paste source layer).

Scenario 2: The relational database has no data initially and is open for data writing after the incoming process is established

Solution: Prerequisite: The binlog switch is enabled for the relational database.

1. Lake access tools: real-time data access tools (such as CDL)

2. Entry mode: real-time incremental data is entered into the lake, as shown below:

3. Lake entry process:

3.1. Use the lake entry tool to configure the relational database as the source end and the Hudi file directory as the destination end;

3.2. Start the operation of the lake entry tool;

3.3. Real-time incremental data of relational database into the lake;

3.3.1 Insert, modify and delete data records into relational database;

3.3.2 Data change logs of the relational database are captured in real time to the lake entry tool;

3.3.3. Lake entry tool parsed logs and called Hudi interface to insert, modify and delete data records into the Hudi file directory (data lake attached to the source layer).

Scenario 3: The relational database has stock historical data and generates new data in real time. The data record has a time identification field

Plan 1:

1. Lake entry tools: Batch data migration tools (such as CDM)

2. Entry mode: first entry of historical stock data + continuous batch increment data

3. Timeliness into lake: near real-time (depending on batch scheduling cycle)

4. Lake entry process:

4.1. Assume that the relational database generates a steady amount of new data in real time;

4.2. Use the lake entry tool to configure the relational database as the source end and the HDFS/Hive directory as the destination end.

4.3 Start the operation of stock data into the lake with the lake entry tool. Where, the time identifier field of the FILTER condition Where is from the initial time to the current time;

Note: The operation time of the storage data into the lake is long, which depends on the amount of historical storage data, network bandwidth, throughput of the operation into the lake and other factors. During this time, the relational database accumulates a large amount of new data because it is constantly receiving new writes.

4.4 Continuously start the operation of adding incremental data into the lake in batches, Where the time identifier field of the FILTER condition Where is from the cut-off time of the last batch of operation to the current time; The data amount of each batch decreases gradually, the time interval between batches decreases gradually and tends to be stable, and the computing resources consumed by each batch also gradually decreases and tends to be stable. The schematic is as follows:

Scenario 2: Prerequisite: Binlog is enabled for the relational database.

1. Lake entry tools: batch data migration tools (such as CDM) + real-time data access tools (such as CDL)

2. Entry mode: first entry of historical stock data into the lake + continuous batch increment data into the lake + real-time increment data into the lake

3. Timeliness of lake entry: the first two stages are not real-time, and the last stage is real-time

4. Lake entry process:

4.1. Assume that the relational database generates a steady amount of new data in real time;

4.2. Use the batch data migration tool to configure the relational database as the source end and the HDFS/Hive directory as the destination end (table A, the file format may be CSV).

4.3. Use the batch data migration tool to start the operation of the stock data into the lake, Where the time identification field of the FILTER condition Where is from the initial time to the current time; Note: The operation time of the storage data into the lake is long, which depends on the amount of historical storage data, network bandwidth, throughput of the operation into the lake and other factors. During this time, the relational database accumulates a large amount of new data because it is constantly receiving new writes.

4.4 Continuously start the operation of adding incremental data into the lake in batches, Where the time identifier field of the FILTER condition Where is from the cut-off time of the last batch of operation to the current time; The data volume of each batch decreases gradually, the time interval between batches decreases gradually and tends to be stable, and the computing resources occupied by each batch of operations also decreases gradually and tends to be stable.

4.5. Ts stops data writing to the relational database at a point in time to ensure that all data before Ts is migrated from batch jobs to the HDFS directory (data lake attached to the source layer);

4.6. Stop batch data Migration tool operations.

4.7. Use the real-time data access tool to configure the relational database as the source end and the Hudi file directory as the destination end (table B, file format is Hudi);

4.8. Start the real-time data access tool;

4.9. At this time, Te relational database is open for data writing;

4.10. Real-time incremental data of relational database into the lake;

4.10.1 Data change log of relational database is captured in real time to real time data access tool;

4.10.2 Real-time data access tool parsing logs, calling Hudi interface to insert, modify and delete data records into the Hudi file directory (data lake attached to the source layer). The schematic is as follows:

Scenario 4: The relational database has stock historical data and generates new data in real time. There is no time identification field in the data record

Solution: Prerequisite: The binlog switch is enabled for the relational database.

1. Lake entry tools: batch data migration tools (such as CDM) + real-time data access tools (such as CDL)

2. Entry mode: whole table migration + real-time incremental data entry into the lake

3. Lake entry aging: The final stage enters real-time

4. Lake entry process:

4.1. Suspend data write to relational database;

4.2. Use the batch data migration tool to configure the relational database as the source end and the HDFS/Hive directory as the destination end (table A, the file format may be CSV).

4.3, use batch data migration tool batch data migration tool to start into the lake job, relational database – “HDFS directory (data lake paste source layer);

4.4. Stop batch data migration tool after the above batch data migration is complete.

4.5. Use the real-time data access tool to configure the relational database as the source end and the Hudi file directory as the destination end (table B, file format is Hudi);

4.6. Start the real-time data access tool;

4.7. At this time, Te relational database is open for data writing;

4.8 Real-time incremental data of relational database into the lake;

4.8.1 Data change logs of the relational database are captured in real time to the real-time data access tool;

4.8.2 Real-time data access tool parsing logs, calling Hudi interface to insert, modify and delete data records into the Hudi file directory (data lake attached to the source layer). The schematic is as follows:

Click follow to learn about the fresh technologies of Huawei Cloud