Introduction:

ETL is responsible for combining data from distributed, heterogeneous data sources such as relational data,

Plane data files are extracted to the temporary intermediate layer for cleaning, transformation and integration, and finally loaded into data warehouse or data mart, which become the basis of online analysis and processing and data mining.

ETL is a very important part of data warehouse. It is a necessary step in connecting the past with the future. Compared with relational database, data warehouse technology has no strict mathematical theory foundation, it is more oriented to practical engineering applications. Therefore, from the perspective of engineering application, loading data according to the requirements of the physical data model and carrying out some series of data processing, the processing process is directly related to experience, and this part of the work is directly related to the quality of data in the data warehouse, thus affecting the quality of the results of online analytical processing and data mining.

Key technologies in ETL

The main links in ETL process are data extraction, data conversion and processing, and data loading. In order to achieve these functions, ETL tools are generally augmented with functions such as workflow, scheduling engine, rules engine, scripting support, statistics, etc.

Data extraction

Data extraction is the process of extracting data from a data source. In practice, the data source is relational database. There are several ways to extract data from a database.

(1) Full extraction

Full extraction is similar to data migration or data replication in that it extracts the table or view data from the data source from the database intact and converts it into a format that can be recognized by its own ETL tool. Full extraction is simple.

(2) Incremental extraction

Incremental extraction only extracts data that has been added or modified to the table to be extracted from the database since the last extraction. During ETL use. Incremental extraction is more widely used than full extraction. How to capture changing data is the key to incremental extraction. Generally, there are two requirements for the capture method: accuracy, which can accurately capture the changing data in the business system according to a certain frequency; Performance. Do not put too much pressure on the service system and affect the existing services. Currently, the methods commonly used in incremental data extraction to capture changing data are as follows:

* * in * * a. trigger: is needed to extract tables set up the trigger, general to build insert, modify, delete, three triggers each time a change in the source table data, is the corresponding data to trigger will change a temporary table, extract the thread data extracted from the temporary table, extract the temporary table data is marked or deleted. The advantage of trigger mode is that the performance of data extraction is high, but the disadvantage is that it requires the establishment of triggers for business tables, which has a certain impact on the business system.

**b. Timestamp: ** It is a change data capture method based on snapshot comparison. A timestamp field is added to the source table, and the value of the timestamp field is changed when the table data is updated and modified in the system. When data extraction is performed, the system time and timestamp field values are compared to determine which data to extract. Some database timestamps support automatic update, that is, the value of the timestamp field is automatically updated when the data of other fields in the table changes. Some databases do not support automatic timestamp update, which requires the business system to manually update the timestamp field when updating business data. Like the trigger method, the timestamp method also has better performance, and the data extraction is relatively clear and simple, but it also has a lot of input to the business system (adding extra timestamp fields), especially for the database that does not support the automatic update of timestamp, the business system is required to carry out additional update timestamp operation. In addition, the inability to capture delete and UPDATE operations on pre-timestamp data limits the accuracy of the data.

** C. Full table comparison: ** The typical full table comparison mode is the MD5 check code. The ETL tool creates a similarly structured MD5 temporary table for the table to be extracted, which records the primary key of the source table and the MD5 checksum calculated from the data of all the fields. During each data extraction, the MD5 checksum of the source table is compared with that of the MD5 temporary table to determine whether to add, modify, or delete data in the source table. Meanwhile, the MD5 checksum is updated. The advantage of MD5 is that it has less penetration into the source system (it only needs to create an MD5 temporary table), but it has obvious disadvantages. Different from the active notification in the trigger and timestamp mode, MD5 compares the data in the whole table passively, which results in poor performance. When the table has no primary key or unique column and contains duplicate records, the accuracy of MD5 is poor.

**d. Log comparison: ** Determine the changed data by analyzing the database’s own logs. Oracle’s CDC (Changed Data Capture) technology is an example of this. The CDC feature was introduced in the Oracle9i database. CDC can help you identify data that has changed since the last extraction. With CDC, data can be extracted while performing insert, UPDATE, or DELETE operations on the source table, and the changed data is stored in the change table of the database. This allows you to capture the changing data and make it available to the target system in a controlled manner using the database view. The CDC architecture is based on the publisher/subscriber model. Publishers capture change data and provide it to subscribers. The subscriber uses the change data obtained from the publisher. Typically, CDC systems have one publisher and multiple subscribers. Publishers first need to identify the source tables needed to capture the changing data. It then captures the changed data and stores it in a specially created change table. It also gives subscribers control over access to changing data. Subscribers need to know what changes they are interested in. A subscriber may not be interested in all the data published by a publisher. The subscriber needs to create a subscriber view to access the changing data that is authorized by the publisher. CDC is divided into synchronous mode and asynchronous mode. In synchronous mode, change data is captured in real time and stored in the change table. Publishers and subscriptions are located in the same database. Asynchronous mode is based on the Flow replication technology of Oracle.

In addition to relational databases, data sources handled by ETL may also be files, such as TXT files, Excel files, XML files, and so on. File data is generally extracted in full. Before extraction, the timestamp of the file or the MD5 check code of the file can be saved and compared with the next extraction. If they are the same, the extraction can be ignored.

Data conversion and processing

The data extracted from the data source may not completely meet the requirements of the target library, such as inconsistent data format, data input errors, incomplete data, etc. Therefore, it is necessary to carry out data conversion and processing for the extracted data.

Data conversion and processing can be carried out in ETL engine, or the data extraction process can take advantage of the characteristics of relational databases.

(1) Data conversion and processing in ETL engine

Data transformation is generally implemented in a componentized manner in ETL engines. Commonly used data conversion components include field mapping, data filtering, data cleaning, data replacement, data calculation, data verification, data encryption and decryption, data merging, data splitting and so on. These components are pluggable, assemblable, and share data across a data bus.

Some ETL tools also provide scripting support that allows users to customize the transformation and processing behavior of data in a programmatic manner.

(2) Data processing in the database

Relational database itself has provided powerful SQL, functions to support data processing, such as adding WHERE conditions in SQL query statement for filtering, query rename field name and target table mapping, substr function, case condition judgment and so on. Here is an example of an SQL query.

select ID as USERID, substr(TITLE, 1, 20) as TITLE, case when REMARK is null then ‘ ‘ else REMARK end as CONTENT from TB_REMARK where ID > 100;

Compared with data conversion and processing in ETL engine, conversion and processing directly in SQL statements is simpler, clearer and has higher performance. The ETL engine can process the SQL statements that cannot be processed.

Data loading

Loading the transformed and processed data into the destination repository is usually the last step in the ETL process. The best way to load data depends on the type of operation being performed and how much data needs to be loaded. When the destination library is a relational database, there are generally two loading methods:

(1) Perform insert, UPDATE and DELETE operations directly in SQL statements.

(2) Adopt bulk loading methods, such as BCP, BULK, and bulk loading tools or apis specific to relational databases.

The first approach is used in most cases because they are logged and recoverable. However, the bulk load operation is easy to use and efficient when loading large amounts of data. Which data loading method is used depends on the needs of the business system.

ETL tool comprehensive comparison table:

The serial number

nationality

The name of the

The price

review

Download address

1

The United States

ODI (built-in database)

Provided by database vendors

Limitations: The oracle database is too deeply coupled to be between heterogeneous databases (C/S architecture)

Click the jump

2

The United States

Informatica (founded in 1993)

70-100 w

The industry’s most professional ETL tool. High CPU, memory, and hard disk requirements (C/S architecture)

Click the jump

3

China

Taskctl (founded in 2014)

Free — 10 w

The earliest ETL tool manufacturer in China. Chinese core technology, has been closely following international professional products (C/S structure)

Click the jump

4

The United States

Datastage (acquired in 2005)

40-80 w

On par with Informatica. No breakpoint recovery capability; Weak scheduling and monitoring client functions (C/S architecture)

Click the jump

5

United Kingdom (post-Japanese acquisition)

Kettle (Open source 2006)

free

The most well-known open source ETL tool in the industry. First, data security risks; Secondly, service follow-up issues. The third-party scheduling tool is required to control job execution time, which causes poor job stability and requires periodic software restart. Lack of metadata management; The later maintenance and management costs cannot be estimated……

Closed source is a land mine!

Click the jump