Abstract:ETL is the process of loading the data of business system into the data warehouse after extraction, cleaning and transformation. It is an important step in the construction of data warehouse. Users extract the required data from the data source, clean the data, and finally load the data into the data warehouse according to the pre-defined data warehouse model. The purpose is to integrate the scattered, disordered and non-standard data in the enterprise, and provide analysis basis for the enterprise’s decision making.
1 Overview of ETL algorithms
> Algorithm application scenario overview
There are a total of 8 ETL algorithms above, which are mainly divided into 4 categories. Incremental accumulation and zipper algorithm are more suitable for historical data tracking of data warehouse. However, in reality, due to business and performance considerations, there are often data table applications with full deletion and full interpolation and incremental accumulation algorithm.
2 Full delete and full insert model
That is Delete/Insert implementation logic;
> Application Scenarios
It is mainly used to load dimension tables, parameter tables, and main file tables. That is, the source table is a full data table. The service logic of the data table only needs to save the latest full data and does not need to track past historical information.
The algorithm implements the logic
1. Clear the target table.
2. The source table is fully inserted.
> ETL code prototype
TRUNCATE TABLE < target TABLE >; - 2. Full insertion INSERT INTO < target table > * * * (field) the SELECT fields FROM * * * < source table > * * * the JOIN < > associated data WHERE * * *;Copy the code
Incremental comprehensive model
Upsert implementation logic;
> Application Scenarios
It is mainly used to load parameter tables and main file tables, that is, the source table can be incremental or full data table, and the target table is always the latest and most complete record.
The algorithm implements the logic
1. PK primary key comparison;
2. The target table and the source table PK consistent change records, update the target table;
3. If the source table exists but the destination table does not, insert the source table directly.
> ETL code prototype
Create temp Table < temp > ***; INSERT INTO < temporary table > (表 名 ***) SELECT *** FROM < source table > ***JOIN *** WHERE ***; -- 2. Merge Into As S1 on (***PK**) when Matchedthen
update set Colx = S1.Colx ***
when Not Matched thenINSERT (表 ***) values (表 ***);Copy the code
Incremental accumulation model
Append implements logic;
> Application Scenarios
It is mainly applied in the loading of flow table, that is, the flow and event data generated daily are added to the target table to retain the full historical data. Flow tables, snapshot tables, and statistical analysis tables are all implemented by this logic.
The algorithm implements the logic
1. Insert the source table directly into the target table.
> ETL code prototype
INSERT INTO (INSERT INTO table) SELECT *** FROM (INSERT INTO table) WHERE ***;Copy the code
5 Full history zipper model
“> < p style =” max-width: 100%; clear: both
concept
Zipper table is a data warehouse ETL data table consisting of at least PK field, tracking field, open chain date and closed chain date.
benefit
According to the date of open chain and closed chain, the valid data of corresponding date can be extracted quickly.
For tracking non-event flow table data of source system, zipper algorithm plays a greater role, and the daily change data of source business system is usually limited. Through zipper processing, the space cost caused by daily snapshot can be greatly reduced without losing the history of data change.
A queue example that extracts valid data for a specified date
Extract valid data on February 5, 2020
Select * From < target table > Where start date <=date'2020-02-05'And End date >date'2020-02-05';Copy the code
Finally extracted data:
> Application Scenarios
The full history zipper tracks the full change history of the source table. If the source table record does not exist, it indicates that the data is closed. According to PK new pull a valid record.
The algorithm implements the logic
1. Extract current valid records;
2. Extract the latest data of the daily source system;
3. Compare the current valid record with the latest source table according to the PK field, update the current valid record of the target table, and close the chain operation;
4. Insert the target table according to the full-field comparison between the latest source table and the current valid records;
> ETL code prototype
Insert into < temp table > Select *** From < target table > Where end date =date'< Max date >'; ; -- 2. Extract the latest source system data of the day < source table temporary table - CUR > -- 3. Insert Into < temporary table - increments > Select *** From < source table temporary table -cur>where(not including open and closed chain fields ***) notin(Select *** From < temp table - open -pre>); Insert into < temp table -increment -upd> Select *** From < temp table -increment -pre>where(not including open and closed chain fields ***) notinSelect *** From < temp table -cur> Select *** From < temp table -cur> DELETE FROM < target table > WHERE (PK***) IN (Select PK*** FROM < temporary table - incremental-upd >) AND end date =date'< Max date >'; INSERT INTO < target table > (Select ***, start time, end date'< data date >'From < temporary table - increment -upd>; INSERT INTO < target table >. (Select ***,date from (Select ***,date from)'< data date >',date'< Max date >'From < temporary table - increment -ins>;Copy the code
6 incremental zipper models
> Application Scenarios
Incremental zipper, the purpose is to track the history of data incremental change, according to PK comparison to pull a new open chain data;
The algorithm implements the logic
1. Extract the open chain data of last day;
2.PK same change record, close the old record chain, open the new record chain;
3. Different PK, source table exists, new open chain record
> ETL code prototype
Insert into < temp table > Select *** From < target table > Where end date =date'< Max date >'; SQL > select * from source system increments < source table temp -cur> Insert into < temp table - increments -ins> Select *** From < temp table - increments -cur>where (***PK***) not in(select ***PK*** from < temp table - open chain -pre>); Insert into < temp table - increment - upD > Select *** From < temp table -cur> inner join < temp table -pre> on (***PK Equivalent * * *)where(*** change field non-equivalent ***); -- 5. Updates records historical change, the close historical old chain, opened a new chain update > < target list AS T1 SET < * * * change field assignment * * * > S1, end date = the date'< data date >'AS S1 WHERE (<***PK = ***>) AND T1; End date =date'< Max date >'; INSERT INTO < target table > (SELECT ***,date) SELECT ***,date'< data date >',date'< Max date >'FROM < temporary table - increment -upd>; INSERT INTO < target table > (SELECT ***,date) SELECT ***,date'< data date >',date'< Max date >'FROM < temporary table - increment -ins>;Copy the code
7 Add and delete zipper model
> Application Scenarios
Primarily, business fields are used to track the history of changes in incremental data that contain deletions.
The algorithm implements the logic
1. Extract the open chain data of last day;
2. Extract non-deleted records from the source table;
3.PK same change record, close the old record chain, open the new record chain;
4.PK comparison, source table exists, new open chain record;
5. Extract source table and delete records;
6.PK comparison, old open chain records exist, close the old record chain;
> ETL code prototype
-- 1. Clean up the goal list "To be Continued" TRUNCATE TABLE < target TABLE >; - 2. Full insertion INSERT INTO < target table > * * * (field) the SELECT fields FROM * * * < source table > * * * the JOIN < > associated data WHERE * * *;Copy the code
8. Complete addition and deletion of zipper model
> Application Scenarios
Primarily, business fields are used to track the history of changes that contain deletions in the full data.
The algorithm implements the logic
1. Extract the open chain data of last day;
2. Extract non-deleted records from the source table;
3.PK same change record, close the old record chain, open the new record chain;
4.PK comparison, source table exists, new open chain record;
5. Extract source table and delete records;
6.PK comparison, old open chain records exist, close the old record chain;
7.PK comparison, extract the old open chain exists but the source table does not exist records, close the old record chain;
> ETL code prototype
-- 1. Clean up the goal list, to be Continued... TRUNCATE TABLE < target TABLE >; - 2. Full insertion INSERT INTO < target table > * * * (field) the SELECT fields FROM * * * < source table > * * * the JOIN < > associated data WHERE * * *;Copy the code
9 Self zipper model
> Application Scenarios
The main flow table data into zipper table data.
The algorithm implements the logic
With the help of the source table business date field, and the target table open chain, closed chain date comparison, end to end, pull out the full history zipper;
> ETL code prototype
-- 1. Clean up the goal list, to be Continued... TRUNCATE TABLE < target TABLE >; - 2. Full insertion INSERT INTO < target table > * * * (field) the SELECT fields FROM * * * < source table > * * * the JOIN < > associated data WHERE * * *;Copy the code
10 Other Instructions
1. According to data warehouse best practices, all data tables usually also contain some control fields, i.e. insert date, update date, update source field, so that data warehouse sensitive to data changes can further track data change history;
2. The ETL algorithm itself is to better serve the data processing process. In the actual business implementation process, it is not limited to the traditional algorithm, that is, it involves more customized ETL algorithms adapted to the business.
Click to follow, the first time to learn about Huawei cloud fresh technology ~