ETL is the process of loading the data of business system into data warehouse after extraction, cleaning and transformation. The purpose of ETL is to integrate the scattered, scattered and non-standard data in the enterprise and provide analysis basis for enterprise decision-making.

ETL is an important link of BI project. Usually,ETL will spend at least 1/3 of the whole project time in BI projects, and the quality of ETL design is directly related to the success or failure of BI projects.

**ETL design is divided into three parts: ** data extraction, data cleaning and conversion, data loading. We also started from these three parts when designing ETL. Data extraction is from various Data sources to THE Operational Data Store (ODS) — this process can also do some Data cleaning and conversion. In the extraction process, different extraction methods need to be selected to improve the operation efficiency of ETL as much as possible.

Among the three parts of ETL, the part that takes the longest time is “T” (Transform, cleaning, transformation), and the workload of this part is 2/3 of the whole ETL in general. Data loads are generally written directly to DW (Data Warehousing) after Data cleaning.

There are many ways to implement ETL, three of which are commonly used. One is to use ETL tools (such as: Oracle OWB, SQL Server 2000 DTS, SQL Server2005 SSIS service, Informatic, etc.), one is to implement SQL, and the other is to combine ETL tools with SQL.

The first two methods have their own advantages and disadvantages. With the help of tools, ETL projects can be established quickly, which shields complex coding tasks, improves speed and reduces difficulty, but lacks flexibility. The advantage of SQL method is that it is flexible and improves the efficiency of ETL operation, but the coding is complex and has high technical requirements. The third method combines the advantages of the previous two methods and will greatly improve the speed and efficiency of ETL development.

Extract data

This part need to do a lot of work in research stage, must first to clear up some business data from the system, various business what the DBMS database server is running in the system, the existence of the manual data, manual data quantity has how old, the existence of unstructured data, and so on, to collect the information before they can make the design of data extraction.

The data source is handled in the same way as the database system that hosts the DW

This type of data source is relatively easy to design. Under normal circumstances, DBMS (SQLServer, Oracle) will provide database link function, in the DW database server and the original business system to establish a direct link relationship can write Select statement directly access.

Handling of data sources that are different from the DW database system

For this type of data source, it is also generally possible to establish database links via ODBC — such as between SQL Server and Oracle. If you cannot establish a database link, you can do this in two ways. One is to use a tool to export the source data as a. TXT or. XLS file, and then import the source system files into the ODS. Another way to do this is through the program interface.

For file type data sources **(.txt,.xls)**, business people can be trained to use database tools to import and extract these data from the specified database. Or you can implement it with tools.

Incremental update issues

Incremental extraction must be considered for systems with large data volumes. Under normal circumstances, the business system will record the time when the business occurs, which can be used to mark the increment. Before each extraction, the maximum time recorded in ODS should be judged first, and then all records larger than this time should be obtained from the business system according to this time. Take advantage of the timestamps of business systems, which generally have no or partial timestamps.

Cleaning and Transform data

Generally, data warehouse is divided into ODS and DW. The usual practice is to clean the business system to ODS, filter out dirty data and incomplete data, transform from ODS to DW, and perform some calculation and aggregation of business rules.

Data cleaning

The task of data cleaning is to filter the data that does not meet the requirements and submit the filtering result to the competent department of the business to confirm whether to filter it out or to extract it after the business unit corrects it.

The data that do not meet the requirements are mainly incomplete data, wrong data and repeated data.

(1) Incomplete data: this kind of data mainly refers to some information that should be missing, such as the name of the supplier, the name of the branch, the regional information of the customer, and the mismatch between the main table and the detailed table in the business system. For this kind of data filtered out, the missing content is written into different Excel files and submitted to the customer, requiring completion within the specified time. Write to data warehouse after completion.

(2) Wrong data: this kind of error is caused by the incomplete business system and directly written into the background database without judgment after receiving input, such as numeric data input into full-angle numeric characters, string data followed by a carriage return operation, date format is incorrect, date out of bounds, etc. This kind of data should also be classified. For problems like full-angle characters and invisible characters before and after data, they can only be found out by writing SQL statements, and then require customers to extract after business system correction. Errors such as incorrect date format or out-of-bounds date will lead to the failure of ETL operation. Such errors need to be picked out by the business system database in THE way of SQL and handed to the business competent department for correction within a specified period, and then extracted after correction.

(3) Repeated data: for this kind of data — especially in dimension tables — all fields of repeated data records will be exported for customers to confirm and sort out.

Data cleaning is an iterative process that cannot be completed in a few days. Only problems are constantly discovered and solved. Whether for filtering, confirm whether the revised general requirements customers, to filter out data, to Excel file or write data to filter the data table, in the early stages of the development of ETL, can every day to the business unit to filter the data email, encouraging them to correct mistakes as soon as possible, at the same time can also be evidence of the validation data in the future. In data cleaning, do not filter out useful data. Verify each filtering rule carefully and ask users to confirm it.

Data conversion

The task of data transformation is mainly to transform inconsistent data, transform data granularity, and calculate some business rules.

(1) Inconsistent data conversion: This process is an integration process, unifying the same type of data of different business systems. For example, the code of the same supplier in the settlement system is XX0001, while that in CRM is YY0001, which is converted into a unified code after extraction.

(2) Transformation of data granularity: business systems generally store very detailed data, while data in a data warehouse is used for analysis and does not need very detailed data. Typically, business system data is aggregated along data warehouse granularity.

**(3)** Calculation of business rules: Different enterprises have different business rules and different data indicators, and sometimes these indicators can not be completed simply by addition or subtraction. At this time, these data indicators need to be calculated in ETL and stored in the data warehouse for analysis.

ETL logs and warnings are sent

ETL log

ETL logs fall into three categories:

One is the execution process log, which is the record of each step in ETL execution, including the start time of each step, the number of lines of data affected by each step, and the form of running account.

One type is error log. When a module fails, an error log is written to record the time of each error, the module where the error occurs, and the error information.

The third type of log is the overall log, which records the ETL start time and end time successfully. If you use the ETL tool, the ETL tool automatically generates some logs, which can also be part of the ETL log.

The purpose of logging is to keep track of how ETL is doing and, if something goes wrong, where it went wrong.

Warning to send

If an ETL error occurs, not only an ETL error log is generated, but a warning is sent to the system administrator. There are many ways to send a warning. The commonly used one is to send an email to the system administrator with error information for the administrator to troubleshoot errors.

ETL is a key part of BI project and also a long-term process. Only by constantly discovering and solving problems can ETL operate more efficiently and provide accurate and efficient data for the later development of BI project.

Afterword.

ETL is a key part of the data warehouse system. At large, ETL is a data integration solution, at small, it is a tool for counting down data. Recall the work for such a long time, processing data migration, transformation of the work is really not a lot. But those jobs are mostly one-off jobs or very small amounts of data. However, in data warehouse system, ETL rose to a certain theoretical height, and the original petty tool use is different. As the name suggests, the process of counting down data has been divided into three steps, with E, T and L standing for extraction, conversion and loading respectively.

In fact, ETL process is the process of data flow from different data sources to different target data. But in a data warehouse,

ETL has several characteristics:

The first is data synchronization, which is not a one-time data dump and pull, it is a regular activity, according to a fixed cycle, even now some people put forward the concept of real-time ETL.

The second is the amount of data, which is usually so large that it’s worth breaking down the flow of data into E, T and L.

There are a number of mature tools that provide ETL functionality, whether they are good or bad. From the application point of view, the process of ETL is not very complicated, these tools to data warehouse engineering and great convenience, especially the convenience of development and maintenance. On the other hand, developers can easily get lost in these tools.

For example, VB is a very simple language and is also very easy to use programming tools, especially fast, but the real VB master how many? Microsoft usually designs products based on the principle of “treating users like idiots” ** and Microsoft stuff works really well under this principle, but as a developer, it’s stupid to treat yourself like a fool.

The same goes for ETL tools, which provide us with graphical interfaces and allow us to focus on rules to improve development efficiency. In terms of performance, it is true that using these tools can build a job to process certain data very quickly, but overall it is not necessarily much more efficient.

The problem is not with the tools, but with the designers and developers. They get lost in the tools and don’t get to the essence of ETL. It can be said that these tools have been used for so long, in so many projects, in so many environments, it must have its success, it must embody the essence of ETL.

If we don’t look beyond the superficial use of these tools to see the ideas behind them, we end up with individual jobs that still have a huge amount of work to put together. As we all know, “combine theory with practice”. If you want to surpass yourself in a field, you must reach a certain height on the theoretical level