1. What is ETL?
ETL (extract-Transform-Load), for enterprise or industry applications, we often encounter a variety of data processing, transformation, migration, so it is essential to understand and master the use of an ETL tool
For ETL who don’t know the details, please click here
2. What is Kettle?
The Kettle is an open-source ETL tool written in Pure Java and runs on Windows, Linux, and Unix. The tool does not require installation and is efficient and stable in data extraction.
MATT, the lead programmer on the project, wants to put all kinds of data into a Kettle and then flow it out in a specified format.
Kettle is an ETL toolset that allows you to manage data from different databases by providing a graphical user environment that describes what you want to do, rather than how you want to do it.
There are two types of script files in the Kettle: Transformation and Job. Transformation converts basic data, and Job controls the entire workflow
Kettle(now renamed PDI, Pentaho Data Integration-Pentaho Data Integration).
3. Kettle download
The Kettle’s official website: community.hitachivantara.com/s/article/d… , the latest version of 8.2 version, do not recommend the official website download, thief slow, compression package a G or so.
Kettle of domestic mirror: version 7.1, mirror.bit.edu.cn/pentaho/Dat… , version 8.2, mirror.bit.edu.cn/pentaho/Pen…
4. Kettle Installation
After downloading the compressed package, decompress it. The Kettle directory file is as follows:
5. Kettle deployment
Configure the Java environment variables after the Kettle download
Because the Kettle is developed in The Java language, the permission of the software depends on the Java runtime environment. You need to install JDK and prepare the running environment for Java software. Install JDK1.8 version can be, configure environment variables, these own baidu on the line, not wordless. In a Windows 10 environment, double-click spoon. bat to run it.
6. Overview of the Kettle Interface
7. Kettle structure
8. Kettle structure -Spoon and Data Integration Server
Spoon is a tool to build ETL Jobs and Transformations. Spoon is graphically designed in a drag-and-drop way to invoke a dedicated data integration engine or cluster from Spoon.
Data Integration Server is a dedicated ETL Server. Its main functions are:
9. Kettle structure -Enterprise Console
Enterprise Console provides a small client for managing the deployment of Pentaho Data Integration Enterprise Edition. This includes enterprise version certificate management, monitoring and controlling activity on the remote Pentaho Data Integration server, analyzing the dynamic performance of registered jobs, and transformations.
10. Core components of kettle.
11. Kettle Conceptual model
The Kettle execution is divided into two levels: Job and Transformation.
12. Kettle implementation, copy data from CSV file to Excel file.
First, create a transformation, go to the core object, go to the CVS file inside the input, drag it to the work area, and double-click the CVS file to input.
You can change the name of the step, click Browse, select CVS file, default other parameters, click Get field, and click OK.
After the CVS file input is configured, you can configure the Excel output as follows:
At this point, you can hold down Shift and drag the mouse, underline the line, and connect the CVS file input to the Excel output.
Finally, click Excel Output, select the field, click Get field, map the output to Excel field, and click OK.
Click CTRL + S to save, then click the Start button.
13. Kettle execution result.
Kettle can be called visual programming.
1) Kettle can be classified as Visula Programming Languages (VPL) because Kettle can define complex ETL programs and workflows in a graphical manner.
2) A Kettle diagram is a conversion and a job.
3) Visual programming has always been a core concept in Kettle, which enables you to quickly build complex ETL jobs and reduce maintenance workload. IT brings the IT world closer to the business world by hiding a lot of technical details.
15. Convert the Kettle.
1) Transaformation is the most important part of ETL solution. It deals with extracting, converting, and loading various operations on data rows.
2) transformation consists of one or more steps, such as reading files, filtering data rows, data cleaning, or loading data into a database.
3) Steps in a transformation are connected by hops, which define a one-way channel that allows data to flow from one step to another.
4) In a Kettle, the unit of data is row. Data flow is the movement of a row from one step to another.
5) Data flow is sometimes called record flow.
A Kettle Step is a basic part of a conversion.
A step has the following key characteristics:
1) The step must have a name that is unique within the scope of the transformation.
2) Each step reads and writes rows (the only exception is the “Generate record” step, which writes only data).
Step A step that writes data to one or more connected output hops and transmits it to the other end of the hop.
4) Most steps can have multiple output hops. A step’s data transmission can be set up for both distribution, where the target step receives records in turn, and replication, where all records are sent simultaneously to all target steps.
17, in the Kettle, Hop (the line between elements)
1) Jump is the line with arrows between steps, and jump defines the data path between steps. 2) A hop is actually a cache of data rows called rowsets between two steps (the size of the rowset can be defined in the Settings of the transformation). 3) When the rowset is full, writing to the rowset will stop until there is more space in the rowset. 4) When the rowset is empty, reading from the rowset stops until there are more readable rows in the rowset.
18, Kettle, data row – data type
Data moves along steps as rows of data. A data row is a collection of zero to multiple fields containing the following data types.
1), String: character type data
2), Number: double precision floating point Number.
3), Integer: signed long Integer (64 bits).
4), BigNumber: arbitrary precision data.
5), Date: Date and time value with millisecond precision.
6), Boolean: Boolean values true and false.
Binary fields can contain images, sounds, videos, and other types of Binary data.
19, Kettle, data row – metadata.
Each step outputs a data row with a description of the field, which is the metadata of the data row. It usually contains the following information.
1) Name: the field name in the row is unique.
2) Data type: data type of the field.
3) format: how data is displayed, such as Integer # and 0.00.
4) Length: string length or BigNumber type length.
5) precision: decimal precision of BigNumber data type.
6) Currency symbol: ¥.
7), decimal point symbol: decimal data decimal point format. The decimal point symbol is different in different cultures, usually the dot (.). Or comma (,).
8) Grouping symbol: grouping symbol of numeric data. The grouping symbol of numbers in different cultural backgrounds is also different, generally point (.). Or a comma (,) or single quotation mark (‘).
20, parallelism in a Kettle.
The rowset cache-based rule of hops allows each step to be run by a separate thread for maximum concurrency. This rule also allows data to be processed as a data stream with minimal memory consumption. In a data warehouse, we often deal with large amounts of data, so this approach of concurrency and low memory consumption is a core requirement of ETL tools.
For a kettle conversion, it is impossible to define an execution order because all steps are executed concurrently. After the conversion is started, all steps are started at the same time, read data from their input hops, and write the processed data to the input hops. The step is stopped until there is no more data in the input hops. When all steps are aborted, the entire transformation is aborted. (To be separated from the data flow)
If you want a task to be executed in a specified order, then use the “jobs” below!