Compared with the current popularity of big data technology, you may feel that there are too few Kettle usage scenarios, or there is no need to use such a device. After checking the Github Kettle, you find that there are some recent updates. In addition, for data users without programming experience, a very simple Kettle is used. Through the graphic interface design to achieve what business, do not need to write code to achieve, you can do some experiments, such as: crawl the website stock data, foreign exchange information and so on.
Kettle supports a variety of input and output formats, including text files, data tables, and database engines. In summary, Kettle’s powerful input, output, and conversion functions make it very easy to manipulate data.
Common input steps
File entry steps
Common text file input steps include CSV file input, Excel file input, and text file input.
The HelloWorld level function, “copying data from CSV files to Excel files,” was described in previous articles. See the steps for more details.
You can select all files in the same directory by selecting the directory and wildcard number wildcard files, or you can choose whether to read files in the subdirectories of the current directory, as shown below:
XML input Steps
XML is an extensible Markup language, which is mainly used to transmit and store data. In some traditional systems, data transmission and docking are still carried out in this way. By means of the input step of “Get data from XML”, data information in XML files is obtained. Using xpath to determine the location of a portion of data in an XML document, xpath provides the ability to find nodes in a tree of data structures based on XML.
The sample
Follow the wizard to complete an example of reading property configuration information from a POM file, as shown below:
Common expressions in xpath
expression | instructions |
---|---|
nodename | Selects all children of this node |
/ | From the root node |
// | Nodes in the document are selected from the current node selected by the match, regardless of their location |
. | Select the current node |
. | Selects the parent of the current node |
@ | Select properties |
div | Selects all the children of the div element |
/div | Select the root element div |
div/p | Select the child p under the div element |
//div | Select all div elements |
div//p | Selects all p elements under the div element |
//@lang | Select all the properties named lang |
JSON input steps
Compared to XML, it is a lightweight data interchange format. JSON core concepts: arrays (data in []), objects (data in {}), attributes (data in k:v)
To implement the function of “call RESTful interface to import JSON results into the library”, whether using Java or Python to call RESTful interface to import JSON results into the library, there is a certain complexity. First, you need to load third-party REST component dependencies, and then connect to the database. Write the SQL statement and finally insert it into the target database. However, with the Kettle tool, Spoon can easily call the interface and import the database using the graphical interface.
Through a simple GET request, obtain “Shanghai and Shenzhen ipo notice” information, data situation and operation are as follows
The data sample
Interface information
Xueqiu.com/service/v5/…
Conversion steps:
The output
Configuration Procedure
Table entry step
Adding a database driver
If you want to manipulate the database, you need to download the database driver and place it in /data-integration/lib. The default driver includes the following:
(base) ~ / Documents/apps/data - integration/lib ls - al | grep SQL - rw - r - r - @ 1 Yezhiwei staff. 1473091 Jun 11 2019 Hsqldb-2.3.2. jar-rw-r --r-- @1 Yezhiwei Staff 825943 Jun 11 2019 Postgresql-42.2.5.jar-rw-r --r-- @1 Yezhiwei Staff 3201133 Jun 11 2019 SQLite-jDBC-3.7.2.jarCopy the code
Add MySQL driver jar package mysql-connector-java-5.1.41.jar to /data-integration/lib:
Configuring database Connections
Get table SQL query statement, data preview
Select the database connection configuration, then click “Get SQL Query Statement”
Select the table, complete the Settings as prompted, automatically generate SQL, “preview” can view the data.
Common output steps
After data is processed, it is usually saved to a place for future use. The output is usually Excel or a database. Excel output has been described in previous examples.
The common database output steps are Insert, update, and delete. The examples focus on the Insert, update, and Delete output steps
Table output
Outputs data from other input sources to the table
update
Update refers to comparing existing records in the database with those in the data stream and updating them if they are different
Insert the updated
To insert an update is to insert redundant data from the data stream on the basis of the update
delete
Delete can be used with custom constant input to define a constant condition and delete all data that does not meet this condition
The sample
Data preparation
Data is inserted or updated from Excel to MySQL
Result of new data insertion
Update the results
conclusion
ETL is an indispensable tool in data warehouse technology. As a classic ETL tool, Kettle can be used to implement services through GRAPHICAL interface design without writing codes. It is also very simple for data users without programming experience.
Of course, there are also some problems, such as: the amount of data processing is relatively small, relatively few documents and so on. Have the opportunity to summarize the problems encountered in the process of use. Welcome to follow the public account HelloTech for more content