Copyright notice: This article is the blogger’s original article, shall not be reproduced without the permission of the blogger. The author can be reached at jishuhui_2015.
This article was written because of a common requirement within the company: synchronize data from the production environment to the test environment for easier testing and bug location.
I started with Navicat Premium, a DB management software that is powerful enough to meet the daily needs of developers, including the data synchronization requirements mentioned above.
With the improvement of the company’s business, more stringent requirements are put forward for data security.
In fact, it is not allowed to synchronize data from any environment to the production environment, that is, the production data is incorrect. Navicat Premium has data transfer capability that can transfer from one database to another without any restrictions.
In addition, in order to prevent developers from having direct access to the production database, the author took away high-privilege accounts and only granted them read-only rights, reserving the ability to view the data. If so, then the pressure of data synchronization is pointed at the author.
In order to seek a more convenient and stable way to complete the work of data synchronization, the author focused on the ETL tool.
If only to fulfill the functional requirements of data synchronization, the concept of ETL is a bit of a misstep. Considering the future data processing requirements, it is imperative to develop an ETL tool (at the time of writing, a functional requirement for exporting Excel has emerged).
In theory, everyday internal ETL requirements can be implemented in a “code + script” manner. However, in the opinion of the author, it is meaningless to repeat the wheel, time-consuming and exhausting. If you can master an ETL tool, it will undoubtedly reduce a lot of workload.
Of course, there are many ETL tools on the market, and many domestic and foreign enterprises have finished products. However, in accordance with the principle of open source, free, powerful and easy to use, only kettle is left.
Before using Kettle, you need to have some basic knowledge. To put it bluntly, if used by a non-technical PM, it will end with “MMP”. It is not necessary to fully learn kettle, but you should know what it does and what it can accomplish so that you can make full use of it in the future. I recommend reading this article. The process of ETL is a workflow. Here is the data synchronization process to be implemented this time:
The client fills in the Settings of the two databases, such as host, user, password, and database. The Settings are stored in the kettle in the form of variable.
Multiple tables may exist in a specified database. Therefore, the kettle iterates the process of obtaining data, clearing tables, and submitting data.
Of course, there are some details between processes. In the following section, I will explain how to set up a database synchronization project using kettle.
In this project, there are nested relationships between jobs and transformations. In line with the principle of “top-down design and bottom-up implementation”, we first configure several sub-processes and then connect them in series.
Create a new transform and save it named “Submit data.ktr”.
Prepare two database connections, main object tree -> right-click DB connection -> New.
${variable}; ${variable};
Note: The password is filled in in this form, and the kettle can also identify that this is a variable.
Because these two DB connections are used by other jobs and transformations, they can be set up to be shared.
Create Transformation and save the Transformation as Data synchronization.ktr.
A one-step block is done because you need to make sure that the table clearing operation completes first. That is, until T1: clear the table, T3: commit the data step will not proceed.
Core object -> Process -> block data until the steps are complete.
Next is the Transformation to get the table name:
Get the full scale name. KTR
Select table_name from table_name, select table_name from table_name, select table_name from table_name, select table_name from table_name, select table_name from table_name, select table_name from table_name.
T2: Select field, Core Object -> Transform -> Field selection. We’re going to specify the fields that we want, so obviously we need the table_name field, or we can rename it, in this case, to tablename.
Hold down the Shift key again to concatenate the steps, and the Transformation to the table name is complete.
An intermediate conversion takes the TABLENAME and sets it to the specified variable to get ${TABLENAME} when the data is submitted.
Create a new transform and save the name “Get variable.ktr”.
T1: get tablename variable value, core object -> job -> get record from result, specify tablename, rename tablename.
Create a Job, save the name “get variable.ktr”, and add a START and two transformations to the core object.
Next comes the last Job, create a new Job, save it and name it “entryPoint.kjb”. Add a START, a transformation, a job, and a successful termination node.
Get variable – data synchronization
Execute each input line
At this point, the entire data synchronization project is set up to run entryPoint.kjb. When running, you need to set parameters related to the database.
This is not enough. You have to enter so many parameters every time you do a job, and you can make mistakes.
The power of kettle is that it also provides a Java API, which can be used to perform higher-level abstractions to further reduce operation costs.
The next chapter explains how to implement a kettle based data synchronization system.