I heard that wechat search “Java fish” will change strong oh!

This article is in Java Server, which contains my complete series of Java articles, can be read for study or interview

(a) What is Datax

I once worked on a project, in which there was a requirement to synchronize data from SQL Server to Mysql regularly every day. At that time, I wrote a piece of Java code to achieve this, and a set of Java code needed to write the connection of two data sources and two sets of SQL code, which was very inconvenient. If Oracle, Mysql, SqlServer synchronization with each other, the code logic is more complex. Moreover, it takes more than two hours to synchronize 6 million pieces of data through code, which is very inefficient.

Recently, I came into contact with a new tool called Datax at work and realized that there is such a simple way to synchronize data.

Datax is an open source offline synchronization tool for heterogeneous data sources of Alibaba. DataX implements MySQL, Oracle, SqlServer, Postgre, HDFS, Hive, ADS, HBase, TableStore(OTS), MaxCompute(ODPS), Hologres, and DRDS Efficient data synchronization between heterogeneous data sources.

Datax can synchronize data from one database to another by configuring a JSON file.

Datax is currently open source on Github: github.com/alibaba/Dat…

(ii) Datax architecture

Datax architecture is constructed by FrameWork+ Plugin, wherein:

Reader: Data acquisition module that collects data from the data source and sends the data to the Framework

Writer: A data writing module that continuously fetches data from the Framework and writes data to the destination end

Framework: The Framework connects reader and Writer, acts as the data transfer channel between them, and handles core technical issues such as buffering, flow control, concurrency, and data conversion.

(iii) Operation principle of Datax

Job: the management node of a single Job, responsible for data clearing, subtask division, and TaskGroup monitoring

Task: it is the smallest Datax unit divided by Job and is responsible for data synchronization every other Task

Schedule: Sets tasks into taskgroups. The number of concurrent tasks in a TaskGroup is 5

DataX Quick start

Datax’s recommendation system is:

  • Linux
  • JDK(above 1.8, 1.8 recommended)
  • Python (recommended Python2.6. X)
  • Apache Maven 3.x (Compile DataX)

I will operate in accordance with the recommendation system.

First we will download datax, datax download there are two ways, one is to directly download the compressed package, the other is to download the source code manually compiled, here is the first to show the use of the download compressed package:

The first is to download datax package: datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.g…

Download it and upload it to a Linux server. Unzip it:

tar -zxvf datax.tar.gz
Copy the code

Go to the bin directory of the datax and run the self-check script

cddatax/bin/ python datax.py .. /job/job.jsonCopy the code

The datax was successfully installed if the result is as follows.

(v) Data synchronization of datAX console

The function of Datax is to realize data transmission between heterogeneous databases, and its application is relatively simple. You only need to configure the corresponding JSON template to transmit data.

To get the json template corresponding to the datax, use the following command:

python datax.py -r streamreader -w streamwriter
Copy the code

You get the corresponding template:

{
    "job": {
        "content": [{"reader": {
                    "name": "streamreader"."parameter": {
                        "column": []."sliceRecordCount": ""}},"writer": {
                    "name": "streamwriter"."parameter": {
                        "encoding": ""."print": true}}}]."setting": {
            "speed": {
                "channel": ""}}}}Copy the code

Let’s do a simple configuration and see what happens. The effect is to print Hello, world ten times in the console, and create a new file in the job directory called stream2stream.json

{
    "job": {
        "content": [{"reader": {
                    "name": "streamreader"."parameter": {
                        "column": [{"type":"string"."value":"hello"
                            },
                            {
                                "type":"string"."value":"world"}]."sliceRecordCount": "10"}},"writer": {
                    "name": "streamwriter"."parameter": {
                        "encoding": "UTF-8"."print": true}}}]."setting": {
            "speed": {
                "channel": "1"}}}}Copy the code

Operating Projects:

python datax.py .. /job/stream2stream.jsonCopy the code

See the effect

Datax mysql data synchronization

Mysql > select * from ‘mysql’ where ‘mysql’ is installed

python datax.py -r mysqlreader -w mysqlwriter
Copy the code

A brief introduction to templates:

Column: indicates the column name of the reader or writer

Connection: Enter the connection information

Where: Sets connection conditions

Specific other parameters can be found in the official documentation in more detail

{
    "job": {
        "content": [{"reader": {
                    "name": "mysqlreader"."parameter": {
                        "column": [], # columns to synchronize"connection"[# connect info {"jdbcUrl": []."table": []}],"password": ""# and password"username": "", # username"where": ""}},"writer": {
                    "name": "mysqlwriter"."parameter": {
                        "column"[], # Write the column name of the segment in the same position as the value to be synchronized above"connection"[# connect info {"jdbcUrl": ""."table": []}],"password": ""# and password"preSql": [], # Perform what you did before writing"session"[], # DataX: [], # DataX: [], # DataX"username": "", # username"writeMode": ""Insert into or replace into or ON DUPLICATE KEY UPDATE statement}}}]"setting": {
            "speed": {
                "channel": ""}}}}Copy the code

Mysql > select * from ‘mysql’; select * from ‘mysql’;

CREATE TABLE `user`(
`id` int(4) not null auto_increment,
`name` varchar(32) not null.PRIMARY KEY(id)
)
CREATE TABLE `user2`(
`id` int(4) not null auto_increment,
`name` varchar(32) not null.PRIMARY KEY(id)
)
INSERT INTO `user` VALUES (1.'javayz')
INSERT INTO `user` VALUES (2.'java')
Copy the code

Next, configure mysqL2mysql.json

{
    "job": {
        "content": [{"reader": {
                    "name": "mysqlreader"."parameter": {
                        "column": [
                            "id"."name"]."connection": [{"jdbcUrl": ["JDBC: mysql: / / 10.10.128.120:3306 / test"]."table": ["user"]}],"password": "123456"."username": "root"}},"writer": {
                    "name": "mysqlwriter"."parameter": {
                        "column": [
                            "id"."name"]."connection": [{"jdbcUrl": "JDBC: mysql: / / 10.10.128.120:3306 / test"."table": ["user2"]}],"password": "123456"."username": "root"}}}]."setting": {
            "speed": {
                "channel": "1"}}}}Copy the code

Run the script as well:

python datax.py .. /job/mysql2mysql.jsonCopy the code

After the console output is successful, you can look at the database and see that the data has been synchronized.

(7) Summary

In addition, datax also supports a large number of databases, and the use of documentation is very detailed, you can have a try yourself, very interesting tools.

I’m fish boy. See you next time!