Author: Roast chicken prince

Source: Hang Seng LIGHT Cloud Community

After the introduction and installation of DataX, we will use an example to illustrate the operation of DataX, and also sort out some considerations in the process of practical application

DataX run

Run the command

Python ${DATAX_HOME}\bin\datax.py -r read plugin type -w write plugin type example: Python ${DATAX_HOME}\bin\datax.py -r mysqlReader -w oraclewriter Python ${DATAX_HOME}\bin\datax.py {json configuration file}Copy the code

Example (transferring data from PGSQL to mysql)

(1) View the template

Python datax.py -r postgresqlreader -w mysqlwriter datax (datax-opensource-3.0), From Alibaba! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the mysqlreader document: https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md Please refer to the postgresqlwriter document: https://github.com/alibaba/DataX/blob/master/postgresqlwriter/doc/postgresqlwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": {"content": [{"reader": {// Database configuration "name": "postgresqlReader ", // data source name" parameter": {// database configuration information "column": [], "connection": [ { "jdbcUrl": [], "table": [] } ], "password": "", "username": "", "where": "" // Data filtering criteria}}, "writer": {// write database configuration "name": "mysqlWriter ", // database name" parameter": {// database configuration information "column": [], "connection" : [{" jdbcUrl ":" ", "table" : []}], "password" : ""," postSql ": [], / / task execution after operation" preSql ": [], / / task execution before operation "username" : ""}}}]," setting ": {/ / basic configuration" speed ": {/ / flow control" channel ": "" // The number of concurrent, which provides channel (concurrent), record stream, byte stream three kinds of flow control mode}}}// also can configure dirty data control and other configurations}Copy the code

(2) Edit the configuration based on the template

{ "job": { "setting": { "speed": { "channel": 3, "byte": 1048576 }, "errorLimit": { "record": 0, "percentage": 0.02}}, "content": [{"reader": {"name": "postgresqlReader ", "parameter": {"username": "authenticator", "password": "financegtn104", "column": [ ""id"", ""name"", ""content_type_id"", ""codename"" ], "splitPk": "", "connection": [{ "table": [ "public.auth_permission" ], "jdbcUrl": [" JDBC: postgresql: / / 10.20.64.241:5433 / finance "]}}}], "writer" : {" name ":" mysqlwriter ", "parameter" : {" username ": "root", "password": "hundsun1234", "column": [ "`id`", "`name`", "`content_type_id`", "`codename`" ], "connection": [{" table ": [" auth_permission"], "jdbcUrl" : "JDBC: mysql: / / 10.20.64.151:3306 / datax_test"}]]}}}}}Copy the code

(3) Execute Job

python ./bin/datax.py ./job/mysql2pgsql.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
​
2020-08-05 13:01:22.518 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-08-05 13:01:22.524 [main] INFO Engine - the machine info =>
  osInfo: Oracle Corporation 1.8 25.181-b13
  jvmInfo:       Linux amd64 3.10.0-862.11.6.el7.x86_64
  cpu num:       8
  totalPhysicalMemory:   -0.00G
  freePhysicalMemory:     -0.00G
  maxFileDescriptorCount: -1
  currentOpenFileDescriptorCount: -1
  GC Names       [PS MarkSweep, PS Scavenge]
  MEMORY_NAME                   | allocation_size               | init_size              
  PS Eden Space                 | 256.00MB                       | 256.00MB               
  Code Cache                     | 240.00MB                       | 2.44MB                 
  Compressed Class Space         | 1,024.00MB                     | 0.00MB                 
  PS Survivor Space             | 42.50MB                       | 42.50MB                
  PS Old Gen                     | 683.00MB                       | 683.00MB               
  Metaspace                     | -0.00MB                       | 0.00MB                 
​
​
2020-08-05 13:01:22.541 [main] INFO Engine -
{
  "content":[
          {
                  "reader":{
                          "name":"postgresqlreader",
                          "parameter":{
                                  "column":[
                                          ""id"",
                                          ""name"",
                                          ""content_type_id"",
                                          ""codename""
                                  ],
                                  "connection":[
                                          {
                                                  "jdbcUrl":[
                                                          "jdbc:postgresql://10.20.64.241:5433/finance"
                                                  ],
                                                  "table":[
                                                          "public.auth_permission"
                                                  ]
                                          }
                                  ],
                                  "password":"*************",
                                  "splitPk":"",
                                  "username":"authenticator"
                          }
                  },
                  "writer":{
                          "name":"mysqlwriter",
                          "parameter":{
                                  "column":[
                                          "`id`",
                                          "`name`",
                                          "`content_type_id`",
                                          "`codename`"
                                  ],
                                  "connection":[
                                          {
                                                  "jdbcUrl":"jdbc:mysql://10.20.64.151:3306/datax_test",
                                                  "table":[
                                                          "auth_permission"
                                                  ]
                                          }
                                  ],
                                  "password":"***********",
                                  "username":"root"
                          }
                  }
          }
  ],
  "setting":{
          "errorLimit":{
                  "percentage":0.02,
                  "record":0
          },
          "speed":{
                  "byte":1048576,
                  "channel":3
          }
  }
}
2020-08-05 13:01:22.557 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-08-05 13:01:22.559 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-08-05 13:01:22.559 [main] INFO JobContainer - DataX jobContainer starts job.
2020-08-05 13:01:22.560 [main] INFO JobContainer - Set jobId = 0
2020-08-05 13:01:22.676 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:postgresql://10.20.64.241:5433/finance.
2020-08-05 13:01:22.704 [job-0] INFO OriginalConfPretreatmentUtil - table:[public.auth_permission] has columns:[id,name,content_type_id,codename].
2020-08-05 13:01:22.917 [job-0] INFO OriginalConfPretreatmentUtil - table:[auth_permission] all columns:[
id,name,content_type_id,codename
].
2020-08-05 13:01:22.924 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (`id`,`name`,`content_type_id`,`codename`) VALUES(?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://10.20.64.151:3306/datax_test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2020-08-05 13:01:22.925 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-08-05 13:01:22.925 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] do prepare work .
2020-08-05 13:01:22.926 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2020-08-05 13:01:22.926 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-08-05 13:01:22.927 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2020-08-05 13:01:22.929 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] splits to [1] tasks.
2020-08-05 13:01:22.930 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2020-08-05 13:01:22.944 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2020-08-05 13:01:22.947 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-08-05 13:01:22.948 [job-0] INFO JobContainer - Running by standalone Mode.
2020-08-05 13:01:22.955 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-08-05 13:01:22.958 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-08-05 13:01:22.958 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-08-05 13:01:22.964 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-08-05 13:01:22.967 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select "id","name","content_type_id","codename" from public.auth_permission
] jdbcUrl:[jdbc:postgresql://10.20.64.241:5433/finance].
2020-08-05 13:01:23.063 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select "id","name","content_type_id","codename" from public.auth_permission
] jdbcUrl:[jdbc:postgresql://10.20.64.241:5433/finance].
2020-08-05 13:01:23.365 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[402]ms
2020-08-05 13:01:23.366 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-08-05 13:01:32.966 [job-0] INFO StandAloneJobContainerCommunicator - Total 24 records, 803 bytes | Speed 80B/s, 2 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-08-05 13:01:32.966 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] do post work.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2020-08-05 13:01:32.968 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /home/dataX/datax/hook
2020-08-05 13:01:32.968 [job-0] INFO JobContainer -
    [total cpu info] =>
          averageCpu                     | maxDeltaCpu                   | minDeltaCpu            
          -1.00%                         | -1.00%                         | -1.00%
​
    [total gc info] =>
            NAME                 | totalGCCount       | maxDeltaGCCount   | minDeltaGCCount   | totalGCTime       | maxDeltaGCTime     | minDeltaGCTime   
            PS MarkSweep         | 0                 | 0                 | 0                 | 0.000s             | 0.000s             | 0.000s     
            PS Scavenge         | 0                 | 0                 | 0                 | 0.000s             | 0.000s             | 0.000s     
2020-08-05 13:01:32.969 [job-0] INFO JobContainer - PerfTrace not enable!
2020-08-05 13:01:32.969 [job-0] INFO StandAloneJobContainerCommunicator - Total 24 records, 803 bytes | Speed 80B/s, 2 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-08-05 13:01:32.969 [job-0] INFO JobContainer -
任务启动时刻                   : 2020-08-05 13:01:22
任务结束时刻                   : 2020-08-05 13:01:32
任务总计耗时                   :                 10s
任务平均流量                   :               80B/s
记录写入速度                   :             2rec/s
读出记录总数                   :                 24
读写失败总数                   :                   0  
Copy the code

DataX – Web operation

Too much editing? Is there a graphical production profile like ketter, and is there a unified management of tasks?

DataX Web is an integrated visual page system of DataX. You can select data sources to generate data synchronization tasks with one click. It supports batch creation and construction of RDBMS data synchronization tasks and integrates open source scheduling system. Supports distributed and incremental data synchronization, real-time query of run logs, monitoring of actuator resources, and KILL running processes. Compared with kettle, datax-Web generates tasks on the page with one click, which enables faster synchronization efficiency

(1) Environment * MySQL (5.5+) Mandatory The corresponding client can be optionally installed. * JDK (1.8.0_xxx) Mandatory * DataX Mandatory * Python (2.x) Datax /bin > datax/ web > datax-python3 > datax/bin > datax/ web > datax-python3 > datax/bin > datax/ web > datax-python3 By default, DataX is executed as a Java subprocess. Users can choose to customize the DataX in Python (2) installation (1) source code compilation (2) official compiled packages. After decompressing the package, go to the decompressed directory and find the install.sh file in the bin directory. If you choose interactive installation, run the./bin/install.sh file directlyCopy the code

Note:

The user_id field is missing in the job_info table installed by the database script on the official website. When a user clicks the job_info table, an error message is displayed indicating that the user_id field cannot be found. Manually add the user_id field, whose type is int and length is 11. Alert (ALTER TABLE job_info ADD user_id int(11);)Copy the code

Detailed reference tutorial: github.com/WeiYe-Jing/…

Installation Package Description

Finally, type IP: port in your browser to see the datax-Web page

DataX versus Kettle

1) Kettle has its own management console, which can directly determine ETL tasks on the client. The Kettle is CS architecture, but does not support BS browser mode. DataX does not have an interface (DataX-Web has been extended to support BS)

2) All supported databases are quite complete, and more kettle databases are supported. DataX is developed by Ali, which can better support Ali’s own database series, such as ODPS and ADS

3) Kettle has joined THE BI organization Pentaho. After joining, Kettle’s development granularity and attention have been further improved

4) The granularity of DataX open source support is not high, the attention is far less than kettle, and the number of code submissions is much less.

5) According to the online reference information, the user can extract a large amount of data in a kettle full test, which takes a long time and is faster than a kettle full test datax.

other

  • The DataX plugin also supports other extensions that you can download and compile yourself: Datax support for ElasticSearch Reader Datax supports incremental postgresQL writeMode update Datax supports incremental postgresQL writeMode Update Datax supports Redis Writer…

  • PostgreSql does not support incremental configuration. You can place modified incremental data in a temporary table (the same library as the target table), delete the modified data by comparing the tables, and insert new data directly


Want to learn more from the tech gurus? Where are problems encountered during development discussed? How to access the massive resources of fintech?

Hang Seng LIGHT Cloud community, a professional fintech community platform built by Hang Seng Electronics, shares practical technology dry goods, resource data, and fintech industry trends, embracing all financial developers.