Author: Roast chicken prince
Source: Hang Seng LIGHT Cloud Community
After the introduction and installation of DataX, we will use an example to illustrate the operation of DataX, and also sort out some considerations in the process of practical application
DataX run
Run the command
Python ${DATAX_HOME}\bin\datax.py -r read plugin type -w write plugin type example: Python ${DATAX_HOME}\bin\datax.py -r mysqlReader -w oraclewriter Python ${DATAX_HOME}\bin\datax.py {json configuration file}Copy the code
Example (transferring data from PGSQL to mysql)
(1) View the template
Python datax.py -r postgresqlreader -w mysqlwriter datax (datax-opensource-3.0), From Alibaba! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the mysqlreader document: https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md Please refer to the postgresqlwriter document: https://github.com/alibaba/DataX/blob/master/postgresqlwriter/doc/postgresqlwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": {"content": [{"reader": {// Database configuration "name": "postgresqlReader ", // data source name" parameter": {// database configuration information "column": [], "connection": [ { "jdbcUrl": [], "table": [] } ], "password": "", "username": "", "where": "" // Data filtering criteria}}, "writer": {// write database configuration "name": "mysqlWriter ", // database name" parameter": {// database configuration information "column": [], "connection" : [{" jdbcUrl ":" ", "table" : []}], "password" : ""," postSql ": [], / / task execution after operation" preSql ": [], / / task execution before operation "username" : ""}}}]," setting ": {/ / basic configuration" speed ": {/ / flow control" channel ": "" // The number of concurrent, which provides channel (concurrent), record stream, byte stream three kinds of flow control mode}}}// also can configure dirty data control and other configurations}Copy the code
(2) Edit the configuration based on the template
{ "job": { "setting": { "speed": { "channel": 3, "byte": 1048576 }, "errorLimit": { "record": 0, "percentage": 0.02}}, "content": [{"reader": {"name": "postgresqlReader ", "parameter": {"username": "authenticator", "password": "financegtn104", "column": [ ""id"", ""name"", ""content_type_id"", ""codename"" ], "splitPk": "", "connection": [{ "table": [ "public.auth_permission" ], "jdbcUrl": [" JDBC: postgresql: / / 10.20.64.241:5433 / finance "]}}}], "writer" : {" name ":" mysqlwriter ", "parameter" : {" username ": "root", "password": "hundsun1234", "column": [ "`id`", "`name`", "`content_type_id`", "`codename`" ], "connection": [{" table ": [" auth_permission"], "jdbcUrl" : "JDBC: mysql: / / 10.20.64.151:3306 / datax_test"}]]}}}}}Copy the code
(3) Execute Job
python ./bin/datax.py ./job/mysql2pgsql.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2020-08-05 13:01:22.518 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-08-05 13:01:22.524 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.181-b13
jvmInfo: Linux amd64 3.10.0-862.11.6.el7.x86_64
cpu num: 8
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2020-08-05 13:01:22.541 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"postgresqlreader",
"parameter":{
"column":[
""id"",
""name"",
""content_type_id"",
""codename""
],
"connection":[
{
"jdbcUrl":[
"jdbc:postgresql://10.20.64.241:5433/finance"
],
"table":[
"public.auth_permission"
]
}
],
"password":"*************",
"splitPk":"",
"username":"authenticator"
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"column":[
"`id`",
"`name`",
"`content_type_id`",
"`codename`"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://10.20.64.151:3306/datax_test",
"table":[
"auth_permission"
]
}
],
"password":"***********",
"username":"root"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"byte":1048576,
"channel":3
}
}
}
2020-08-05 13:01:22.557 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-08-05 13:01:22.559 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-08-05 13:01:22.559 [main] INFO JobContainer - DataX jobContainer starts job.
2020-08-05 13:01:22.560 [main] INFO JobContainer - Set jobId = 0
2020-08-05 13:01:22.676 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:postgresql://10.20.64.241:5433/finance.
2020-08-05 13:01:22.704 [job-0] INFO OriginalConfPretreatmentUtil - table:[public.auth_permission] has columns:[id,name,content_type_id,codename].
2020-08-05 13:01:22.917 [job-0] INFO OriginalConfPretreatmentUtil - table:[auth_permission] all columns:[
id,name,content_type_id,codename
].
2020-08-05 13:01:22.924 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (`id`,`name`,`content_type_id`,`codename`) VALUES(?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://10.20.64.151:3306/datax_test?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2020-08-05 13:01:22.925 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-08-05 13:01:22.925 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] do prepare work .
2020-08-05 13:01:22.926 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2020-08-05 13:01:22.926 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-08-05 13:01:22.927 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2020-08-05 13:01:22.929 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] splits to [1] tasks.
2020-08-05 13:01:22.930 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2020-08-05 13:01:22.944 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2020-08-05 13:01:22.947 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-08-05 13:01:22.948 [job-0] INFO JobContainer - Running by standalone Mode.
2020-08-05 13:01:22.955 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-08-05 13:01:22.958 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-08-05 13:01:22.958 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-08-05 13:01:22.964 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-08-05 13:01:22.967 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select "id","name","content_type_id","codename" from public.auth_permission
] jdbcUrl:[jdbc:postgresql://10.20.64.241:5433/finance].
2020-08-05 13:01:23.063 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select "id","name","content_type_id","codename" from public.auth_permission
] jdbcUrl:[jdbc:postgresql://10.20.64.241:5433/finance].
2020-08-05 13:01:23.365 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[402]ms
2020-08-05 13:01:23.366 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-08-05 13:01:32.966 [job-0] INFO StandAloneJobContainerCommunicator - Total 24 records, 803 bytes | Speed 80B/s, 2 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-08-05 13:01:32.966 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX Reader.Job [postgresqlreader] do post work.
2020-08-05 13:01:32.967 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2020-08-05 13:01:32.968 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /home/dataX/datax/hook
2020-08-05 13:01:32.968 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2020-08-05 13:01:32.969 [job-0] INFO JobContainer - PerfTrace not enable!
2020-08-05 13:01:32.969 [job-0] INFO StandAloneJobContainerCommunicator - Total 24 records, 803 bytes | Speed 80B/s, 2 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-08-05 13:01:32.969 [job-0] INFO JobContainer -
任务启动时刻 : 2020-08-05 13:01:22
任务结束时刻 : 2020-08-05 13:01:32
任务总计耗时 : 10s
任务平均流量 : 80B/s
记录写入速度 : 2rec/s
读出记录总数 : 24
读写失败总数 : 0
Copy the code
DataX – Web operation
Too much editing? Is there a graphical production profile like ketter, and is there a unified management of tasks?
DataX Web is an integrated visual page system of DataX. You can select data sources to generate data synchronization tasks with one click. It supports batch creation and construction of RDBMS data synchronization tasks and integrates open source scheduling system. Supports distributed and incremental data synchronization, real-time query of run logs, monitoring of actuator resources, and KILL running processes. Compared with kettle, datax-Web generates tasks on the page with one click, which enables faster synchronization efficiency
(1) Environment * MySQL (5.5+) Mandatory The corresponding client can be optionally installed. * JDK (1.8.0_xxx) Mandatory * DataX Mandatory * Python (2.x) Datax /bin > datax/ web > datax-python3 > datax/bin > datax/ web > datax-python3 > datax/bin > datax/ web > datax-python3 By default, DataX is executed as a Java subprocess. Users can choose to customize the DataX in Python (2) installation (1) source code compilation (2) official compiled packages. After decompressing the package, go to the decompressed directory and find the install.sh file in the bin directory. If you choose interactive installation, run the./bin/install.sh file directlyCopy the code
Note:
The user_id field is missing in the job_info table installed by the database script on the official website. When a user clicks the job_info table, an error message is displayed indicating that the user_id field cannot be found. Manually add the user_id field, whose type is int and length is 11. Alert (ALTER TABLE job_info ADD user_id int(11);)Copy the code
Detailed reference tutorial: github.com/WeiYe-Jing/…
Installation Package Description
Finally, type IP: port in your browser to see the datax-Web page
DataX versus Kettle
1) Kettle has its own management console, which can directly determine ETL tasks on the client. The Kettle is CS architecture, but does not support BS browser mode. DataX does not have an interface (DataX-Web has been extended to support BS)
2) All supported databases are quite complete, and more kettle databases are supported. DataX is developed by Ali, which can better support Ali’s own database series, such as ODPS and ADS
3) Kettle has joined THE BI organization Pentaho. After joining, Kettle’s development granularity and attention have been further improved
4) The granularity of DataX open source support is not high, the attention is far less than kettle, and the number of code submissions is much less.
5) According to the online reference information, the user can extract a large amount of data in a kettle full test, which takes a long time and is faster than a kettle full test datax.
other
-
The DataX plugin also supports other extensions that you can download and compile yourself: Datax support for ElasticSearch Reader Datax supports incremental postgresQL writeMode update Datax supports incremental postgresQL writeMode Update Datax supports Redis Writer…
-
PostgreSql does not support incremental configuration. You can place modified incremental data in a temporary table (the same library as the target table), delete the modified data by comparing the tables, and insert new data directly
Want to learn more from the tech gurus? Where are problems encountered during development discussed? How to access the massive resources of fintech?
Hang Seng LIGHT Cloud community, a professional fintech community platform built by Hang Seng Electronics, shares practical technology dry goods, resource data, and fintech industry trends, embracing all financial developers.