DataX-Web
DataX Web is a distributed data synchronization tool developed on DataX. It provides a simple and easy-to-use operation interface, reduces the learning cost of using DataX, shortens the time of task configuration, and avoids errors during configuration. Users can create data synchronization tasks by selecting data sources on the page. RDBMS data sources can create data synchronization tasks in batches, support real-time viewing of data synchronization progress and logs, and provide the function of termination of synchronization. Xxl-job can be integrated and developed for incremental data synchronization according to time and self-increasing primary keys.
Task executor supports cluster deployment, multi-node routing policy selection for executor, timeout control, retry, failure alarm, task dependence, and executor CPU. Memory. Load monitoring and so on. More data source support, data transformation UDFs, table structure synchronization, data synchronization consents and more complex business scenarios will be provided in the future.
System Requirements
- Language: Java 8 (JDK version recommended 1.8.201 or above) Python2.7(Python3 support needs to be modified to replace the three Python files under datax/bin. The replacement files are in doc/datax-web/datax-python3).
- Environment: MacOS, Windows,Linux
- The Database: Mysql5.7
Architecture diagram:
Features
- 1. Build DataX Json via Web;
- 2, DataX Json stored in the database, convenient task migration, management;
- 3, Web real-time view extract logs, similar to Jenkins’ log console output function;
- 4. DataX operation record display, DataX operation can be stopped by page operation;
- 5, Support DataX scheduled task, support dynamic modification of task status, start/stop task, and terminate the running task, effective immediately;
- 6. Scheduling adopts centralized design and supports cluster deployment;
- 7. Task execution is distributed, and task “executor” supports cluster deployment;
- 8. The actuator automatically registers tasks periodically. The scheduling center will automatically discover the registered tasks and trigger the execution.
- 9. Routing policies: When the executor is deployed in a cluster, various routing policies are provided, including the first, last, polling, random, consistent HASH, least Frequently used, most recently unused, failover, and busy failover.
- 10. Blocking processing strategy: the processing strategy when the scheduling is too dense for the executor to process, including single-machine serial (default), discarding subsequent scheduling, scheduling before overwriting;
- 11, task timeout control: support custom task timeout, task running timeout will actively interrupt the task;
- 12. Retry when a task fails: You can customize the retry times for task failures. When a task fails, the system automatically retries the task based on the preset retry times.
- 13. Task failure alarm; By default, email failure alarms are provided, and expansion interfaces are reserved to facilitate the expansion of SMS and nailing alarms.
- 14, user management: support online management system users, there are administrators, ordinary users two roles;
- 15. Task dependency: Support the configuration of subtask dependency. When the parent task is completed and successfully executed, it will actively trigger the execution of a subtask.
- 16, running report: support real-time view of running data, and scheduling reports, such as scheduling date distribution, scheduling success distribution, etc.
- 17. Specify the incremental field and configure the scheduled task to automatically obtain the data range each time. If the task fails, retry to ensure data security.
- Page can be configured DataX start JVM parameters;
- 19. Add manual test function after data source configuration is successful;
- 20. You can configure templates for common tasks. After constructing JSON, you can choose to associate templates to create tasks.
- 21. JDBC added Hive data source support, which can select data source to generate column information in building JSON page and simplify configuration;
- 22. Obtain DataX file directories through environment variables preferentially, and do not specify JSON and log directories during cluster deployment.
- 23. You can specify Hive partitions by setting dynamic parameters and dynamically insert incremental data into partitions together with increments.
- 24, the task type from the original DataX task to Shell task, Python task, PowerShell task;
- 25. Add HBase data source support, JSON construction can obtain hbaseConfig, column from HBase data source;
- 26. Add MongoDB data source support. Users only need to select collectionName to complete JSON construction.
- 27. Add the monitoring page of actuator CPU, memory and load;
- Add 24 class add-on DataX JSON configuration example
- 29. Public fields (created at, created by, modified at, modified by) are automatically filled when inserted or updated
- 30. Token verification for Swagger interface
- 31. Increase the timeout period of the task. For the kill datax process of the timeout task, it can be combined with the retry policy to avoid the datax stuck due to network problems.
- 32. Add project management module, which can manage tasks by classification;
- 33. Add batch task creation function to RDBMS data source, select data source, table can batch generate DataX synchronization task according to template;
- JSON build adds ClickHouse data source support;
- 35. Actuator CPU. Memory. Load monitoring page graphical;
- 36. RDBMS data source incremental extraction adds primary key increment mode and optimizes page parameter configuration;
- 37. Change the connection mode of MongoDB data source and reconstruct the HBase data source JSON construction module;
- 38. Add stop function for script tasks;
- 39, RDBMS JSON build add postSql, and support to build multiple preSql, postSql;
- 40. Data source information encryption algorithm modification and code optimization;
- 41. Add DataX execution result statistics to log page;
Quick Start:
Please click on:Quick Start
Linux:A key deployment
Docker mirror:address
The Introduction:
1. Actuator configuration (using open source project XXL-job)
- 1. “Scheduling center OnLine:” The list of OnLine scheduling centers is displayed on the right. After a task is executed, the system calls back the scheduling center in failover mode to notify the execution result, avoiding the single point risk of callback.
- 2. The list of OnLine actuators is displayed in “Actuator List”. You can view the cluster machines corresponding to the actuators through “OnLine Machines “.
Attribute description of the actuator
1, AppName: (and the application in datax - executor. Yml datax. Job. The executor, consistent appname) each actuator cluster only labeled appname, actuator periodically to appname automatically registered for the object. You can use this configuration to automatically discover the registered actuators for task scheduling. 2, name: the name of the actuator, because AppName limits the composition of alphanumeric, readable, name to improve the readability of the actuator; 3. Sorting: the sorting of actuators. Where actuators are needed in the system, such as new tasks, the list of available actuators will be read according to this sorting; 4. Registration mode: how the scheduling center obtains the address of the actuator; Automatic registration: The actuator automatically registers the actuator. The scheduling center can dynamically discover the address of the actuator machine through the underlying registry. Manual entry: Manually enter the address information of the actuator, separated by commas, for the dispatch center. 5, machine address: "registration mode" is valid when "manual entry", support manual maintenance of the actuator address information;Copy the code
2. Create a data source
Step 4 Use
3. Create a task template
Step 4 Use
4. Build a JSON script
- 1. Step 1 and Step 2: Select the data source created in Step 2. JSON building currently supported data sources hive, mysql, oracle, postgresql, essentially, hbase, directing, clickhouse other JSON data source building is developing, the need to manually write for the time being.
- 2. Field mapping
- 3. Click Build to generate JSON. At this point, you can choose to copy JSON and create a task. You can also click select template to directly generate tasks.
5. Create tasks in batches
6. Task creation (The association template creation task is not described. For details, see 4.
DataX task
Shell task
The Python tasks
PowerShell task
- Task types: DataX task, Shell task, Python task, PowerShell task;
- Blocking processing strategy: the processing strategy when scheduling is too dense for the executor to process;
- Single machine serial: after the scheduling request enters the single machine actuator, the scheduling request enters the FIFO queue and runs in serial mode.
- Discards subsequent scheduling requests: After a scheduling request enters the single-node executor, the request is discarded and marked as a failure if a scheduling task is found running on the executor.
- Overwriting the previous scheduling: After the scheduling request enters the single-machine executor, the running scheduling task is found, the running scheduling task is terminated, the queue is cleared, and the local scheduling task is executed.
- Incremental update it is recommended that the blocking policy be set to discard subsequent scheduling or standalone serial
- Set single serial when should pay attention to setting up reasonable retry count (the number of failures retry * each time < task scheduling cycle), the number of retry if set too much leads to data duplication, 30 seconds to perform a task, for example, each time need for 20 seconds, set up the retry three times, if a task failed, The first retry period is 1577755680-1577756680. If the retry task is not finished and a new task is started, the new task period is 1577755680-1577758680
- Incremental parameter setting
- Partition Parameter Settings
7. To-do list
8. You can click View Logs to obtain logs in real time and terminate the datax process in progress
9. Monitor task resources
10. Admin can create users and edit user information
UI
Front-end Github address
Contributing
Contributions are welcome! Open a pull request to fix a bug, or open an Issue to discuss a new feature or change.
Welcome to contribute to the project! Such as submitting PR to fix a bug, or creating an Issue to discuss new features or changes.
Copyright and License
MIT License
Copyright (c) 2020 WeiYe
The product is open source and free, and will continue to provide free community technical support. Free access and use within individuals or enterprises.
Welcome to register at the registration address for product promotion and community development motivation only.
V – 2.1.2
new
- Add project management module, can classify the task management;
- Add batch task creation function to RDBMS data source, select data source, table can batch generate DataX synchronization task according to template;
- JSON build adds ClickHouse data source support;
- Actuator CPU. Memory. Load monitoring page graphical;
- RDBMS data source incremental extraction adds primary key increment and optimizes page parameter configuration;
- Change the connection mode of MongoDB data source and reconstruct the HBase data source JSON construction module.
- Script tasks added the stop function.
- RDBMS JSON build adds postSql and supports building multiple preSql, postSql;
- Merge datax-Registry module into datax-RPC;
10. Data source information encryption algorithm modification and code optimization; 11. Incremental time synchronization supports more time formats. 12. Add DataX execution result statistics to log page;
Update:
- PostgreSql, SQLServer, Oracle Data source JSON build add schema name selection;
- Optimization of consistent field names and data source keywords in DataX JSON;
- Task management page button display optimization;
- Added task description on the log management page.
- [Fixed] JSON build front-end form form can’t cache data
- HIVE JSON build adds header and tail options
Remark:
2.1.1 Upgrade is not recommended. Changing the data source information encryption mode will cause decryption failure of previously encrypted data sources and task execution failure. If upgrade is required please rebuild data source, task.
V – 2.1.1
new
- Add support for HBase data source to obtain hbaseConfig, column from HBase data source for JSON construction.
- Add MongoDB data source support, users only need to select collectionName to complete JSON construction;
- Add actuator CPU. Memory. Load monitoring page;
- Add 24 class plug-in DataX JSON configuration example
- Public fields (created at, creator, modified at, modifier) are automatically populated when inserted or updated
- Token authentication for the Swagger interface
- If the timeout period of a task is increased, the kill datax process can be used together with a retry policy to prevent the datax process from being stuck due to network problems.
Update:
- Data source management encrypts user names and passwords to improve security.
- Encrypts the username and password in the JSON file and decrypts it when the DataX task is executed
- Interactive optimization of page menu sorting, icon upgrading, prompt messages, etc.
- Log output Delete irrelevant information such as project class name, reduce file size, optimize large file output, and optimize page display.
- Logback is configured to obtain the log path from YML
Fixed:
- If the task log is too large, an error message is displayed indicating that the request times out.