DataX is an open source data transmission framework of Alibaba. Not according to the normal process, anyway, there is no detailed documentation. If you are not familiar with the framework, it is recommended to read the documentation first.

How to design a data transmission framework? Perhaps the initial requirement is to be able to pull data from a data source and output it to the target data source. The data sources here are not limited to DB. It is natural to think that you might need a reader to pull the data, a writer to receive the data. The concept of a channel may be required. And there may be multiple channels in between. The channel abstraction here corresponds to JDK.nio.channel, which is the data carrier. Data should not go directly from reader to Writer, but rather through pipes. You can then limit the flow of data at the channel layer and increase the buffer. Now that the basic architecture is out, what else is needed? Performance, how to improve performance, by increasing parallelism. This involves task splitting and thread isolation. Different tasks should be executed by separate threads, possibly using thread pools. It is also possible to use a new thread alone. Then the task as a whole should require an entity to summarize. It can be considered that a large task can be divided into many smaller tasks by some kind of splitting strategy, and each task has a thread to execute. Tasks may require some kind of retry mechanism, with a corresponding upper limit on the number of retries. (embarrassed, I feel that the role of taskGroup is not very strong. One point I think of is that after the total task is broken down into multiple sub-tasks, multiple taskgroups are generated through certain grouping strategies. Then in cluster mode it is possible to send tasks to other machines in the cluster. And receive the processing results from other machines to determine whether all tasks are completed. To be honest, forced complexity. This abstraction might make sense if you were able to split tasks into groups where the failure of some groups didn’t affect the whole)

As a basic assumption, let’s look at what core components are designed in DataX.

Job: Indicates the total task, which needs to be split.

Reader: Inherits from a plug-in skeleton class that may need to read and insert data in different ways for different sources, and this part of the logic is implemented by the plug-in provider.

Reader.Job: The inner class of Reader. Defines the logic for breaking up jobs into multiple tasks

Reader.task: The inner class of Reader. Defines how to read data

Writer: Inherits from a plug-in skeleton class that defines its own write logic based on different target data sources

Writer.Job: internal class of Writer. Define how to split a Job into multiple tasks

Writer.Task: An inner class of Writer. Defines how to write data

Task: Minimum unit of data transfer

Task. WriteRunner/readerRunner: each Task internal include loading logic and writing logic. Delegate to an internal implementation of two member variables. Runner actually contains Reader/Writer plug-in classes. The plug-in is actually delegated to perform the task. WriteRunner and readerRunner correspond to one thread.

TaskGroup: Tasks are randomly scrambled and assigned to groups

Sender: Actually the reader reads the data and stores it in the Sender, and the Sender decides when to send the data to the channel. So it’s a buffer. If you don’t know the underlying implementation of a channel, it’s a good idea to opt for buffer + bulk writes. Perhaps the channel implementation uses storage technology or other IO operations, so bulk writes can be efficient. Receiver: Contrasts with Sender. Called by writer to carry data from a channel. Buffers are also used. Writing to the target data source is typically a time-consuming operation. Improve efficiency by batch writing ideas. Channel: data transmission Channel. The default implementation in DataX is through a blocking queue.

JobContainer: Contains information about Job running status and job-level configuration items. And how to load the plug-in. How to break up tasks, into groups. And start multiple TaskGroupContainers using the Schduler object. TaskGroupContainer: Manages the running and configuration information of each TG level. Responsible for managing the start of each task, monitoring whether the task fails, and deciding whether to retry or terminate the program.

The general operation process is like this. (Ignore a series of configuration parsing, data statistics). The Engine object creates JobContainer and splits and groups jobs. A series of TaskGroupContainer objects are generated. Each TaskGroupContainer creates and runs an equal number of tasks based on the assigned channel. The number of channels is a description of IO resource consumption. If the number of concurrent tasks is not limited, excessive I/O resources may be occupied and other programs cannot run. Or suddenly load too much data into memory causing OOM. Unfortunately, he doesn’t want ES to have strict memory usage control. Limiting the flow at the byte level makes it virtually impossible to determine how much memory consumption each task will consume. The logic for running task is to start readRunner/writeRunner at the same time. Data is written to a channel by reading the plug-in and consumed by writing the plug-in.

When I look at the code, the DataX feels like a work-in-progress, with lots of TODO inside and IO limiting not done. Cluster processing is also not supported. And there is no refreshing feeling, according to the normal way of thinking can also guess the architecture 7788. Inflexible, leaving too much logic to plug-ins. It is also impossible to do multidirectional transmission of data, for example, an input source may have many consumers downstream. The reference is camel. How to route and how to configure routing rules may be the essence of ETL.