Digital stack is a cloud-native — site-data medium platform PaaS, and we have an interesting open source project on Github and Gitee: FlinkX, FlinkX is a unified data synchronization tool based on Flink batch flow. It can collect both static data and real-time data. It is a data synchronization engine integrating global, heterogeneous and batch flow. If you like, please give us a star! Star! Star!
Github Open Source project: github.com/DTStack/fli…
Gitee Open source project: gitee.com/dtstack_dev…
I. Where to collect data
When we talk about big data, using it, making it valuable, we have to have the data in the first place.
We speak data China “deposit”, “tong”, “use”, the first is to “save”, we should put the data in China, was put in the data warehouse, on the basis of the “save”, we will be different sources of different formats of data analyzing, the data island through one by one, the data resources form data assets, and then according to the user’s specific scenario, data applications.
The generation of data does not come out of thin air. Kangaroo Cloud number Stack provides two ways of offline data synchronization collection and real-time data synchronization collection, helping users efficiently collect scattered data resources, store them together, and carry out “global” data collection in an instrumental way, laying a foundation for the construction of data center.
2. How to collect data
1. Synchronize offline data collection
The following figure shows the data synchronization task configured visually:
FlinkX, a data synchronization tool of multiple stacks, plays the role of “bridge” among different storage systems. It is the basic core function of the data center and supports various heterogeneous storage system data (MySQL, Server, Oracle, etc.). The plug-in architecture can support more new data sources at any time. Supports large-capacity and high-concurrency synchronization, providing better performance and stability than single-point synchronization.
This solution meets the synchronization requirements of different levels, such as minutes (5 minutes), hours, and days.
The data synchronization interface of kangaroo cloud stack is shown in the figure below:
The data synchronization module FlinkX is a conduit for data exchange between various storage units. In order to carry out large-scale data set mining and calculation in the data center platform, the usual practice is to transfer the data to the data center platform before the task execution, and transfer the calculation results to external storage units (such as application databases such as MySQL) after the task execution.
The functions of data integration are shown in the figure below:
The data synchronization module has the following features:
1) Rich data source support
Data synchronization module can be used for MySQL, Oracle, SQLServer, PostgreSQL, DB2, HDFS (Textfile/Parquet/ORC), Hive, HBase, FTP, ElasticSearch, MaxCompute, Elastic Search, Redis, MongoDB, CarbonData and other data sources. Data can be read or written to these data sources. You only need to configure the connection information of the data source, such as the JDBC URL, user name, and password of the Oracle database, and then configure the corresponding data synchronization task.
2) Distributed system architecture
The data synchronization module uses an advanced distributed system architecture to concurrently read and write data from multiple nodes, which greatly improves the data synchronization throughput. Compared with open source data synchronization schemes such as Sqoop and Kettle, the data synchronization module has higher data throughput and better supporting functions.
3) Wizard/custom configuration mode
Wizard mode:
It is convenient and simple, visualized field mapping, and quickly completes synchronization task configuration. You can use the wizard to create and configure synchronization tasks, including selecting source database source table, target database target table, field mapping, and synchronization speed.
Script mode:
Features are versatile, efficient, deep tuning, support for all data sources. You need to write a JSON script to complete the configuration process.
4) Scheduling and dependency configuration
In an actual data production process, the data synchronization task is usually the first and last task of the data processing link, with the responsibility of “extracting data from the business system” and “writing the resulting data out,” respectively. [Offline Computing – Development suite] Supports the configuration of dependencies on synchronization tasks to restrict the execution sequence of synchronization tasks and other tasks.
Data synchronization tasks are usually cycle, daily, weekly, or minutes per hour (5 minutes) performs a, support offline calculation – development suite 】 【 the synchronization task configuration cycle, synchronization tasks performed periodically, detailed scheduling and depend on the configuration function please reference data development: building data analysis logic section.
5) Full/incremental synchronization
To minimize the impact on the service system, incremental data synchronization is usually required during data reading from the service system. In the case of data change time field in the source database table, [Offline Computing – development kit] supports incremental data synchronization for relational databases, users only need to input the corresponding data filtering statements to achieve.
6) Entire library synchronization
Whole database synchronization is a quick tool to help improve user efficiency and reduce user costs. It can quickly upload all tables in a MySQL database to the data platform, saving a lot of initialization effort. Assuming the database has 100 tables, you might need to configure 100 data synchronization tasks, but with the whole database upload you can do it all at once (requiring a high degree of standardization in the database table design).
In the entire database synchronization configuration, you can select tables to be synchronized in batches, and configure information such as full/incremental, and batch synchronization. In addition, you can customize table names and field types to achieve high flexibility.
7) Multipath synchronization between database and FTP
The data synchronization module can support the data synchronization of relational database under the mode of database and table. The user only needs to select multiple tables and databases on the page (the structure of each table is required to be the same).
In addition to the relational database database partition mode, it also supports a task to read multiple files from multiple FTP paths, reducing the repetitive work of synchronizing task configuration.
8) Control of synchronization speed
During initial data synchronization, a large amount of historical data needs to be synchronized to the middle platform. Therefore, the data read speed needs to be improved. When the service database is under heavy operating pressure, the data read and write speed needs to be reduced to reduce the pressure on the database.
The data synchronization module supports synchronization speed control. You can adjust the synchronization speed by setting the upper limit. This parameter depends on the hardware configuration and data amount.
2. Real-time data synchronization collection
The figure above shows the real-time data flow synchronization architecture, illustrated as follows:
1) Oracle and SQLServer data source: the user side needs to purchase and deploy the OGG real-time acquisition tool, real-time acquisition of Oracle redo log data, and then through the stack of DTinsightStream product visual configuration data to Kafka, data is real-time archiving or real-time consumption.
2) MySQL Data source: Several stacks of DTinsightStream products have integrated the Canal data acquisition tool to collect MySQL Binlog data in real time, directly through visual configuration to Kafka, where the data is archived or consumed in real time.
3) Log data source: The multi-stack DTinsightStream product is based on the jLogstash component (distributed modification compared with the open source jLogstash) for the real-time collection module of log class, which can implement distributed resource scheduling based on YARN, and directly send data to Kafka through visual configuration. Data is archived or consumed in real time.
The real-time collection module is convenient and flexible to configure on the WEB, similar to offline data synchronization tasks. You can configure the real-time collection module in wizard and script mode. Take MySQL real-time collection as an example. You only need to configure data sources, tables, and some filter conditions on the page.
In addition to configuration functions, the system can also monitor and alarm the input and output data when the real-time collection task is running.