1. Scene Description
For example, the order library is divided into databases and tables, as shown in the figure below:
Instead of creating a separate task for each database instance and importing its data into the MQ cluster, the requirement is to create a single task to synchronize the data to the MQ cluster, because the table structure and data mapping rules are the same for the synchronization task except for the different libraries.
2. Detailed solution of Flinkx
2.1 Basic process of FINk Stream API development
The general steps for programming with the Flink Stream API are shown below:
Warm prompt: you details about Stream API will take place in a future article, this article is mainly focused on InputFormatSourceFunction, focus on the separation of the data source.
2.2 Flinkx Reader core class diagram
Flinkx encapsulates different data sources into a single Reader, whose base class is BaseDataReader. The figure above mainly lists the following key class systems:
-
InputFormat Flink core API, which is used for data segmentation and data abstraction of input sources. Its core interfaces are described as follows:
-
Void configure(Configuration Parameters) performs additional Configuration on the Input source, which is called only once during the lifetime of the Input.
-
BaseStatistics getStatistics(BaseStatistics cachedStatistics) Returns input statistics. If no statistics are required, null can be returned during implementation.
-
T[] createInputSplits the input data to support parallel processing. Our data splitting system is described in The following table.
-
InputSplitAssigner getInputSplitAssigner(T[] inputSplits)
Getting the InputSplit splitter is how to get the next InputSplit when executing a task. Its declaration is as follows:
-
void open(T split)
Opens a data channel based on the specified InputSplit. To further understand this method, let’s take a look at Flinkx’s JDBC, ES write example:
-
Boolean reachedEnd() indicates whether the data has ended. In Flink, the data source of InputFormat usually represents a bounded DataSet.
-
OT nextRecord(OT reuse)
Retrieves the next record from the channel.
-
Void close() closes.
-
-
InputSplit Data split root interface, which defines only the following methods:
- Int getSplitNumber() Gets the sequence number of all shards in which the current shard is located.
This article first briefly introduces its general implementation subclass: GenericInputSplit.
- Int partitionNumber Specifies the split number
- Int totalNumberOfPartitions totalNumberOfPartitions
To understand this, consider the following scenario: for a table with more than ten million levels of data, consider using 10 threads for data partitioning, i.e., cutting into 10 points. Id % totalNumberOfPartitions = partitionNumber
-
SourceFunction Flink abstract definition of the source.
-
RichFunction is a RichFunction that defines the lifecycle and can get the runtime environment context.
-
ParallelSourceFunction Supports parallel source functions.
-
RichParallelSourceFunction
Parallel rich functions
-
InputFormatSourceFunction
Flink RichParallelSourceFunction provided by default implementation class, can be as RichParallelSourceFunction general method and its internal logic implemented by InputFormat data read.
-
-
BaseDataReader
Flinkx data reading base class, in Flinkx all data reading sources wrapped as readers.
2.3 Flinkx Reader Builds DataStream Processes
After sorting out the above class diagram, we should have a general understanding of the meaning of the above class mentioned in Flink, but how to use it? Next through the consult flinkx DistributedJdbcDataReader (BaseDataReader subclass) readData call process, experience its using method.
Basically follow InputFormat to create the corresponding SourceFunction, Then by using the approach of StreamExecutionEnvironment addSource will create corresponding SourceFunction DataStreamSource.
2.4 Flinkx is a solution for splitting database and table tasks
As described in the scenario at the beginning of this article, an order system is designed with four libraries and eight tables, and each Schema contains two tables. How can we improve the performance of data export and data extraction? Common solutions are as follows:
- First of all, split the database by table, that is, 4 databases and 8 tables, which can be divided into 8 copies, and each data is allocated to deal with 1 table in an instance.
- The data of a single table is extracted and then split, for example, the model is taken by ID for further decomposition.
Flinkx does that, so let’s take a look at how it works.
Step1: Firstly, split the database instance and table, and organize it into a DataSource list according to the table dimension. Then, split algorithm will be performed based on the original data.
The following specific task split in InputFormat implementation, this example in DistributedJdbcInputFormat createInputSplitsInternal.
DistributedJdbcInputFormat#createInputSplitsInternal
Step2: create an inputSplit array based on partitions, where the concept of partitions is equivalent to the first scheme mentioned above.
DistributedJdbcInputFormat#createInputSplitsInternal
Step3: If you specify the splitKey task split algorithm, the first DistributedJdbcInputSplit inherited from GenericInputSplit, the total number of partitions for numPartitions, then generate the parameters of the database, SplitKey mod totalNumberOfPartitions = partitionNumber Where splitKey is the fragment key, such as ID, TotalNumberOfPartitions represents the totalNumberOfPartitions, and partitionNumber represents the serial number of the current partitions. Data is split through the SQL module function.
DistributedJdbcInputFormat#createInputSplitsInternal
Step4: If no table level data split key is specified, the split strategy is to split sourceList, that is, some partitions process several of the tables.
That’s it for task sharding in Flinkx.
3, summarize
This paper mainly introduces how to divide tasks based on Flink in the case of MySQL database and table based on Flinkx, and briefly introduces the basic class system of basic programming paradigm, InputFormat and SourceFunction in Flink.
Tips: This paper does not make a detailed in-depth study of Flink API, and will analyze Flink content one by one in the follow-up. However, the organization of Flink series articles is not sequential, so the author will analyze Flink in the process of continuous practice of Flink.
Well, that’s all for this article. Your likes and retweets are the biggest encouragement for me to continue to output high-quality articles.
Welcome to add the author micro signal (DINGwPMZ), pull you such as technical exchange plus group discussion, the author quality column directory: 1, source analysis RocketMQ column (40 +) 2, source analysis Sentinel column (12 +) 3, source analysis Dubbo column (28 +) 4, source analysis Mybatis column 5, source analysis Netty column (18 +) 6, source analysis JUC column MyCat = MyCat; MyCat = MyCat; MyCat = MyCat; MyCat = MyCat