Data extraction refers to the extraction of the required data from the source data, is the first step in the construction of data center. Data source is commonly a relational database, in recent years, with the vigorous development of the mobile Internet, other types of data sources, a typical such as site visit date, APP to browse the log, IoT device log Will tell from the technical realization way, retrieve data from a relational database, can be subdivided into full amount extraction, incremental extract 2 ways, The two methods are applicable to different business scenarios

Incremental extract

  • Timestamp mode
It is common to extract incremental data by timestamp. A service system adds a timestamp field to the source table and changes the value of the timestamp field when creating or modifying table records. When the extraction task is running, scan the whole table, and determine which data to extract by comparing the business time and timestamp fields of the extraction task. This kind of data synchronization method, in terms of accuracy, there are two disadvantages: 1, can only get the latest status, unable to capture process change information, such as electric commercial shopping scene, if client order soon after payment, the next day when extracting incremental data, can only get the latest payment status, the status of order may have been lost. To solve this problem, it is necessary to comprehensively determine whether the state needs to be backtracked according to business requirements. 2. Records that have been deleted are lost. In a service system, records are physically deleted. You can’t do incremental extraction. Generally, the service system is required not to delete records but to mark them.

Service system maintains timestamp If traditional relational databases such as Oracle and DB2 are used, the service system needs to maintain the timestamp field. When updating service data, the service system updates the timestamp field in the code. This approach is common, but because of the coding implementation, it can be a lot of work, and there is the possibility of missing changes

Triggers maintain timestamps. Typical relational databases support triggers. When there is a change in the database record, a specific function is called to update the timestamp field. Typical examples are as follows:



MySQL can automatically implement the maintenance of changed fields, reducing the development workload to a certain extent. An example is as follows: Create records



The final result is as follows:



Update record



The result is that the database automatically changes the timestamp field:



  • Analyze MySQL binlog logs
In recent years, with the booming development of the Internet, Internet companies generally use MySQL as the main database. As it is an open source database, many companies have made customized development. One of the most important features is to implement read/write separation and master/slave real-time synchronization by subscribing to MySQL binlog. The typical diagram is as follows:



Binlog parsing provides a new method for data synchronization. The parsed results are sent to big data platforms such as Hive and MaxCompute to implement data synchronization with a second delay.

The binlog incremental synchronization method is technically advanced and has three major advantages: 1. The data delay is small. In the alibaba Double 11 scenario, second-level delay can be achieved under the huge amount of data. 2. Data deletion can be captured without data loss; 3. There is no extra requirement for the business table, and the timestamp field can be missing;

Of course, there are some downsides to this approach: 1. The technical threshold is high. The technical reserve of general companies is not enough to complete the whole system construction by themselves. At present, China is limited to the head of Internet companies, large state-owned enterprises, central enterprises. However, with the rapid development of cloud computing, aliyun opened tools and services, which can directly realize real-time synchronization. The classical combination is MySQL, DTS, Datahub and MaxCompute. 2. The resource cost is high. A system is required to receive binlog logs of the service library in real time. A primary key is required in the business table for data sorting

  • Analyze Oracle Redo Log logs
Oracle is a powerful database that parses Redo logs in real time through Oracle GoldenGate and publishes parsed results to specified systems

Full extraction Full extraction is to extract table or view data from the database and write it to big data platforms such as Hive and MaxCompute. It is similar to data migration between service libraries. Full synchronization is simple and is usually used in offline synchronization scenarios with a small amount of data. However, this synchronization method has two drawbacks, which are exactly the same as incremental offline synchronization: 1. 2. Records that have been deleted will be lost

Business library table synchronization policy

  • Synchronization architecture Diagram From a service perspective, offline data table synchronization can be divided into four scenarios. The overall architecture diagram is as follows:




In principle, only data mirroring is recommended for cloud data synchronization. Do not perform business related data transformation work. There are three starting points for the transformation from ETL strategy to ELT: 1. Machine cost. Conversion outside the library requires additional machines and brings new costs; 2. Communication costs. Developers of business systems, as well as users of the data center, are familiar with the original business library tables. If additional transformations are carried out, they need to learn additional tools and products. 3. Execution efficiency. The performance of conversion machines outside the library is generally lower than that of MaxCompute and Hadoop clusters, which increases the execution time. During synchronization, you are advised to cloud all fields in the table to reduce change costs

  • The source data of the small data table is fully updated daily, fully extracted by direct database connection and written into the daily/monthly full partition table.
  • Log table The original log increments are extracted to the daily increments table and stored in daily increments. Since the log data is only new but not modified, there is no need to save the full table.
  • In direct connection mode, the big data table obtains incremental data to the current incremental partition table by using the service timestamp, and then writes the current incremental partition table full partition table of the previous day into the current full partition table.
  • Increment table/hours/minutes from time to time the amount of source data update frequency is higher, reach the level of minutes/hours, incremental data can be drawn from the source database by timestamp to hours/minutes incremental partition table, N hour/minute incremental partition table merge into the daily incremental partition table, then increment the partition table today will merge the day before the partition table, Write to today’s full partition table.
Alibaba data Center team, committed to the output of Ali cloud data intelligence best practices, help each enterprise to build their own data center, and then jointly realize the new era of intelligent business! Alibaba Data Solutions, core products: Dataphin, driven by Alibaba’s core big data methodology OneData, provides one-stop data construction and management capabilities; Quick BI combines alibaba’s data analysis experience to provide one-stop data analysis and presentation capabilities; Quick Audience, combining Alibaba’s consumer insight and marketing experience, provides one-stop crowd selection, insight and marketing capabilities, connects Alibaba’s business and realizes user growth.





The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.