I don’t know if you have ever encountered such a requirement:

  • The same type of data in multiple systems, if you want to get all the information, it is necessary to continuously adjust the interface of multiple systems;
  • Business is complex, a requirement needs to associate several tables or even dozens of tables to get the desired results;
  • The system has done a sub-database sub-table, but need to statistics all the data.

So how can such needs be met? We chose the plan of “data integration in advance through ETL”.

What is the ETL

When it comes to ETL, many development partners may be unfamiliar with it. ETL is mostly used in positions related to big data and data analysis. I only came into contact with ETL during my work in recent years. The current project relies on ETL, which is an important part of the project.

ETL stands for three words:

  • 2. Extraction is the best way to do this. It’s taking the data out of the database;
  • A Transformation b. Including but not limited to: data screening and verification, data association, data content and structure modification, operation, statistics, etc.;
  • Loading; Save the processed data to the target database.

From these three words, we can basically understand the role of ETL: after extracting, cleaning and converting the data of various business systems, the processed data will be landed in the database (data warehouse); In the process, ETL can bring together data that is scattered, scattered, and not standardized.

image.png

Usage scenarios

There are several scenarios in which ETL tools are used in the projects I have been exposed to:

1. Statements and BI system:

At the beginning of the company’s construction, there were fewer businesses and fewer systems. A database was enough. With the increase of the company’s business, the business system was divided into many systems; As the amount of data continues to increase, when the data of a single system increases to a certain extent, a sub-database sub-table is also made;

At this time, when leaders and business personnel do data analysis, the data source may be multiple tables of multiple systems, at this time, it is very difficult to run out the results through a complex SQL; Usually, companies will build a data warehouse and extract data to the data warehouse through ETL tools, and then perform data fitting and presentation.

2. Cross-system data processing or query:

The company we work for now has hundreds of business systems. Due to the complexity of business processes, there are many business checks in the front-end system before the formal submission of transactions.

For example, to query the transaction history of the customer in system X, system Y, and system Z; So you need to call the interfaces of system X, Y, and Z respectively, which is very unfriendly to the front-end system, so what is the common solution?

  • Plan A: do an intermediate service, the intermediate service to call X, Y, Z system interface, the client directly call this intermediate service; This kind of solution just transfers the front-end to the middle service;
  • Plan B: integrate X, Y and Z systems to build the service center; This is a good approach, but it is extremely difficult for many companies to integrate systems X, Y, and Z into a mid-platform system, or to refactor one of the systems itself;
  • Plan C: Extract and process the data needed in X, Y and Z systems into a data warehouse through ETL to provide external services; The biggest advantage of this system is that under the premise of not transforming X, Y, Z three systems, and can realize cross-system query.

We went one step further on the basis of plan C, that is, we processed the data after landing. We associated the data that needed to be associated across tables in advance and stored it in MongoDB to provide external query services. This allows multiple tables to be associated with a single table query.

image

Spit data vs. suck data

Following plan C in the second example, some students may have a question: what data should be extracted? Why not let these systems spit it out?

The answer is simple: “Sometimes, data can’t be spit out.”

There are mature middleware for MySQL database to send out data, such as Canal, which can obtain data by listening to MySQL binlog log. Binlog is set to ROW mode, and it can obtain every new, deleted and modified log, as well as data before and after modification.

Other commercial databases, such as Oracle and DB2, also have trigger mechanism, which can notify when data changes, for example, call a program, send data to message queue, and then other programs monitor the message queue for follow-up processing.

No matter what type of database, this kind of “spit data” scheme, for the infrastructure requirements are relatively high, and the original system has a certain degree of invasion; So we adopted a less intrusive approach to the original system: active data extraction.

Advantages and disadvantages of ETL scheme

1. The advantages

  • The data source system only needs to open the access permission of the database. In order to ensure the impact of data extraction on services, it usually accesses the standby database of the source system and sets a database user with read-only permission separately.
  • Support data extraction of different types of data sources, such as Mysql, DB2, Oracle source libraries, ETL can also be easily done;
  • Data integration integrates the same data of different business systems. For example, some systems M/F represent men and women, and some systems 1/0 represent men and women. ETL is converted into a unified code after extraction and processing.

2. The shortcomings

  • A more fatal shortcoming is that there is a certain delay in data extraction and processing, which needs to be evaluated according to business scenarios and whether to accept this delay.
  • May be affected by changes in source database table structure;
  • If the tables in the source repository do not have timestamps, or the timestamps are inaccurate, then incremental extraction becomes difficult;
  • ETL development positions need to be recruited, which is not easy to recruit from my current experience.

Uncle will point code | article “original”


Please pay attention to the uncle who can code