preface
Yesterday someone in the group suddenly asked this question:
When I first heard of ELT, I was also shocked. I just thought about it briefly and put it down. Hearing it again today, I didn’t feel anything.
The most succinct explanation, however, was given:
Just change the order, right?
And then someone got blindsided! It is ok?
Another guess:
Er… In fact, ETL and ELT are really just in different order.
ETL is Extract, Transform, Load;
ELT is Extract, Load and Transform.
You will not feel that this help make number storehouse all day long know to play the devil, the whole point of new words to deceive people!
Er… If you think so, that can underestimate our warehouse people, underestimate the structure of this thing. Let me explain ETL and ELT in detail.
You can look down on me, but you can’t look down on my major!
At that time…
Old numbers warehouse people do projects, are orderly, very methodical.
We typically start with business systems and understand the data structure of all data sources.
At the same time, I will go to understand the business process to see how the business works and how the system leaves traces, so that two times of verification, the logic is clear.
In fact, at this point, we can know a lot of information, experienced people have already guessed the needs of the user in their head, started to design reports.
The next logical step is to capture user needs and plan for AD hoc queries, multidimensional analytics, fixed reports, dashboards, etc.
Then there are the various sub-subject areas, layers, and logical models that click like a tiger.
If you remember, I wrote down the steps of warehouse construction:
Note the last step, “Physical modeling,” in the figure above, where we really start to build tables in the data warehouse on a large scale, i.e. implementation.
And after that? ETL is the process of moving data from the business system to the ODS, then working like an assembly line, processing a Transform, placing it in a box, processing another link, placing it in a box.
This is the construction of DWD, DWB, DWS, DM and other data storehouse layers. In this way, data is processed one layer at a time, and then loaded into this layer of data storehouse, and then processed in the next layer, and then loaded into the past.
So the whole process of data processing and flow is ETL, which is Extract, Transform, and Load.
The biggest advantage of the assembly line is that under the premise of fixed processing links, the construction efficiency is the fastest and the cost is the best. Basically, after the construction, only maintenance is needed.
I have a few friends who are data directors for ordinary companies, and when the data construction work is done, the team is relaxed. Every day is basically to see if there is any problem with the task, and deal with some simple report maintenance work.
As you can see, ETL is well suited to scenarios where the requirements are clear and the business is fixed.
But why another ELT??
The Qing dynasty died…
Very simple, because the qing dynasty died ah!
ETL works very well, automating all data and storing it neatly in a data warehouse for all parties to access.
ETL is also very simple, mostly in visual, low-code form, and you just have to design the flow.
The cost of ETL is very low. It is built once, and then you don’t have to re-invest. You just need to check the batch every day to see if there are any problems. So a lot of people focus on operations.
But ETL also has a very fatal disadvantage: the process is too long, too cumbersome, too long, and too expensive to change!!
Anyway, I don’t want to change someone else’s ETL, it’s too painful. I don’t even want to touch what I wrote. Because ETL programs usually do E and L together, the logic in a single program is often very, very complicated.
Here’s an easy one for you:
Dolphin Schedul’s Lidong also gave me a more artistic one:
Isn’t it complicated? That’s not all. Let me show you another complicated one:
I don’t think so, do I? Not as many nodes as the one above, right? In fact, this is because a picture can not fit!
My brother Across the Newborn, who provided this image, tells me that there are over 1,100 nodes! He designed it himself!
But I just want to ask him if he dares to move now! Dare to dare not? Hey hey ~ ~ ~
He didn’t dare! So what was he most afraid of? The most afraid to change the demand, the most afraid to change the business library!
If the business or underlying data moves, the ETL process adjusts accordingly. Simple logic is also easy to deal with, once encountered “across the freshman” brother of this situation, among other things, light looking for nodes will have to find death ah!
So ETL development is simple, but maintenance costs are prohibitingly high! The complexity is incredible! The job is incredibly difficult!
Frequent business changes, coupled with ETL’s time costs and throughput constraints (congestion), result in the data processing method of ETL not meeting the needs of today’s enterprise development.
To do that?
Change!
Of course it has!
But how?
Turn ETL into ELT!
Yeah, but I don’t get to the point.
Instead of simply reversing the process, we have to go back to the root of the ETL problem.
The complexity of ETL is due to the tight coupling between Transform and Load.
Let’s just think about it in the simplest architectural sense, what’s the simplest way to make something complicated simple?
“Demolition” word formula!
Just separate the Transform from the Load, so that the data processing part can focus on computing, and the data handling part can focus on moving. Don’t mix it up.
So ETL tools become moving components, computing engines, and scheduling engines.
Handling components are responsible for handling data, do not do any data processing operations;
The computing engine does the data processing, and I don’t care about anything else;
The scheduling engine does the flow choreography, and I don’t do anything else.
Some people ask, where is the ETL tool? Well, we need to separate ETL from ETL tools. ETL here refers specifically to the data processing process.
In the above steps, the ETL tool can also be used instead. After all, ETL tools can do all three things.
Since the whole workflow is broken down, the process will naturally change. The first step is the same, but then things are different, and the whole thing looks like this:
It is important to note that the above architecture is just an illustration, and all the specific components are interchangeable.
For example, for extraction, you can use ETL, you can use Kafka, and the great thing about this amazing thing is that you can eat as much data as you want.
ODS can be an ODS for a warehouse or a data lake. Kafka is also used in some bizarre ways.
The action can also be loaded using Kafka or any ETL tool;
You can use Spark, Flink, or MR as your computing engine, as long as it can run tasks, output directly to HBase, or throw it to Kafka or Redis.
Now you can see why ETL and ELT by comparing the two pictures.
What are the benefits of ELT?
The bottom change is that E, T, and L are completely decoupled. Decoupling has many benefits, such as performance bottlenecks, program simplification, component replacement, reduced maintenance costs, and more.
Most important, however, is the extreme flexibility that decoupling leads to, adapting to the current complex and changing market environment.
Because in the complex and changeable environment, the traditional data processing routine of ETL is extremely unsuitable.
When you slowly divide the subject area, extract entities, build models, write all kinds of code, debug, and finally produce a report, the business comes to you and says: brother, our method of play has changed, APP has been iterated for three rounds, this is a new demand. Where are you going to cry?
Therefore, in the new business of many large factories, modeling is weakened and efficiency is enhanced. In fact, the logic of ELT is used.
The data goes directly to the lake, then writes a script and throws it into Spark, drags a wide table into the library, and then defiles it into a report presentation tool.
This has to mention that Baidu’s small partner in the pulse of the question:
Actually, buddies are in ELT, not ETL. Because ETL has to be modeled, the ELT can be flexible enough to just drag a wide table.
conclusion
Times are always changing, and technology is always changing. What we need to do is keep learning, keep improving, think deeply. Pay attention to me, we walk hand in hand!
Thanks for reading, and that’s the end of this sharing. Also welcome everyone to add my wechat, we chat privately ah
Recommended reading: \
What is the difference between traditional data warehouse and big data warehouse? \
Did Ali brag about his Big Data Road?
Is the warehouse dead? Data lake should stand! \
I understand offline data Join, but how to Join real-time data? \
Why Kafka? What about other MQ? \
More highlights: