\
When REAL-TIME data was first introduced, it was called “streaming data”. I strongly suspect that there are very limited scenarios for real-time data. Let’s just talk about Join, a very common data operation. How do you implement it?
What about Join?
In a relational database, the principle of joining two tables is actually relatively easy to understand. The simplest way to think about it is this:
To Join the primary table and the secondary table, in fact, each record of the primary table and each record of the secondary table are queried and matched. Somebody said, is it enough to match one? Of course not, because there could be many-to-many relationships, so you have to match each one.
After the matching, the matching results are combined, and then output the results. This is the principle of a full table Join. It’s simple and easy to understand, but it’s too expensive! That’s N full table scans.
This is uncommon in general scenarios, because experienced data engineers use the second method:
Create an index on the associated field of a secondary table. In this way, when the main table to find data, do not have to do the full table scan, index directly fetch, and then check back to good. That’s fast!
But what about those that just don’t have an index? Our favorite data player is to add an intermediate layer. Add a layer of Join Buffer:
This can reduce the number of accesses to the primary table and reduce the number of accesses to the secondary table when merging. This is at least faster than a full table scan on both sides.
However, as you may have noticed here, this main table is crucial. If the main table also has a lot of data, the efficiency impact is still significant. If the ids in the main table are repeated, it becomes even more taxing.
Therefore, several optimization principles of Join come out:
- Small table drives large table, the principle is to reduce the number of cycles of the main table;
- The secondary table Join field must have an index to reduce the number of matches.
- Increase the number of JoinBuffers as much as possible for cases where there really is no index.
In a big data environment such as Hive, optimization methods such as buckets can be used to speed up Join efficiency. \
At this point, you may say, you are all offline data Join, we can understand. The offline data is already stored there, and can be joined in any way. But real-time data, are flow, want to associate an ID, the data in the secondary table has not arrived, how to associate ah? Yes, sad!
How to Join real-time data?
Real-time data, also known as streaming data. As the name implies, the data is flowing, you do not know when the next data will come, can not do the whole table sorting and matching, so there is no way to directly Join the same as offline data, this is indeed a bit more complicated. Since we cannot Join directly, can we treat different situations differently? Of course it can! Let’s analyze it
When we do Join, there are basically several cases:
1. Association between business table and dimension table;
2. Association between business table and business table;
2.1 association between large tables and small tables; \
2.2 association between large tables and large tables; \
2.3. Current data and current data association; \
2.4 association of current data with historical data. \
You see, is that basically all there is to it? All right, let’s break it down one by one and see what we can do.
The business table is associated with the dimension table
The real-time data are basically transmitted and stored on MQ, and Spark Streaming and Flink are used as computing engines. Therefore, the Join process is carried out in Spark Streaming and Flink. \
In this case, can we first read the data of dimensional tables that need to be joined in these computing engines to the storage media with very fast query, such as Redis or HBase, and then Join with the data in the storage?
Of course you can! This is the Join scheme of flow table and dimension table!
However, there is a small problem, is that the dimension table will change, this is easy to solve, let’s update the whole process in real time, so that the Redis dimension table into a real time dimension table, so the problem is solved.
Association of a business table with a business table
There are four different cases, so let’s start with big tables and small tables. Since it’s a small watch, this one is easier to manage. Our machine memory is not small, assuming that the server is 32G memory, empty 1G put small table is OK? The impact won’t be huge. Join a small table on the same node as the big table. When a large table joins a small table, it can be accessed directly from memory.
What about large tables and large table associations? I can’t help it. With a regular full Join, all data is stored in a state. This method of resource occupation is the largest, but also the most helpless.
Current data and current data join, Spark Streaming and Flink all have solutions. Flink uses window logic. In the same window, both sides of the data are all arrived, directly join good. What we need to worry about is how to deal with late data. The general idea is to wait a little longer or make up later.
Join current data and historical data in the same way as above, except to add a specified time.
conclusion
Offline data Join is actually a process of numerous queries and matching. The optimization idea is also clear, reduce the amount of data in the primary table, increase the index of the secondary table, etc.
Real-time data is more difficult, because there is no way to do the full data sort what, so scores a variety of situations. But there are solutions. What we need to pay attention to is how to do if the data is late. The solution is to wait more or make up later.
To conclude, I would like to borrow a picture drawn by Atsushi Minuta, which is very clear. \
Extended reading: 1 Flink ebook +9 PPT sharing, you can download it by replying to “Flink” in the background of the public account “Big Data Architect”. In addition, I set up a big data architect exchange group, interested students can add me on wechat: ShirenpengWH, note “enter the group”, thank you.
Enjoy better with the following articles
Exploration and practice of Flink real-time computing in traditional financial business scenarios
Data pack: Data pack for real-time data warehouse architecture selection
【 real time 】 Build a real-time data warehouse hand by hand
[dry goods] how many storehouse to divide after all?
I need your retweets to satisfy my vanity a little