Flink Community Volunteer Miao Haochong (Flink Community volunteer)
This paper will review the evolution of data warehouse from its birth background, data warehouse architecture, and comparison between offline and real-time data warehouse, and then share some schemes for implementing typical ETL scenarios based on Flink.
1. Overview of real-time data warehouse
1.1 Background of real-time data warehouse generation
Let’s review the concept of a data warehouse.
The concept of data warehouse was put forward by Bill Inmon in the 1990s. At that time, the background was that the traditional OLTP database could not well support the scenario of long period analysis decision. Therefore, we should compare and understand the four core points of data warehouse concept with the state of OLTP database at that time.
- Topic-oriented: Data warehouse data is organized differently from OLTP transaction-oriented processing. Because data warehouses are analytically decision-oriented, data is often organized by topics such as analysis scenarios or analysis objects.
- Integrated: For data warehouses, it is often necessary to collect multiple scattered and heterogeneous data sources, do some ETL processing such as data cleaning, and integrate them into a data warehouse. OLTP does not need to do similar integration operations.
- Relatively stable: AN OLTP database is business-oriented, and its main function is to accurately reflect the current service status. Therefore, an OLTP database needs to support a large number of add, delete, and change operations. But for the data warehouse, as long as the data stored in the warehouse, the general use scenario is query, so the data is relatively stable.
- Reflect historical changes: A data warehouse is a collection of data that reflects historical changes, which can be understood as a snapshot of historical data. For AN OLTP database, it simply reflects the latest state at the time.
These four points are a core definition of a data warehouse. We can also see that for real-time data warehouses, some of the definitions of traditional data warehouses, that is, offline data warehouses, will be weakened, for example, to reflect historical changes. After introducing the basic concepts of Data warehouse, the modeling of Data warehouse will use some classical modeling methods, mainly including paradigm modeling, dimension modeling and Data Vault. In the Internet big data scenario, the most commonly used method is dimensional modeling.
Then take a look at the classical architecture of the offline warehouse. The diagram below:
This data warehouse architecture is mainly a scenario scheme biased towards Internet big data. It can be seen from the figure above that there are three core links.
- The first part is the data source. Generally, there are two types of data sources for Internet companies:
- Type 1 is a log type data source that collects user behavior logs and some back-end logs by reporting them at a buried point on the client. Buried behavior logs typically go through a process where data is reported to Nginx, collected by Flume, stored in a message queue like Kafka, and then pulled by real-time or offline pull tasks to HDFS, our offline data warehouse.
- The second type of data source is the business database, and for the business database, its binlog is generally collected through Canal, and then also collected into the message queue, and finally pulled to HDFS by Camus.
These two data sources will eventually land in the HDFS ODS layer, also known as the paste source data layer, which is consistent with the original data source.
- The second link, the offline data warehouse, is shown in the blue box. You can see that it is a hierarchical structure in which the model design is based on the idea of dimensional modeling.
- At the bottom layer is the ODS layer, which stores data in HDFS without information loss and basically keeps the original log data unchanged.
- On top of ODS layer, unified data cleaning and normalization are generally carried out to obtain DWD detailed data layer. This layer also contains uniform dimensional data.
- Then, based on the DETAILED data layer of DWD, we will organize our data according to some analysis scenarios, analysis entities, etc., and organize them into some summary data layer DWS with different topics.
- On top of DWS, we will make some APP application data layers closer to the application for the application scenario, which should be highly summarized and can be directly imported into our application services for use.
In the production process of offline data warehouse, some offline production architecture engines are generally used, such as MapReduce, Hive, Spark, etc. Data is generally stored in HDFS.
- After the first two steps, some data of our application layer will be stored in the data service, such as some KV storage such as HBase, Redis and Kylin. In addition, the system encapsulates corresponding service interfaces for some data stored in these data stores to provide services externally. In the outermost layer we will produce some business oriented reports, analysis oriented data products, and support some online business products and so on. This layer is called the data application part that is closer to the business side.
The above is an introduction to the classical architecture of a basic offline warehouse.
You all know that now with the popularity of mobile devices, we are gradually moving from the manufacturing era to the Internet era. In the era of manufacturing, the traditional data warehouse is mainly to support the business decision makers and managers of enterprises in traditional industries in the past to make some business decisions. At that time, the business decision cycle was relatively long and the amount of data was small, so databases such as Oracle and DB2 were sufficient for storage.
But with the development of distributed computing technology, the development of intelligent technology, the improvement of overall computing power, the development of the Internet and other factors, the amount of data we are collecting on the Internet has been increasing exponentially. And the business no longer only depends on people to make decisions. The main body of decision-making has been largely transformed into computer algorithms, such as some intelligent recommendation scenarios. So at this time, the decision period, from the original day level requirements to second level, decision time is very short. In terms of scenes, there will also be more scenes requiring real-time data processing, such as real-time personalized recommendation, advertising scenes, and even some traditional enterprises have begun to monitor whether the processed products have quality problems in real time, as well as anti-cheating, which the financial industry heavily relies on. Therefore, under such a background, real-time data warehouse must be put forward.
1.2 Real-time warehouse architecture
First of all, let me introduce the classical real-time data warehouse architecture – Lambda architecture:
This architecture was proposed by the author of Storm. In fact, the main idea of Lambda architecture is to superimpose the part of real-time data warehouse on the basis of the original offline data warehouse architecture, and then merge the offline storage data with our real-time data of T +0 to generate real-time update results of data status.
- Compared with the diagram 1.1 of offline Data Warehouse architecture above, it is obvious that the increase of real-time data warehouse is shown in the yellow area above. We usually put the real-time storehouse data on message queues like Kafka, and there are some layers of dimension modeling, but for the summary data part, we don’t put the APP layer data in the real-time storehouse, but move more to the data service side to do some calculations.
- For real-time computing, we often use computing engines such as Flink, Spark-Streaming and Storm, which can improve the timeliness from days and hours to seconds and minutes.
You can also see in this architecture diagram that the intermediate data warehouse link has two parts, one is the offline data warehouse, the other is the real-time data warehouse. We had to operate two engines (real-time and offline computing), and at the code level, we had to implement both real-time and offline business code. However, when merging, we need to ensure the consistency of data between implementation and offline, so whenever we make changes to our code, we also need to do a lot of comparison and verification of real-time offline data. In fact, this is relatively high for both resources and operation and maintenance costs. This is an obvious and prominent problem in Lamda’s architecture. Hence the Kappa structure.
One of the main ideas of Kappa architecture is to remove the offline data warehouse in the part of data warehouse, and adopt real-time data warehouse in the production of data warehouse. As you can see from the figure above, the offline warehouse module is no longer in the middle.
For those of you familiar with real-time warehouse production, there may be a question about the Kappa architecture. Because we are constantly faced with business changes, much business logic needs to be iterated. Some data produced before, if the caliber is changed, need to recalculate, or even brush historical data. For real-time data warehouse, how to solve the data recalculation problem?
The idea behind this part of the Kappa architecture is to have a message queue that can store historical data, such as Kafka, and a message pair that allows you to restart consumption from a historical node. Then you need to start a new task to consume Kafka data from an earlier point in time, and when the new task is running at the same pace as the current running task, you can switch the current task downstream to the new task, and the old task can stop. And the original output table can also be deleted.
With some of the advances we’ve made in real-time OLAP technology, a new real-time architecture has been proposed, which we’ll call the real-time OLAP variant.
The idea is to put the bulk of the aggregation, analysis, and computation into the hands of a real-time OLAP engine. In the part of real-time data warehouse calculation, we do not need to do particularly heavy, especially the aggregation of some related logic, and then can ensure that we can flexibly face the needs of various business analysis changes in the data application layer, the whole architecture is more flexible.
Finally, let’s make an overall comparison of these structures of real-time data warehouse:
This is a comparison of the overall three real-time data warehouse architectures:
- From the perspective of computing engines: Lamda architecture needs to maintain two sets of computing engines for batch flow, while Kappa architecture and real-time OLAP variant only need to maintain the streaming computing engine.
- Development costs: For the Lamda architecture, development costs are higher because it requires the maintenance of two sets of code that are offline in real time. The Kappa architecture and real-time OLAP variants require only one set of code to be maintained.
- Analysis flexibility: The real-time OLAP variant is the most flexible.
- On real-time OLAP engine dependencies: The real-time OLAP variant is strongly dependent on the capabilities of the real-time OLAP variant engine, while the former two are not strongly dependent.
- Computing resources: The Lamda architecture requires two sets of batch streaming computing resources, the Kappa architecture only requires streaming computing resources, and the real-time OLAP variant requires additional OLAP resources.
- Logical change recalculation: The Lamda architecture is recalculated through batch processing, the Kappa architecture needs to re-consume message queue recalculation as described above, and the real-time OLAP variant also needs to re-consume message queue, and this data needs to be re-imported into the OLAP engine for calculation.
1.3 Traditional data warehouse vs real-time data warehouse
Then let’s look at the differences between traditional and real-time data warehouse as a whole.
- First of all, from the perspective of timeliness: offline counting warehouse supports hour-level and day-level, real-time counting warehouse to second-level minute level, so the timeliness of real-time counting warehouse is very high.
- In terms of data storage mode, offline data warehouse needs to be stored in HDFS and RDS, real-time data warehouse generally has message queue, and some KV storage, such as dimension data, will be stored in KV storage.
- In the aspect of production and processing, off-line counting warehouse needs off-line computing engine and off-line scheduling. However, for real-time data warehouse, it mainly relies on real-time computing engine.
2. Implement typical ETL scenarios based on Flink
Here we mainly introduce two real-time ETL scenarios: dimension table JOIN and dual-stream JOIN.
-
Dimension table join
- Preload the dimension table
- Hot Storage Association
- Radio dimension table
- Temporal table function join
-
Shuangliu join
- Offline join vs. real-time Join
- Regular join
- Interval join
- Window join
2.1 dimension table join
2.1.1 Preloading a Dimension table
Plan 1:
The specific implementation method is to define a class to implement RichFlatMapFunction, read dimension database in open function, load data into memory in full, and then use operator on probe flow. Associated with memory dimension data at runtime.
The advantage of this solution is that it is relatively simple to implement, but the disadvantage is also obvious, because we need to load each dimension data into memory, so it only supports a small amount of dimension data. At the same time, if we want to update the dimension table, we need to restart the job, so it is a bit expensive in terms of updating the dimension data, and causes a period of delay. For the preloaded dimension table, it is applicable to the scenario of small dimension table, the change frequency is not very high, and the timeliness of the change is also relatively low.
Let’s look at a simple code example:
What is captured in this code is a key fragment. A DimFlatMapFunction is defined to implement the RichFlatMapFunction. There is a DIM of Map type, which can be used to store our dimension data after reading the dimension data of DB. Then, in the open function, we need to connect our DB to obtain the data in DB. Then in the code below you can see that our scenario is to fetch the ID of the item and the name of the item from an item list. Then we get the dimension data in DB and store it in DIM.
Then dim will be used in the flatMap function. After obtaining the data of probe flow, we will compare them in DIM. If there is data for the same item ID, append the related item name to the tuple, and then make an output. That’s the basic process.
In fact, this is a basic first version of the solution implementation. However, there is an improvement in this solution. In the open function, you can create a new thread to load the dimension table periodically. In this way, there is no need to manually restart the job to update the dimension data, so that the dimension data can be updated periodically.
Scheme 2:
Distributed dimension files are Distributed to Task Manager through the mechanism of Distributed Cash and then loaded into memory for association. The implementation can be divided into three steps:
- The first step is to register the file with env.registercached.
- The second step is to implement RichFunction and fetch the cache file through RuntimeContext in the open function.
- The third step is to parse and use this part of the file data.
One advantage of this approach is that you do not need to prepare or rely on external databases. The disadvantage is that the data is also loaded into memory, so the amount of dimension table data supported is relatively small. And if the dimension data needs to be updated, the job needs to be restarted. Therefore, it is not recommended to use this solution in the formal production process, because in fact, from the perspective of data warehouse, all data can be managed by schema. Putting the data in the file to do such an operation is not conducive to our overall data management and standardization. So this way, you can do some small demo, or some testing time to use.
It is suitable for scenarios where dimension data is in file form, data volume is relatively small, and update frequency is relatively low, such as we read a static code table, configuration file, etc.
2.1.2 Hot Storage Association
The second big implementation idea in dimension table JOIN is hot storage association. To be specific, dimension data is imported into hot storage such as Redis, Tair and HBase, and then queried by asynchronous I/O. In addition, the Cache mechanism is superimposition, and some obsolete mechanisms can be added. Finally, dimension data is cached in memory to reduce the overall pressure of accessing hot storage.
Such a process is shown in the figure above. In the Cache area, Google’s Guava Cache is recommended. It encapsulates some of the asynchronous interactions of the Cache, as well as some of the mechanisms of Cache obsolescence, and is relatively convenient to use.
There are two important points in the experimental scheme just now, one is that we need to use asynchronous IO to access the storage, here also with you to review the difference between synchronous IO and asynchronous IO:
- For synchronous IO, after sending a request, it must wait for the request to return before sending a new request. So the overall throughput is relatively small. Since real-time data processing is particularly concerned with latency, this approach to synchronous IO is unacceptable in many scenarios.
- Asynchronous IO is that multiple requests can be issued in parallel, the total throughput is relatively high, the latency will be relatively low. If asynchronous IO is used, the throughput of external storage will increase, which will make the external storage have a relatively large pressure, and sometimes will become the bottleneck of our entire data processing delay. So the Cache mechanism was introduced to reduce the amount of access to external storage.
The Cuava Cache is very simple to use. Here is an example of defining a Cache:
You can see that the interface is very simple and you can try it out. The advantage of the hot storage association scheme is that the dimension data can be more because it is not loaded in memory in full, so it is not limited by memory size. In the traffic scene of Meituan-Dianping, our dimensional data can support the order of 1 billion. On the other hand, the disadvantages of this scheme are also obvious. We need to rely on hot storage resources, and there is a certain delay in the feedback of dimension update to the results. Because we need to import the data into hot storage first, we also lose time when the Cache expires.
In general, this method is applicable to scenarios where the dimension data is large and the dimension update can be delayed.
2.1.3 Broadcast dimension table
The third big idea is to broadcast dimension table, mainly using broadcast State to broadcast dimension data stream to the downstream task for join.
Implementation method:
- Dimension data is sent to Kafka as the broadcast raw stream S1
- Define the state descriptor MapStateDescriptor. Call s1.broadcast () to get broadCastStream S2
- Call BroadcastConnectedStream S4, which is BroadcastConnectedStream S4
- In KeyedBroadcastProcessFunction/BroadcastProcessFunction correlation processing logic is implemented, and as a parameter called S4. The process ()
The advantage of this scheme is that dimension changes can be timely updated to the results. However, the disadvantage is that the data still needs to be stored in memory, because it is stored in state, so the amount of data supporting dimension table is still not very large. The ideal scenario is that we need to be aware of dimensional changes from time to time, and the dimensional data can be converted into real-time streams.
Here is a small demo:
We use the broadcast stream pageStream, which actually defines a page ID and the name of the page. For a non-broadcast stream, probeStream is a JSON string containing the device ID, page ID, and timestamp. It can be interpreted as a record of the user’s PV access behavior on the device.
The entire implementation follows the above four steps:
- Step 1 is to define the state descriptor for the broadcast.
- Step 2: Create a broadCastStream.
- In step 3, we need to connect two streams.
- The most important step in step 4 is to implement the BroadcastProcessFunction. The first parameter is our probeStream, the second parameter is our broadcast stream data, and the third parameter is our output data. The main data processing logic is in the processElement.
In the process of data processing, we first obtain our broadcastStateDesc through context, then parse the probe stream data, and finally obtain the corresponding pageID. Then we just got the state to check whether there is the same pageID. If we can find the corresponding pageID, we will add the corresponding Pagename to our entire JSON stream to output.
2.1.4 Temporal table function join
After introducing the above methods, there is another important method to use Temporal table function join. First, what is Temporal table? It is actually a concept: it is a view that can return the data contents of a constant change table, also known as changingTable, which can be a real-time Changelog data, or a materialized dimension table placed on external storage.
Its realization is to join probe flow and Temporal table through UDTF, which is called Temporal table function join. This method of join applies to the scenario where dimension data is in the form of Changelog flow, and we have demands to be associated according to the time version.
First, let’s take a look at an example. This is an example of exchange rate and currency trading on the official website. For our dimensional data, which is the aforementioned Changelog Stream, it is RateHistory. It reflects the exchange rates of different currencies relative to the yen at different times.
The first field is the time, and the second field is the currency. The third field is the exchange rate relative to Japanese yen, and from the point of view of our Probe table, it defines the situation of purchasing orders of different currencies. For example, two euros were purchased at 10:15. The table records a currency transaction. In this example, what we require is the total transaction amount of yen to buy currency. How to achieve this goal through Temporal table function join?
- Step 1: First of all, we will define TemporalTableFunction on the Changelog flow, in which two key parameters are necessary. The first parameter is a time attribute that helps us identify the version information. The second parameter is the component to be associated with. In this case, we choose currency.
- Then we register the name TemporalTableFunction in tableEnv.
Then we have a look at our registered TemporalTableFunction and what effect it can have.
So if we use the rates function, for example, to get the state at 11:50. You can see that for the dollar, at 11:50 it actually falls between 11:49 and 11:56, so 99. And for the euro, 11:50 falls between 11:15 and 12:10, so we take 119. It actually realizes the concept of TemporalTable that we defined at the beginning, which can obtain the valid data of Changelog at a certain time. Once TemporalTableFunction is defined, we need to use it to implement business logic.
Notice that we need to specify the join key that we want to use. For example, because the two streams are constantly updated, the 11:00 record in our order table is associated with the 10:45 state of euro, and then it is 116, so the final result is 232.
Temporal table function join
2.1.5 Comparison of DIMENSIONAL table Join
Then, we will review some differences of join in various dimensions in dimension table, so as to better understand the applicable scenarios of various methods.
- In terms of implementation complexity: in addition to the hot storage association is slightly more complex, other implementation methods are basically relatively low complexity.
- In terms of dimension table data volume: hot storage association and Temporal table function join can support more data volume. The other way is to load the dimension table into memory, so the memory size is limited.
- On dimension table update frequency: Because preloading DB data into memory and restarting Distributed Cache to update dimension table data, they are not suitable for scenarios where the dimension table changes frequently. As for broadcast dimension table and Temporal table function join, they can update the data of dimension table in real time and reflect the results, so they can support the scenario of frequent update of dimension table.
- In terms of real-time dimension table update: broadcast dimension table and Temporal table function join can achieve relatively fast real-time update effect. Hot storage association can also meet service requirements in most scenarios.
- In the form of dimension table, it can be seen that the first method mainly supports the form of accessing a small amount of data stored in DB. Distributed Cache supports the form of files. Hot storage association requires access to HBase and Tair and other hot storage. Broadcast dimension table and Temporal table function join both require dimension data to be converted into real-time stream.
- On external storage: The first method and hot storage association is dependent on external storage.
In the dimension table join section, we will introduce these basic methods first. Some students may have other plans, and we can give feedback and exchange later. Here we mainly mention some commonly used plans, but not limited to these plans.
2.2 the shuangliu join
First let’s review how batch processing handles two table joins. When batch engines are implemented, there are two ideas.
One is a Sort-Merge join. The other option is to convert a Hash table into a Hash join and load it into memory. Do these two ideas still apply to a dual-stream JOIN scenario? In the dual-stream Join scenario, the objects to be processed are no longer such batch data and limited data, but infinite data sets. For infinite data sets, there is no way to sort them and then process them, nor to convert all infinite data sets into Cache and load them into memory for processing. So these two ways are basically not applicable. Meanwhile, in the dual-stream Join scenario, our join objects are two streams, and the data is constantly entering, so the results of our join also need to be continuously updated.
So what kind of solution should we have to solve the implementation problem of dual-stream JOIN? A basic idea of Flink is to continuously store the data of two streams in state and then use it. Because the join results need to be updated continuously, the previous data theoretically cannot be discarded without any additional conditions. However, from the point of view of implementation, state cannot permanently save all data, so it is necessary to localize the global scope of join in some ways, that is, to split an infinite data stream as far as possible into wired data sets to do join.
In fact, is basically such a big idea, next to look at the concrete implementation.
2.2.1 Offline Join vs. Real-time Join
Let’s take inner Join as an example to see a simple implementation idea:
The left flow is the one marked in black, and the right flow is marked in blue. Inner join is needed for these two flows. First, left-stream and right-stream need to store the relevant elements in the corresponding state after the element enters. In addition to storing the data elements in the state, the data elements in the left stream need to be compared with the Right state to see if they can be matched. Similarly, after the flow element on the right arrives, it also needs to compare with the Left State to see if it can match. If it can match, it will be output as the result of inner join. This figure shows the rough details of an inner join. This is also to give you a general experience of the implementation of dual-stream JOIN.
2.2.2 Regular join
Let’s first look at the first type of two-way join, Regular join. This type of join is required to preserve the state of both streams continuously and without cleanup. The data on both sides are all visible to the streams of the other side, so the data needs to be continuously stored in the state, and the state cannot be too large, so this scenario is only suitable for bounded data streams. The syntax is similar to that of offline batch SQL:
Flink supports Regular join, which is basically the same as our ordinary SQL.
2.2.3 the Interval of the join
The second type of join supported by Flink is Interval Join, also known as Interval join. What does it mean? When two streams join, one of them must fall within a certain time range of the timestamp of the other stream and have the same join key to complete the join. The addition of time window limits allows us to clean up the data beyond the time range, so there is no need to retain the full amount of State.
Interval Join supports both Processing time and even time to define time. If processing time is used, Flink internally uses system time to divide Windows and perform relevant state cleanup. Even Time uses the Watermark mechanism to divide Windows and do State cleanup.
Let’s look at some examples:
The example above uses two tables of data: an order table and a delivery table. The time limit defined here is that the delivery time must be within 4 hours after the order is placed.
The author of Flink shared a very intuitive content earlier, and here is an example of his section:
We can see that the Interval Join defines a lower bound for time, which allows us to clean up data outside the lower bound. For example, in SQL, we have specified that orderTime must be greater than shiptime minus 4 hours. For the Shipments, if the Watermark at 12:00 is received, it means that the timestamp for the Orders stream is less than the data timestamp before 8:00 can be discarded and is no longer retained in the state.
For shipTime, there is a time limit that must be greater than orderTime. For the Orders flow, if a Watermark of 10:15 is received, the state of Shipments prior to 10:15 is discarded. So Interval Join allows us to clean up some historical states.
2.2.4 Window join
Finally, Window Join, the third type of two-stream join: the concept is to join elements in two streams that have the same key and are in the same Window. Its execution logic is similar to that of Inner Join, in that the join key must be the same and elements in the same window can be output in the final result. The specific way to use it is as follows:
Currently, Window Join only supports Datastream apis, so this is a form of Datastream. As you can see, we first join the two streams, then define join key conditions in where and equalTo, and then specify WindowAssigner in window. Finally, we need to define JoinFunction or FlatJoinFunction to implement our matching logic.
Since Windows are divided into three types, our window join will also be divided into three types:
- Class 1 Tumbling Window join: This type of Window has a time interval.
You can see that there are two streams in this diagram (the green stream and the yellow stream). In this example, we define a two-millisecond window. Each circle is a single element on each stream, and the timestamp represents the time of the element, so we can see that it is divided by two-millisecond intervals. There is no overlap between Windows and Windows. For the first window, we can see that the green stream has two elements matching, and the yellow stream also has two elements matching. They are combined in the form of a pair, and finally input into JoinFunction or FlatJoinFunction for specific processing.
- The second type of window is Sliding Window Join: so we use Sliding window.
Sliding window defines the size of a window first, and then the size of a sliding window. If the sliding window size is smaller than the defined window size, there will be overlap between Windows. As shown in the figure, the red window and yellow window overlap, where the 0 element of the green stream is in both the red window and yellow window, indicating that an element can be in both Windows. Then, when Sliding Window Join is performed, you can see that for the red Window there are two elements, green 0 and yellow 0, which are eligible for Window Join conditions, so they form a pair of 0,0. For the yellow window, the two digits of green 0 and yellow 0 and 1 will be combined into two pairs of 0,1,0,0 and 1,0. Finally, they will enter the JoinFunction we defined for processing.
- The third type is SessionWindow Join: the window used here is the session window.
The session window defines an interval in which a stream opens a new window if no elements arrive. In the figure above we can see that there is no overlap between Windows. The Gap we define here is 1. For the first window, we can see that the green 0 element is in the same window as the yellow 1 and 2 elements, so it will form a pair of 1,0 and 2,0. The remaining pairs are also similar. All pairs that meet the conditions will enter the last JoinFunction for processing.
As a whole, we can review that this section mainly introduces Flink ETL implementation methods for two categories of scenarios of dimension table JOIN and dual-stream JOIN. In terms of dimension table join, it mainly introduces four methods: preloading dimension table, hot storage association, broadcast dimension table and Temporal table function join. Regular Join, Interval Join, and Window Join are introduced.
About the author:
Mai Rong, senior technical expert of Meituan-Dianping, responsible for user panoramic data construction, responsible for the construction of meituan-Dianping traffic data warehouse, traffic management analysis system and other traffic data systems, focusing on sea volume data processing, data warehouse construction, user and traffic analysis and other fields.