Hive number warehouse zipper table principle

First, the zipper table principle

1, the introduction of

In the process of data model design of data warehouse, such requirements are often encountered: (1) The amount of data is relatively large. (2) Some fields in the table will be updated, such as the user’s address, product description information, order status and so on. (3) It is necessary to view the historical snapshot information of a certain time point or period, for example, to view the status of an order at a certain time point in history. (4) check a certain user in the past a certain period of time, update several times and so on. (5) The proportion and frequency of change is not very large. For example, there are 10 million members in total, and about 100,000 new members are added and changed every day. (6) If a full amount of this table is kept every day, a lot of constant information will be saved in each full amount, which is a great waste of storage. The zipper table can not only meet the historical state of the response data, but also save the storage to the greatest extent.

2. Concept of zipper table

The zipper table is defined according to the way the table stores data in the data warehouse design. As the name implies, the so-called zipper is to record history. Recording information about all changes in an object from its beginning to its current state.

3. Give examples

Here is an original table containing 5 columns of data. The data contains three columns, orderid, create_time, and mod_time, which represent order number, order creation time, and order modification time, respectively. It can be seen that the data is collected on 2020-11-14, and only the previous day’s data can be counted on that day.

After we have an original table, we create a zipper table to record all the order updates and new details. As shown in the figure below, we can see that the zipper table has two more columns than the original table, start_time and end_time respectively, which represent the effective date and expiration date of the order respectively. The first effective date is the creation date, and the expiration date is set to a large number, indicating that as long as the order is not modified, it is permanent.

After creating the original table and the preliminary zipper table, we start to count the data of the 14th on the 15th, and have two more records, as shown in the following figure. Because the records we generate each day are stored in the partition table by date, the partition table has an extra column at the end of the table representing the partition name, which is actually the date.

Updating the original table as shown in the figure below, we find that in the record of 14th, order No. 2 was modified once on 14th, and an order No. 6 was created at the same time

Then we need to integrate the partition table of data No. 14 with the preliminarily designed zipper table, and make a left join. The left join partition table of the zipper table is left join, and the value of END_TIME field in the joined table which is NULL is reset to the permanent effective date.

Why do I do that? (1) First, we conducted the left outer link, and the result was that only order No. 2 could match the zipper table, indicating that only order No. 2 was updated on 14th, so we changed the end_time of order No. 2 in the zipper table to 14th, while the end_time of the other four orders would be null, indicating that no update occurred. So let’s make it permanent. (2) We also found that in the data of 14th, in addition to modifying Order No. 2, we created another Order No. 6, so we directly added an Order No. 6 in the zipper table at the end.

After the record of No. 14 is finally updated, the zipper table is shown as the following figure.

And so on, we start to collect the record data of No. 15 and import it into the partition table, as shown in the figure below. Notice that order number 2 has been updated again, order number 4 has been updated again, and order number 7 has been created.

Update the original table as shown in the figure below. Change end_time for orders 2 and 4 to 15, and add order 7 information.

Update the zipper table to make a left join of the new zipper table (No. 14) and the data partition table (No. 15).

There is a very important point here: Order No. 2 has been modified once before, and the end_time time is on 14th. Now it is modified again on 15th. At the same time, Order No. 4 was updated for the first time, so the END_TIME of Order No. 4 was changed to Order No. 15. Finally, create another order number 7. As shown in the figure below

The last step is to add the data of order No. 2 and order No. 4 from data No. 15 to the zipper table again. And reset END_TIME to the permanent effective date.

The final display of the zipper table: we can see the day on which an order has been changed. On the day of the change, we can find out whether the order has been changed after the order number by searching the order number. This process forms a zipper table. Take Order No. 2 as an example: we found that Order No. 2 was created on 13th and modified once on 14th, and then we counted the data of 15th, and found that Order No. 2 was modified again on 15th. Finally, whether Order No. 2 will be modified again needs to collect and check the data partition table every day after that.

Hive number warehouse zipper table principle

First, the zipper table principle

1, the introduction of

2. Concept of zipper table

3. Give examples

Related Posts

Hive Core Principles (HIVE-3.1.2)

Car Home: A Lake-Warehouse Architecture Practice based on Flink + Iceberg

Hive tuning