The previous report Automation: Dimensionality Modeling without Pressure takes an order in a supermarket as an example to briefly describe the concept of fact table and dimensionality table in dimensionality modeling. This chapter mainly talks about the special data content of time dimension in dimensionality.
Why is the dimension of time special? For example, the classification of goods: vegetables, fruits, drinks, small household appliances… There are many categories, but the number of categories is limited and not many, but for time, every point in time is a value.
Some people say since there’s so much time why not treat it as a dimension table? But our various statistics tend to be based on time as the basis of judgment, such as daily turnover, year-on-year growth rate… So how do we deal with it?
Simple time dimension
First we need to determine the minimum granularity time we need. Here we expect to find a time where the accuracy of business records is above the level of seconds or milliseconds. This step we need to deal with according to our future report needs.
For example, if our business only needs to analyze daily information, we only need date as a minimum granularity latitude value. So our date is a minimum granularity dimension table with one row of data for each day, storing the dimension type for the day.
If we need to distinguish between the number of morning visitors and the number of afternoon visitors, then our dimension might need to be accurate to the morning and afternoon, when there might be an evening? This would require splitting into three types a day, with three rows of data per day.
So what does our dimension table store? Let’s look at a simple dimension table, DIM_DATe_TIME_UTC8, with minimal date granularity
UTC date + 8 | Starting time | End time |
---|---|---|
2010.10.10 | 2019.10.09 16:00:00 | 2019.10.10 16:00:00 |
2010.10.11 | 2019.10.10 16:00:00 | 2019.10.11 16:00:00 |
2010.10.12 | 2019.10.11 16:00:00 | 2019.10.12 16:00:00 |
2010.10.13 | 2019.10.12 16:00:00 | 2019.10.13 16:00:00 |
Using the dimension table above, we can quickly convert the UTC time in the database to THE UTC+8 time zone date, and quickly generate the report data that can be read directly by the back end. Similarly, we can create multiple dimensions in different time zones to quickly calculate the multi-time zone date.
Note that this dimension table can be generated in one stored procedure, but instead of generating dimensions from the beginning of the universe to 1,000 years later, it can be generated according to data needs. For example, the report only needs to show the situation in the last two years, so we only need to generate latitude values in the last two years. This reduces the amount of data during database analysis and filters out data outside the time range through the time dimension table.
Find value in spare time
Let’s take it a step further, what about the granularity of the time dimension of morning, afternoon and evening? Let’s create the time dimension table DIM_date_time_segment_UTC8 with the minimum granularity of time periods:
UTC date + 8 | Period of time | Starting time | End time |
---|---|---|---|
2010.10.10 | 1 | 2019.10.09 16:00:00 | 2019.10.09 23:00:00 |
2010.10.10 | 2 | 2019.10.09 23:00:00 | 2019.10.10 08:00:00 |
2010.10.10 | 3 | 2019.10.10 08:00:00 | 2019.10.10 16:00:00 |
I changed the start and end times arbitrarily, and now I need a second dimension table, write the dimension of “time” dim_time_segment_status_UTc8:
UTC date + 8 | Period of time | English |
---|---|---|
1 | In the morning | Forenoon |
2 | In the afternoon | Afternoon |
3 | evening | Night |
Comprehensive analysis time
I mentioned that we need to find the minimum granularity of time, so what if we need the maximum intensity of time? For example, the need for monthly statistics, quarterly statistics?
We have two ways, the first way to improve the coupling degree, reduce the number of table operation process, without too much introduction directly on the table structure:
UTC date + 8 | Starting time | End time | year | monthly | quarter |
---|---|---|---|---|---|
2010.10.10 | 2019.10.09 16:00:00 | 2019.10.10 16:00:00 | 2010 | 10 | 1 |
2010.10.11 | 2019.10.10 16:00:00 | 2019.10.11 16:00:00 | 2010 | 10 | 2 |
2010.10.12 | 2019.10.11 16:00:00 | 2019.10.12 16:00:00 | 2010 | 10 | 3 |
2010.10.13 | 2019.10.12 16:00:00 | 2019.10.13 16:00:00 | 2010 | 10 | 4 |
Think of the long days. Let’s live each day for three months
Another way to decouple is to keep the schedule unchanged, so we keep the UTC+8 date, year, month, and quarter fields as a separate table.
All YOU see is a bunch of tables, so let’s see how to use your favorite code, ok
select dim_invoied_date_time.utc_cn_date as invoiced_date,
dim_invoiced_date_year_month.month as invoiced_month,
dim_invoiced_date_year_month.quarter as invoiced_quarter
from supermarket_sales_order as sales_order
join dim_date_time_utc8 as dim_invoied_date_time
on sales_order.invoiced_time >= date_time.from_time
and sales_order.invoiced_time < date_time.end_time
join dim_date_year_month as dim_invoiced_date_year_month
on dim_invoied_date_time.utc_cn_date = dim_date_year_month.utc_cn_date
Copy the code
Because the table is involved in the start and end time overlap, so the code must pay attention to greater than or equal to, less than
It is not recommended to use BETWEEN. You must make clear how between handles the boundary in the current database. If you use BETWEEN, you must avoid considering the threshold data in the migration database
Seize the moment when time passes
Our reports often need to focus not only on the final state of information about the current business, but also on the past.
Taking a supermarket order as an example, an order has been created, paid for, redeemed, refunded… There are multiple state changes. If we record the time point of each change in our business database, for example, we have a table like this:
order_id | time | status |
---|---|---|
1233 | 2019.10.09 16:00:00 | 1 |
1233 | 2019.10.10 16:00:00 | 2 |
1233 | 2019.10.11 16:00:00 | 3 |
This user seems very hesitant ah, one step a day
Status Let’s assume a table with the following mapping relationships: 1= created, 2= paid, 3= refunded
If the business only records such a table structure, how do we apply it to reports? Should we come here every time to search for the maximum time value stored here for an order to get the current state, and calculate the “payment hesitation period” by finding the time of two status items adjacent to an order?
For the problem that an object will change its state at multiple time points, we introduce Slowly Changing Dimension to solve the problem in order to facilitate rapid analysis, including the current state, the duration of the state and the state at a certain point in time.
There are three approaches to SCD:
- Keep it up to date: We only care about the final state, so fine, update it every time (only the latest data is stored)
- Start and End time: We record the start and end time of each state. The end time of the latest state is null. Each time new data is added, the end time of the last record is completed first. This method is to split the complex logic to the business runtime to fill in the information (will cause the business data update operation), or the business only record the appearance of the above table by the DW layer through the view calculation (large amount of calculation requires incremental calculation).
- Chain storage: we don’t care about the time interval, we just need to know what the state was last time, or how our state changed constantly, so we only need to store the ID of the entry of one state in each data (query again but do not change the historical data).
Here’s an example of a start and end time:
order_id | start_time | end_time | status |
---|---|---|---|
1234 | 2019.10.09 16:00:00 | 2019.10.10 15:59:59 | 1 |
1234 | 2019.10.10 16:00:00 | 2019.10.11 15:59:59 | 2 |
1234 | 2019.10.11 16:00:00 | 3 |
End_time can also be start_time next time, so keep the style consistent because it affects the interval boundaries of the processing logic
Finally, welcome to test/compare the performance differences of the following three methods in the case of large data:
-
Time zone conversion is performed through the time dimension of Join
-
Conversion is performed when data is read by view or ETL tools using SQL’s time calculation function
-
Time is calculated by the time calculation function in ETL tool
| copyright statement: this article 4.0 BY using CC license – SA protocols, please attach the original source links and reprinted in this statement. | this article links: Cologic Blog – report automation: seize the moment of time – www.coologic.cn/2020/03/176…
Wechat official account: Coologic