The previous report Automation: Dimensionality Modeling without Pressure takes an order in a supermarket as an example to briefly describe the concept of fact table and dimensionality table in dimensionality modeling. This chapter mainly talks about the special data content of time dimension in dimensionality.

Why is the dimension of time special? For example, the classification of goods: vegetables, fruits, drinks, small household appliances… There are many categories, but the number of categories is limited and not many, but for time, every point in time is a value.

Some people say since there’s so much time why not treat it as a dimension table? But our various statistics tend to be based on time as the basis of judgment, such as daily turnover, year-on-year growth rate… So how do we deal with it?

Simple time dimension

First we need to determine the minimum granularity time we need. Here we expect to find a time where the accuracy of business records is above the level of seconds or milliseconds. This step we need to deal with according to our future report needs.

For example, if our business only needs to analyze daily information, we only need date as a minimum granularity latitude value. So our date is a minimum granularity dimension table with one row of data for each day, storing the dimension type for the day.

If we need to distinguish between the number of morning visitors and the number of afternoon visitors, then our dimension might need to be accurate to the morning and afternoon, when there might be an evening? This would require splitting into three types a day, with three rows of data per day.

So what does our dimension table store? Let’s look at a simple dimension table, DIM_DATe_TIME_UTC8, with minimal date granularity

UTC date + 8 Starting time End time
2010.10.10 2019.10.09 16:00:00 2019.10.10 16:00:00
2010.10.11 2019.10.10 16:00:00 2019.10.11 16:00:00
2010.10.12 2019.10.11 16:00:00 2019.10.12 16:00:00
2010.10.13 2019.10.12 16:00:00 2019.10.13 16:00:00

Using the dimension table above, we can quickly convert the UTC time in the database to THE UTC+8 time zone date, and quickly generate the report data that can be read directly by the back end. Similarly, we can create multiple dimensions in different time zones to quickly calculate the multi-time zone date.

Note that this dimension table can be generated in one stored procedure, but instead of generating dimensions from the beginning of the universe to 1,000 years later, it can be generated according to data needs. For example, the report only needs to show the situation in the last two years, so we only need to generate latitude values in the last two years. This reduces the amount of data during database analysis and filters out data outside the time range through the time dimension table.

Find value in spare time

Let’s take it a step further, what about the granularity of the time dimension of morning, afternoon and evening? Let’s create the time dimension table DIM_date_time_segment_UTC8 with the minimum granularity of time periods:

UTC date + 8 Period of time Starting time End time
2010.10.10 1 2019.10.09 16:00:00 2019.10.09 23:00:00
2010.10.10 2 2019.10.09 23:00:00 2019.10.10 08:00:00
2010.10.10 3 2019.10.10 08:00:00 2019.10.10 16:00:00

I changed the start and end times arbitrarily, and now I need a second dimension table, write the dimension of “time” dim_time_segment_status_UTc8:

UTC date + 8 Period of time English
1 In the morning Forenoon
2 In the afternoon Afternoon
3 evening Night

Comprehensive analysis time

I mentioned that we need to find the minimum granularity of time, so what if we need the maximum intensity of time? For example, the need for monthly statistics, quarterly statistics?

We have two ways, the first way to improve the coupling degree, reduce the number of table operation process, without too much introduction directly on the table structure:

UTC date + 8 Starting time End time year monthly quarter
2010.10.10 2019.10.09 16:00:00 2019.10.10 16:00:00 2010 10 1
2010.10.11 2019.10.10 16:00:00 2019.10.11 16:00:00 2010 10 2
2010.10.12 2019.10.11 16:00:00 2019.10.12 16:00:00 2010 10 3
2010.10.13 2019.10.12 16:00:00 2019.10.13 16:00:00 2010 10 4

Think of the long days. Let’s live each day for three months

Another way to decouple is to keep the schedule unchanged, so we keep the UTC+8 date, year, month, and quarter fields as a separate table.

All YOU see is a bunch of tables, so let’s see how to use your favorite code, ok

select dim_invoied_date_time.utc_cn_date as invoiced_date,
			dim_invoiced_date_year_month.month as invoiced_month,
			dim_invoiced_date_year_month.quarter as invoiced_quarter
from supermarket_sales_order as sales_order
   join dim_date_time_utc8 as dim_invoied_date_time
   		on sales_order.invoiced_time >= date_time.from_time
   		  and sales_order.invoiced_time < date_time.end_time
   join dim_date_year_month as dim_invoiced_date_year_month
   		on dim_invoied_date_time.utc_cn_date = dim_date_year_month.utc_cn_date
Copy the code

Because the table is involved in the start and end time overlap, so the code must pay attention to greater than or equal to, less than

It is not recommended to use BETWEEN. You must make clear how between handles the boundary in the current database. If you use BETWEEN, you must avoid considering the threshold data in the migration database

Seize the moment when time passes

Our reports often need to focus not only on the final state of information about the current business, but also on the past.

Taking a supermarket order as an example, an order has been created, paid for, redeemed, refunded… There are multiple state changes. If we record the time point of each change in our business database, for example, we have a table like this:

order_id time status
1233 2019.10.09 16:00:00 1
1233 2019.10.10 16:00:00 2
1233 2019.10.11 16:00:00 3

This user seems very hesitant ah, one step a day

Status Let’s assume a table with the following mapping relationships: 1= created, 2= paid, 3= refunded

If the business only records such a table structure, how do we apply it to reports? Should we come here every time to search for the maximum time value stored here for an order to get the current state, and calculate the “payment hesitation period” by finding the time of two status items adjacent to an order?

For the problem that an object will change its state at multiple time points, we introduce Slowly Changing Dimension to solve the problem in order to facilitate rapid analysis, including the current state, the duration of the state and the state at a certain point in time.

There are three approaches to SCD:

  • Keep it up to date: We only care about the final state, so fine, update it every time (only the latest data is stored)
  • Start and End time: We record the start and end time of each state. The end time of the latest state is null. Each time new data is added, the end time of the last record is completed first. This method is to split the complex logic to the business runtime to fill in the information (will cause the business data update operation), or the business only record the appearance of the above table by the DW layer through the view calculation (large amount of calculation requires incremental calculation).
  • Chain storage: we don’t care about the time interval, we just need to know what the state was last time, or how our state changed constantly, so we only need to store the ID of the entry of one state in each data (query again but do not change the historical data).

Here’s an example of a start and end time:

order_id start_time end_time status
1234 2019.10.09 16:00:00 2019.10.10 15:59:59 1
1234 2019.10.10 16:00:00 2019.10.11 15:59:59 2
1234 2019.10.11 16:00:00 3

End_time can also be start_time next time, so keep the style consistent because it affects the interval boundaries of the processing logic

Finally, welcome to test/compare the performance differences of the following three methods in the case of large data:

  1. Time zone conversion is performed through the time dimension of Join

  2. Conversion is performed when data is read by view or ETL tools using SQL’s time calculation function

  3. Time is calculated by the time calculation function in ETL tool

| copyright statement: this article 4.0 BY using CC license – SA protocols, please attach the original source links and reprinted in this statement. | this article links: Cologic Blog – report automation: seize the moment of time – www.coologic.cn/2020/03/176…

Wechat official account: Coologic