0 x00 preface

While the previous article briefly introduced several commonly used data models, this article will provide an in-depth understanding of the most common of them, dimensional modeling.

0x01 What is Dimensional modeling

Dimensional Modeling was championed by Ralph Kimball, another master in The field of data warehousing, whose book The DataWarehouse Toolkit-The Complete Guide to Dimensona Modeling, It is the most popular data warehouse modeling classic in data warehouse engineering field.

According to the book, dimensional modeling does not require that the dimensional model satisfy the third paradigm. The emphasis on 3NF in the database is primarily to eliminate redundancy. Normalized 3NF divides the data into multiple distinct entities, each of which constitutes a relational table. An order database, for example, might start with one row per order representing one record, and then evolve into a spider’s web to satisfy 3NF, perhaps containing hundreds of normalized tables. Also, for BI queries, the canonical models are too complex for users to understand and document their use. Dimensional modeling solves the problem of overly complex patterns.

Let’s explain dimensional modeling in a different way. Those who have studied database should know star model, which is a typical dimensional model in the design of data warehouse. When we do dimensional modeling, we build a fact table, and that fact table is the center of the star model, and then we have a bunch of dimension tables, and those dimension tables are stars that are radiating out. What is a fact table and what is a dimension table?

0x02 Basic Concepts

There are some important concepts in dimensional modeling, and once you understand these concepts, you basically understand what dimensional modeling is.

To understand this, let’s start with a business scenario: chat! Such as SMS chat, software chat these chat scenarios. Below is a simple star model we designed for a chat scenario with three of the most commonly used concepts: fact tables, dimension tables, and metrics.

1. The fact table

Measurable values generated by operational events that occur in the real world are stored in fact tables. At the lowest level of granularity, a fact table row corresponds to a measure event and vice versa.

Well, it’s not easy to understand what a fact sheet is. The book is written so, just when the entry of the time to see the time a face meng force.

Going back to the previous figure, the chat table in the middle is a fact table. You can understand that it is an operational event that happens in reality. Every time we send a message to a friend, it is a fact, and its representation in the table is a record.

We can go back and look at the characteristics of the fact table, there’s no actual content in the dimension table, it’s a collection of primary keys, and each of these ids corresponds to a record in the dimension table.

2. The dimension table

Each dimension table contains a single primary key column. The primary key of a dimension table can act as a foreign key for any fact table associated with it, and of course the description environment of a dimension table row should correspond exactly to the fact table row. Dimension tables are generally wide, flat, non-standard tables that contain a large number of low-granularity text attributes.

The four tables surrounding the CHAT table in our figure are dimension tables, each of which has a unique primary key and stores detailed data information in the table.

3. The measurements

What are the metrics? Take a look at the last field in the Chat table: content_length, which tells us how many words we send in a text message. A metric is a measure of an action, such as the length of a text message, the duration of a phone call, or the amount of an order.

0 x03 practice

Let’s take a chat scenario as an example to walk through how dimensional modeling can be done, and give an example of how to use the model (which is still important).

1. Business scenarios

Suppose we are working in a communication company, and the company needs to model the business. We first limit the business scenario to chat, which has the following elements:

  1. The initiation of a chat requires the participation of several individuals: sender, receiver, time, IP, and device.

  2. A user can send messages to multiple users at different times.

Well, based on those two points, let’s design our model.

Second, model design

First of all, let me think about what we can do if we don’t model our design according to what dimension?

If I were you, I would design the following table. Believe it or not, I can list 50 fields!

If we had designed it as a dimension model, we would have gotten this table, which was posted earlier.

In fact, I personally think how to design this table has its rationality, whether we are right or wrong, just say the advantages and disadvantages of both.

Let’s start with our dimensional model:

  1. Small data redundancy (because a lot of specific information is stored in the corresponding dimension table, for example, there is only one user information)

  2. Clear structure (table structure at a glance)

  3. Easy to do OLAP analysis (data analysis is fun to use)

  4. Increased usage costs, such as multiple tables associated with queries

  5. Data inconsistencies, such as when a user initiates a purchase, and data stored in our dimension table

Here are the pros and cons of our big watch:

  1. Business intuitive, when doing business, this table is particularly convenient, directly to the business.

  2. Easy to use, write SQL is very convenient.

  3. The data redundancy is huge, really huge, and with hundreds of millions of users, his order behavior will be terrible

  4. The granularity is rigid, everything is written dead, and the table is not reusable.

Three, use examples

Data models have to be built to serve better applications, so let me give you an example to get a real feel for how to use our models.

Demand: Find out how many characters men sent to women in each city of Guangdong province in 2017.

Implementation:

Is the implementation simple? And then there are all kinds of up and down operations that can be done based on these tables.

0 XFF summary

Dimension modeling is a very excellent modeling method, which has many advantages, but it is difficult for us to completely implement it in practical work. There will be trade-offs, for example, we still need some wide tables for business, and sometimes there will be a lot of data redundancy.

Dimension models are supported in many open source systems, such as Kylin, which uses the star model in dimension modeling, and also supports the snowflake model in the latest version.