preface
The main content of this note is The Data Warehouse Toolkit, which can be called The Bible of Warehouse modeling
What is a star model
Start with a main table of business facts. An order, for example, is a business fact. The order has the SKU information of the goods, the sales market information, the date information, these basic attributes, called dimension.
snow
A product dimension, with its own classification, packaging and other information, is also made into a separate table, around the fact table, like a snowflake.
Why star model
- OLTP is designed for online transactions and write-heavy scenarios, so the granularity is fine. The application scenario of warehouse model is data analysis, which involves a large number of queries, so it needs less association and more integration
- Reduce the difficulty and complexity of business understanding, some business facts, across many tables, even across many libraries, such as the life cycle of an order, involving the order team, warehousing team, logistics team. If you don’t model, you need all the people who work with the data to understand the business details of the table’s data structure
Three model types
In summary, there are three types of fact tables.
- A transaction fact sheet, such as a record of a sale
- Periodic snapshot table. Records snapshots of service entities at a certain period of time. For example, keeping track of daily sales of promotional items
- A cumulative snapshot table that records the fact table of some columns of business process changes for a business entity
Warehouse data models often have all three to cope with different OLAP scenarios. The differences between the three are as follows
Cumulative fact model
For some business entities, a series of business process changes occur. In the fact table, a record is used to record all information about the key processes of the business entity, and the record is updated with the occurrence of each business event. In other words, a record accumulates various changes, which is called the cumulative snapshot table. For example, the whole process of a commodity entering the warehouse may include receipt, inspection, packing, transportation, etc. The model design example is as follows:
An example of a row change is as follows:
The cumulative snapshot fact table ensures a finite number of process nodes. Dynamic any number of processes is not suitable for cumulative snapshot fact tables because they change too frequently.
How is a model defined
- Choose a business fact. A fact must answer who, when, where, what was done, why and how? (who, when, where, what, why, and how)
- Define the granularity of data. The granularity of the fact table must be fine, so as to flexibly carry the indicator calculation of the higher logical caliber.
- Identify dimension table
- Identity fact table
How do I respond to dimension table changes
Dimension tables are relatively stable, but not completely immutable. For example, the user’s information dimension table can change the user’s age, address and so on. So how do you deal with these changes? There are mainly the following kinds
- Retains the original value
- Change dimension table property values
- Zipper table
- Add new fields to record old data
- Mini-Dimension
- hybrid
Retains the original value
The fact table stores raw values, not the associated dimension table. In this way, when the related property changes, the value corresponding to the new fact table record is the new property value. The current price of an item, for example, can be stored directly on a fact table
Modify dimension table attribute values
Directly modify the value of the dimension table corresponding to the changed field. This will lead to old data in the fact table, associated dimension table, also get new values, resulting in changes in the report results, OLAP Cube has to be recalculated and other problems. So be careful with this approach. Examples are as follows:
Zipper table
Add data validity time fields to the dimension table and turn it into a zipper table so that the dimension table can not only reflect current values, but also record historical information. The following is an example:
Each record has an ID, and the fact table is associated with the dimension ID at that time.
Add new fields to record old data
In some report scenarios, old fact table records may need to be unassociated based on new dimension information, and changes are not frequent. This requirement can be met by adding fields to the dimension table to record old data so that both new and old data exist. The following is an example:
Mini-Dimension
Some dimension tables have a large number of data fields, and some field data changes frequently. If zippered table design is adopted, the data in dimension tables will increase rapidly. So the zipper watch is not a good design. For example, in the user information, the information of origin, nationality and so on changes little. But users’ income levels and location range change frequently. So you can pull out these variations and make a mini-dimension table. Also to reduce the number of items in the mini-dimension table. You can use ranges as field values, such as a possible user mini-dimension table as shown in the following example:
The final model design is as follows:
Mini-dimensions record all possible combinations of property values that change frequently. The fact table is associated with the mini-dimension table to cover each of the changing requirements. Range values are used because of the mini-dimension table. So if you want to get the value of a specific attribute it is impossible. Therefore, for this kind of business data that changes frequently and has a large amount of data, we can find a way to make fact tables instead of dimension tables.
What about hierarchical dimensions
For example, a department has its own parent department and its own child department. How does this hierarchical dimension manifest itself in a dimension table? There are finite and infinite levels.
The hierarchy is small and finite
A limited number of levels, such as province, city, district, county, up to the village, can not be infinite. For this hierarchy, multiple fields can be displayed on a single line.
Province | city | zone | county |
---|
Unfixed hierarchy
Departments, for example, have a fluid hierarchy. It is not practical to add a field to the above line. At this point, you need to specify the relationship between the two rows through a more abstract parent-child relationship. For example, add parent key:
Some modeling specifications
Level all tiers
Some of the dimensional information itself is multi-tiered, such as product terms a brand, a brand belongs to a category. They’re all many-to-one relationships. When we build a product table, we should not build multiple tables in the same way as in a transactional database, but rather level the hierarchy. Easy to query, improve performance, because dimension table data volume is generally small, do not worry about redundancy. For example, a product dimension is expressed as follows:
Extract the information implicit in the code
Some businesses have business codes. A business code contains a variety of information, such as the first two digits representing regions, the middle two digits representing provinces, etc. In addition to storing the business code, we also store the implied information in the code in fields. The goal is to store all kinds of data in a fine-grained and intuitive way to facilitate report calculation and improve computing performance.
Avoid null fields
If a field, especially one that will be used as an association, should be avoided as much as possible. Because NULL can cause table association, or data statistics, presentation problems. So when a value is missing, use some special value instead, such as -1, or a text to fill the corresponding field. In the case of null in a dimension field, a record can be created in the dimension table that is specifically interpreted as empty, again populated with descriptive text. This ID is then associated in the fact table.
Degenerate Dimensions
Fields in the fact table can hold degenerate dimension primary keys in addition to normal dimension table primary keys. For example, if there is a productId resource in the fact table, the corresponding dimension table is the Product table. However, the business event corresponding to a record in the fact table, its ID in the relational database, or the invoice number, etc., are still meaningful for the subsequent calculation of the report. This field is still stored in the fact table as a dimension, but it does not have a standard dimension table like the product ID. We call this a degenerate dimension.
Factless Fact Tables that did not occur
Normally, we would model business events that actually occur, such as promotional items with records of actual purchases. Its data model is designed as follows:
But the data model in this way, but the fact model cannot answer which promotional item has not been purchased on the specified date. Because there is no purchase, there is no retail record, no information in the database at all, and no statistics. At this point, we can add another fact table that records only the items that participate in the promotion each day. At the same time, create a promotion dimension table to describe specific promotion information. The model design is as follows:
So in order to answer the previous question, the first step is to query a certain day according to the date to find out the commodities in the promotion fact table that day. The second step is to find out what items were actually sold that day from the actual sales fact sheet you created at the beginning. This question can be solved by making an external correlation between the two data.
Dimension table Primary Keys
Dimension tables do not use ids that correspond to operational databases, that is, do not use THE IDS of OLTP tables, but generate their own IDS of type Int. There are several reasons.
- The OLTP database mainly deals with online services, where the table structure and service data are added and updated frequently. Original OLTP tables may be divided into different databases and different tables as the volume of services increases
- The data source for dimension tables may be several tables in OLTP that do not have a fixed primary key
- To improve performance, the dimension table does not have a lot of data, so int is more than enough. The ID of the dimension table is stored by the fact table as a foreign key. The fact table usually has a large amount of data. Therefore, selecting a smaller storage type saves more storage space
- It is convenient to make some business records that do not exist in the OLTP library, such as record Promotion dimension table, record a record of No Promotion. Facilitates fact table association.
Primary key of the fact table
Although a fact table record can be uniquely identified by the combination of dimension foreign keys stored in it, it is better to set a separate fact table primary key like the dimension table primary key, which has the following benefits:
- In some cases, the primary key ID at the breakpoint can be recorded to determine the amount of recovery, since the primary key is usually ordered
- Quickly uniquely locate a fact table record
- Can be used as an association key when multiple fact tables have parent-child relationships
Multi-role dimension table
There may be multiple fields in the fact table, using the same dimension table. And then finally doing report calculations based on fact tables is going to be a problem. It is impossible to join the same dimension table twice. The solution is to create different dimension tables based on the common dimension tables and have the fact tables associate these new dimension tables, which can be entity tables or views. Suppose a fact table has multiple fields associated with a time dimension table, an example of this treatment is modeled as follows:
There are multiple extend index attributes in the fact table, whose function is to calculate some indicators that can be calculated in advance, to improve report performance, unified calculation caliber. That is, if they are uncontroversial and widely used
Fact table generic granularity
For example, the order has the order details of the business. We can build models based on the fact of order details. But how is the total amount of the order apportioned to the order details, according to what rules? If there is no apportionment, the total amount of the order is directly redundant in the order details. In some scenarios, the amount will be overcalculated in the aggregation calculation.
Therefore, it is best to do both the master table facts and the detail table facts, and then associate them with the primary foreign key. At the same time, for the sake of performance, some order attributes that do not participate in the calculation can be redundant to the order details. If only the order details fact table is used, there will not be too many tables associated. The following is a bad case, the customer information of the main table of orders, there is no redundancy to the detail table
Real-time calculation and counting warehouse
Usually data warehouse data is T+1. However, some business scenarios require real-time data. There are two types of real-time data:
- Instantaneous data changes data that are directly connected to the source data. In order to reduce the data delay, no ETL processing is performed in the middle.
- Intra-day (intra-day), the same as ordinary T+1 data processing, only to pull the source data several times in a Day, and then go through the complete ETL process
For the second, you also need to model, but the associated dimension tables need to respond in real time to changes in the day’s data
Macro process
Initiate meetings to establish the responsibilities of relevant personnel
Business owners, business drivers, project managers, data architecture, system architecture, business users, developers. Among them, it is very important to establish the authority responsible person who knows the business. Because the core purpose of the warehouse system is to meet the needs of the business side. It takes a lot of understanding and grooming of the business, and it takes very capable and powerful business leaders.
Macroscopic model combing
Use matrix diagrams to model the overall business of the company, identify common dimension tables, and unify various business terms.
- If the terminology is not uniform
In the final report calculated by data warehouse, different teams may have different names, resulting in ambiguity and mutual inunderstanding, which will increase the threshold of users
- If you don’t unify the dimensions
For example, the time dimension is inconsistent, for example, the sequential time period in the two time dimensions is inconsistent, which may lead to different values of the same indicator in the two reports. Conversely, the time dimension is unified so that two fact models using the same dimension table can join analysis
Computational selection and modeling
Training use, later maintenance upgrade
Unify all kinds of terms to train all kinds of business personnel in the relevant system knowledge, reduce information asymmetry, listen to their difficulties and pain points, and carry out system iteration and upgrade.
Data quality control
Data cleaning and data quality verification should be put in the initial stage of data architecture as far as possible to detect problems as soon as possible. Data Quality screening is mainly divided into three parts: 1, the field quality check, check the empty fields are free value, check whether the format of the field data meet the requirements 2, data structure, such as a two tables have a parent-child relationship, need to check whether this parent-child relationships exist 3, business rules, such as aviation business, check the source system to platinum user, whether the standard mileage? The processing of the problem data is as follows: 1. Interrupt the ETL processing process. It is not recommended that you immediately intervene to solve the problem and restart the etL process
The recording of faulty data
Some trampling best practices
- To lay a solid foundation, we usually put the star model on the DWD layer, and DWD must be built well. Great oaks grow from little acorns. The subsequent DWD model must have a design and review process.
- The construction of data quality system, the data quality of each layer of the warehouse monitoring, timely alarm, data quality management system should be embedded in each layer of the warehouse
- Building a metadata management system, this is a map of the warehouse.
- All report development, in principle, does not allow direct use of ODS layer tables, which can lead to data smokestacks
If there is no quality management system
- Report quality problems are found only on the user side. Troubleshooting takes time and requires the entire link
- After troubleshooting problems, rerun data, rerun time
If there is no metadata management system
- No map warehouse construction will be out of control, will be out of control, will be blind
- Counting capacity, word of mouth, leaving lost
- Unable to answer the leader’s soul torture, how many statements did? What report did you make? What are the assets of a warehouse?