The article directories
-
- preface
- Key concepts for dimensional modeling
-
- Measurement and Context
- Facts and dimensions
- The fact table
- A dimension table
- Star schema and Snowflake schema
- A general process for dimensional modeling
-
- 1. Select a service process
- 2. Define granularity
- 3. Determine dimensions
- 4. Establish your facts
preface
Whether it’s a Hadoop-based warehouse (Hive), a traditional MPP warehouse (Teradata), or a traditional Oracle, MySQL, or MS SQL Server relational database, we all face the following problems:
- How do you organize data in a data warehouse?
- How can you organize your data to make it easy and convenient to use?
- How can a data warehouse be organized to make it scalable and maintainable?
Ralph Kimball’s dimensional modeling theory answers and solves the above problems well.
The theory and technology of dimensional modeling is also the most widely used and recognized technology in the field of data warehouse. Today we are going to take an in-depth look at Ralph Kimball’s dimension modeling techniques, covering the basic theory, general process, dimension table design and fact table design, and laying the foundation for our future Hadoop data warehouse practice.
Key concepts for dimensional modeling
Measurement and Context
Dimensional modeling supports analysis of business processes, so it is achieved by modeling business process metrics.
So what is measurement?
In fact, we can easily identify metrics by talking to the business side, the demand side, or reading reports, charts, and so on.
Consider the following business requirements:
- How were the store’s sales last month?
- What are the store inventory trends?
- How is the store visited (PV, UV)?
- What percentage of regular customers visit the store?
Sales, inventory, visits, and repeat visitors are measures.
But talking about metrics alone is meaningless.
The concepts of measurement and environment form the basis of dimensional modeling. All dimension modeling is done through a detailed design of the metrics and their context and environment.
Facts and dimensions
In Kimball’s dimensional modeling theory, measures are called facts, and context and context are called dimensions.
In general, facts take the form of numbers and are usually surrounded by a lot of textual context.
The context in the form of text describes the “five W” information of facts (When, Where, What, Who and Why), which can be intuitively divided into independent logical blocks. Each independent logical block is a dimension. For example, an order can be intuitively divided into multiple dimensions, such as commodity, buyer and seller.
During dimensional modeling and design, information and analysis requirements can be easily categorized into facts and measures based on requirements descriptions or on existing reports.
For example, the demand of business personnel is “statistics the sales of the store last month according to the first-level category”. The description of “from the first-level category” clearly indicates that the demand side hopes to conduct statistical analysis on the sales of the first-level category, where the first-level category is a dimension. Similarly, “last month” is another dimension, and sales are clearly true.
The fact table
The fact table is the base table, or core table, in the dimensional model
In fact, all measures of a business process are stored in the fact table in dimension modeling, in addition to which the referenced dimensions are stored.
Fact tables are often closely related to an enterprise’s business processes, and since an enterprise’s business process data makes up the vast majority of all its data, fact tables also typically occupy the vast majority of data warehouse storage.
For example, for a supermarket, the detailed data of its sales usually accounts for the vast majority of its own data and is constantly accumulated and growing every day, while other data such as goods, stores, employees and equipment are relatively fixed and do not change much.
A row in the fact table corresponds to a measurement event
In fact, the measurement events corresponding to each row can be coarse or subtle. For example, for a supermarket, when designing its dimensional model, a row of the fact table representing customer purchase events can record either a customer receipt or a sub-item of the customer receipt.
So what level should we go to?
In dimensional modeling, it is believed that the fact table should contain the lowest, most atomic details, because this will bring the greatest flexibility. In dimensional modeling, the level of detail is called the granularity of the fact table. For example, the granularity of the fact table of customer purchase behavior should be small items, not small items.
The most commonly used measures in fact tables are generally numeric and additive
Such as small items, such as the number of sales, sales amount additivity for data analysis is crucial, because the data used in general not only retrieve the fact table data line, and often one-time retrieval hundreds, thousands or even millions of lines of the facts, and deal with so many lines of the most common thing is the most useful and put them together, add up and from different angles and dimensions.
But not all measures in the fact table are additive. Some are semi-additive and some are non-additive
Semi-additive fact means that only certain dimensions can be added, such as inventory, which can add up the inventory of warehouses in various places, or different items in a warehouse, but obviously cannot add up the inventory of the same item in a warehouse at different times.
The bank account balance is also a semi-additive example. You can add up the account balance of different branches or different account holders, but you cannot add up the account balance of different months.
Nonadditive facts, such as the price of an item or the status of an order, cannot be added at all.
In addition to the stored facts, a fact table will contain multiple related foreign keys
Use to associate and join the corresponding dimension tables.
For example, an order fact table might contain an item foreign key connected to an item table, a buyer foreign key connected to a member table, or a store foreign key connected to a store table, and so on.
It is through these foreign keys that all angles and dimensions of analysis can be carried out.
The fact table can be divided into transaction fact table, periodic snapshot fact table and cumulative snapshot fact table according to the role division of granularity.
- Transaction fact table is used to carry transaction data, usually with low granularity, such as product transaction fact and ATM transaction fact.
- The periodic snapshot fact table records the accumulated service data at a regular and fixed interval in a large granularity, for example, the monthly average account balance fact table.
- Cumulative snapshot fact tables are used to record the entire information of a business process with a time span and are usually relatively rare.
It should be noted that a fact table can only have one granularity when designing the fact table, and facts of different granularity cannot be established in the same fact table.
A dimension table
Dimension table is the soul of dimension modeling. Generally speaking, the design of dimension table directly determines the quality of dimension modeling
Dimension tables contain the context and context of the business process measures recorded by real tables, and they usually contain many description and label fields, in addition to the “five W’s” information.
Dimension tables usually have multiple columns or attributes
In practice, dimension tables containing dozens or even hundreds of attributes are not uncommon. Dimension tables should include as many meaningful textual descriptions as possible to facilitate downstream use.
Dimensional attributes are the basic source of query WHERE conditions, groups (SQL group statements), and report label generation. In query and report requirements, attributes are identified by the word by.
Dimensional attributes play an important role in data warehouse
Because they are the source of virtually all the interesting constraints and report labels, they are key to making a data warehouse easy to learn and use. In many ways, a data warehouse is nothing more than a manifestation of dimensional properties.
The capabilities of the data warehouse are directly proportional to the quality and depth of the dimension attributes.
- The more time you spend providing detailed business language attributes, the better the data warehouse will be;
- The more time you spend on a given aspect of the attribute column values, the better the data warehouse;
- The more time you spend ensuring the quality of the attribute column values, the better the data warehouse will be.
The dimension table is the entry point to the fact table
Rich dimensional attributes give rich analytical cutting capabilities. Dimensions provide the user with an interface to the data warehouse, the best properties are textual and discrete, and the properties should be real words rather than some coded shorthand notation.
We should try to minimize the use of encoding in dimension tables by replacing it with more detailed text attributes.
Sometimes when designing a database, it is not clear whether a numeric data field extracted from a data source should be treated as a fact or a dimensional attribute. The decision is usually made by looking at a field as a metric with many values and participating in the operation (when looking at facts). It is also a description of discrete values that do not vary much and act as constraints (when viewed as dimensional attributes).
Star schema and Snowflake schema
Once you understand fact tables and dimension tables, the next question is how to combine them. In dimension modeling, there are two basic architectures for combining dimension tables and fact tables: star schema and Snowflake schema.
When all dimension tables are directly connected to the fact table, the overall composition is shaped like a star, so it is called a star schema.
Star architecture is an informal structure with redundant data storage. For example, consider the dimension table of goods, whose brand information exists in every row of goods, including its brand ID, name, brand owner, etc.
Generally, many goods have the same brand, so the brand information in the commodity dimension table is stored many times repeatedly, which is redundant.
When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table through other dimension tables, the overall composition is shaped like a snowflake. This architecture is called a snowflake architecture.
The Snowflake architecture is a normalization of the dimension table of the star architecture, such as the goods table example above. In the Snowflake architecture, each row stores only the brand ID, and all other information about the brand (including brand name, owner, registered place, and all other descriptive information) is stored in a separate brand dimension table. Through the foreign key of brand ID, the product list can obtain all brand description information indirectly.
The Snowflake architecture saves some storage by eliminating data redundancy, but it also brings inconvenience to downstream users
If the downstream users need to analyze the sales of the brand, they must first associate the goods table with the order table, and then associate the brand table with the goods table. Because of this, the snowflake schema is rarely used in the real world of dimensional modeling.
Sometimes simple solutions are the most beautiful, the most powerful, and the most effective
This is the case with star-based dimensional modeling. The star architecture sacrifices some storage redundancy, but provides extreme ease of use and low cost of use and learning for downstream users.
Even for business people with no technical or dimensional modeling background, this is easy to understand, and the current storage cost is extremely low. The extra storage cost is very cost-effective compared to the cost of each subsequent association calculation, user use and learning.
Star architecture, each dimension are equal, all dimension table is the equivalence entrance into the fact table, the user can from any dimensions, any dimension properties, or any number of dimensions combination, any number of dimension attributes combination, easy for data filtering and aggregation (summary, mean, maximum, minimum, etc.) operation, intuitive and very accord with business analysis.
The business is changing, and models must be designed to withstand the changing needs of the business. In practical design, new dimensions can be added or dimension attributes added to dimension tables to meet the analysis requirements of new business perspectives.
In most cases, the star architecture is used in the design of the data warehouse model, but in some special cases, such as when a bridge table must be used, the Snowflake architecture must be used.
A general process for dimensional modeling
Dimensional modeling typically takes a sequential approach to design, that is, selecting a business process, defining granularity, determining dimensions, and determining facts.
This step of dimensional modeling runs through the entire process and links of dimensional modeling, as described below.
1. Select a service process
Business processes are the business activities of enterprises and organizations, which are generally supported by corresponding source business systems.
For a supermarket, its most basic business activity is the user cashier pay; For an insurance company, the most basic business activities are claims and policies. Of course, in practice, business activities may not be so simple and direct, and listening to users’ opinions is usually the most efficient way to do this.
It is important to note, however, that the business process is not referred to as a business unit or function. In model design, focus on business processes rather than business departments. If you build dimensional models that are tied to departments, you cannot avoid data inconsistencies (business codes, meanings, etc.). Therefore, the best way to ensure data consistency is to establish a single, consistent dimensional model for a business process from the overall and holistic perspective of the enterprise and the company.
2. Define granularity
Defining granularity means giving a clear indication of what the fact table rows actually represent and what they mean, and granularity conveys information about the degree of detail associated with the fact table measures. The essence is how to describe individual rows of a fact table.
Typical granularity definitions include:
- Each subitem of a supermarket customer’s receipt;
- Detailed sub-items of hospital bills;
- Every deposit or withdrawal from an individual bank account;
- A snapshot of the monthly balance of your personal bank account;
For dimension design, it is important to agree on the granularity of the fact table, and without a clear definition of granularity, you cannot proceed to the next step.
During the process of defining granularity, the most atomic granularity of the business process should be selected to maximize the subsequent flexibility and meet the analysis needs of business users at any granularity.
3. Determine dimensions
Once granularity is defined, the details of the relevant business processes are determined, and the corresponding dimensions can be easily identified. As mentioned above.
A dimension is a description of the context and context of the measure
Through dimensions, business process measures and facts are enriched and fleshed out. For orders, common dimensions include item, date, buyer, seller, store, and so on.
Each dimension can also contain a large number of description information, for example, the commodity dimension table will contain the commodity name, label price, brand, category, online time and so on.
4. Establish your facts
Determine the facts through business process analysis of what might be analyzed. Once granularity is defined, facts and metrics are generally easy to determine, such as order activity in a supermarket, where the relevant metrics are obviously quantity sold and dollar sold.
In the actual dimensions in the design of the facts, may also run into measurement resolution of problems, such as supermarket for single receipts up to RMB 100 minus 10 activities, if small amount of more than 10 yuan, how the $10 discount amount assigned to each item docket in practical design, can discuss and business specific and formulate concrete split allocation algorithm.