** Abstract: ** will take you to understand the most comprehensive knowledge system of data modeling, detailed interpretation of the three paradigms, star model, snowflake model, constellation model, modeling specifications and other content.
One, foreword
Data modeling sounds very technical at first, and to the uninitiated it feels very highbrow and unfathomable.
In the current era, the amount of data can be said to be massive, and continues to grow, so for enterprises, how to quickly and accurately get the information they want from these data?
What is data modeling
In simple terms, data modeling is based on the understanding of business, integrating and associating various data, and ultimately enhancing the usability and readability of these data, so that users can quickly obtain valuable information they care about and make timely responses, thus bringing benefits to the company.
Third, why modeling
Data modeling is a set of methodology, mainly to the integration and storage of data to do some guidance, emphasize from all aspects of the reasonable storage of data. There are many benefits to having the right data model:
- Query usage performance improvement
- Improve user efficiency and user experience
- Data quality improvement
- .
So big data systems need data model methods to better organize and store in order to strike a balance between performance, cost, efficiency and quality.
4. Common modeling tools
PowerDesigner:
Power Designer is the CASE tool set of Sybase company, which can be used to analyze and design management information system easily. It almost includes the whole process of database model design. Power Designer can be used to create data flow charts, conceptual data models, physical data models, and structural models for data warehouses, as well as control team design models. It can be used with many popular software development tools, such as PowerBuilder, Delphi, VB and so on to shorten the development time and make the system design more optimized.
Power Designer is a powerful software for database design. It is a database modeling tool commonly used by developers. It can be used to design the database at the Conceptual Data Model and Physical Data Model levels respectively. Here, the conceptual data model describes entity definitions and entity relationship definitions independent of the database management system (DBMS); Physical data model is the embodiment of the target database management system based on the conceptual data model.
5. Differences between business system and data warehouse modeling
In the business system, the random read and write of the business library is usually faced. At present, the three normal form (3NF) model is mainly used to store data.
In the process of data warehouse modeling, since it is mainly the batch data reading operation, but things are not what we care about, mainly focus on data integration and query processing performance, so we will adopt other modeling methods, Kimball dimension modeling is the most classic.
Kimball and Inmon architecture
6.1 Inmon architecture
The radial enterprise Information Factory (CIF) approach was pioneered by Bill Inmon and industry insiders. In this environment, data is obtained from operational data sources and processed in ETL system. This process is called data acquisition, and the atomic data obtained from this process is stored in the database meeting the third normal form. This standardized atomic data warehouse is called enterprise-level data warehouse (EDW) under CIF architecture.
Similar to the Kimball approach, CIF advocates enterprise data coordination and integration, but CIF believes in using standardized EDW to fulfill this role, while the Kimball architecture emphasizes the important role of enterprise bus with consistency dimension
Analysis databases for Inmon enterprise-level data warehouses are typically departent-centric (rather than organized around business processes) and contain summary data, rather than atomic-level data. If the business rules applied to the data during ETL go beyond the basic profile, such as department name changes or other similar calculations, It would be difficult to relate the analytical database to the EDW atomic data
6.2 Kimball architecture
The Kimball architecture leverages the EDW at the center of CIF, but this TIME the EDW is completely isolated from analysis and reporting users and only serves as a source of data where the data is dimensional, atomic, process-centric and aligned with the enterprise-level data warehouse bus structure.
6.3 Architecture Comparison
- process
Inmon architecture is top-down, that is, data extraction –> data warehouse –> data mart, data source oriented, a waterfall development approach, model biased toward 3NF,
Kimball: Architecture is bottom-up, i.e. from data mart (subject segmentation)–> data warehouse –> data extraction, requirements oriented, typically using a star model
- Fact tables and dimension tables
In the Inmon architecture, the concept of fact table and dimension table is not emphasized, because the data source may change greatly, and more emphasis is placed on data cleaning
Kimball architecture emphasizes that the model consists of fact tables and dimension tables, and pays attention to the design of fact tables and dimension tables
- The data mart
In the Inmon architecture, data marts have their own physical storage and are real.
In Kimball data warehouse architecture, data mart is a logical concept, which is just a subject domain division in multidimensional data warehouse without its own physical storage. It can also be said to be a virtual data mart. Is an access layer of a data warehouse, a collection of data organized by subject domain to support departmental decisions.
- center
The Inmon architecture is department-centric, while the Kimball architecture is business-process-centric
- EDW access
Inmon architecture gives users direct access to enterprise Data Warehouse (EDW)
In the Kimball architecture, users do not have direct access to the enterprise data Warehouse (EDW), only to presentation data
6.4 summarize
Kimball dimension modeling is a common choice in enterprise development
7. Several ways of data modeling
7.1 ER model
The ER model is tri-paradigm and is an enterprise-level topic abstraction rather than a separate description of a business
7.1.1 First paradigm. paradigm
When classification is not separable, the relationship is normalized, and when a lower normal form is decomposed into a higher normal form, it is called normalization.
Data tables can be divided into 1-5NF, with the first normal form being the minimum requirement and the fifth normal form being the highest requirement.
The most commonly used paradigms are first normal (1NF), second normal (2NF), and third normal (3NF).
7.1.2 First paradigm
Each column in the table is a nonseparable atomic item
As can be seen from the above figure, there are two values in the phone field, which are separable and do not conform to 1NF. They can be changed to:
7.1.3 Second normal Form
The second normal form should satisfy the following two conditions simultaneously:
- Satisfy the first paradigm.
- No partial dependencies.
As can be seen from the figure above, if a user places a lot of orders, the user name, harvest address and mobile phone number are repeated, resulting in data redundancy, which is obviously not in line with the second normal form and can be changed to:
7.1.4 Third normal Form
The third normal form should satisfy the following two conditions simultaneously:
- Satisfy the second normal form
- No passing dependencies
To put it simply, the relationship repeats and can be derived from each other.
As shown in the figure above, if the zip code is known, it can actually be derived from provinces and cities. On the contrary, if the zip code is known, it can also be derived from provinces and cities. There is transfer dependence, resulting in redundancy, which does not conform to the third normal form and needs to be transformed:
7.1.5 summary
In the design of relational data model, it is generally necessary to meet the requirements of the third normal form. If a table has a good primary foreign key design, it should be a table that satisfies 3NF.
The benefits of canonicalization are to improve the efficiency of updating data by reducing data redundancy while ensuring data integrity. However, we should also guard against the problem of over-standardization in practical application. The more normalized you are, the more tables are divided and the more likely you are to use table join operations when querying data.
If too many tables are joined, the performance of the query will be affected. The key issue is to carefully balance the relationship between data query and data update and determine the most appropriate level of normalization based on business requirements. It is also important not to modify business requirements in order to follow strict normalization rules.
7.2 Dimensional Modeling
Dimension modeling is a logical design method to structure a large amount of data, including dimensions and indicators. Unlike ER model, which aims to eliminate redundant data, dimension modeling is analytically oriented and ultimately aims to improve query performance, so it will increase data redundancy and violate the three paradigms.
Dimension modeling also focuses on enabling users to quickly complete requirements analysis and timely respond to complex queries. Generally, dimension modeling can be divided into three types:
- Star model
- Snowflake model
- The constellation model
The most common of these is actually the star model.
7.2.1 background
According to the relationship between fact table and dimension table, common models can be divided into star model, snowflake model and constellation model in the business intelligence solution of multidimensional analysis. When designing a model for logical data, consider whether the data is organized in a star, snowflake, or constellation model.
7.2.2 Star model
In a star model, there is a fact table and zero or more dimension tables. The fact table and dimension tables are related by primary key foreign key, and there is no correlation between dimension tables. When all dimension tables are directly connected to the “fact table”, the whole diagram looks like a star. The star model is the simplest and most commonly used model. Because the star model has only one big table, it is more suitable for big data processing than other models. Other models can be transformed into star models.
Star structure is an irregular structure. Each dimension of the cube is directly connected to the fact table, and there is no gradient dimension, so the data has some redundancy. For example, in the regional dimension table, there are two records, city C of country A province B and city D of country A province B. Then the information of country A and province B is stored twice respectively, that is, there is redundancy.
7.2.3 Snowflake model
When one or more dimension tables are not directly connected to the fact table, but connected to the fact table through other dimension tables, the diagram is like multiple snowflakes connected together, so it is called the snowflake model. The snowflake model is an extension of the star model. It further layers the dimension table of the star model. The original dimension table may be expanded into small dimension tables, forming some local “hierarchy” areas. These decomposed tables are connected to the main dimension table instead of the fact table. As shown in the figure, the regional dimension table is decomposed into dimension tables such as country, province and city. It has the advantage of improving query performance by minimizing data storage and combining smaller dimension tables. The snowflake structure removes data redundancy.
7.2.4 Constellation model
The constellation model is an extension of the star model, which is based on one fact table, and the constellation model, which is based on multiple fact tables and sharing dimension table information, is often used in situations where data relationships are more complex than the star and snowflake models. Constellation model requires multiple fact tables to share dimension tables, so it can be regarded as a collection of star models, so it is also called galaxy model
7.2.5 contrast
- Because of the redundancy of data, the star model does not need external connections for many statistical queries, so the efficiency of the star model is generally higher than that of the Snowflake model.
- Star structure does not need to consider many normalization factors, design and implementation are relatively simple.
- The snowflake model is inefficient because it removes redundancy and some statistics need to be generated through table joining.
- Normalization is also a relatively complex process, the corresponding database structure design, data ETL, and later maintenance are more complex.
7.2.6 summary
Through comparison, we can find that the data warehouse is more suitable to use the star model to build the underlying data Hive table most of the time, through a large number of redundancy to reduce the number of table query and improve the query efficiency, star model is more friendly to OLAP analysis engine support, which can be reflected in Kylin. The snowflake model is very common in relational databases such as MySQL and Oracle, especially in e-commerce database tables. In data warehouse, there are few application scenarios of snowflake model and constellation model, but not none. Therefore, in the specific design, we can consider whether we can combine the advantages of both to participate in the design, so as to achieve the purpose of design optimization.
7.2.7 Modeling Principles:
- High cohesion and low convergence
The data with similar or related services and the same granularity is designed as a logical or physical model: the data with high probability of simultaneous access is stored together, and the data with low probability of simultaneous access is stored separately.
- The core model is separated from the extension model
The core model and the extended model system should be established. The segments of the core model support common core business, and the fields of the extended model support the needs of personalized or a small number of applications. The segments of the extended model should not invade the core model excessively, so as not to damage the architectural simplicity and maintainability of the core model.
- Common processing logic sink and single
The more common processing logic at the bottom, the more it should be encapsulated and implemented at the bottom of the data scheduling dependence, do not let the common processing logic exposed to the implementation of the application layer, do not let the common logic exist at the same time.
- Cost and performance balance
Proper data redundancy can exchange query and refresh performance, and excessive redundancy and data replication are not recommended.
- Data can be rolled back
The processing logic is unchanged, and the results of running data repeatedly at different times are determined to be unchanged.
- consistency
Fields with the same meaning must be named the same in different tables and must use the names in the specification definition.
- Clear and understandable naming
Table names must be clear and consistent, and table names must be easy for consumers to understand and use.
Steps of star model design:
- Select the business process that needs to be analyzed for decision making. A business process can be a single business event, such as payment of a transaction, refund, etc. It can also be the status of an event, such as the current account balance, etc. It can also be a business process consisting of a series of related business events, depending on whether we are analyzing the occurrence of certain events, the current state, or the event flow efficiency.
- Choose granularity. In event analysis, we determine the granularity of choice by anticipating the degree of refinement required for all analyses. Granularity is a combination of dimensions.
- Identify dimension tables. After selecting the granularity, you need to design dimension tables based on this granularity, including dimension attributes, for grouping and filtering during analysis.
- Choose the facts. Identify the metrics to be measured for the analysis
7.3 Data Vault Model
Data Vault is a model initiated by Dan Linstedt. It is derived from the model and designed to realize Data integration, but it cannot be directly used for Data analysis and decision-making. It emphasizes the establishment of an auditable basic data layer, that is, the historicity, traceability and atomicity of data, without requiring excessive consistency processing and integration of data.
It also organizes enterprise data structurally based on topic concepts and introduces further paradigm processing to optimize the model to cope with the extensibility of source system changes. The Data Vault type consists of the following parts.
• Hub: the core business entity of an enterprise, consisting of entity key, data warehouse sequence proxy key, loading time, and data source.
• Link: indicates the relationship between hubs. The biggest difference here is that the relationship is abstracted as a separate unit, which improves the extensibility of the model. It can be straight
The following describes the relationship between 1:1 :n n:n without any change. It consists of the Hub’s proxy key, load time, and data source.
• Satellite: a detailed description of a Hub. A Hub can have multiple Satellites. It consists of the proxy key of the Hub, load time, source type, and detailed description of the Hub.
Data Vault model is easier to design and produce than ER model, and its ETL processing can be configured.
8. Model stratification
8.1 introduction
Data warehouse is generally divided into three layers, which are Data paste source layer (ODS, Operation Data Store), Data Common Data Model layer (CDM, Common Data Model) and Data Application layer (ADS, Application Data Service).
8.2 the ods layer
Paste the source layer, consistent with the business library, without doing any processing
8.3 the CDM layer
The Common Data Model (CDM) consists of DIM dimension table, DWD,DW and DWS, which are processed from ODS layer Data. It mainly completes data processing and integration, establishes consistent dimensions, builds reusable detailed fact tables for analysis and statistics, and summarizes common granularity indicators
- Common Dimension Layer (DIM) : Based on the idea of dimension modeling, enterprise consistency dimension is established. Reduce the risk of data calculation caliber and algorithm inconsistency. Common dimension level tables are also called logical dimension tables, and dimensions and dimension logical tables usually correspond one to one.
- Detailed granular fact layer (DWD) : Normalized coding transformation, cleaning, unified formatting, desensitization of data without horizontal integration
- Topic width surface layer (DW) integrates various DWD information and outputs topic width table (surface business process, non-redundant information construction of different business processes, in the form of foreign keys)
- Common summary granularity fact layer (DWS) : Based on the upper-level application and product index requirements, a common granularity summary indicator fact table is constructed with the subject object of analysis as modeling-driven, and the physical model is physicalized by wide table method. Construct statistical indicators with naming norms and consistent caliber, provide public indicators for the upper level, and establish summary wide tables and detailed fact tables.
Tables for the common summary granularity fact layer are also commonly referred to as summary logical tables and are used to hold derived metric data.
8.4 the ads layer
Application Data Service (ADS) : customized development oriented to Service requirements and stores personalized statistical index Data of Data products.
8.5 Logical Hierarchical Architecture
8.6 Benefits of layering
- Clear data structure: Each data layer has its scope, making it easier to locate and understand when using tables.
- Data lineage tracing: To put it simply, we end up presenting the business with a business table that can be used directly, but it comes from many sources, and if one source table fails, we want to be able to quickly and accurately locate the problem and know the scope of its harm.
- Reduce duplication: Formalizing data layering and developing some common middle tier data can reduce significant duplication.
- Simplify complex problems: Break a complex task into multiple steps to complete, each layer handles a single step, relatively simple and easy to understand. In addition, it is convenient to maintain the accuracy of data. When there is a problem with data, it is not necessary to repair all the data, but only need to repair from the steps with problems.
This article is shared from Huawei cloud community “From three paradigms to Modeling Specifications, detailed explanation of data Modeling knowledge system”, the original author: Fan MM.
Click to follow, the first time to learn about Huawei cloud fresh technology ~