“This is the 22nd day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
OneData is alibaba’s internal data integration and management method system and tool.
The guiding ideology
First of all, conduct sufficient business research and demand analysis.
Secondly, the overall data architecture design is mainly divided according to the data domain. According to dimension modeling theory, bus matrix is constructed to abstract business process and dimension.
Thirdly, the report requirements were abstracted to sort out the relevant index system, and the OneData tool was used to complete the definition of index specifications and model design. Finally, it is code development and operation and maintenance.
Its implementation process is mainly divided into: data survey, architecture design, specification definition and model design.
Research data
Business research
It is necessary to confirm the business areas to be planned into the data warehouse and the functional modules contained in each business area. Taking Ali’s business as an example, the following matrix can be planned:
Demand research
What are the indicators of demand side relationship? What dimensions and measures are needed? Whether the data is precipitated to the summary layer.
Architecture design
Partition of data fields
Data domain is a collection of abstract business processes or dimensions. General data domain is related to application system (function module), so it can be considered to divide business processes of the same function module system into a data domain:
Building the Bus Matrix
After sufficient business research and requirements research, it is time to build the bus matrix, which requires two things:
- Identify what business processes are under each data domain.
- Which dimensions business processes relate to and define business processes and dimensions under each data domain by bus matrix:
Index System construction
The basic concept
Data domain: a collection of abstracted business processes or dimensions for business analysis.
Business processes: Events in the business activities of an enterprise.
Time period: indicates the event range or time point of data statistics, for example, the last 30 days or until now.
Modifier type: An abstract division of modifiers.
Modifiers: Business scenario-qualified abstractions of metrics other than statistical dimensions. Abstract words belong to an abstract type, such as PC, Android, and Apple in the access terminal type.
Metric/atomic metric: A business term with a clear meaning. Such as: amount of payment.
Dimension: A dimension is the environment of measurement, used to reflect a class of attributes of the business, the set of such attributes is called a dimension, can also be called entity objects, such as geographical dimension, time dimension.
Dimension attributes: A description of a dimension, belonging to a dimension. For example, countries and provinces under the geographical dimension.
Derived metric: atomic metric + multiple modifiers (optional)+ time period.
Clarify definitions of atomic indicators, modifiers, time periods, and derived indicators.
Operating conditions
Derived indicators come from three types of indicators: transactional indicators, stock-based indicators and composite indicators.
Transactional metrics: Metrics that measure business activities.
Stock indicator: Indicates the statistics of some states of entity objects.
Compound indexes are based on the above two indexes.
Model design
Data hierarchy
In the industry, there are similar views on the layering of log storehouse, which is generally divided into access layer, middle layer and application layer. However, there are some differences in the understanding of the middle layer.
Access layer (ods)
The business data is generally synchronized to the data warehouse with fixed frequency such as dataX or SQOOP to construct ODS layer.
If it is log data, it is synchronized to data warehouse through Flume or Kafka.
The access layer generally does not do any processing or cleaning of the source data, which is easy to trace back.
Detail layer (DWD)
Theoretically, data at the detail layer is to clean and process data at the ODS layer to improve the availability of data at the ODS layer. It is necessary to weigh whether data at the DWD layer is referenced at the same layer:
- In general, it is not recommended for the DWD layer to reference the same layer. This can reduce the dependency between tasks at the detail layer and reduce the node depth.
- However, in some scenarios, data processing logic from ODS layer to DWD layer is complex and calculation overhead is high. In this case, appropriate reuse of DWD table can be considered to construct new DWD table.
Summary layer (DWS)
This layer relies on our index system and aggregates the data of THE DWD layer according to various dimensions.
Data Mart Layer (DWM)
Put it in this layer when we have some aggregated statistical requirements across business domains.
The application layer (app)
This layer mainly aims at the summary layer, carries on the correlation indicator combination, generates the report.
The dimension design
In dimension modeling, measures are referred to as facts, and dimensions are used to analyze the diverse environments required for facts. Dimensions are generally used for querying, categorizing, and sorting.
Through the constraints of the report, as well as the previous data research and communication with the business side, we can obtain the dimensions.
Dimensions are associated with fact tables through primary keys, which can be divided into surrogate keys and natural keys. Proxy keys have no business meaning and are generally used to deal with slowly changing dimensions, while natural keys have business meaning.
Basic methods of dimensional design
- Select or create a dimension and master the dimensions in the current data warehouse architecture through the previous construction of the bus matrix.
- Determine the main dimension table. In this case, the main dimension table is generally an ODS table and synchronizes directly with the service system.
- Identify the relevant dimension tables. Data warehouse is the data integration of business source systems. There is correlation between tables in different business systems or the same business system. Based on the business combing, we can identify which tables are associated with the main dimension tables and select some of them to generate dimension attributes.
- Determine the dimension properties. This step is divided into two stages. The first stage is to select dimension attributes or generate new dimension attributes from the main dimension table. The second stage is to select dimension attributes or generate new dimension attributes from the related dimension tables.
Normalization and de-normalization
When there are multi-level dimension attributes, a series of dimension tables instead of a single dimension table are formed after normalization according to the third normal form, which is called the snowflake model.
The operation of merging a dimension’s attribute hierarchy into a single dimension is called de-normalization.
Consistency dimension and cross probe
We have many requirements for observing business processes in different data domains or merging different business processes in the same data domain. For example: Nearly one-day PV and UV statistics for the commodity dimension for the log data domain; For transaction data domain statistics commodity dimension nearly a day GMV.
This kind of data probe is called cross probe, which combines the commodity facts of different data domains together.
The premise that data warehouse can cross probe is that different data fields must have consistency dimension.
Dimension integration
Since data sources of data warehouse come from different application systems, which are independent from each other, the description and storage of the same information may be different.
And these different data need to be integrated after entering the data warehouse:
- Unification of naming conventions. The table and field names must be the same.
- Unification of field types. The field types of the same and similar fields are the same.
- Common code and unification of code values.
- Unification of tables with the same business meaning. Based on the concept of high cohesion and low coupling, the tables with large business relationship and small difference of source system influence are integrated.
Table level consolidation takes two main forms:
Vertical integration, where different source tables contain the same data set but store different information, can be consolidated into the same dimensional model.
Horizontal integration, that is, tables from different sources contain different data sets, there is no crossover or partial crossover among these subsets, and if there is crossover, the data is de-duplicated; If there is no crossover, consider whether the natural keys of different subsets conflict. If there is no conflict, consider the natural keys of each subset as integrated natural keys, or process each natural key into a supernatural key.
Zipper table
Zipper table, also known as extreme storage technology. The problem with this approach is that if I want to keep all of the user’s historical states, I may need to keep every historical partition permanently.
If you use a zipper table, each partition can keep the historical status of each user for the day, and the history partition can also be cleaned.
Thus, although more data is stored in a single partition, the entire table stores less data after some historical partitions are cleaned, because many snapshots of unchanged user information are cleaned.
The micro dimension
Microdimensions are created by removing some of the unstable attributes from the relatively stable main dimension and placing them in a new table with its own proxy key.
Recursive hierarchy
Recursive hierarchy refers to the hierarchical relationship of instance values of a dimension table. The recursive hierarchy of dimensions is divided into equilibrium hierarchy with fixed order of magnitude and non-equilibrium hierarchy without fixed order of magnitude.
Since the warehouse generally does not support recursive SQL to handle this kind of hierarchy, an alternative approach is needed.
- The hierarchy is flat, suitable for balancing the dimensions of the hierarchy.
- Hierarchical bridge table, suitable for unbalanced hierarchical dimensions.
More value dimensions
Multi-valued dimension means that one record in a fact table has multiple records corresponding to it in a dimension table.
For multi-valued dimensions, there are three common processing methods:
- Reduce the granularity of the fact table.
- Column extension.
- A more general way, the use of bridge table.
Miscellaneous dimension
The miscellaneous dimension is a combination of indicator or flag fields in the operational system and is generally not included in the consistency dimension.
If these dimensions exist in the fact table as facts, the fact table will occupy more space. If the dimension table is set up separately, many small and fragmentary dimension tables will appear.
At this point, the usual solution is to create miscellaneous dimensions, which can be understood as storing many small dimension tables into a large dimension table by row to column, by building these fields into a dimension table with only a single foreign key in the fact table.
The degradation of dimension
Refers to dimensions whose attributes are stored directly in the fact table.
Reference Documents:
-
Meituan data platform and data warehouse construction practice, over 100,000 words summary
-
Hundreds of quality big data books, with must-read lists (Big Data Treasure)
-
Fifty thousand words | spent a month unscrambles the Hadoop vomiting blood