(Hornet’s Nest Technology original content, public ID: MFWtech)
1. Hornet’s nest data warehouse and data center
In recent years, the popularity of the data center concept has continued unabated. Since 2018, Hornet’s nest has also started its own way of exploring data in Taiwan.
What exactly is the data center? Should we build it? What is the essential difference between data warehouse and data warehouse? I believe many companies are paying attention to these issues.
I think the concept of data center is very close to the combination of traditional data warehouse + big data platform. It is the data of the enterprise construction after the accumulation of data centers, data warehouse, etc, with the help of the thinking of platform, the data to better integrate and unity, in the form of componentization realize flexible data processing and application of functions in a more clear data organization to cope with rapid change of the business, in the form of services a way better to release the data value.
Therefore, the data center is more of a reflection of a change in management thinking and architecture. Under such an idea, we combined with our own business characteristics to build the hornet’s nest data center, the core architecture is as follows:
Before the construction of the data center, Hornet’s Nest has established its own big data platform and accumulated some general and component-based tools, which can support the rapid construction of the data center. As another core part of The Center, hornet’s nest data warehouse mainly undertakes the construction of unified data, including unified data model, unified index system and so on. The following introduces the hornet’s nest in the construction of data warehouse specific practice.
2. Data warehouse core architecture
Hornet’s nest data warehouse follows the standard three-tier architecture, and adopts the dimensional model design to locate the data layer, instead of abstractly breaking up the data, paying more attention to the integration of business process data. The existing warehouses are mainly offline, and the overall structure is as follows:
As shown in the figure, it is divided into three layers: business data layer, public data layer and application data layer. The positioning, objectives and construction principles of each layer are different.
(1) Industry **** service data layer: it contains two layers, STG (data buffer layer) and ODS (operation data layer). The data structure of these two layers is almost the same as that of business data.
-
STG: Also known as data preparation area, it is positioned to cache temporary data from DB extraction, messages, and log parsing. Its structure is consistent with that of the business system. Responsible for garbage data, non-standard data cleaning conversion; This layer only serves the ODS layer;
-
ODS: The operation data layer is located in the service detail data retention area, and is responsible for retaining historical data changed after data access. In principle, all data is retained. According to the change characteristics of business table data, the model design adopts two forms of zipper and flow table.
(2) Public data layer: it is divided into three layers: DWD (detailed data layer), DWS (summary data layer) and DIM (public dimension layer). It is mainly used for processing and storing integrated detailed business process data and slightly or severely summarized granular public dimension index data. The public data layer, as the core layer of the warehouse, locates at the business perspective and extracts the common data access and statistical requirements of the data warehouse, so as to build the public data oriented to support applications and provide shared data access services.
-
DWD: This layer is the integrated detailed business process data, which is responsible for vertical and horizontal data integration of various business scenarios, redundant processing of common dimensions, and processing of detailed business label information.
-
DWS: The summary data layer conducts mild and high aggregation of the common dimension index data according to the topic;
-
DIM: Defines dimensions in a unified and standardized manner to share dimension information.
(3) Application data layer: DWA layer, which is mainly used for personalized data processing of each product or business line, such as commercial product data, search and recommendation, risk control, etc.
Data model design
3.1 Method Selection
Data model is the abstraction of data features in the real world, and the design method of data model is the method of summarizing and summarizing data. At present, there are two main model design methodologies in the industry. One is the paradigm modeling method proposed by Bill Inmon, the father of data warehouse, also called ER modeling, which advocates top-down data model construction from the perspective of enterprises. The second is the dimensional modeling method advocated by Ralph Kimball, which advocates bottom-up construction of data models based on business requirements.
In the big data environment, the business system data system is massive and complicated, the data structure is diverse, the change is frequent, and the business needs to quickly respond to a variety of complex business needs. The above two traditional theories can no longer meet the needs of Internet data warehouse. In this context, hornet’s nest data warehouse adopts the hybrid model design mode of “demand-driven mainly, data-driven auxiliary” to select models according to different data levels. Mainly from the following four aspects of comprehensive consideration:
1. Subject-oriented: Business data is classified by subject division method in paradigm model theory.
**2. Consistency guarantee: ** adopts the idea of bus structure in dimension model theory to establish a unified consistency dimension table and consistency fact table to ensure consistency.
3. Data quality assurance: Both paradigm modeling and dimension modeling attach great importance to data quality, and comprehensively use the methods in the two theories to ensure data quality.
4. Efficiency guarantee: Reasonable methods such as dimension degradation, change dimension and increasing redundancy are adopted to ensure the efficiency of data calculation and query.
Among them, ODS chose to keep the normal model of paste source, and did not do further model abstraction, but only took zipper treatment for this layer from the perspective of saving storage. DWD and DWS mainly adopt dimensional models and some wide table models based on the consideration of construction cost, performance and ease of use. The essence of wide table model is to integrate the whole business and the whole node information vertically and horizontally based on the extension of dimension model. At the same time, the method of degradation dimension is adopted to put the measures of different dimensions into different columns of the data table to realize the construction of the whole process view of the business, so as to improve the ease of use and query efficiency of the wide table model, and easy to expand the model.
-
Horizontal integration: Horizontal integration is to integrate the data from multiple data sources of the same business into one model. If the data from multiple data sources overlap, select one data source according to preset service rules to avoid the intersection of integrated service data. For example, if master data management is not carried out for commodity data, commodity information of different business lines will be scattered in various business system tables, which cannot meet the requirements of enterprise-level data analysis. In this case, it is necessary to integrate these commodity data horizontally according to business topics.
-
Vertical integration: A complete business flow usually goes through multiple links, each node information is generated at different points and stored in different data tables. Vertical integration is to integrate the information of each key node in the same business into the wide table model of the whole business process. Hornet’s nest order trading model is constructed in this way, which will be introduced in detail below.
3.2 Design Objectives
In the model design of hornet’s nest data warehouse, accuracy, ease of use and timeliness are the design goals, so as to meet the diverse needs of business people for data.
-
Accuracy: Data quality control should be implemented in the modeling process to ensure data accuracy.
-
Ease of use: Both extensibility and understandability of the model.
-
Timeliness: fully consider the efficiency of using the model, and provide convenient and fast data query and data calculation services.
3.3 Design Process
The overall process of the model design of the hornet’s nest number warehouse involves four major links: demand survey, model design, development test and model on-line, and the output and input documents of each stage are standardized.
-
Demand survey: collect and understand business requirements, reach a consensus on the caliber of specific requirements, conduct table-field-level analysis after sorting out the functions of the business system or system module involved in the requirements, and verify the data to ensure that the existing data can support the business requirements.
-
Model design: preliminarily classify models according to requirements and business survey results, and select appropriate subject areas for model storage; After the theme is determined, the design stage of the data model is entered. The key issues such as bus structure construction and model specification definition should be considered in the process of logical model design. Physical model design is based on the logical model, considering the storage performance and other factors to do physical process of the logical model, is the final physical realization of the logical model. Physical model is generally consistent with logical model. After model design is completed, review and Mapping design are required.
-
Model development: it is the code implementation process of model calculation script, including data mapping, script implementation, test verification and other development process. After the unit test is completed, the business side shall be informed to conduct business verification on the model data together, collect the verification problems, and return to verify the rationality of the model design.
-
Model online: Once validated, the model can be deployed online in the production environment. After going online, you need to configure monitoring for the model so that you can know the status of providing data services for services in a timely manner. We also publish entity and attribute documentation for the model to consumers of warehouse data for better use of the model.
3.4 Subject Classification
Based on the current sorting of various departments and business systems, the hornet’s Nest data warehouse has designed four big data domains (transaction, traffic, content and participant), which are subdivided into 11 themes:
Taking the construction of hornet’s nest order transaction model as an example, the design based on business production bus is a common mode, that is, the whole process of order transaction is investigated first, the key nodes in the process are located, and the core factual information occurring on each node is confirmed. The model is the carrier of data, and what we need to do is to summarize the factual information of each node in the production bus through the model (or model system).
Order production bus:
As shown in the figure above, we need to refine the core information of each node. In order to avoid the omission of key information, participants, occurrence time, occurrence events and occurrence protocols of nodes are generally considered as core information of nodes and need to be obtained in a key way. The following single node is taken as an example. Participants include ordering users, service merchants, platform operators, etc. The occurrence time includes the user’s order time, the merchant’s confirmation time, etc. An event occurs, that is, a user buys a product, and relevant information about this event needs to be recorded. The order generated when the agreement occurs, the amount of the order and the agreed content are all the agreement information that we need to record.
In this way, bus architecture can continuously add the core information of each node to the model, so that the application scope of the model support can be gradually expanded and perfected. Therefore, the level of understanding of the business process will directly affect the quality of the output model.
The more business nodes involved, the more complex the business process becomes. From a data point of view, these business processes generate two basic scenarios, namely the splitting and aggregation of data. As the process progresses, atomic business units of the previous node may need to split out more information in the new node, or participate in the multidirectional process of the new node. Similarly, convergence of data may occur. Taking an order as an example, the ordering node data is order granularity, while the data split occurs when it comes to the payment node. The fragmentation and convergence of data along with the bus nodes may continue to diverge.
In view of the above situation, in the process of model realization, we can not stack the data information of different granularity of each node together, which will produce a lot of redundant information, and also make the positioning of the model itself unclear, affecting the use. Therefore, it is necessary to output models of different granularity to meet various application requirements. For example, there will be both a data model for order granularity and a data model for analyzing the status information of each order node at different times.
Production based on the thinking of dimensional modeling, the model integration bus core information, each node will be a common dimension according to further extend the node information, in order to reduce application level frequently associated correlation dimension of resource consumption, the model will be the paradigm of redundant information related dimensions, in order to get the application layer of the use is convenient. Finally, a hornet’s nest whole-process order model is established which integrates tourism, transportation, hotel and other business lines and business node information.
4. Data warehouse tool chain construction
In order to improve data productivity, hornet’s nest data warehouse has established a set of tool chain to realize the automation of collection, research and development, and management processes. At this stage, three important tools are as follows:
1. Data synchronization tool
Synchronization tools solve two main problems:
-
Synchronize data from the source system to the data warehouse
-
Synchronize data from the data warehouse to other environments
The following focuses on synchronizing data from the source system to the data warehouse.
The data synchronization design of hornet’s nest supports flexible data access mode, which can choose extraction mode and processing mode. The extraction method mainly includes incremental extraction or full extraction. The processing method is data-oriented storage method, which requires the data to be stored in the chain type or stored in the way of flow log.
During access, only need to fill in the data table information configuration and specific field configuration information, the data can be automatically connected to the data warehouse to form the ODS layer data model of the data warehouse, as follows:
2. Task scheduling platform
We use Airflow in conjunction with our own task scheduling system to support not only routine task scheduling but also various data reruns and historical replenements required by the task scheduling system.
Do not underestimate the data rerun, history complement, these two functions are important reference items in choosing a scheduling tool. People who do data know that in the actual data processing process, they will face a lot of data caliber changes, data anomalies, etc., and need to perform data re-run, refresh, supplement and other operations.
The “one-click rerun” function designed by us can take out all the post-tasks dependent on the relevant task, and support the selective deletion or virtual execution of tasks of any node:
-
If you select Delete, the tasks that depend on this task will not be executed
-
If virtual execution is selected, the task will be ignored and all subsequent dependent tasks will continue to run normally.
The following is the diagram of rerunning all downstream tasks based on a task. You can ignore or delete a specific execution node by selecting it.
3. Metadata management tool
The category of metadata includes technical metadata, business metadata, and management metadata, which will not be explained too much in terms of concept. Metadata management plays an important role in data construction, which mainly has two points in data warehouse application:
(1) Consanguinity management
Kinship management can trace the whole link of data processing and parse the ins and out of the table to support various scenarios, such as:
-
Support analysis and adjustment of upstream changes to downstream
-
Monitor the operation cost and efficiency of each node and link task
-
Monitor the number of data model dependencies and identify which models are the focus
The following is a blood plot in a data model, presented in different colors upstream and downstream:
(2) Data knowledge management
Through clear and detailed description of technical and business metadata, data knowledge can be formed to provide better guidance for data personnel. Our data knowledge mainly includes entity description and attribute description, as follows:
Of course, there are many tools in the chain of data warehouse tools, such as automated modeling tools, data quality management tools, data development tools, etc., have been well implemented.
5. Data warehouse application — index platform
With a reasonable data warehouse architecture and tool chain to support data development, the next step is to consider how to externalize the output data. The following to hornet’s nest data application tool – index platform, a brief introduction.
Almost all enterprises will build their own indicator platform, and each enterprise establishes different standards. In this process, many indicators, unclear definition, slow query and other problems will be encountered. In order to avoid these problems as much as possible, there are several principles that should be followed in the design of the metrics platform:
-
Index definition standard, clear, easy to understand, and there is no ambiguity, clear classification
-
The production process is simple, transparent and configurable
-
The performance counter query efficiency must meet the requirement of fast response
-
Kpi rights management is flexible and controllable
Based on the above principles, the hornet’s nest index platform is built in accordance with the refined design. The index platform structure is shown as follows:
Among them:
-
Data warehouse is the source of indicator data, and all indicators are currently processed uniformly through data warehouse
-
Index management includes index creation and index metadata management: data warehouse is responsible for the production and creation of the most core, the most basic index; Other personnel can derive indicators based on these indicators in accordance with the rules; Metadata management records the specific source path of the indicator, indicating that the data source of the indicator is the warehouse table, or Kylin, MySQL, or ES
-
The index dictionary presents the definition, caliber and description of the index to ensure the transparency and explainability of the index
-
The data service accepts the query request of the indicator, determines the query cost according to different scenarios, selects the optimal link to query the indicator, and returns the query result
-
Multi-dimensional query can provide indicators and dimensions of query services to be presented on the interface. Users can select indicators based on dimensions or dimensions based on indicators to query specific data
-
Rights management can be implemented at the table level, indicator level, and dimension value level
Six, summarized
The data construction of an enterprise needs to go through several major steps:
-
The first step is the digitalization of business: as the name implies, all business can be reflected through data, which mainly refers to the online transformation of traditional offline processes.
-
The second step, data intelligence: data alone is not enough, but also need enough intelligence, how to support operation, marketing and all kinds of business through intelligent data, this is the main problem currently solved by data Center;
-
The third step is the businessization of data: data is often referred to as data-driven business. Data can not only be data, but the maximum value of data is to drive new business innovation and drive enterprise growth.
At present, most enterprises are still in the second stage, because this step needs to be solid enough to lay a good foundation for the third step, which is why major enterprises have to invest a lot of costs in the construction of big data platform, data warehouse and even data center.
The construction of the hornet’s nest data center has just begun. We believe that an ideal data center should have three core premises: data standardization, componentization of tools and clearness of organization. In order to achieve this goal, we will establish a unified and standardized data warehouse as one of the key work of the current data center.
Data comes from the business and is ultimately applied to the business. Data value can be maximized only when data is sufficiently valued and fully connected with services. In Hornet’s nest, the management, r&d, product, operation, sales and other roles attach great importance to data, and the number of users of data products accounts for up to 75% of employees.
The use of a large number of users, driving us in the data center construction of the road forward. How to apply the emerging technology ability to the construction of data warehouse, how to solve the problems faced by enterprises in data construction with limited cost effectively, will be the thinking of hornet’s nest warehouse construction.
Author: Yan Bo, research and development director of Hornet’s Nest data warehouse.