The rapid development of digital economy has brought new opportunities and challenges to the operation of enterprises. How to effectively carry out data governance, break data islands, give full play to the business value of data and protect data security has become a hot topic in the industry. Based on the process of Meituan distribution data governance, this paper shares the construction and practice of the unified distribution data “base” of data definition, model design and data production.
1 introduction
With the rapid development of digital economy, data has become a new factor of production. How to effectively carry out data governance, improve data quality, break data islands, and give full play to the business value of data has become a hot topic in the industry. Based on Meituan distribution data governance process, key distribution data and share with you the construction and practice of “base”, how to establish data definition to the data through the systematic modeling production of Bridges, a data definition, model design, production of the unity of the three links, eliminate because of lack of data standards and implementation does not reach the designated position caused by data trust issues, In addition to the high-quality transformation from data to information, it provides data and metadata guarantee for subsequent convenient data consumption. I hope to provide some references for students engaged in data governance in the transformation process of data to assets.
2 what is systematic modeling
Systematic modeling is based on dimension modeling and driven by the concept of prior governance. Metadata runs through the modeling process, with indicators and dimensions defined above and actual data production below. Firstly, through high-level model design, business indicators are structurally disassembled into a combination of atomic indicators/computational indicators + limiting conditions, and they are assigned to specific business processes and themes to complete the planned definition of business indicators. Secondly, detailed physical model design is automatically produced based on high-level model design. Third, based on the generated physical model design, the data processing logic is semi-automatically or automatically generated to ensure the unification of the final business definition and physical implementation. The details are shown in the figure below:
From the definition of systematic modeling, it emphasizes two unification, namely the unification of data requirements and model design and the unification of model design and physical implementation.
The unification of data requirements and model design, which is the product of warehouse domain division and specific requirements. Warehouse domain division is the abstraction of data based on business itself but beyond the limits of business requirements. The abstraction of data completion subject and business process is an important basis for business indicators and dimension requirements attribution and the realization of high cohesion and low coupling of data construction. The specific requirements model design is the content filling based on the division of the warehouse domain, and the requirements are assigned to the corresponding subject and business process in the form of indicators and dimensions, so as to drive and constrain the specific detailed model design and outline the valuable information architecture assets.
Model design and physical implementation, based on the precipitation information architecture of metadata model design steps, in order to drive and constraints, the actual physical model, constraint corresponding to the physical model of the DDL, in data processing, to prevent “chimney” due to the lack of effective constraint the development, is a model of online before, automatically business definition and physical implementation consistency verification, Ensure that the DML implementation is correct.
3. Why systematic modeling
Distribution data construction, after a period of time, there is demand management (indicators, dimensions), model design, model development between each other is not unified, the data architecture specification can’t be real and effective management, metadata (index, dimension, design) model and the actual physical model split, mismatch, various data assets information. In addition, due to the lack of system grasp, the quality of model design can not be fully standardized, leading to part of the demand for direct data development, resulting in the deterioration of model construction quality. This lack of discipline and constraints leads to “smokestack” development that wastes technical resources and creates duplicate and unreliable data. The breakthrough point of systematic distribution modeling is to standardize “basic data construction” and eliminate the troubles and technical waste brought to business by “smokestack” development.
3.1 Systematic modeling can effectively manage data architecture and eliminate “smokestack” development from the source
Systematic modeling can not only realize integrated design and development in tools, but also form effective cooperation between model design and development implementation in mechanism. Model design is driven by demand, and development implementation is driven and constrained by model design to prevent disordered and “smokestack” development caused by the separation between model design and development implementation and the lack of constraints in development implementation.
3.2 Systematic modeling of precipitated normative metadata can effectively eliminate business troubles in retrieving and understanding data
Systematic modeling not only defines split between original data specification of physical model, model design, and ultimately achieve together, and the data in the form of metadata assets the depiction of the precipitation down, and each index not only is a specification of the size of the processing business definition and clear, and can be mapped to the corresponding physical table, Effectively eliminate the business in the retrieval and understanding of data.
4. How to conduct systematic modeling
Implementing systematic modeling starts at the source, linking data specification definitions, data model design, and ETL development to achieve “design is development, what builds is what you get.” Overall strategy starts from the source, to solve the problem of index defined on the demand level, and then in turn and drive model design and constraints data processing, will produce in the data field of online business process each link abstraction, digital, and implement business rules to complete the “physical world” the number of twin, form a “digital world”. At the tool level, the integration design and development based on requirements is realized, and the effective cooperation between model design and data development is formed in mechanism.
Systematic modeling not only implements integrated design and development based on requirements but also forms effective cooperation between model design and data processing in mechanism. Firstly, based on the data warehouse planning, the business indicators and dimensions are mapped to the corresponding themes and business processes. Then, based on the data definition standards, the business indicators are deconstructed structurally to achieve the technical definition of indicators and complete the high-level model design. Secondly, based on the metadata precipitated in the high-level model design process, the final physical model design is driven and constrained, and the final DDL is determined for the subsequent data processing to complete the physical model design, so as to constrain the subsequent data development.
4.1 High-level model design
Front-line data requirements are presented to data engineers in the form of indicators and dimensions. Data engineers first determine the business process to be analyzed, complete the division and definition of business process, and assign indicators to the corresponding business process. Secondly, according to the business caliber of the index, the business index is divided into atomic index + limiting condition + time cycle or calculation index + limiting condition + time cycle, and the technical definition of the index is completed. Thirdly, the design of the consistent dimensions of the business process is completed by integrating the analysis perspectives of all parties. The design of the consistent dimensions of the business process constitutes the bus matrix under the theme.
The above – mentioned high-level model design involves two links. First, domain model division is completed through business abstraction. Business processes are divided based on the actual process of business, and business processes are assigned according to the analysis domain. In a specific business, the analysis domain and corresponding business process do not change with the change of analysis requirements, and the division of the domain does not change with the change of analysis requirements. Based on this division, a stable asset catalog can be constructed. Second, logical modeling is completed by completing technical definitions of business indicators and attributing them to specific business processes, as well as determining the analytical dimensions of specific business processes. Logical modeling further Outlines the specific analysis metrics and analysis dimensions in the specific analysis domain and business process, and completes the final design of the high-level model, which determines the specific physical output in the specific analysis domain and business process.
More specifically, identifying analysis measures under a business process requires completing technical definitions of business metrics and attributing them to a particular business process. In this step, we produced a structured technical definition of business indicators from a technical perspective, forming a set of structured indicators system. On the one hand, structured definition is easy to unify and form a standard, avoiding ambiguity in understanding caused by full text description. On the other hand, structured definition helps the system to guarantee its consistency, and solves the problem that it is difficult to implement the consistency guarantee by manual. In our structured indicator scheme, indicators are divided into atomic indicators, computational indicators and derived indicators, and the following clear definitions are made for these three types of indicators:
- Atomic indicator: refers to a non-separable indicator under a business process, a noun with clear business meaning. In physical implementation, it is a combination of business entity fields and specific aggregation operators under a specific business process.
- Calculation index: the index obtained by the combination of atomic index and limiting conditions and through four operations of addition, subtraction, multiplication and division. The calculation index has a clear calculation formula as the definition of the calculation index, and can be combined with multiple limiting conditions. For the attribution of calculation indicators, we follow two principles: (1) Since atomic indicators can be attributed to corresponding business processes, business processes generally have time sequence, and the calculation indicators are attributed to the latter business process; ② If multiple business processes are involved and these business processes have no time sequence, in this case, it is necessary to judge the correlation between the indicator description content and the subject business process, and then ascribe to the corresponding business process. In physical implementation, the calculation index can be directly and automatically generated by its defined calculation formula.
- Derivative index: index composed of “time period + multiple qualification conditions + atomic index/calculation index”. Since the derived indicator is derived from atomic indicator/calculation indicator, the derived indicator needs to belong to the business process to which the atomic indicator/calculation indicator belongs.
- Qualification condition: Qualification condition is a logical encapsulation of indicator business caliber. Time cycle can also be regarded as a special qualification condition, which must be included in derivative indicators. In physical implementation we treat it as a logical label of derived facts.
After such a definition, derivative indexes can be clearly divided into atomic derivative indexes and computational derivative indexes, which can be easily semi-automatically generated and realized in a structured way. Derivative indexes cover all indexes of data products such as user generated reports, while atomic indexes and computational indexes, as the core contents of the index system, are not directly provided to users. It is also easy to make clear the realization way of indicators. The logic of atomic indicators and calculation indicators should be sunk in the basic fact layer as far as possible, while derivative indicators should be realized in the middle layer and application layer according to requirements.
4.2 Detailed model design
Detailed model design is a bridge to transform high-level model design into actual physical production. Detailed model design must be combined with the production process of data and give the actual physical model matching its hierarchical model. According to the responsibility boundary between different layers of warehouse, the detailed model design presents different characteristics.
Specifically, engineers need data combined with the needs of the business, the corresponding logical modeling output of DDL final processing production of physical model, this is the core of our model design in detail, for the middle tier aggregation model, is to improve query performance, based on the detailed model is a process, do not involve the business task aperture processing, as long as the metadata definition is clear, It is entirely possible to implement “TEXT2SQL” through tools to achieve configuration production. Our engineers only need to pay attention to the development of infrastructure layer, and the intermediate and application layer construction is completed by tools, saving a lot of time and energy. Before expanding into the detailed model design, we first introduce the warehouse layering, and then introduce the matching detailed model design through the data layering.
4.2.1 Introduction to Data Warehouse layering
According to the whole data production flow link, data will experience generation, access, processing to the final consumption, data warehouse construction is mainly concentrated in the data access and processing links. Access includes data acquisition and data cleaning two process, through the process to complete the data from the business system to the flow of the warehouse, for subsequent scenario based on the analysis of data modeling provides the raw data, we defined the data generated from the process for preparing area data, the process is automated by tools, basic don’t need too much artificial participation and design.
Another process, in order to support the user and producer of statements and other BI application queries, we need to provide users with open area data, the current take dimensional modeling and warehouse stratification theory, through the way of the star subsidiary model + multidimensional aggregation model respectively satisfy the consumers’ online analysis of fixed and ad-hoc analysis of unexpected, casual query demands. This area is the core of the data engineer’s overall work and can help us improve the efficiency and quality of data production by using the metadata of online modeling precipitation. In the data preparation area, we divided the data model into basic detail layer (B3) and intermediate summary layer (B2, B1) to support the data requirements of different scenarios.
4.2.2 Detailed model design driven by metadata
Design concept
Metadata-driven detailed model design is based on the logical model produced by high-level model design, and then drives and constraints the subsequent physical model DDL to be processed. It is roughly divided into three steps: first, determine the physical model name; Second, basic facts are automatically generated based on model attribution, derived facts are determined based on demand, and facts are determined. Thirdly, the consistency dimension of the model is determined based on bus matrix.
The details of each step vary depending on the warehouse layer to which the model belongs. For the intermediate summary layer, it is just a multi-dimensional roll-up on the basis of the basic model. After the basic model is determined, it can automatically produce DDL and DML through simple index drag and drop. It is relatively simple and will not be detailed here. Next, we focus on describing the detailed model design of the basic fact layer, as shown in the figure below:
The first step is to determine the model name according to the source of the model. Through this step, not only the model name is standardized, but also the asset mount is automatically realized before the data production, which facilitates the subsequent data management and operation. The second step, according to the model of the first step in the mount, constraints, and to determine the model to the fact of production, namely the model contains the basis of the fact that field is determined by snapshots of the corresponding business process under the table, automatic production fields, basic facts derived from the model contains the facts by derivative index under the corresponding business process required qualification of the decision, This ensures the unity of requirements, model design and physical implementation.
Through this process, we restrict the arbitrary processing of physical models in the actual production process, eliminating the redundancy caused by “smokestack” development at the source. Metadata constrains which facts should be produced by the corresponding topic, preventing cross coupling problems caused by unclear boundaries from the source, and ensuring high cohesion and low coupling of the final physical model.
Third step, the consistency of the physical model is determined based on the bus matrix dimensions, is not to add dimensions based on the demand, if late as demand fluctuation and frequent adjustment model, which can cause based model reusability is poor, but at the beginning of the production model, a one-time finish dimension design and production, to enhance the stability and reusability of the model.
Product realization
After explaining the concepts and constraints of detailed model design, let’s take a closer look at how this is implemented at the product level. Detailed model is designed based on the last stage of the high-level model design and the basic principle of physical modeling, with the method of systematic guidance data engineer according to the standard process to complete the corresponding physical model design, in the final output of DDL as the link of the deliverables, guiding data engineer in the production process, complete the DML writing.
In addition to assisting data engineers to complete the standardized model design, this link also completed the context description through the physical model, including completing the mapping relationship between the physical table and the asset catalog, and the mapping relationship between the physical field and the indicator dimension, providing complete basic metadata for the subsequent asset consumption link. According to the physical model design of the final deliverables, its design process mainly includes two parts: first, according to the specifications and standards, determine the name of the physical model; Second, according to norms and standards, determine the data dictionary of the physical model.
- The name of the physical table is automatically generated by determining the storehouse level, subject domain, and business process corresponding to the physical model.
- Based on the analysis measures and dimensions determined in the high-level model design, the data dictionary corresponding to the physical table is automatically generated to ensure the consistency between the model design and the final physical landing, and prevent non-standard development from the source.
4.3 Stuck point before online
High-level model design and detailed model design constrain and regulate how data engineers determine the DDL of a model. There is no matching constraint on how to constrain and ensure that the actual processing logic (DML of the model) is consistent with the business definition. Before going online, the point is to use metadata generated in the two links of high-level model and detailed model design to complete the consistency verification of DML and business definition in an automated way, eliminating the cost problem caused by manual verification. Specific card point verification includes four types:
- For data consistency verification of the same indicator from different sources, the same indicator from different sources is rolled up to the same dimension with the same value.
- Verification of consistency between business definition and concrete implementation, mainly for code value field, the specific value must be consistent with the corresponding business definition;
- Constraint class verification of R&D compliance, for example, primary key must be unique, full table scan, code process branch coverage (T+1 redirection, batch redirection, full redirection);
- The cascading impact of change, including downstream production task impact and consumption task impact.
5 concludes
Systematic modeling is the distribution data team around the asset-like construction “to quality, and the authors and data application effect” the product of this goal hatch, in line with the standard process tooling the train of thought, we use tools to restrain and standardize the production of data engineer, to do advance model of standardization management, avoid business rapid development stage of “governance” after construction. In terms of model quality, we have realized the unification of high-level model design, physical model design, and the unification of business definition and physical implementation. In terms of efficiency improvement, online modeling has deposited valuable metadata for us in a systematic way, which is the key to our subsequent application efficiency improvement based on metadata.
① Systematic modeling builds a bridge from data definition to production, realizes the transformation from data to information, provides a complete process guarantee, and realizes the unification of more than 10 topics, more than 180 atomic indicators, more than 300 calculation indicators and more than 90 derivative indicators within the distribution.
Within Meituan, the governance scores of normative construction related to core themes such as delivery transaction and contract fulfillment have achieved excellent results, especially in the index integrity construction score and physical model dimension integrity score, both achieved more than 90 points.
② Thanks to the unification of metadata and data realized by systematic modeling, we have realized the transformation of data construction from “nanny” mode to “service + self-help” mode.
In terms of data retrieval, thanks to high-quality metadata precipitated by systematic modeling, we built data maps, solved the problem of “searchable/accessible” data, and realized what you build is what you get in terms of retrieval content.
In terms of data consumption, thanks to high quality metadata systematic modeling precipitation, we implemented the “service + self-help” data service mode, not only eliminates the traditional research report development rely solely on production of long development process, demand response is slow, covering user problems such as less, and solved the problem of unable to “zero SQL” ad-hoc analysis, Meet the business personnel through “drag, pull, drag” can quickly generate analysis report demands.
At present, this mode is widely used in all business regions “zero SQL” data operation personnel morning, weekly, quarterly reporting and other business scenarios, thanks to the above mode, not only has been widely praised by front-line personnel, but also our data RD from the “take number”, “run number” heavy work free.
Author’s brief introduction
Wang Peng, Xinxing and Xiaofei are all from the data team of distribution Division.
Team introduction
Distribution data group is responsible for level based on Meituan distribution business must order, millions of merchants and rider produce huge amounts of data real-time and off-line data calculation for the construction of the system and product system, for the business achieve core objectives of safety, efficiency and experience, as a new generation of intelligent distribution system immediately – “Meituan super brain” construction system of the digital and intelligent ability, provide data support, Provide perfect data system and decision-making ability based on data science for business operation management, strategic decision-making and algorithm strategy. As the basis of Meituan everything home, Meituan Distribution has the most abundant real-time computing and offline computing scenarios, and applies the most advanced data computing technology architecture in the industry to build the technical ability to guarantee the timeliness, consistency, accuracy and integrity of data computing and the stability of data computing and services. Welcome to join us and build an industry-leading data support platform together with meituan distribution data team.
Read more technical articles from meituan’s technical team
Front end | | algorithm back-end | | | data security operations | iOS | Android | test
| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.
| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.