This article is based on the I technology conference

Firstly, I will introduce the overall business situation of IQiyi Company and the design and problems of data warehouse 1.0. In view of the defects of Data warehouse 1.0, how to evolve to data warehouse 2.0 architecture and the problems to be solved and goals to be achieved.

This picture shows very clear iQIYI product matrix, the early iQIYI is video business, then from the surrounding derived some new business, video business is given priority to with video business around the core IP, derive a short video, video, ziba iQIYI cloth, reading, and beep dalai, bubble, live under well, iQIYI knowledge, sports, business, and many other business, From apple tree to apple orchard, a pan-entertainment ecological matrix is constructed.

It can be seen that there are many businesses involved in the product matrix, and each business generates its own data and has its own unique product form. In addition to conducting business-oriented data exploration and analysis in a specific business scenario, it is also necessary to extract and refine common data from the perspective of the commonness of multiple services in multiple business scenarios to achieve cross-business horizontal exploration and analysis, so as to guide and empower services with data. At the same time, each service supports and interacts with each other, resulting in frequent data interaction between services.

Data Warehouse 1.0

As shown in the architecture diagram of data warehouse 1.0, the overall layer is divided into five parts, the original data layer is at the bottom, and the detail layer, aggregation layer and application layer are respectively above. On the right side is the dimension layer facing the whole data warehouse, which is used to manage the consistency dimension.

1. The original data layer is used to save the original data, which comes from various data production systems. It is mainly divided into three parts: Pingback delivery, unified and standardized burying point is carried out in each business product, and then the collected data is reported. Business database is mainly the data generated at the back end of the business, such as membership orders, literature orders and so on. Through the means of data integration, the data of the business database is directly synchronized to the original data layer for preservation. Third-party external data mainly comes from data sources outside the enterprise.

2. The detail layer is used to restore business processes, save the finest-grained data, perform ETL processing on original data in different modes, and complete data cleaning and partial business logic processing.

3. The aggregation layer stores non-detailed data, usually light aggregation and heavy aggregation data obtained after various calculations, and mainly adopts dimensional modeling method to construct.

4. The application layer is the resulting data generated to meet business needs. It has strong customization and is mainly provided to relevant data applications, external systems, and personnel with specific data requirements. The interface is used to connect the data warehouse to other systems, such as the service library and report system.

Looking at the data warehouse 1.0 architecture, the entire data warehouse system is actually built in accordance with the business perspective. Each business will establish a small data warehouse in line with its own business characteristics, which can quickly respond to business needs, and at the same time be flexible to support business decisions. However, with the growth of data and more complex business scenarios, there is a lack of public data extraction and summary, resulting in repeated smokestack construction, inconsistent indicator caliber, ambiguous data, low production/use efficiency and other problems, as well as a lack of favorable support from tool platforms.

When data is crossed between different services, data is directly obtained from the detail layer of other services or even the original data layer in order to respond to service requirements as soon as possible. In this case, it is easy to have inconsistent indicator statistical caliber. Lack of common aggregation data precipitation and accumulation, a lot of data overlap phenomenon, resulting in chimney type repeated construction, high resource consumption cost. In the process of warehouse 1.0 construction, although there was a relatively complete data warehouse specification, it lacked the support and control of tool platform, and often had different names and meanings or ambiguous indicators, which increased the use cost and development efficiency of downstream. More typical scenario is under A business line has one dimension, under the business line B has A dimension, but these two dimensions have the same name, and in the meaning of each business line represents or attribute is different, this time if you want to be geared to the needs of both business to do some data of cross probe consumes A lot of offline communication and screen cost.

Data Warehouse 2.0

In response to the 1.0 shortcomings, we upgraded the warehouse architecture and gradually evolved into the data warehouse 2.0 era. At the beginning of warehouse 2.0 design, the general direction and goals need to be clearly defined.

1. Define layering and composition, as well as the positioning and interrelationship of different parts.

2. Standardize and standardize the entire data modeling process. Before, due to the lack of tool support, it could not reflect the necessity and importance of data modeling work, so it provides a complete tool platform to realize the overall construction of data warehouse 2.0 from 0 to 1.

3. This is a never-ending topic, so keep your data consistent. No matter which product line or business line comes out of the data, the statistical caliber should be consistent, clear, there is no difference.

4. Improve efficiency. Combine the first three points to clarify the structure of the entire data warehouse, minimize the complex horizontal interaction between data, keep the data flow clear and concise, solve the chimney type repetitive construction, so as to improve production and use efficiency and reduce costs.

Warehouse 2.0 was born to solve various data problems faced by the 1.0 era and make data work more valuable. Through the unified caliber, standardized naming, the establishment of a unified index and dimension system, the use of standardized modeling methods and other means to open and standardize the company data, while sinking the general logic to improve computing efficiency, reduce the cost of use; Supporting a variety of data tools, can let more people participate in the data use and analysis work, so that data decision-making deep into every corner of the company, to achieve the ultimate goal of data-driven development.

2.0 throughout the data warehouse architecture, will tell from layered with 1.0 difference is not big, mainly divides into the original data layer, the detail, collect layer, application layer and uniform dimension, can be seen above, to the whole number of warehouse made a big adjustment, divided into three parts, respectively for unification for warehouse, the business market and the number of topics warehouse, At the same time, it also makes clear the division of labor, positioning and data reference flow of each component.

The lowest level is the unified data warehouse, which is mainly divided into the unified detailed data layer and the unified aggregation data layer. The detail layer is responsible for connecting with all the original data of the lower layer, restoring the data of all business domains and business processes 100%, and shielding the impact of changes in the original data of the lower layer on the upper layer, which is the bottom foundation of the whole warehouse 2.0. Through the detail layer, business relationship to data relationship logic transformation is completed, and related dimensions are supplemented, the most fine-grained data is saved, and complex business logic separation, data cleaning, unified standardized data format and other ETL processes are carried out. The aggregation layer is responsible for settling the common index, providing the calculation index of uniform caliber upward, and avoiding double calculation. In addition, OneID system based on the unified cumulative device library and new device library will be provided for upper-layer use.

Business mart mainly focuses on business demands and builds various data sets to meet business analysis. In the process of building business marts, we divide business marts as fine-grained as possible, and there will be no data dependence or horizontal reference between business marts. In the application layer, we can conduct summary calculation across marts to provide external data services. The advantage of this method is that, if there are some organizational structure adjustment or job responsibility change, each business mart does not need to adjust, but only need to make corresponding modification in the application layer. Meanwhile, it also avoids the data change cost caused by the mixing of computing task code and the split of data authority.

Based on the common topic domain/topic perspective within the scope of the company, and based on the consistency dimension, the topic number warehouse conducts data integration analysis and related construction across various businesses, including flow number warehouse, content number warehouse, user number warehouse, etc.

The application layer includes data application products such as business reports, content analysis, and user operations. According to specific scenarios and requirements, the application layer obtains data from business marts and subject data warehouses.

After finishing the overall structure of warehouse 2.0, we will clarify and summarize the positioning of each component of warehouse.

1. Unified data warehouse provides comprehensive and universal data for the bottom layer and serves as the specification maker and constraint to provide data and underlying model for the upper layer, which is the foundation of data warehouse construction.

2. Business mart is based on the data and model of unified data warehouse, combining with the purpose of business data analysis, and constructing data sets to meet the needs of each business according to the characteristics of the business.

3. The subject database is also based on the data and models of the unified database. It is oriented to the analysis of different entities and the construction of data sets in users, content and other fields.

During the construction of business mart and subject warehouse, the principle of high cohesion and low coupling should be kept as far as possible to prevent the data flow from being too complex, too deep, and the dependence between different levels from causing the later maintenance cost and development cost to become larger and larger.

Number of warehouse construction

The following describes how to build the data warehouse platform and standardize the data warehouse modeling process based on the data warehouse 2.0 architecture and consistency dimension/index system.

The architecture of the warehouse platform is shown in the figure above. At the bottom are the basic services. The resulting physical tables may include Hive, MySQL, Kafka, ClickHouse, etc. The next layer is work order management, authority management, resource management, as auxiliary functions of the platform. Work order management is used to examine and approve the creation/modification of dimensions/indicators. We specially set up a data warehouse committee to control the formulation of dimensions/indicators and adjust the warehouse construction according to the actual situation. Permission management Manages the operation permissions of different users and displays the corresponding entries according to the permissions of developers.

Data warehouse management and data model module are the core of the whole data warehouse platform. Data warehouse management is responsible for the abstract integration management of atomic components needed in the process of logarithmic warehouse construction, including business management, theme management, dimension management, index management, etc. Only when these basic steps are in place can subsequent warehouse modeling work be carried out. We have divided data modeling into three segments, business modeling, data modeling, and physical modeling, each of which will be described in more detail later.

The data warehouse platform provides a unified API externally, including dimensions and indicators, for connecting with other peripheral systems and top-level product applications. For example, the definition, calculation caliber and description of indicators are displayed uniformly in the report system to ensure that indicators convey accurate information in the process of data production-use flow. The warehouse platform pushes the metadata information generated in the process of data modeling to the metadata center for data atlas service to discover and understand data.

Before introducing the data modeling process, let’s talk about the construction of consistency dimensions and indicator systems. At present, data warehouse construction is mainly based on dimension modeling theory, so consistency dimension is the cornerstone of the bottom. In the data warehouse platform, we divide dimensions into three types, namely common dimension, enumerated dimension and virtual dimension.

The common dimension, the most common dimension type, usually has a corresponding dimension table, each dimension is composed of a primary key and multiple dimension attributes, such as user dimension, content dimension, etc. Enumeration dimensions, a special case of common dimensions, also known as dictionary dimensions, enumerate and standardize enumeration values to represent dimension objects in the form of key-value, for example: XX, 0 means no, 1 means yes; Virtual dimensions do not contain specific business entities or dimension objects that can be logically defined by solidifiable data ranges, such as random numbers and session ids.

When creating a dimension, you need to add some label attributes, such as the English name and Chinese name of the dimension, description, generality, and the entity to which it belongs (such as time, space, application, etc.). The dimension can be divided into business dimension and general dimension by “generality” : If a dimension can be used by only one service, it is defined as a service dimension. That is, it applies to only one service and other services are unavailable. Is used by two or more businesses and is defined as a common dimension. In fact, the two types change over time. A dimension is used by only one service at the beginning, but will be used by multiple services as the service develops. When the service dimension is upgraded to a common dimension, a common dimension mirror of the service dimension is constructed.

A dimension will contain several dimension attributes, each of which contains English name, Chinese name, data type, description, etc. At the same time, it is necessary to define the final field name of the dimension attributes in the physical table, so as to achieve the same name, synonym and global uniqueness within the scope of the data warehouse.

As for the theory of dimension construction, it will not be expanded in detail here. Interested students can search relevant articles and books online for understanding and learning.

The indicator system

The index system consists of index metadata (atomic index metadata and composite index metadata), modifiers, time periods and statistical indexes (atomic index and composite index).

** Indicator metadata: ** is an abstraction of statistical indicators, all indicators must be derived from some indicator metadata.

** Atomic index metadata/measurement: ** is the smallest unit that cannot be disassembled in the business process, generally composed of action + measurement. At the same time, atomic index metadata is equivalent to measurement, which is a measurement unit describing a fact. Metadata is an abstraction of a pointer to a business process fact that, without modifiers and time periods, does not represent specific statistical meaning.

** Composite index metadata: ** is a plurality of atomic index metadata or composite index metadata through computing processing, is an abstraction of composite indicators, if you need to use as a statistical indicator needs to increase the corresponding time cycle and modifiers, such as click rate, proportion, conversion rate, etc..

** modifiers: ** modifiers can be understood as the environment where statistical indicators exist, used to clarify the specific meaning of statistical indicators and refine the description of caliber. Each statistical indicator can have one or more modifiers, and dimension attributes can be converted to each other, such as: Beijing users, movie channel page, etc.

** Time period: ** Time period is used to describe the time range of a statistical indicator. It can be regarded as a special modifier. In a statistical indicator, the special modifier must be specified, for example, the current day or the latest 30 days.

** Statistical indicators: ** is divided into atomic indicators and composite indicators, which are instantiations of indicator metadata and represent specific factual measures.

** Atomic indicator: ** atomic indicator = one meta indicator + multiple modifiers (optional)+ time period. It describes the statistical significance of a service process, for example, the number of playback times on the last day.

** Composite indicator: ** is calculated by multiple atomic indicators or composite indicators, describing the relationship between multiple business processes, such as: the completion rate of play in the latest day, the number of startup times per capita in the last 30 days, etc.

The control of the indicator system in the data warehouse platform is very strict, and the construction of indicators generally needs to be systematic, standardized, summarized and refined based on the needs of the actual scene through the cooperation of development, product, analyst and business.

The modeling process

Unified warehouse modeling is the basis of business layer modeling, which needs to cover as many business processes and dimensions as possible, including business modeling, data modeling and physical modeling.

Business modeling is based on the existing information of the business, combined with the students’ understanding of the business, the business is sorted out. At this time, the specific analysis is not oriented to the details, and the scope is mainly confirmed to be the relationship between business domain, business process and entity, and the business bus matrix is output. The purpose of business modeling is to decompose business requirements and transform them into data understanding, including the specific processes: dividing business domains, confirming business processes, designing event facts, confirming related entities, associating events and constructing business bus matrix.

Business domain division. A business domain is a collection of business processes and is a coarse-grained division of each link of the service. Related business processes are gathered into a business domain, such as a broadcasting domain.

Business process is atomic behavior in business and cannot be disassembled any more. We need to confirm the business process in business modeling and specify the business domain to which the business process belongs. A business process can only belong to one business domain.

M design event facts.

Review identifies related entities, identifies the scope of entities involved in a business process at a coarse-grained level, prevents missing analysis perspectives, and provides connecting nodes for associated event entities.

Related event facts. Unified data warehouse modeling needs to cover all the existing event fact fields and carry out more dimensions of association through entities.

The business bus matrix is constructed. The horizontal and vertical coordinates are the business domain and business process that describe the fact itself, as well as the dimensions and entities that describe the fact environment.

The data modeling stage is mainly to refine the business bus matrix, complete the logical transformation from business relationship to data relationship, supplement the relevant dimensions, and output the star (Snowflake) model.

Check business confirmation, generally not cross business, for a single business model.

Review verifies business processes, which can be for a single or multiple business processes.

Check the dimensions, which are included in the business process.

Review validates metrics that are involved in business processes.

Dimension attributes are degraded. In order to make downstream use more efficient, some general dimension attributes are degraded to the detail layer model, so as to minimize join operations with dimension tables and improve efficiency.

Construct star model to guide subsequent development operations.

Physical modeling is actually the process of materializing a data model. The materialization process is slightly different depending on the engine. Finally, the data model is materialized into Hive physical tables/views, or even Kafka Topic with Schema structure.

Review confirms the data model and selects the data model that needs materialization.

Check the table name and supplement and improve the table name information, such as calculation period, table type and business information, according to warehouse specifications.

Confirm the description/instructions of use, and supplement the Chinese description of the table information and precautions for use.

Check confirm partition fields, such as day level and hour level.

Check the life cycle, and set the time range of data retention according to the importance of data, such as 30 days, 1 year, etc.

The database generates physical tables and inputs the service metadata information of the tables into the metadata center. The table names, field names, and field types are standardized and unified.

As mentioned before, the unified data warehouse serves as the basic source of the underlying model and data. The business mart/subject data warehouse is modeled based on the existing underlying model, mainly including data modeling and physical modeling (of course, the business can be better understood through the business bus matrix output in the unified data warehouse business modeling stage).

Business layer data modeling target is the output of the theme of the star model, according to the different themes and analysis scenarios, select the relevant business process, using reasonable modeling method for data modeling, the main process includes: confirmation subject selection, business process, granularity, confirmation dimension, statistical indicators, the final output star model.

Identify topics based on specific analysis requirements.

Review identifies the business and business process to be analyzed.

Check the unified warehouse model, the system automatically recommends relevant models, selects models that meet the conditions, and carries out subsequent modeling work on this basis.

The particle size can be confirmed in general. The model with the same particle size can be combined with indicators.

Check the dimensions and select the dimensions for subsequent drill-down analysis. The selection process is carried out within the scope of the business process and cannot exceed the scope that the dimensions can be associated with.

Party B confirms statistical indicators and selects statistical indicators derived from business process-related metrics (atomic indicator metadata).

Construct star model.

The physical modeling process is the same as described previously and will not be repeated.

The following figure is an example of star model produced in the data modeling stage. In the model diagram, associated business information and data logic are clearly expressed to assist subsequent data development.

Data mapping

With metadata as the center, the Data Atlas provides complete and standard metadata query capabilities, reduces the cost of data discovery and data understanding, and builds a catalog of core data assets to improve data usage efficiency.

The following figure shows the architecture of the metadata center. The open-source framework Atlas is used at the bottom and targeted for secondary development. JanusGraph stores metadata information and data kinship, while ES provides unified metadata search service.

The metadata center is mainly responsible for the collection and management of metadata and the construction of data kinship. Metadata can be divided into technical + business metadata, metadata through different platform or infrastructure service component in automated synchronization and acquisition, HiveHook way, for example, the technique of automatic acquisition Hive metadata information, at the same time, several storehouse platform is responsible for the business in the process of modeling the metadata into the center of the metadata information synchronization. In the first part, HiveHook and SparkHook mechanisms are used to automatically parse the input table and output table after the SQL/computing task is completed, and at the same time, the workflow task information when the SQL/computing task is submitted is intercepted to automatically inject the blood. In addition, in the big data development platform, the company has a set of self-developed data integration products (BabelX), which has also realized the Hook mechanism to support the integration of data kinship. The second part is to periodically pull the input/output relationship from the peripheral system service. We have broken through the full link lineage from pingback-BI reports to trace the source/end of the entire lineage link up/down through any node.

In the past, in the process of development or use, if data is needed, most of them are found through offline communication, looking for products and looking for development. After several rounds of unremitting efforts, this way will consume a lot of communication costs and labor costs. When the data is found, it is also faced with how to understand the data and how to use the data correctly. Even if there is a good document specification, it is hard to avoid timely update, inaccurate information expression and even some information cannot be expressed in the document. If the fate is not good or the information transmission is not accurate, the data found will be inconsistent with the expectation, and we need to re-communicate to find. Based on meta data center, we build the data mapping services, used for data discovery and data understanding, in the data map construction, with a clear and definite goal: to create efficient environment, support rapid “data”, intuitive understanding and use of data, guiding the data development, improve the efficiency of development, etc “with data requirements.

“Finding data”, also known as data discovery, needs to provide keyword search capabilities based on “certain consensus”, such as dimension combination and indicator combination, automatically present data matching with target dimension + indicator matrix, and combine sorting and secondary screening functions to gradually narrow the target range for final positioning. If you search for non-standardized information, such as descriptions, it may be misleading due to inaccurate expression/ambiguity. In addition, a structured directory or a wizard-type query function is required to display a comprehensive view of services and the data under services in a simple and quick way. For example, the data warehouse map provides a list model and an atlas model. In the list mode, data tables, dimensions, and indicators are displayed in the form of a catalog, and information such as the business and subject of the data table is displayed, which can be browsed and positioned after simple screening. In the graph mode, all services and service processes/topics are graphically summarized in the form of topologies to display a panoramic view of the service and data models. Layer by layer filtering is performed based on the wizard to locate target data.

“Using data” refers to how to effectively understand the business information reflected in the data and use it correctly after finding the data. Therefore, a knowledge graph of metadata is constructed for data usage scenarios to obtain relevant technical metadata and business metadata and display them in friendly categories. For example, basic information includes the project to which the data belongs, the person in charge, and the authority approval personnel. Business information includes the business to which data belongs, business domain, and subject domain. Data warehouse labels include Chinese name, description, topic model, dimension/index, etc. Data asset information includes asset level, SLA, and quality score.

Using Hive tables as an example, all Hive information is collected in the metadata center and classified by label. In this way, data users can quickly find data through search and directory browsing and fully understand the meaning of data.

Data related

As mentioned before, we built a full-link kinship from Pingback-BI report, injecting input/output information and corresponding workflow task information at each corresponding link. The value of data kinship will not be elaborated too much. Impact assessment, troubleshooting, and link analysis can be performed through data kinship. When, for example, a table appeared quality problem in the data link needs to be repaired back a big upgrade or table data, need to cooperate to migrate downstream, can pass data related export all downstream table, the corresponding workflow task, and the Owner, be able to help data producers or managers coordinate downstream changes quickly. New employees can have a clearer understanding of the entire data warehouse construction process through blood relationship. For this, data relationship tool provides link pruning and filtering functions. When there are too many downstream data, it is convenient to see a branch link clearly by pruning and filtering according to certain conditions. When the BI report reaches the offline standard, the computing task of the corresponding link can be found and offline based on the full link kinship. This solves the problem that it is easy to go online but difficult to go offline, improves resource utilization and saves costs.

Another application based on full-link data kinship is asset rating. In the data governance process, we first mark data asset rating, so as to specify corresponding governance policies according to different levels. Higher levels have higher requirements on SLA and data quality. Asset grading and marking is achieved through automation and manual marking. At the end of data kinship, that is, data application. Taking BI reports as an example, each report will have important level labels. Because not all data will be reported, it can only be manually annotated. In the follow-up work, we hope to close the door through data service, that is, classify data API into levels, integrate data API into the whole link blood, break through the blood between data API and data application, so as to cover data asset grade labels more comprehensively and automatically.

Summary and Outlook

Finally, a brief summary and outlook are made, and some work directions are introduced. In the follow-up construction of the data center, we hope to achieve intelligence, automation, service and modeling.

intelligent

With data quality platform construction for example, typically, platform provides custom rules of quality check and inspection ability, developers by artificial way rules and threshold Settings, while the number of table is more and more long, will consume a large amount of manpower cost, even with the expert experience, also cannot make the holiday factor calculation, dynamic threshold scheme. A solution is that we are currently trying to collect the number warehouse core fields and formulate the general rules of data quality (such as table rows, to multiplicity, null value rate, etc.), automatic acquisition history data collected samples and training, forecast future trends, the dynamic threshold Settings, realize the automation of data quality, save manpower, on the one hand, On the other hand, holidays can be considered in the process of intelligent prediction to improve the accuracy of data quality monitoring and reduce false positives due to holidays.

automation

Based on the data warehouse platform, the work of data modeling is standardized and streamlined. In the subsequent work, some general models and processes are accumulated to realize automatic code generation.

As a service

In the past, when data interconnects with data applications or services, most of them directly access the underlying data sources, which results in various data access methods, low access efficiency, data and interfaces cannot be shared, and changes in underlying data, affecting data applications. In the construction process of the existing data center, we build a unified data service, and the data interacts with the data application/business center uniformly in the way of API to solve the above problems.

modeling

The data model hides the underlying physical implementation and describes business information and data relationship logic in a standard, normative manner. In the future, instead of physical tables, we want models for the user. Users only need to select corresponding services and select required dimensions and indicators to automatically route to the model. Based on the dependency between the model and underlying physical tables and the federated query capability, query tasks are automatically generated to access the most appropriate physical table and output data to end users.

I Qiyi Technology SauceYou didn’t know it was used by veteran TV fans? # Good Species Grass official # Good drama recommendation # Good Thing Recommendation # Movie recommendation # Planting grass # Amway @IQiyi Video number

Maybe you’d like to see more

Practice of low code in iQiyi Magpie Bridge data synchronization platform

Iqiyi paper was accepted by ACM MM conference, opening cartoon character data set to open a new generation of cartoon intelligent recognition