Hi, I’m Qi Yunqi!

Recently has been busy with project delivery, so the original article update frequency is a lot slower, we forgive you.

Tonight, I have time to share with you an article about building offline data stores based on Ali Cloud MaxCompute.

I do Dataphin data delivery myself (the bottom layer is MaxCompute), and I do everything I can. In the daily work is absolutely necessary to deal with it, this article to do a copy of notes.

  • Selection of technical architecture

  • Several positions layered

  • The data model

  • Hierarchical call specification

Help.aliyun.com/document\_d…

Selection of technical architecture

The tutorial itself uses Aliyun MaxCompute as an example. In fact, the process and methodology are universal.

Before the design of the data model, the selection of the technical architecture needs to be completed. In this tutorial, MaxCompute, a big data product of Aliyun, and DataWorks are used to complete the overall data modeling and research and development process. The actual methodology can be fully applied to the Ariyun Dataphin.

The complete technical architecture is shown in the following figure.

Among them, DataWorks’ data integration is responsible for data collection and basic ETL [can build data collection and ETL foundation platform based on open source related technical components]. MaxCompute serves as an offline computing engine throughout the big data development process. DataWorks includes a series of functions including data development, data quality, data security, data management and so on.

Several positions layered

In the data system of Alibaba, we suggest that the data warehouse be divided into three layers, and the bottom up is as follows: Data introduction layer (ODS, Operation Data Store), Data Common layer (CDM, Common Data Model) and Data Application layer (ADS, Application Data Service).

The following figure shows the layers of the data warehouse and the uses of each layer.

  1. Operation Data Store (ODS) : Stores raw and unprocessed Data to the Data warehouse system. ODS is the Data preparation area of the Data warehouse, which is consistent with the structure of the source system. The main responsibility is to introduce the basic data into MaxCompute, while recording the historical changes of the basic data.

  2. Common Data Model (CDM), including DIM dimension table, DWD and DWS, is processed from ODS layer Data. It mainly completes data processing and integration, establishes consistent dimensions, builds reusable detailed fact tables oriented to analysis and statistics, and summarizes indicators of common granularity.

  • Common Dimension Layer (DIM) : Based on the concept of dimension modeling, establish the consistency dimension of the whole enterprise. Reduce the risk of inconsistency of data calculation aperture and algorithm. The tables of the common dimension layer are also commonly referred to as logical dimension tables, and dimensions and dimension logical tables usually correspond one to one.

  • Common summary granularity fact Layer (DWS) : Taking the topic object of analysis as the modeling driver, building a common granularity summary index fact table based on the upper application and product index requirements, and physicalizing the model by means of wide table. Build naming norms and consistent statistical indicators, provide public indicators for the upper layer, and build summary wide tables and detailed fact tables. Tables at the common summary granularity fact layer are also commonly referred to as summary logic tables and are used to hold derived metric data.

  • Detailed granular fact Layer (DWD) : Using business process as modeling driver, building the finest-grained detailed fact table based on the characteristics of each specific business process. In combination with the data usage characteristics of enterprises, some important dimension attribute fields of detailed fact tables can be appropriately redundant, that is, wide table processing. Tables at the detail granularity fact layer are also commonly referred to as logical fact tables.

  1. Application Data Service (ADS) : Stores the personalized statistical indicators of Data products. According to CDM and ODS layer processing generation.

The data classification architecture is divided into three parts in ODS layer: data preparation area, offline data area and quasi-real-time data area. The overall data classification architecture is shown in the figure below.

In this tutorial, data from the transaction data system is integrated through DataWorks data and synchronized to the ODS layer of the data warehouse. After data development, the fact wide table is formed, and then the public summary is carried out in the dimension of commodities and regions.

The overall data flow is shown in the following figure. ETL (Extract, Transform, and Load) processing from ODS to DIM is performed in MaxCompute and is synchronized to all storage systems after processing.

The ODS and DWD layers are placed in the data middleware for downstream subscriptions. The DWS and ADS layer data is usually landed in the online storage system and used downstream through the form of interface calls.

The data model

1. Data Introduction Layer (ODS)


The ODS layer stores the original data you obtain from the business system, and is the source data of other upper-layer data. The data in the business data system is usually very detailed data, accumulated over a long time, and accessed frequently. It is application-oriented data.

Data introduction layer table design

In this tutorial, the main data included in the ODS layer are: transaction system order details, user information details, commodity details and so on. The data is raw and raw. Logically, this data is stored as a two-dimensional table. Although the ODS layer does not strictly belong to the category of digital warehouse modeling, it is also very important to plan the ODS layer and do a good job of data synchronization.

In this tutorial, six ODS tables are used: s_auction, which records the information of the goods to be auctioned. Record information about the merchandise that is normally sold: s_sale. Record user details: s_users_extra. Record the new transaction order information: s_biz_ORDER_delta. Record the newly added logistics order information: s_logistics_ORDER_delta. Record the new payment order information: s_pay_ORDER_delta.

Note: The table is identified as an increment table by _delta. Some fields in the table have the same name as the keyword, which can be resolved by adding the _col1 suffix.

Build an example (s_auction)

'CREATE TABLE IF NOT EXISTS s_auction' (' id STRING COMMENT 'item ID',' title STRING COMMENT 'item name ',' gmt_modified ' STRING COMMENT 'product last modified date ',' 'price DOUBLE COMMENT' Transaction price, Minimum_bid DOUBLE COMMENT 'minimum bid ',' 'duration STRING COMMENT' Incrementnum DOUBLE COMMENT 'Incrementnum DOUBLE COMMENT ',' city STRING COMMENT 'Where is the product ',' prov STRING COMMENT 'Where is the product ',' 'ends' STRING COMMENT 'Sale end time ',' 'Quantity BIGINT COMMENT ',' 'stuff_status BIGINT COMMENT' Product status 0 New 1 Unused 2 Used ', ' 'Auction_Status BIGINT COMMENT' Item status 0 Normal 1 User deleted 2 Offline 3 Never on sale ', ' 'cate_id BIGINT COMMENT' item ID', ' 'cate_name STRING COMMENT 'category name ',' 'commodity_id BIGINT COMMENT' category ID', ' 'commodity_name STRING COMMENT' category name ', ' 'umID STRING COMMENT PARTITIONED BY (DS STRING COMMENT 'format: YYYYMMDD')' 'LIFECYCLE 400; `Copy the code

Data import layer storage

To meet historical data analysis requirements, you can add the time dimension as a partition field in the ODS tier table. In practice, you can choose to use incremental storage, full storage, or zip storage.

  • Incremental storage Stores incremental service data in days. Service dates are used as partitions. Each partition stores daily incremental service data. Here are some examples:

    On January 1, user A visits the e-commerce store B of company A. The e-commerce log of company A generates A record T1. On January 2, user A visited company A’s e-commerce store C again. The e-commerce log of Company A generated record T2. With incremental storage, T1 will be stored in the January 1 partition and T2 will be stored in the January 2 partition.

    On January 1, user A buys product B from company A’s e-commerce network, and the transaction log will generate A record T1. On January 2nd, user A returned B’s product again, and the transaction log will update the T1 record. With incremental storage, initially purchased T1 records will be stored in the January 1 partition, and updated T1 records will be stored in the January 2 partition.

Transactional ODS tables such as transactions and logs are suitable for incremental storage. This kind of table data volume is large, using the full storage method storage cost pressure. In addition, downstream applications of such tables have less need for historical full data access (which can be derived from subsequent data warehouse summaries). For example, the log class ODS table has no business process of data updating, so all the incremental partition unions together are one full data copy.

  • Full storage Indicates full storage by day. Each partition uses the service date as the partition. Each partition stores the full service data up to the service date.

    For example, on January 1, seller A published two commodities B and C on the e-commerce network of Company A, and two records T1 and T2 will be generated in the front-end commodity list. On January 2nd, seller A took commodity B off the shelves and released commodity D at the same time. The record T1 will be updated in front of the commodity list and t3 will be generated at the same time.

    With full storage, the partition stores t1 and T2 records on January 1, and the updated T1, T2, and T3 records on January 2.

For small volumes of slowly changing dimensional data, such as merchandise categories, full storage can be used directly.

  • Zipper storageZipper storage throughAdd two timestamp fields (start_dt and end_dt), to record all change data on a daily basis, and usually the partition fields are also the two timestamp fields.

The following is an example of zipper storage.

In this way, downstream applications can obtain historical data by limiting timestamp fields. For example, to access January 1 data, the user simply restricts start_dt<=20160101 and end_DT >20160101.

Slowly changing dimension

MaxCompute does not recommend using proxy keys. Instead, use natural keys as dimension primary keys for two main reasons:

  • MaxCompute is a distributed computing engine, and generating globally unique proxy keys is a huge effort. This is more complicated and unnecessary when dealing with large data volumes.

  • The use of proxy keys increases the complexity of ETL, which increases the development and maintenance costs of ETL tasks.

Slowly changing dimensions can be handled by snapshot mode without the use of surrogate keys.

In snapshot mode, the calculation period is usually once a day. Based on this period, dimension changes are processed by taking a full snapshot every day.

For example, in the commodity dimension, a full snapshot data is kept every day. The fact table of any day can get the commodity information of the day, and can also get the latest commodity information, through the limited date, using the natural key to correlate. The advantages of this method are as follows:

  • The approach to dealing with slowly changing dimensions is simple and effective, with low development and maintenance costs.

  • Easy to use and easy to understand. The data user only needs to obtain the snapshot data of the specified day. The fact snapshot of any day is associated with the dimension snapshot of any day by the natural key of the dimension.

The main drawback of this method is the great waste of storage space. For example, the daily changes of a dimension account for a very low proportion of the total data volume. In extreme cases, the daily changes do not occur. In this case, storage waste is serious.

This method mainly realizes the optimization and logical simplification of ETL efficiency by sacrificing storage. Avoid excessive use of this method, and you must have a corresponding data life cycle system to clear useless historical data.

Data synchronization loading and processing

ODS data needs to be synchronized from each data source system to MaxCompute before it can be used for further data development. This tutorial recommends that you use the DataWorks data integration feature for data synchronization. When using data integration, it is recommended that you follow the following specifications:

  • A system’s source tables can only be synchronized to MaxCompute once to maintain the consistency of the table structure.

  • Data integration is used only for full offline data synchronization. Incremental real-time data synchronization requires the data transfer service (DTS). For details, see Data Transfer Service (DTS).

  • Data integration Fully synchronized data goes directly into the day partition of the full scale.

  • ODS layer tables are recommended to be stored as statistical date and time partition tables to facilitate data storage cost management and policy control.

  • Data integration adaptively handles changes in source system fields:

  • If the target table for the source system field does not exist on MaxCompute, the data integration can automatically add the table field that does not exist.

  • If the target table’s fields do not exist in the source system, the data integration populates NULL.

2. Common Dimension Summary Layer (DIM)


The Common Dimension Summary Layer (DIM) builds consistency dimensions across the enterprise based on the concept of dimension modeling.

The Common Dimension summary layer (DIM) consists mainly of dimension tables (DIMENSION tables). Dimensions are logical concepts, the perspective from which business is measured and viewed. A dimension table is a physical table built on a data platform based on dimensions and their attributes. It adopts the principle of wide table design. Therefore, building the Common Dimension Summary Layer (DIM) first requires defining the dimensions.

Define the dimensions

When dividing the data domains and building the bus matrix, the dimensions need to be defined in conjunction with the analysis of the business process. Take the marketing business section of e-commerce company A in this tutorial as an example. In the transaction data field, we focus on the business process of confirming receipt (successful transaction).

In the business process of confirming receipt, there are mainly business perspectives that depend on both the merchandise and the place of receipt (in this tutorial, it is assumed that receipt and purchase are the same place).

From the perspective of commodities, the following dimensions can be defined: Commodity ID Commodity name Commodity price Commodity old and new degree: 0-new, 1-idle, 2-second-hand Commodity category ID Commodity category Name Category ID Category Name Buyer ID Commodity status: 0- Normal, 1- User delete, 2- Down, 3- never put in the city where the product is located in the province of the product

From the perspective of region, the following dimensions can be defined: Buyer ID City Code City name Province Code Province name

As the core of dimensional modeling, the uniqueness of dimension must be guaranteed in enterprise data warehouse. Taking the commodity dimension of Company A as an example, there is only one dimension definition allowed. For example, the dimension of province code is consistent with the information conveyed by any business process.

The design dimension table

Once the dimension definition is complete, you can supplement the dimensions to generate the dimension table. Dimension table design needs to pay attention to:

  • You are advised to limit the number of dimensional forms to 10 million.

  • When dimension tables are joined with other tables, Map Join is recommended.

  • Avoid updating dimension table data too frequently.

When designing dimension tables, you need to consider the following:

  • Stability of data in dimension tables. For example, company A’s e-commerce members usually do not die out, but their data may be updated at any time. In this case, it is necessary to create A single partition to store the full amount of data. If there are records that do not update, you may need to create a history table and a daily table separately. The daily table is used to store the current valid records and keep the data volume of the table from expanding. The history table inserts partitions based on the death time and uses a single partition to store records of the death time of the partitions.

  • Whether vertical splitting is required. If a dimension table has a large number of properties that are not used, or the query is slowed down because it carries too many attribute fields, consider splitting the fields to create multiple dimension tables.

  • Whether horizontal split is needed. If there are clear boundaries between records, consider splitting them into multiple tables or designing multi-level partitioning.

  • The core dimension table output time usually has strict requirements.

The main steps for designing dimension tables are as follows:

  • Complete the initial definition of dimensions and ensure the consistency of dimensions.

  • Determine the main dimension table (the central fact table, using the star model in the tutorial). The primary dimension table here is typically a data entry layer (ODS) table that is directly synchronized with the business system. For example, s_auction is the commodity table synchronized with the foreground commodity center system, which is the primary dimension table.

  • Determine the correlation dimension table. Data warehouse is the data integration of business source systems. Tables in different business systems or the same business system are associated with each other. According to business combing, which tables are associated with the main dimension table, and select some of them for generating dimension attributes. Taking the commodity dimension as an example, according to the combing of business logic, it can be found that there is an association relationship between commodities and categories, sellers, shops and other dimensions.

  • There are two main stages in determining dimensional attributes. The first stage is to select dimension attributes from the main dimension table or to generate new dimension attributes. The second stage is to select dimension attributes from the related dimension tables or to generate new dimension attributes. Take the commodity dimension as an example, select dimension attributes or generate new dimension attributes from the primary dimension table (s_auction) and related dimension tables such as category, seller, and store.

  • Generate as many dimensional properties as possible.

  • Give as many meaningful verbal descriptions as possible.

  • Distinguish numeric attributes from facts.

  • Try to precipitate generic dimensional attributes.

Public Dimension summary Layer (DIM) Dimension table naming convention of the public dimension summary layer (DIM) : DIM_ {business section name /pub}_{Dimension definition}[_{custom naming label}]. Pub is a dimension unrelated to a specific business section or shared by each business section, such as the time dimension.

Here are some examples:

Public area Dimension table DIM_pub_area A company e-commerce section full scale of goods DIM_ASale_ITM

Build said cases

CREATE TABLE IF NOT EXISTS DIM_ASale_ITM (' 'ITEM_id BIGINT COMMENT' item ID', ' 'item_title STRING COMMENT' Item name ', ' 'item_price DOUBLE COMMENT' purchase price DOUBLE COMMENT 'item_stuff_status BIGINT COMMENT' Purchase price DOUBLE COMMENT '0 new 1 idle 2 used ',' 'cate_id BIGINT COMMENT' 'Category ID',' 'cate_name STRING COMMENT' category name ', ' 'commodity_id BIGINT COMMENT' category ID', ' 'commodity_name STRING COMMENT 'Category name ',' umID STRING COMMENT 'Buyer ID',' 'item_status BIGINT COMMENT' Item status _0 Normal 1 User deleted 2 Removed from shelves 3 Not on shelves ', ' 'City STRING COMMENT' 'Product Location ',' prov STRING COMMENT 'product location ')' 'COMMENT' Product Full inventory ' 'PARTITIONED BY (DS STRING COMMENT' date, YYYYMMDD '); CREATE TABLE IF NOT EXISTS DIM_pub_area (' buyer_ID STRING COMMENT 'buyer ID',' 'city_code STRING COMMENT' City code', ' 'city_name STRING COMMENT' city name ', 'prov_code STRING COMMENT' province code', 'prov_name STRING COMMENT' province name ') ' 'COMMENT PARTITIONED BY (DS STRING COMMENT 'date partition, format YYYYMMDD ')' 'LIFECYCLE 3600; `Copy the code

3. Detail granularity Fact Layer (DWD)


The detail granularity fact layer is modeled by business process, and based on the characteristics of each specific business process, the finest-grained detail layer fact table is constructed. Based on the data usage characteristics of an enterprise, you can properly redundancy some important dimension attribute fields of the detailed fact table, that is, wide tabulating.

The fact tables of the Common summary granularity fact Layer (DWS) and detail granularity fact Layer (DWD) are the core of the data warehouse dimensional modeling and need to be designed closely around the business process. Describe the business process by capturing the measures that describe the business process, including the dimensions of reference and the measures related to the business process. Measurements are usually numerical data that serve as the basis for a logical table of facts. The description information of the fact logical table is the fact attribute, and the foreign key fields in the fact attribute are associated by the corresponding dimension.

The level of business detail expressed by a single record in a fact table is called granularity. In general, granularity can be expressed in two ways: the level of detail represented by a combination of dimensional attributes, and the specific business meaning represented.

A decimal value, usually of integer or floating point type, that is used as a measure of a business process and has three types: additive, semi-additive, and non-additive:

  • Additive facts are those that can be summarized in terms of any dimension associated with the fact table.

  • Semi-additive facts can only be summarized for specific dimensions, not for all dimensions. For example, inventory can be aggregated by place and item, while it makes no sense to add up inventory by month of the year on a time scale.

  • Completely inadditive, as in a ratio fact. Non-additive facts can be decomposed into additive components to achieve aggregation.

Fact tables are generally thinner and have faster row growth than dimension tables. Dimension attributes can be stored in the fact table, and this storage of dimension columns in the fact table is called dimension degradation, which speeds up queries. As with other dimensions stored in dimension tables, dimension degradation can be used to filter queries of fact tables, implement aggregation operations, and so on.

There are three types of detailed granular fact layers (DWD) : transaction fact table, periodic snapshot fact table, and cumulative snapshot fact table. For details, see Warehouse Construction Guide.

  • Transaction fact tables are used to describe business processes, track measured events at a point in space or time, and hold the most atomic data, also known as atomic fact tables.

  • Periodic snapshot Fact table records facts at regular, predictable intervals.

  • Cumulative snapshot fact tables represent critical step events between the start and end of a process, cover the entire lifecycle of the process, and typically have multiple date fields to record critical points in time. As the cumulative snapshot fact table changes over the life cycle, records are modified as the process changes.

Design principles for detailed granular fact tables

The detailed granularity fact table design principles are as follows:

  • Typically, a detailed granular fact table is associated with only one dimension.

  • Include as many business process-related facts as possible.

  • Select only the facts that are relevant to the business process.

  • Decompose nonadditive things into additive components.

  • The granularity must be declared before the dimensions and facts can be selected.

  • You cannot have multiple granular facts in the same fact table.

  • The units of fact should be consistent.

  • Handle Null values with caution.

  • Use degradation dimensions to improve the ease of use of fact tables.

  • The overall design process of detailed granular fact table is shown in the figure below.

The transaction business process and its measures are defined in conformance measures. Detail fact tables focus on model design for business processes. The design of a detailed fact table can be divided into four steps:

Select the business process, determine the granularity, select the dimensions, determine the facts (metrics). Granularity is primarily about recording semantic descriptions of business activities without dimensions being expanded. When you build a detailed fact table, you need to choose to develop detailed data based on existing tables, knowing what granularity of data the table records are storing.

Detail granularity fact Layer (DWD) specification

Generally, you need to follow the naming convention as follows: DWD_ {business plate /pub}_{data field abbreviation}_{Business process abbreviation}[_{Custom table naming label abbreviation}] _{single-partition incremental full identifier}, pub indicates that data includes data of multiple business plates. The single partition increment full id is usually: I for increment, f for full. For example: DWD_ASale_trD_ordcrT_triP_DI (A e-commerce company air travel order order fact table, daily update increment) and DWD_ASalE_ITM_ITEM_DF (A e-commerce goods snapshot fact table, daily refresh the full amount).

In this tutorial, the DWD layer consists of three main tables: the transaction commodity information fact table: DWD_ASale_TRD_ITM_DI. Transaction member information fact table: ODs_ASale_trd_mbr_di. Trade order information fact table: DWD_ASale_trD_ORD_di.

Create an example (dwd_asale_trD_ITm_di)

'CREATE TABLE IF NOT EXISTS DWD_ASALE_TRD_ITM_di' '('' ITEM_id BIGINT COMMENT 'item ID',' 'item_title STRING COMMENT 'item name ',' 'item_price DOUBLE COMMENT' item price ', ' 'item_stuff_status BIGINT COMMENT' Item status ', ' 'item_prov STRING' COMMENT 'item_city STRING COMMENT' item_city ', ' 'cate_id BIGINT COMMENT' item_city ID', ' 'cate_name STRING COMMENT 'Category name ',' 'commodity_id BIGINT COMMENT' category ID', ' 'commodity_name STRING COMMENT' category name ', ' 'buyer_id BIGINT COMMENT PARTITIONED BY (DS STRING COMMENT 'date ')' 'LIFECYCLE 400; `Copy the code

4. Common Summary Granularity Fact Layer (DWS)


Detail granularity ==> Summary granularity

The common summary granularity fact layer takes the topic object of analysis as the modeling drive and builds the common granularity summary indicator fact table based on the upper application and product indicator requirement. A table in the common summary layer usually corresponds to a derived metric.

Common summary fact table design principles

Aggregation is the aggregation of data in the original detail granularity. DWS Common summary layer is topic aggregation modeling for analysis objects. In this tutorial, the final analysis targets are the total sales volume of a category (e.g. kitchen appliances) in each province in the most recent day, the names of the Top10 sales items in that category, and the distribution of purchasing power of users in each province. Therefore, we can summarize the data of the most recent day in terms of the goods, categories, buyers, etc., which were finally traded successfully.

  • Aggregation does not cross facts. Aggregation is a summary of the original star model. In order to obtain and query results consistent with the original model, the dimensions and measures of the aggregation must be consistent with the original model, so the aggregation does not span facts.

  • Aggregation improves query performance, but it also makes ETL maintenance more difficult. Pre-existing data that has been aggregated into the clustered table needs to be readjusted when the primary category corresponding to a subcategory changes.

In addition, the following principles should be followed when designing the DWS layer:

  • Data commonality: Consider whether aggregation of aggregates can be made available to third parties. You can determine whether a dimension-based aggregation is often used in data analysis. If the answer is yes, it is necessary to aggregate and precipitate the detail data into the aggregation table.

  • Not across data domains: Data domains are abstractions that classify and aggregate data at a high level. Data domains are usually classified by business processes, for example, transactions are unified under the transaction domain, and new additions and modifications of commodities are placed under the commodity domain.

  • Statistical period: The name of the table must indicate the statistical period of the data. For example, _1D indicates the latest day, TD indicates the current day, and ND indicates the latest N days.

Common Summary fact Table specification

Public summary fact table naming convention: DWS_ {Business sector abbreviation /pub}_{Data field abbreviation}_{Data granularity abbreviation}[_{Custom table naming label abbreviation}]_{Statistical period range abbreviation}. By default, offline computing should include the last day (_1D), the last N day (_nd), and the current date (_TD).

If there are too many _nd table fields that need to be split, only one statistical period unit can be used as an atomic split. That is, a table is split in a statistical period. For example, a table is split in the last 7 days (_1W). A split table is not allowed to store multiple statistical cycles.

For the hour table (whether it is a day refresh or an hour refresh), it is represented by _hh. For the minute table (whether refreshed by day or by hour), it is represented by _mm.

Here are some examples: Dws_asale_trd_byr_subpay_1d (A e-commerce company’s buyer granularity transaction phased payment one day summary table) Dws_asale_trd_byr_cod_nd (A e-commerce company’s buyer granularity cash on delivery transaction summary fact table) Dws_asale_itm_slr_hh (Summary table of sellers’ granularity in hours of e-commerce company A) — Dimension is hour. Dws_asale_itm_slr_mm (Summary table of sellers’ granularity in minutes of e-commerce company A) — dimension is minute

Build said cases

The following table sentences are used to construct the DWS layer to meet the business requirements

CREATE TABLE IF NOT EXISTS DWS_ASalE_TRD_byR_ORD_1d (' buyer_ID BIGINT COMMENT 'Buyer ID',' buyer_nick STRING COMMENT ' 'Buyer nickname ',' mord_prov STRING COMMENT 'consignee province ',' cate_id BIGINT COMMENT 'category ID',' cate_name STRING COMMENT 'category name ',' 'confirm_paid_AMt_sum_1D DOUBLE COMMENT' Total amount of goods received for the last day order ' ') 'COMMENT' Buyer granularity All transactions summary fact table for the last day ' 'PARTITIONED BY (ds STRING COMMENT 'YYYYMMDD')' 'LIFECYCLE 36000; CREATE TABLE IF NOT EXISTS DWS_ASALE_TRD_ITM_ORD_1D (' 'ITEM_id BIGINT COMMENT' 'item ID',' 'item_title STRING COMMENT 'commodity name ',' 'cate_id BIGINT COMMENT' commodity ID', ' 'cate_name STRING COMMENT' commodity name ', ' 'mord_prov STRING COMMENT' consignee province ', ' 'confirm_paid_AMt_sum_1D DOUBLE COMMENT' Total amount of goods received for the last day's order has been confirmed ') 'COMMENT' Commodity granularity trade Summary fact table for the last day ' 'PARTITIONED BY (DS STRING COMMENT 'YYYYMMDD')' 'LIFECYCLE 36000; `Copy the code

Hierarchical call specification

After completing the layering of the data warehouse, you need to make conventions about the calling relationships between the levels of data.

The ADS application layer calls the data of the common layer of the data warehouse first. The ADS application layer is not allowed to reprocess data from the ODS layer across the CDM intermediate layer if CDM data already exists. The CDM middle layer should actively understand the data construction requirements of the application layer, deposit the common data to the common layer, and provide data services for other data layers. At the same time, THE ADS application layer also needs to actively cooperate with the CDM middle layer to carry out continuous data public construction transformation. Avoid excessive ODS layer references, unreasonable data replication, and subset redundancy. The overall principles of hierarchical invocation are as follows:

  • ODS layer data cannot be directly referenced by application layer tasks. If the middle tier does not have deposited ODS layer data, it is accessed through the CDM layer’s view. The CDM layer view must be encapsulated with a scheduler to keep the view maintainable and manageable.

  • The depth of CDM layer tasks should not be too large (no more than 10 layers are recommended).

  • A computed refresh task allows only one output table, except in special cases.

  • If multiple tasks refresh outputs a table (with different tasks inserted into different partitions), a virtual task needs to be created on DataWorks that relies on the refresh and output of multiple tasks. Typically, downstream should rely on this virtual task.

  • The CDM summary layer calls the CDM detail layer first, which can accumulate indicators. The CDM summary layer tries to call the produced coarser grained summary layer first, avoiding the large amount of summary layer data directly calculated from the massive detailed data layer.

  • CDM transactional fact tables are preferentially invoked by CDM detailed layer cumulative snapshot fact tables to maintain consistent data output.

  • A targeted CDM common summary layer should be built to avoid excessive reference and dependence on the detailed data of the CDM layer by the application layer.

I am “cloud qi”, a love technology, poetry can write big data development ape, welcome everyone to pay attention!

[

When we talk about data quality, what are we talking about?

2021-03-01

] (mp.weixin.qq.com/s?__biz=MzU…).

[

Comprehensive interpretation of data center, data warehouse and data lake

2021-02-21

] (mp.weixin.qq.com/s?__biz=MzU…).

[

What about data governance? This 10,000-word long article finally speaks clearly!

2021-02-20

] (mp.weixin.qq.com/s?__biz=MzU…).

[

This is probably the most comprehensive summary of OLAP stocks ever! (Ten thousand words dry goods)

2021-01-27

] (mp.weixin.qq.com/s?__biz=MzU…).

Views on future data development technology directions

2021-01-20

! [YunQi QI] (https://p3-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/3d3114d63bd447af959b48ebd0096935~tplv-k3u1fbpfcp-zoom-1.image)Copy the code