** Abstract: ** How to build a data warehouse, what methods and principles should be followed in this process, what skills are there in the project practice.
I. The “heart” of data Warehouse
Let’s start with the data model. Models are simulations and abstractions of real-world features, such as maps, architectural design sand tables, model airplanes, and so on.
The DataModel DataModel is an abstraction of real-world data characteristics.
In the construction of data warehouse project, the establishment of data model is of great significance. The customer’s business scenes, process rules and industry knowledge are reflected through the data model, which builds a bridge of communication between business personnel and technical personnel. Therefore, in some foreign data warehouse literature, Call the data model of data warehouse “TheHeartoftheDataWarehouse” heart.
The quality of data model design directly affects the data
- The stability of
- Ease of use
- Access efficiency
- Storage capacity
- Maintenance costs
Data model, data layering and ETL procedure in data warehouse
2.1 an overview of the
Data warehouse is an information system that integrates various external data sources in a (quasi) real-time/batch way and provides data consumption to end users in a variety of ways.
In the face of various upstream business systems, an important task of data warehouse is to carry out data cleaning and integration, form a standardized and standardized data structure, and provide a reliable data basis for subsequent consistent data analysis.
To play on the other hand, the inside of the data warehouse data value you need through a variety of forms, have used to understand the status of the enterprise production of regular reports, have to report to the management of KPI cockpit, push, are used to screen to display the real-time data is used for application data mart, data used for analysts also have laboratory… For different ways of data consumption, data needs to move from a highly consistent base model to a dimensional model that facilitates data presentation and analysis. Data in different phases therefore needs to be matched with data models with different architectural characteristics, which is why data is layered in a data warehouse.
The data flow among the data layers is from one data model to another data model, and this transformation process needs the help of ETL algorithm. Raw materials, for example, the data is in the data warehouse, and the data model is the form of products mold, each different data layer is the warehouse “workshop”, the formation of the data in each “workshop” line of transmission is rely on the scheduling tool this process automation software, execute SQL client tools is the mechanical arm assembly line, ETL program is the core of the algorithm that drives the manipulator to process products.
Above is the data warehouse hybrid Radiation Architecture from the Data Warehouse Toolkit – the definitive Guide to Dimensional Modeling
2.2 Layered models in the financial industry
Data warehouse in the financial industry has the highest requirements for model construction and is also the most mature industry. In the process of data warehouse project construction in the financial industry for many years, basically, buffer layer, basic model layer, summary layer (common processing layer) and market layer have been formed. Different customers will rely on the four layers of the model to do different evolution, may be merged into three layers, may be subdivided into five or six layers. This article briefly introduces the most common four-tier models:
** Buffer layer: * * some of the projects, also known as ODS layer, simply say this layer data model on the source, to the users of the warehouse is in warehouse form a upstream landing buffer zones of the system, and the original production data in this layer is preserved and reflect, so this layer data retention time cycle short, common is 7 ~ 15 days, The biggest use is directly to provide simple as-is access based on the source system architecture, such as auditing.
Base layer: also known as core layer, base model layer, PDM layer, etc. Data is divided and consolidated according to the topic domain, and detailed data is saved for a long period. This layer of data is highly integrated and is the core area of the entire data warehouse, which is the foundation of all subsequent data layers. Data stored in this layer is at least 13 months old, and 2 to 5 years is common.
** Market layer: ** jumps to the last layer first. The data model of the market layer has strong business significance and is easy for business personnel to understand and use. It is used to meet the access and query of department users, business users and key management users, and often connects with the data query of the previous portal, the access of report tools and the exploration of data mining and analysis tools.
** Summary layer: The ** summary layer is not actually built from the start. Is often established base layer and layer bazaar, found that many data mart layer, statistics, processing of the existence of base layer, data query and repeatedly scanning statistical algorithm of data mart and different department, in fact, there are common, so the primary key between two layers, the summary results with common form an independent data level, flow, Saves computing resources for the entire system.
2.3 Common ETL algorithms for data warehouse
Although the data model in the data warehouse varies greatly for different industries and different business scenarios, in essence, the data processing from buffer layer to base layer is how to efficiently add/full data to the data table of the base layer, and form a reasonable information chain of historical changes of data. From the base layer to the summary layer and then to the market layer, it is how to process data through association, summary, aggregation and grouping. Therefore, after long-term accumulation, data conversion algorithms between data levels can actually form a fixed ETL algorithm, which is also the reason why many data warehouse code generation tools on the market can automatically and intelligently form data warehouse ETL scripts without coding. Here because of the space, only a simple list of the buffer layer to the base layer common SEVERAL ETL algorithm, the specific algorithm corresponding to the SQL script can find time to separate the length of the detailed introduction.
1. Full table overwrite A1 Algorithm description: Delete all data in the target table and insert the current data source. Data volume: Full data Application scenario: Only the latest status data 2 is used. Update insert (Upsert)A2 Algorithm description: The current data is updated based on the primary key comparison, and the new data is added by inserting. Amount of data source Data: Incremental or full Data Application scenario: Only the latest status data 3 is used without reserving historical tracks. Description of Historychain A3 algorithm: Data is compared with the data of the last day according to the primary key, the chain is closed and opened on the current day for the updated data, and the chain is opened on the current day for the new data. Source data: Incremental or full data Application scenarios: The data of historical change trajectory should be retained, and deletion information, such as customer table and account table, will be ignored in this part of data. FullHistorychain A4 algorithm description: the full data of the current day is compared with the data of the last day in the zipper table in the full field, the comparison result does not exist in the data of the day to close the chain operation, to update the data of the day to close the chain and open the chain operation, to add the new data of the day to open the chain record source data quantity: Application scenario of full data: The historical change trajectory data needs to be retained. This part of data will be linked based on data comparison and deletion information. 5. A5 with delete delta zipper (Fx:DeltaHistoryChain) For the incremental data of the current day, perform the chain closing operation on the deleted data according to the change mark in the increment. After comparing the updated and newly added data with the primary key of the previous day, perform the chain closing operation on the current day and chain opening operation on the current day as required. Add the record of the chain opening on the new data. We need to keep the data of the historical change trajectory, and this part of the data will determine the deletion information according to CHG_CODE 6. Add algorithm (Append) A6 Algorithm description: Delete the incremental data of the current day or month and insert the incremental data of the current day or month. Source Data Quantity: Incremental data Application scenario: Flow data or event dataCopy the code
GaussDB(DWS) and data warehouse
Huawei GaussDB(DWS) is a distributed MPP database based on the public cloud infrastructure. It mainly applies to massive data analysis scenarios. MPP database is the most mainstream database architecture to implement data warehouse system in the industry. The main feature of this architecture is shared-nothing distributed architecture, which is composed of many logical nodes (namely DN nodes) with independent and non-shared system resources such as CPU, memory and storage.
In such a system architecture, service data is distributed on multiple nodes, and SQL is pushed to the nearest location of the data for execution. Large-scale data processing is completed in parallel, and rapid response to data processing is realized. The shared-nothing distributed architecture can also ensure that the service processing capacity increases linearly with the expansion of cluster scale.
Click to follow, the first time to learn about Huawei cloud fresh technology ~