This is the second day of my participation in Gwen Challenge
Learn a little every day, make a little progress!
1. What is data warehouse?
A data warehouse, you can think of it as a granary, only it stores data. Data Warehouse (DW) is a topic-oriented, integrated, time-varying, but relatively stable collection of Data used to support the management decision process. It is a single data store, created for analytical reporting and decision support purposes. Provides guidance on business process improvement, monitoring of time, cost, quality, and control for businesses that require business intelligence.
2. What can data warehouse do?
1) The company sets annual sales targets, but we can’t just make decisions based on historical reports;
2) How to optimize the business process, which also needs the help of the data warehouse; For example, the order completion of an e-commerce website includes browsing, placing orders, payment and logistics, among which the logistics link may cooperate with express companies such as ZTO, SHentong and Yunda. Every time a Courier company sends an order, it will have the confirmation time of order delivery. According to the time of order delivery, it can analyze which express company is more efficient and efficient, so as to choose which express company to cooperate with, eliminate which express company, and increase the user friendliness.
Say simply, the word that number storehouse does is data summary, clean hind for my use.
3. Characteristics of data warehouse
Feature 1: Subject oriented, data warehouse is based on a specific topic, only the data related to the topic, other irrelevant details data will be excluded.
Feature 2: Integrated, collecting data from different data sources to the same data source, this process has some ETL operations.
Feature 3: Changes over time. Key data changes implicitly or explicitly based on time.
Characteristic 4: The data of data warehouse is not updatable, after the data is loaded, generally only query operation, there is no increase, deletion and change operation of traditional database. Data warehouse data reflects the content of historical data over a long period of time, and is a collection of database snapshots at different points in time, and exported data based on these snapshots for statistics, synthesis, and reorganization, rather than data processed online.
4. Development history of data warehouse
The development course of data warehouse is still quite clear, which can be roughly divided into three stages: simple report form stage, data mart stage, data warehouse stage. Now let’s take a look. What did you do at each stage?
1. Simple report stage
The goal of this stage is very simple, is to make reports, help decision-making. This stage produces the reports that business people need in their daily work and generates the simple summary data needed to help leaders make decisions. The final output of this phase is mostly in the form of databases and front-end reporting tools.
2. Data mart stage
This stage is driven by the continuous development of the business. According to the needs of a certain business department (mostly the general operation department or the Marketing Department), the relevant business data can be collected and sorted out, and the cross-department multidimensional reports can be provided according to the needs of various business departments. The purpose of these activities is also to help the leaders of various departments to make decisions when there is evidence to rely on, but at this stage of the report set degree is relatively higher, more widely applicable departments.
3. Data warehouse stage
Is at this stage in the further development of a data mart stage, mainly according to certain data model for the whole enterprise of the whole data acquisition, comb, and can according to the various business departments need to provide cross-department completely consistent business report data, generated by data warehouse has a guidance on the business data, At the same time to provide comprehensive data support for leaders to make decisions.
To sum up, all stages of data warehouse work for the same purpose: to provide data support, help decision-making. But they focus on different angles, the construction of data warehouse and the construction of data mart important difference lies in the support of data model. Therefore, the construction of data model plays a decisive role in building data warehouse.
5. The difference between database and data warehouse
For big data practitioners, the three words database software, database and data warehouse are familiar, so let’s talk about them.
1. Database software
It is a kind of visible, operable physical layer application software, is to realize database logic function. Common ones are: Orcale, MySQL, Redis, MongoDB, etc. We usually use visual operating software tools like Navicat, DBeaver, etc.
2. Database
It is a logical concept, used to store data warehouse. Through the database software to achieve. A database consists of many tables, which are two-dimensional and can have many fields in one table. The fields are lined up and the corresponding data is written line by line into the table. A table in a database is capable of representing multidimensional relationships in two dimensions. At present, the popular databases in the market are two-dimensional databases. For example, Oracle, DB2, and MySQL.
3. Data warehouse
It is an upgrade of the database concept. Logically, there is no difference between database and data warehouse, which is the place where data is stored through database software, but in terms of data volume, data warehouse is much larger than database. Data warehouse is mainly used for data mining and data analysis to assist leaders in making decisions.
In IT architecture, the database must exist. You have to have a place to store your data. For example, the order data involved in online shopping, Taobao, JINGdong and so on are stored in the background database. In the actual production environment, the database is used to do the work of data storage, but all the businesses involved need the database.
Data warehouse is one of the technologies under BI. As the database is linked with business applications, it is impossible for a database to hold all the data of a company. The table design of database is often designed for a certain application. For example, if we want to know the order details and whereabouts of online shopping order and payment, we need to redesign the table structure of the database and conduct data analysis. For data analysis and data mining, we introduce the concept of data warehouse. The table structure of data warehouse is designed according to the analysis demand, analysis dimension and analysis index.
6. Differences between OLTP and OLAP
The difference between a database and a data warehouse is really the difference between OLTP and OLAP
1. Operational Processing: it is called ON-LINE Transaction Processing (OLTP), which can also be called transaction-oriented Processing system. It is the daily online operation of specific business in the database, and usually queries and modifs a few records. Users are concerned about operation response time, data security, integrity, and the number of concurrent users. As the main means of data management, traditional database system is mainly used for operational processing.
2, Analytical Processing: called OLAP (On-Line Analytical Processing) generally for some topics of historical data analysis, support management decisions, data ETL (extract-transform-load).
Comparison between OLTP and OLAP is shown in the following table:
7. Data warehouse Architecture Layering (emphasis)
After the previous layers of foreshadowing, I believe that you have a clearer understanding of data warehouse. Next we will talk about data warehouse architecture layering, and we will further clarify data warehouse in terms of architecture and layering reasons.
1. Data warehouse architecture
Data warehouse can be divided into four layers: ODS (temporary storage layer), DW (data warehouse layer), DM (data mart layer), APP (application layer). Common data warehouse architectures are at the end of this section, so let’s talk about what each layer does.
1) ODS layer
A temporary storage layer is a temporary storage area for interface data to be processed in the next step. Generally speaking, data of ODS layer and data of source system are isomorphic. The main purpose is to simplify subsequent data processing. In terms of data granularity, the data granularity of ODS layer is the smallest. ODS layer tables typically have two types, one for storing data that needs to be loaded currently and one for storing historical data after processing. Historical data must be cleared after 3-6 months to save space. However, different projects should be treated differently. If the amount of data in the source system is not large, it can be retained for a longer time, or even stored in full. 阿鲁纳恰尔邦
2) the DW layer
For the data warehouse layer, the DATA in the DW layer should be consistent, accurate, and clean, i.e. after the source system data has been cleaned (removed from impurities). The data in this layer generally follows the third normal form of the database, and its granularity is usually the same as that of the ODS. In the DW layer, all historical data in the BI system will be saved, for example, the data of 10 years will be saved. 阿鲁纳恰尔邦
3) DM layer
Is the data mart layer, which organizes data subject-oriented, usually star-shaped or snowflake structured data. In terms of data granularity, the data at this layer is at the mild summary level and no detailed data exists. In terms of the time span of data, it is usually part of the PDW layer, and the main purpose is to meet the needs of user analysis. However, from the perspective of analysis, users usually only need to analyze the data in recent years (such as the data in the last three years). In terms of breadth of data, it still covers all business data.
4) the APP layer
For the application layer, this layer of data is completely to meet the specific analysis needs to build data, but also star or snowflake structure data. Highly aggregated data in terms of data granularity. In terms of data breadth, it does not necessarily cover all business data, but is a true subset of DM layer data, which is a repetition of DM layer data in a sense. From extreme case, for each statement in the APP layer to build a model to support, to achieve the purpose of the data warehouse in space for time of layered is just a suggestion of the nature of standard, the actual implementation needs to be according to the actual situation to determine the data warehouse layer, different types of data may take different layering method.
2. Why layer the data warehouse?
There are three main reasons for layering a data warehouse:
1) Change space for time and improve user experience of the application system through a large amount of pre-processing, so there will be a lot of redundant data in the data warehouse.
2) Without stratification, if the business rules of the source business system change, the whole data cleaning process will be affected and the workload will be huge.
3) through the layered management can simplify the data cleaning process, because the original step of the work assigned to multiple steps to complete, is equivalent to a complex work became more simple work, put a large black box into a white box, each layer of the processing logic is relatively simple and easy to understand, so we are more likely to ensure the accuracy of each step, When data errors occur, we often only need to locally adjust a step.
Common architecture diagrams for BI
8. Metadata
The importance of the yellow pages is evident when you need to know about a local business and the services it provides. Metadata is similar to this telephone yellow Pages. Below we from the definition of metadata, metadata storage, metadata function of these three aspects.
1. Definition of metadata
Data warehouse metadata is data about the data in the data warehouse. It acts like a data dictionary for a database management system, holding information such as logical data structures, files, addresses, and indexes. Broadly speaking, in data warehouse, metadata describes the data in the data warehouse structure and method of creation data.
Metadata is an important part of data warehouse management system. Metadata manager is a key component of enterprise-level data warehouse. It runs through the whole process of data warehouse construction and directly affects the construction, use and maintenance of data warehouse.
2. Storage mode of metadata
There are two common storage methods for metadata: one is based on data sets, each data set has a corresponding metadata file, and each metadata file contains the metadata content of the corresponding data set; Another type of storage is database-based, that is, metadata. The metadata file is composed of several items, each item represents an element of the metadata, and each record is the metadata content of the data set. The way to store each have advantages and disadvantages, the advantage of the first kind of way to store is calling the data at the same time as the corresponding metadata as a separate file is transmitted, relative database has strong independence, when to retrieve metadata can make use of the realization of the function of the database, can also keep the metadata file transferred to other database systems in operation; The disadvantage is that if each data set corresponds to a metadata document, there will be a large number of metadata files in a large database, and it is not convenient to manage. In the second storage mode, there is only one metadata file in the metadata database, which is convenient for management. To add or delete data sets, you only need to add or delete corresponding record items in the file. When retrieving metadata for a data set, the user system is required to accept this particular form of data because all it is actually getting is a record of relational table data. Therefore, the metadata database is recommended.
Metadata is used to store metadata, so it is best to choose mainstream relational database management system for metadata database. The metadata database also contains mechanisms for manipulating and querying metadata. The main benefit of establishing metadata database is to provide uniform data structure and business rules, and to integrate multiple data marts organically. At present, some enterprises tend to set up multiple data marts, rather than a centralized data warehouse, then consider before building the data warehouse or data mart, to establish a first is used to describe data and application integration of metadata database service, do a good job in the early stages of the implementation of data warehouse support is of great help for the follow-up development and maintenance. Metadata database ensures the consistency and accuracy of data warehouse data and provides the basis for enterprise data quality management.
3. The role of metadata
The main functions of metadata in data warehouse are as follows: 1) Describe what data is in the data warehouse to help the decision analyst locate the content of the data warehouse 2) define how data is entered into the data warehouse, As the data summary, mapping and cleaning guide 3) record of business events and then data extraction work schedule 4) records and data consistency detection system 5) to evaluate the implementation of the requirements and data quality not only definition of the data in the data warehouse metadata model, sources, extraction and transformation rules, etc., And it is the basis of the operation of the entire data warehouse. Metadata connects the various loose components of the data warehouse system to form an organic whole as shown in the figure below.
9. Star model and snowflake model
According to the relationship between fact table and dimension table, common models can be divided into star model and snowflake model in business intelligence solution of multidimensional analysis. When designing a model for logical data, consider whether the data is organized in a star or snowflake model.
- Star model
When all dimension tables are connected directly to the fact table, the whole diagram looks like a star, so the model is called a star model.
Star schema is A kind of formal structure, the cube of each dimension is directly connected to the fact table, there is no gradient dimension, so there is A certain redundancy, data such as in regional dimension table, there are national provincial B cities as well as the national provincial B C D two records, then the state information of A and B are stored for two times, is redundant.
- Snowflake model
When one or more dimension tables are not directly connected to the fact table, but connected to the fact table through other dimension tables, the diagram is like multiple snowflakes connected together, so it is called the snowflake model. The snowflake model is an extension of the star model. It further layers the dimension table of the star model. The original dimension table may be expanded into small fact tables, forming some local hierarchical regions. These decomposed tables are connected to the main dimension table instead of the fact table.
As shown in the figure, the regional dimension table is decomposed into dimension tables such as country, province and city. It has the advantage of improving query performance by minimizing data storage and combining smaller dimension tables. The snowflake structure removes data redundancy
Because of the redundancy of data, the star model does not need external connections for many statistical queries, so the efficiency of the star model is generally higher than that of the Snowflake model. Star structure does not need to consider many normalization factors, design and implementation are relatively simple. The snowflake model is not necessarily as efficient as the star model because it removes redundancy and some statistics need to be generated through the join of tables.
Normalization is also a relatively complex process, the corresponding database structure design, data ETL, and later maintenance are more complex. Therefore, under the premise of acceptable redundancy, star model is more used and more efficient in practical application.
- Star model versus snowflake model
Star and snowflake models are two commonly used approaches in data warehousing, and the comparison between them is discussed from four perspectives.
1) Data optimization: The Snowflake model uses normalized data, that is, the data is organized inside the database to eliminate redundancy, so it can effectively reduce the amount of data. Through referential integrity, its business hierarchy and dimensions are stored in the data model. In contrast, the star model uses de-normalized data. In the star model, dimensions refer directly to fact tables, and business hierarchies are not deployed through referential integrity between dimensions
2) Business model: In the Snowflake model, the business level of the data model is represented by a primary key-foreign key relationship of a different dimension table. In the star model, all the necessary dimension tables have only foreign keys in the fact table
3) Performance: The third difference is in performance. The snowflake model has many connections between dimension tables and fact tables, so its performance is relatively low. For example, if you want to know the details of a user, the snowflake model will join several tables to summarize the results. The star model is much less concatenated, and in this model, if you need corresponding information, you simply concatenate the dimension table with the fact table
4) ETL: The snowflake model loads the data mart, so the ETL operation is more complex in design and cannot be parallelized due to the limitations of the attached model. Star models load dimension tables without the need to add satellite models between dimensions, so ETL is relatively simple and can achieve a high degree of parallelization
Summary: The Snowflake model makes it easier to analyze dimensions, such as specific advertisers, which customers or companies are online? Star models are more suitable for analyzing metrics, such as how much revenue do they make for a given customer? 阿鲁纳恰尔邦
More exciting content, please scan attention to the public number: Xiao Han senior take you to learn