What is a data warehouse

A data warehouse (DW) is a topic-oriented, integrated, stable, time-varying collection of data used to support the management decision process.

The purpose of a data warehouse is to systematically organize, understand, and use data for strategic decision making at the top of an enterprise.

Characteristics of a data warehouse

Data warehouse has the following characteristics:

(1) Subject oriented

Topics are the areas of focus that users are concerned about when making decisions with data warehouses. Data warehouse will load the data of multiple business systems together through a topic, for each topic (such as: user, order, commodity, etc.) to analyze and build, operational database is to support a variety of business and build.

(2) Integration

Data warehouse will aggregate data from different source databases, but it is not simply copied, but through extraction, filtering, cleaning, transformation, synthesis and other work.

(3) Stability is not volatile

Data warehouse data is built for enterprise data analysis, so data is stored for a long time after being loaded. The data in the data warehouse mostly represents the data at a certain time in the past, which is mainly used for query and analysis, and is not often modified or added.

(4) Change with time

A data warehouse stores historical data and periodically receives new data from operational applications. So data in a data warehouse generally has a time dimension. In fact, the data warehouse records each instantaneous of the system and forms animation (i.e. the snapshot set of the data warehouse) through transient connection, so as to reproduce the whole process of the system movement during data analysis.

Why use a data warehouse

Generally, the data of a data warehouse comes from various business application systems, and the data in the business system may be structured data in relational databases such as Oracle, MySQL and SQL Server. It may also be unstructured data in flat files such as text and CSV, Word and Excel documents, or semi-structured data such as HTML and XML that are self-described. These data go through a series of data extraction, transformation, cleaning, and finally loaded into the data warehouse in a unified format. The data in the data warehouse, as the data source for analysis, provides to the following system of query, analysis system, data mart, report system, data mining system, etc.

Using a data warehouse has the following benefits:

  • Multiple data sources are integrated into a single data store so that data can be presented using a single data query engine.
  • Alleviates resource contention issues caused by executing large queries on transaction processing databases.
  • Maintain historical data.
  • By consolidating data from multiple source systems, there is a central view of the entire enterprise.
  • Reduce or correct bad data problems and improve data quality by providing consistent encoding and description
  • Represents organizational information consistently.
  • Provide a single, common data model for all data, regardless of the data source.
  • Refactoring data to make it more meaningful to the business.
  • Delivers excellent query performance to complex analytical queries without impacting operational systems.
  • Developing decision queries is easier.

The difference between data warehouse and traditional database

Although the data warehouse is developed from the traditional database system, there are still many differences between them.

Data warehouse system composition

System based on data warehouse, data warehouse, to all kinds of application system integration, data analysis, provide a solid platform for the history of a unified, through data analysis and report module of OLAP (on-line analytical processing) query and analysis tools, decision analysis, complete the information extraction, data mining to meet the needs of the decision.

The whole data warehouse system is divided into: source data layer, data storage and management layer, OLAP server layer, front-end analysis tool layer.

Components of a data warehouse system:

  • Data warehouse: Data warehouse is the core of the whole data warehouse environment. It is the place where data is stored and provides the support for data retrieval. Supports mass data storage and fast retrieval.
  • Extraction tools: Extraction tools extract data from various data sources, clean them, transform them, and store them in a data warehouse.
  • Metadata: Metadata is data that describes the structure, location, and creation of data within a data warehouse. Manage and use data warehouse through metadata.
  • Data mart: A data mart is a subset of a complete data warehouse. It is a subset of data that is separated from the data warehouse for a specific application purpose or scope. It can also be called departmental data or subject data. The purpose is to reduce the amount of data processing and make the use of information faster and more flexible.
  • OLAP services: Provide the ability to analyze data stored in a data warehouse, quickly query and aggregate complex data, and help users analyze the various dimensions of multidimensional data.
  • Front-end tools: mainly include various report tools, query tools, data analysis tools, data mining tools and various application development tools based on data warehouse or data mart.

ETL

ETL is used to describe the process of data extraction, cleaning transformation, and loading. ETL integrates and improves the quality of data according to uniform rules and is the process of transforming data from a data source to a target data warehouse (DW). ETL is the heart and soul of business intelligence/data warehouse.

1. Data extraction

Data extraction is the process of extracting Data from various Data sources and storing it in an Operational Data Store (ODS).

2. Data cleaning and conversion

Data cleaning transfer refers to the transformation of extracted data according to pre-designed rules, so that the original heterogeneous data format can be unified.

Data cleaning conversion includes two processes: data cleaning and data conversion.

  • Data cleaning refers to replacing null or missing data and replacing invalid data to ensure data correctness.

  • Data transformation refers to the integration, splitting and transformation of data.

    • Data integration refers to the combination of multiple data that may have potential association between different types of data through multi-table association. Data integration enricifies data dimensions and facilitates the discovery of more valuable information.
    • Data splitting refers to splitting one piece of data into multiple pieces according to certain rules.
    • Data transformation refers to data transformation operations such as row and column conversion, sorting, serial number modification and removal of repeated records.

3. Data loading

Data loading refers to the loading of cleaned and transformed data into the data warehouse. Data can be loaded in the following ways:

  • The incremental load
    • Timestamp mode
    • Log table mode
    • Full table comparison mode
  • Full amount of load
    • Delete and insert all tables

The data mart

Data Mart is a subset of the complete Data warehouse. For a specific application purpose or scope, a part of Data independent from the Data warehouse can also become department Data or subject Data. The data warehouse is composed of all its data marts. And the data mart should be coordinated to meet the needs of the whole enterprise analysis and decision-making.

The design idea of “top-down” and “bottom-up” is generally adopted to establish data mart and data warehouse.

The difference between data warehouse and data mart

  • A data warehouse provides data to individual data marts. The former is enterprise level, large scale, the latter is department level, relatively small scale.
  • Data marts of several departments form a data warehouse. The development cycle of data mart is short and fast, while the development cycle of data warehouse is long. Speed is slow.
  • The data structure in the data warehouse adopts the standardized mode, and the data in the data mart adopts the star mode. In general, the granularity of data in a data warehouse is finer than that of a data mart.

OLTP vs OLAP

OLTP

OLTP (Online transaction processing) is one of the important applications of traditional relational database. It is mainly the basic and daily transaction processing, which has high requirements for response. What is strong is the performance of intensive data updating and the reliability and efficiency of the system.

OLTP is event-driven and application-oriented.

OLTP features:

  • Very high response time requirements;

  • The number of users is very large, mainly operators;

  • Database operations are based on indexes;

  • The transaction of the database has been defined, the query is simple, generally does not involve multi-table join operation.

OLAP

OLAP (Online analytical Processing) is a multidimensional analysis technology, which is used to meet the needs of decision-making users to explore the regularity of business activities and the analysis of market operation trends from multiple perspectives in a large amount of business data, and assist them to make strategic development decisions.

OLAP systems can be divided into the following data storage modes:

  • ROLAP: The multidimensional data for analysis is stored in a relational database, and a batch of views are selectively defined according to the requirements of the application. Views are also stored in a relational database.
  • MOLAP: The multidimensional data for OLAP analysis is physically stored as a multidimensional array, forming a “cube” structure.
  • HOLAP: the MOLAP and ROLAP structure of a little organic combination, can meet the user’s various complex analysis needs.

OLAP tools are problem-specific online data access and analysis tools that analyze, query, and report data in a multidimensional manner.

Multi-dimensional analysis refers to the data organized in multi-dimensional form by slicing, switching, drilling, rotation and other analysis actions, in order to analyze the data, users can observe the data in the database from multiple angles, multiple sides, so as to deeply understand the information contained in the data.

  • Drill down: Change the level of dimension and change the granularity of analysis. Include:
    • Drill up: Generalize low-level detail data to high-level summary data in one dimension, or reduce dimensions.
    • Drill down: Drill down from summary data to detailed data to observe or add new dimensions.
  • Slice and switch: After selecting values on one dimension, care about the distribution of measurement data on the remaining dimensions. If there are only two remaining dimensions, it is a slice; If there are three, it is a switch.
  • Rotation: Changing the direction of dimensions, rearranging the placement of dimensions in a table (e.g., swapping rows and columns).

Comparison between OLTP and OLAP

— END —
Please long click the picture below to follow the public account DigNew


Recommended reading:

  • Learn Big Data from 0 -Hive Performance Optimization
  • Learn Big Data from 0 -Hive Basics
  • Take you quick-and-dirty HBase | HBase to read and write performance optimization
  • Take you quick-and-dirty HBase | HBase column family optimization
  • Take you quick-and-dirty HBase | HBase RowKey design