What is a data warehouse

Data warehouse is a topic-oriented, integrated, time-characteristic, stable data collection to support the decision-making process in business management

  • Typical applications:

    • Report generation
    • The data analysis
    • Data mining
  • Other data warehouse characteristics

    • Very large amount of data (over TB)
    • It is a new application of database
    • Fewer users
  • Commercial data warehouse

    • Typical examples: DB2, Teradata, Vertica
    • It is expensive and usually supports terabytes or less of data
  • Data warehouse in big data era

    • It’s a huge amount of data
    • Extensibility and fault tolerance are important
    • Cost considerations

If you do not understand the basic concepts of data warehouse, you can refer to the article “Get to know data warehouse” before.

2 Basic architecture of data warehouse based on Hadoop

  • Technical means

    • Hive is commonly used as a data warehouse
      • Computational scalability of very large data set designs
      • Supports HQL query – Simple and cheap to learn
      • Unified metadata management
  • The basic characteristics of

    • Support massive data
    • Multidimensional data analysis
    • Fewer users
    • High data latency

2.1 Hadoop-based data warehouse: first edition

  • advantages

    • It meets the basic requirements of data warehouse
    • Able to handle massive amounts of data
    • The system has excellent scalability and fault tolerance
  • disadvantages

    • Low performance and poor real-time performance

2.2 Hadoop-based data warehouse: second edition

  • To improve the

    • Use the MPP(Presto) system to improve query performance
  • advantages

    • It meets the basic requirements of data warehouse
    • Able to handle massive amounts of data
    • The system has excellent scalability and fault tolerance
    • Real-time performance is good
  • disadvantages

    • High data latency (long cycle from data generation to storage and query)

2.3 Hadoop-based Data Warehouse: Version 3 (added real-time Pipeline)

  • To improve the

    • Spark Streaming system is used to reduce data latency
  • advantages

    • It meets the basic requirements of data warehouse
    • Able to handle massive amounts of data
    • The system has excellent scalability and fault tolerance
    • Real-time performance is good
    • Low data latency

3 Data warehouse examples

Website report system

  • The basic function

    • Generate reports according to service requirements
    • Reports can be generated in real time or on a daily basis
  • The data size

    • Data volume: TB
    • Table number: 100+
  • users

    • About a dozen

3.1 Collecting Data

3.2 the ETL

  • ETL

    • Extract, Transform, Load
    • You can use MapReduce, Spark, or Pig
    • Storage format: row storage and column storage
  • Row storage versus column storage

How do I create ORC tables with compression

  • ETL log format (text format) is as follows:

  • Temporary tables (text format) are defined as follows:
	CREATE EXTERNAL TABLE tmp_logs (
		domain_id INT,
		log_time STRING,
		log_date STRING,
		log_type INT,
		uin BIGINT
	)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ', '
	STORED AS TEXTFILE
	LOCATION '/user/hivetest/logs';
Copy the code
  • Tmp_logs: LOAD DATA INPATH ‘/nginx/logs/2016011206’ OVERWRITE INTO TABLE tmp_logs;

  • Import temporary table data into orC format table:

	CREATE TABLE logs (
		domain_id INT,
		log_time STRING,
		log_date STRING,
		log_type INT,
		uin BIGINT
	)
	PARTITION BY(log_time STRING)
	STORED AS ORC
	tblproperties("orc.compress"="SNAPPY");

	INSERT INTO TABLE logs PARTITION(dt='the 2016-01-12-06') SELECT * FROM tmp_logs;
Copy the code
  • Compression algorithm

  • The query
	SELECT domain_id, sum(log_type) FROM logs WHERE
	log_time>'the 2016-01-12-06' GROUP BY domain_id;
Copy the code

3.3 Parameterized Reports and Visualization

  • Parameterized report

    • Generate SQL according to customized data requirements
  • Visualization tool

    • Echarts: echarts.baidu.com/
    • D3.js: d3js.org/
    • Tableau: Commercial visualization software

4 Summary

  • Benefits of building a data warehouse based on Hadoop

    • Free open source
    • Support massive data
    • Peripheral tools mature
  • Process of building data warehouse based on Hadoop

    • The data collection
    • Data ETL
    • Parameterized reporting and visualization