Hadoop-based data warehouse

What is a data warehouse

Data warehouse is a topic-oriented, integrated, time-characteristic, stable data collection to support the decision-making process in business management

Typical applications:
- Report generation
- The data analysis
- Data mining
Other data warehouse characteristics
- Very large amount of data (over TB)
- It is a new application of database
- Fewer users
Commercial data warehouse
- Typical examples: DB2, Teradata, Vertica
- It is expensive and usually supports terabytes or less of data
Data warehouse in big data era
- It’s a huge amount of data
- Extensibility and fault tolerance are important
- Cost considerations

If you do not understand the basic concepts of data warehouse, you can refer to the article “Get to know data warehouse” before.

2 Basic architecture of data warehouse based on Hadoop

Technical means
- Hive is commonly used as a data warehouse
  - Computational scalability of very large data set designs
  - Supports HQL query – Simple and cheap to learn
  - Unified metadata management
The basic characteristics of
- Support massive data
- Multidimensional data analysis
- Fewer users
- High data latency

2.1 Hadoop-based data warehouse: first edition

advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
disadvantages
- Low performance and poor real-time performance

2.2 Hadoop-based data warehouse: second edition

To improve the
- Use the MPP(Presto) system to improve query performance
advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
- Real-time performance is good
disadvantages
- High data latency (long cycle from data generation to storage and query)

2.3 Hadoop-based Data Warehouse: Version 3 (added real-time Pipeline)

To improve the
- Spark Streaming system is used to reduce data latency
advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
- Real-time performance is good
- Low data latency

3 Data warehouse examples

Website report system

The basic function
- Generate reports according to service requirements
- Reports can be generated in real time or on a daily basis
The data size
- Data volume: TB
- Table number: 100+
users
- About a dozen

3.1 Collecting Data

3.2 the ETL

ETL
- Extract, Transform, Load
- You can use MapReduce, Spark, or Pig
- Storage format: row storage and column storage
Row storage versus column storage

How do I create ORC tables with compression

ETL log format (text format) is as follows:

Temporary tables (text format) are defined as follows:

	CREATE EXTERNAL TABLE tmp_logs (
		domain_id INT,
		log_time STRING,
		log_date STRING,
		log_type INT,
		uin BIGINT
	)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ', '
	STORED AS TEXTFILE
	LOCATION '/user/hivetest/logs';
Copy the code

Tmp_logs: LOAD DATA INPATH ‘/nginx/logs/2016011206’ OVERWRITE INTO TABLE tmp_logs;
Import temporary table data into orC format table:

	CREATE TABLE logs (
		domain_id INT,
		log_time STRING,
		log_date STRING,
		log_type INT,
		uin BIGINT
	)
	PARTITION BY(log_time STRING)
	STORED AS ORC
	tblproperties("orc.compress"="SNAPPY");

	INSERT INTO TABLE logs PARTITION(dt='the 2016-01-12-06') SELECT * FROM tmp_logs;
Copy the code

Compression algorithm

The query

	SELECT domain_id, sum(log_type) FROM logs WHERE
	log_time>'the 2016-01-12-06' GROUP BY domain_id;
Copy the code

3.3 Parameterized Reports and Visualization

Parameterized report
- Generate SQL according to customized data requirements
Visualization tool
- Echarts: echarts.baidu.com/
- D3.js: d3js.org/
- Tableau: Commercial visualization software

4 Summary

Benefits of building a data warehouse based on Hadoop
- Free open source
- Support massive data
- Peripheral tools mature
Process of building data warehouse based on Hadoop
- The data collection
- Data ETL
- Parameterized reporting and visualization

What is a data warehouse

2 Basic architecture of data warehouse based on Hadoop

2.1 Hadoop-based data warehouse: first edition

2.2 Hadoop-based data warehouse: second edition

2.3 Hadoop-based Data Warehouse: Version 3 (added real-time Pipeline)

3 Data warehouse examples

3.1 Collecting Data

3.2 the ETL

3.3 Parameterized Reports and Visualization

4 Summary

Related Posts

[JVM] Garbage collection algorithm

Dig into Lombok’s @slf4J annotation and I find it interesting.

[Dead knock Java collection] – ArrayList source analysis