What is a data warehouse
Data warehouse is a topic-oriented, integrated, time-characteristic, stable data collection to support the decision-making process in business management
-
Typical applications:
- Report generation
- The data analysis
- Data mining
-
Other data warehouse characteristics
- Very large amount of data (over TB)
- It is a new application of database
- Fewer users
-
Commercial data warehouse
- Typical examples: DB2, Teradata, Vertica
- It is expensive and usually supports terabytes or less of data
-
Data warehouse in big data era
- It’s a huge amount of data
- Extensibility and fault tolerance are important
- Cost considerations
If you do not understand the basic concepts of data warehouse, you can refer to the article “Get to know data warehouse” before.
2 Basic architecture of data warehouse based on Hadoop
-
Technical means
- Hive is commonly used as a data warehouse
- Computational scalability of very large data set designs
- Supports HQL query – Simple and cheap to learn
- Unified metadata management
- Hive is commonly used as a data warehouse
-
The basic characteristics of
- Support massive data
- Multidimensional data analysis
- Fewer users
- High data latency
2.1 Hadoop-based data warehouse: first edition
-
advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
-
disadvantages
- Low performance and poor real-time performance
2.2 Hadoop-based data warehouse: second edition
-
To improve the
- Use the MPP(Presto) system to improve query performance
-
advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
- Real-time performance is good
-
disadvantages
- High data latency (long cycle from data generation to storage and query)
2.3 Hadoop-based Data Warehouse: Version 3 (added real-time Pipeline)
-
To improve the
- Spark Streaming system is used to reduce data latency
-
advantages
- It meets the basic requirements of data warehouse
- Able to handle massive amounts of data
- The system has excellent scalability and fault tolerance
- Real-time performance is good
- Low data latency
3 Data warehouse examples
Website report system
-
The basic function
- Generate reports according to service requirements
- Reports can be generated in real time or on a daily basis
-
The data size
- Data volume: TB
- Table number: 100+
-
users
- About a dozen
3.1 Collecting Data
3.2 the ETL
-
ETL
- Extract, Transform, Load
- You can use MapReduce, Spark, or Pig
- Storage format: row storage and column storage
-
Row storage versus column storage
How do I create ORC tables with compression
- ETL log format (text format) is as follows:
- Temporary tables (text format) are defined as follows:
CREATE EXTERNAL TABLE tmp_logs (
domain_id INT,
log_time STRING,
log_date STRING,
log_type INT,
uin BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ', '
STORED AS TEXTFILE
LOCATION '/user/hivetest/logs';
Copy the code
-
Tmp_logs: LOAD DATA INPATH ‘/nginx/logs/2016011206’ OVERWRITE INTO TABLE tmp_logs;
-
Import temporary table data into orC format table:
CREATE TABLE logs (
domain_id INT,
log_time STRING,
log_date STRING,
log_type INT,
uin BIGINT
)
PARTITION BY(log_time STRING)
STORED AS ORC
tblproperties("orc.compress"="SNAPPY");
INSERT INTO TABLE logs PARTITION(dt='the 2016-01-12-06') SELECT * FROM tmp_logs;
Copy the code
- Compression algorithm
- The query
SELECT domain_id, sum(log_type) FROM logs WHERE
log_time>'the 2016-01-12-06' GROUP BY domain_id;
Copy the code
3.3 Parameterized Reports and Visualization
-
Parameterized report
- Generate SQL according to customized data requirements
-
Visualization tool
- Echarts: echarts.baidu.com/
- D3.js: d3js.org/
- Tableau: Commercial visualization software
4 Summary
-
Benefits of building a data warehouse based on Hadoop
- Free open source
- Support massive data
- Peripheral tools mature
-
Process of building data warehouse based on Hadoop
- The data collection
- Data ETL
- Parameterized reporting and visualization