“This is the 14th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Hive architecture

1.1 Introduction to Hive

Hive is a data warehouse tool based on Hadoop. Hive maps structured data files to a database table and provides SQL-like query functions.

In essence, Hive converts SQL into MapReduce/Spark tasks for calculation. HDFS provides data storage at the bottom layer. Hive can be interpreted as a tool for converting SQL into MapReduce/Spark tasks. Even further, Hive is a MapReduce/Spark Sql client

Why use Hive?

The main reasons are as follows:

  • The cost of learning MapReduce is high, the project cycle is too short, and MapReduce is difficult to realize complex query logic development.
  • If Hive is used, Hive uses an operation interface similar to SQL syntax to improve rapid development capabilities. Avoiding the need to write MapReduce reduces learning costs and provides an extension of functionality

Hive features:

  1. Scalable: Hive can expand the cluster size without restarting services.
  2. Scalability: Hive supports user-defined functions that users can implement according to their own requirements.
  3. Fault tolerance: Good fault tolerance, node problems can be completed SQL execution.

1.2 Hive Architecture

Basic composition:

User interfaces include CLI, JDBC/ODBC, and WebGUI. The COMMAND line interface (CLI) is the shell command line. JDBC/ODBC is the JAVA implementation of Hive, which is similar to traditional JDBC database. WebGUI uses a browser to access Hive.

Metadata storage: Typically stored in relational databases such as mysql/ Derby. Hive stores metadata in a database. Hive metadata includes the table name, column, partition, and attributes of a table, table attributes (whether it is an external table), and directory where the table data resides.

Interpreter, compiler, optimizer, and executor: Performs lexical analysis, syntax analysis, compilation, optimization, and query plan generation of HQL query statements. The generated query plan is stored in HDFS and subsequently executed by MapReduce calls.

1.3 Relationship between Hive and Hadoop

Hive uses HDFS to store data and MapReduce to query and analyze data

1.4 Comparison between Hive and Traditional Databases

Hive is used for offline data analysis of massive data

  1. Query language. SQL is widely used in data warehouses. Therefore, AN SQL-like query language HQL is designed specifically for Hive features. Developers familiar with SQL development can easily use Hive for development.
  2. Data storage location. Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on block devices or local file systems.
  3. Data format. Hive does not define any specific data format. The data format can be specified by users. User-defined data format requires the following attributes: Column separators (usually Spaces, “\t”, “\x001″), row separators (” \n “), and methods to read file data (TextFile, SequenceFile, and RCFile are the default Hive file formats). Hive does not need to convert user data format to Hive defined data format during data loading. Therefore, Hive does not modify data during data loading. Instead, Hive copies or moves data content to the HDFS directory. In databases, different databases have different storage engines that define their own data formats. All data is stored in an organized manner, so the process of loading data into the database can be time-consuming.
  4. Data update. Hive is designed for data warehouse applications, and data warehouse content is read more than written. Therefore, Hive does not support rewriting or adding data. All data is determined during loading. Data in a database usually needs to be modified frequently, so you can use INSERT INTO… VALUES add data using UPDATE… SET Modifies data.
  5. The index. As mentioned earlier, Hive does not process or scan data during loading, and therefore does not index some keys in the data. Hive scans the entire data to access specific values that meet conditions, resulting in high access latency. With the introduction of MapReduce, Hive can access data in parallel. Therefore, Even without indexes, Hive still has an advantage in accessing large amounts of data. In a database, one or more columns are usually indexed, so the database can have high efficiency and low latency for accessing a small amount of data under certain conditions. Hive is not suitable for online data query due to high data access latency.
  6. The execution. Most Hive queries are executed using MapReduce provided by Hadoop, and databases usually have their own execution engines.
  7. Execution delay. Hive does not have indexes and scans the entire table, resulting in high latency. Another factor that causes high Hive execution latency is the MapReduce framework. Because MapReduce has a high latency, Hive query execution using MapReduce also has a high latency. In contrast, the database execution latency is low. Of course, this low is conditional, that is, the data size is small, when the data size exceeds the processing capacity of the database, Hive parallel computing obviously has an advantage.
  8. Scalability. Since Hive is built on top of Hadoop, Hive scalability is consistent with Hadoop scalability (the largest Hadoop cluster in the world is at Yahoo! In 2009, the scale was about 4000 nodes). The database has very limited extended rows due to ACID semantics. Oracle, the most advanced parallel database, has a theoretical scaling capacity of only about 100.
  9. Data size. Hive is built on clusters and can use MapReduce for parallel computing, so it can support large amounts of data. Correspondingly, the database can support a smaller data scale.

Summary: Hive has the appearance of SQL database, but the application scenario is completely different. Hive is only suitable for batch data statistical analysis.

1.5 Hive Data Stores

  1. All Hive data is stored in HDFS. There is no specific data storage format (Text, SequenceFile, ParquetFile, ORC RCFILE).

SequenceFile is a file format in Hadoop: the file content is organized in serialized KV objects

  1. You only need to tell Hive the column and row separators in the data when you create the table, and Hive can parse the data.

  2. Hive contains DB, Table, External Table, Partition, and Bucket data models.

  • Db: show the hive in the HDFS. Metastore. Warehouse. Dir directory a folder.

  • Table: indicates a folder under the owning DB directory in HDFS.

  • External Table: Similar to table, except that data can be stored in any specified path.

  • Partition: indicates a subdirectory under the table directory in the HDFS.

  • Bucket: indicates multiple files that are hashed in the same table directory in the HDFS.

Hive table types

2.1 Hive Data Types

The basic data types of Hive are: TINYINT, SAMLLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, TIMESTAMP(V0.8.0+) and BINARY(V0.8.0+).

Hive collection types include STRUCT, MAP, and ARRAY.

Hive provides four data models (tables) : internal table, external table, partition table, and bucket table.

Table metadata is stored in traditional database tables. Hive currently supports only Derby and MySQL databases.

2.2 Hive Internal Table

Internal tables in Hive are conceptually similar to tables in traditional databases. Each table in Hive has its own storage directory. All the table data is stored in a configuration in the hive – site. The XML file ${hive. Metastore. Warehouse. Dir} / table_name directory.

Create internal table:

CREATE TABLE IF NOT EXISTS students(user_no INT,name STRING,sex STRING,  
         grade STRING COMMOT 'class') COMMONT'Student List'  
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ', '
STORE AS TEXTFILE;      
Copy the code

2.3 Hive External Table

External refers to an external table. An external table refers to data that already exists in the HDFS. The external table is similar to an internal table except that metadata is deleted when the external table is deleted.

Create external table:

CREATE EXTERNAL TABLE IF NOT EXISTS students(user_no INT,name STRING,sex STRING,  
         class STRING COMMOT 'class') COMMONT'Student List'  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ', '  
STORE AS SEQUENCEFILE 
LOCATION '/usr/test/data/students.txt';   
Copy the code

2.4 Hive Partition table

Each partition of a partitioned table corresponds to an index of the corresponding partitioned column in the database, but it is organized differently from a traditional relational database. In Hive, each partition in a partitioned table corresponds to a directory under the table, and data in all partitions is stored in the corresponding directory.

For example, partitinTable contains three partitions: Nation (country), DS (date), and City (city). Nation = China, DS = 20130506, and City = Shanghai correspond to HDFS directories:

/ datawarehouse/partitinTable/nation = China/city = = 20130506 / Shanghai/ds.

Variable names defined in a partition cannot be the same as columns in a table.

Create a partition table:

CREATE TABLE IF NOT EXISTS students(user_no INT,name STRING,sex STRING,
         class STRING COMMOT 'class') COMMONT'Student List'  
PARTITIONED BY (ds STRING,country STRING)  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ', '  
STORE AS SEQUENCEFILE;
Copy the code

2.5 Hive Bucket sharing table

The bucket table hashes the specified columns and splits the data according to the hash value. The data with different hash values is written to the file corresponding to each bucket.

Divide the data into buckets according to the specified field. In other words, divide the data according to the field. You can divide the data into multiple files according to the field.

Create a bucket table:

CREATE TABLE IF NOT EXISTS students(user_no INT,name STRING,sex STRING,  
         class STRING COMMOT 'class',score SMALLINT COMMOT 'total') COMMONT'Student List'  
PARTITIONED BY (ds STRING,country STRING)  
CLUSTERED BY(user_no) SORTED BY(score) INTO 32 BUCKETS  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ', '  
STORE AS SEQUENCEFILE;      
Copy the code

2.6 Hive view

In Hive, views are logical data structures that can be simplified by hiding complex data operations (Joins, subqueries, filters, data flattening).

Unlike relational databases, Hive views do not store data or instantiate. Once a HIve view is created, its schema is immediately determined. Subsequent changes to the underlying table, such as adding new columns, do not affect the schema of the view. If the underlying table is deleted or changed, subsequent queries to the view will fail. Based on the Hive View features above, views should be used with caution in ETL and data warehouse for tables that change frequently.

Create a view:

CREATE VIEW employee_skills
 AS
SELECT name, skills_score['DB'] AS DB,
skills_score['Perl'] AS Perl, 
skills_score['Python'] AS Python,
skills_score['Sales'] as Sales, 
skills_score['HR'] as HR 
FROM employee;
Copy the code

MapReduce jobs are not triggered when a view is created because only metadata changes.

However, a MapReduce Job process is still triggered when a VIEW is queried: SHOW CREATE TABLE or DESC FORMATTED TABLE statement displays the VIEW created by the CREATE VIEW statement. The following are the DDL operations for the Hive view:

Change the properties of a view:

ALTER VIEW employee_skills 
SET TBLPROPERTIES ('comment' = 'This is a view');
Copy the code

Redefine the view:

ALTER VIEW employee_skills AS 
SELECT * from employee ;
Copy the code

Delete a view:

DROP VIEW employee_skills; 
Copy the code