A list,
Hive is a data warehouse built on Hadoop. It maps structured data files into tables and provides SQL-like query functions. SQL statements used for query are converted into MapReduce jobs and submitted to Hadoop for running.
Features:
- It is simple and easy to use (it provides A QUERY language HQL similar to SQL), which enables people who are proficient in SQL but do not know Java programming to perform big data analysis well.
- High flexibility, can customize user function (UDF) and storage format;
- Computing and storage capabilities designed for very large data sets make cluster expansion easy;
- Unified metadata management, sharing data with presto/Impala/SparkSQL, etc.
- High execution delay, not suitable for real-time data processing, but suitable for offline processing of massive data.
Hive architecture
2.1 Command-line shell & Thrift/JDBC
There are two ways to manipulate data: command-line shell and Thrift/JDBC:
- Command-line shell: Operates data using hive command lines.
- Thrift/JDBC: The THRIFT protocol operates on data in a standard JDBC manner.
2.2 Metastore
In Hive, table names, table structures, field names, field types, and table separators are called metadata. All metadata is stored in Hive’s built-in Derby database by default, but because Derby can only have one instance, which means that multiple command-line clients cannot access it at the same time, MySQL is often used instead of Derby in real production environments.
Hive implements unified metadata management, which means that you can create a table in Hive and use it directly in presto/Impala/SparkSQL. They will get unified metadata information from Metastore. In the same way that you create a table in Presto/Impala/SparkSQL, you can use it directly in Hive.
2.3 HQL Execution Process
To execute an HQL in Hive, perform the following steps:
- Syntax parsing: Antlr defines the syntax rules of SQL, completes SQL morphology, syntax parsing, transforms SQL into abstract syntax Tree AST Tree;
- Semantic analysis: traversing AST Tree, abstracting the basic component unit of query QueryBlock;
- Generate a logical execution plan: iterate over QueryBlock, translate to execute OperatorTree;
- Optimize the logical execution plan: The logical layer optimizer performs OperatorTree transformation, merges unnecessary ReduceSinkOperator, and reduces shuffle data volume.
- Generate a physical execution plan: traverse the OperatorTree, translate to MapReduce task;
- Optimize physical execution plan: The physical layer optimizer transforms MapReduce tasks to generate the final execution plan.
For details about Hive SQL execution process, refer to the compilation Process of Hive SQL provided by Meituan Technical team
Data types
3.1 Basic data types
Columns in Hive tables support the following basic data types:
Categories: | type |
---|---|
Integers (int) | TINYINT – A signed integer of 1 byte SMALLINT – A signed integer of 2 bytes INT – a signed integer of 4 bytes BIGINT – An 8-byte signed integer |
Boolean (Boolean) | BOOLEAN – TRUE/FALSE |
Floating point numbers | FLOAT – single-precision floating point type DOUBLE – A DOUBLE precision floating point type |
Fixed point numbers | DECIMAL – user-defined precision fixed-point numbers, such as DECIMAL(7,2) |
String types (String) | STRING – Specifies the character sequence of the character set VARCHAR – a character sequence with a maximum length limit CHAR – a sequence of characters of fixed length |
Date and time types | TIMESTAMP – TIMESTAMP TIMESTAMP WITH LOCAL TIME ZONE – TIME stamp, nanosecond precision DATE – The DATE type |
Binary types | BINARY – a sequence of bytes |
TIMESTAMP WITH LOCAL TIME ZONE TIMESTAMP WITH LOCAL TIME ZONE
- TIMESTAMP WITH LOCAL TIME ZONE: The TIME submitted by the user to the database is converted to the TIME ZONE where the database resides. The time is changed to the time zone where the client is located based on the client.
- TIMESTAMP: Commit any time to save any time, query does not do any conversion.
3.2 Implicit conversion
Base data types in Hive follow the following hierarchy, which allows implicit conversions from subtypes to ancestor types. For example, data of type INT can be converted implicitly to type BIGINT. An additional note: The type hierarchy allows strings to be converted implicitly to DOUBLE.
3.3 Complex Types
type | describe | The sample |
---|---|---|
STRUCT | Similar to objects, is a collection of fields, fields can be different types, can be usedName. Field name Way to access |
STRUCT (‘xiaoming’, 12 , ‘2018-12-12’) |
MAP | A collection of key-value pairs that can be usedThe name [key] To access the corresponding value |
map(‘a’, 1, ‘b’, 2) |
ARRAY | An array is a collection of variables of the same type and name that can be usedThe name [index] Access the corresponding value |
ARRAY(‘a’, ‘b’, ‘c’, ‘d’) |
3.4 the sample
Here is an example of the use of primitive and complex data types:
CREATE TABLE students(
name STRING.Name -
age INT.Age -
subject ARRAY<STRING>, - discipline
score MAP<STRING.FLOAT>, -- Test scores of various subjects
address STRUCT<houseNumber:int, street:STRING, city:STRING, the province:STRING> -- Home address
) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Copy the code
4. Content format
When data is stored in text files, rows and columns must be Separated according to a format, such as comma-separated Values (CSV) file or tab-separated Values (TSV) file. A disadvantage is that commas or tabs can be found in normal file content.
Therefore, by default, Hive uses several characters that rarely appear in files. The following table lists the default row and column separators of Hive.
The separator | describe |
---|---|
\n | For text files, each line is a record, so you can use a newline character to split the record |
^A (Ctrl+A) | Split fields (columns). Octal encoding can also be used in CREATE TABLE statements\ 001 To represent the |
^B | Used to split elements in an ARRAY or STRUCT, or between key-value pairs in a MAP, Octal encoding can also be used in the CREATE TABLE statement \ 002 said |
^C | Used for partitioning keys and values in maps, and octal encoding can also be used in CREATE TABLE statements\ 003 said |
The following is an example:
CREATE TABLE page_view(viewTime INT, userid BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\ 001'
COLLECTION ITEMS TERMINATED BY '\ 002'
MAP KEYS TERMINATED BY '\ 003'
STORED AS SEQUENCEFILE;
Copy the code
Five, storage format
5.1 Supported Storage Formats
Hive creates a directory for each database in HDFS. Tables in the database are subdirectories of the directory, and data in the table is stored as files in the corresponding table directory. Hive supports the following file storage formats:
format | instructions |
---|---|
TextFile | Store as a plain text file. This is the default Hive file storage format. This storage method does not compress data, resulting in high disk overhead and high data parsing overhead. |
SequenceFile | SequenceFile is a binary file provided by the Hadoop API that serializes data into a file in the form of <key,value>. The binary is internally serialized and deserialized using Hadoop’s standard Writable interface. It is compatible with MapFile in the Hadoop API. SequenceFile of Hive inherits SequenceFile of Hadoop API. However, its key is empty and value is used to store actual values. In this way, MR does not need to perform additional sorting operations during map running. |
RCFile | RCFile is a Hive file storage format developed by FaceBook. A table is divided into several row groups and the data in each row group is stored in columns. The data in each column is stored separately. |
ORC Files | ORC extends RCFile to some extent and is an optimization for RCFile. |
Avro Files | Avro is a data serialization system designed for applications that support large volume data exchange. Its main features are: support binary serialization, can be convenient, fast processing of a large number of data; Dynamic languages are friendly, and Avro provides mechanisms that make it easy for dynamic languages to process Avro data. |
Parquet | Parquet is a columnar storage format for analytical business based on Dremel’s data model and algorithm. It reduces storage space while improving IO efficiency through efficient compression by column and special coding techniques. |
ORC and Parquet are recommended because of their outstanding comprehensive performance and wide use.
5.2 Specifying the Storage format
The STORED AS argument is usually used to specify when creating a table:
CREATE TABLE page_view(viewTime INT, userid BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\ 001'
COLLECTION ITEMS TERMINATED BY '\ 002'
MAP KEYS TERMINATED BY '\ 003'
STORED AS SEQUENCEFILE;
Copy the code
The storage file types can be specified as follows:
- STORED AS TEXTFILE
- STORED AS SEQUENCEFILE
- STORED AS ORC
- STORED AS PARQUET
- STORED AS AVRO
- STORED AS RCFILE
Internal table and external table
An Internal Table is also called a Managed/Internal Table. By default, an Internal Table is created without specifying a Managed Table. To create an External Table, you need to decorate it with External. The main differences between an internal table and an external table are as follows:
The inner table | External tables | |
---|---|---|
Data storage location | . The location of the internal table data is stored by the hive metastore. Warehouse. Dir parameters specified, the default table data is stored in the HDFSUser /hive/warehouse/ database name db/ table name / directory |
Storage location of external table dataLocation Parameter specification; |
Import data | After importing data to an internal table, the internal table moves the data to its own data warehouse directory. The data life cycle is managed by Hive | The external table does not move the data to its own data warehouse directory, but merely stores the location of the data in the metadata |
Delete table | Delete metadata and files | Delete metadata only |
The resources
- Hive Getting Started
- Hive SQL compilation process
- LanguageManual DDL
- LanguageManual Types
- Managed vs. External Tables
See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series