Hive Series 1 - Hive introduction and core concepts

A list,

Hive is a data warehouse built on Hadoop. It maps structured data files into tables and provides SQL-like query functions. SQL statements used for query are converted into MapReduce jobs and submitted to Hadoop for running.

Features:

It is simple and easy to use (it provides A QUERY language HQL similar to SQL), which enables people who are proficient in SQL but do not know Java programming to perform big data analysis well.
High flexibility, can customize user function (UDF) and storage format;
Computing and storage capabilities designed for very large data sets make cluster expansion easy;
Unified metadata management, sharing data with presto/Impala/SparkSQL, etc.
High execution delay, not suitable for real-time data processing, but suitable for offline processing of massive data.

Hive architecture

2.1 Command-line shell & Thrift/JDBC

There are two ways to manipulate data: command-line shell and Thrift/JDBC:

Command-line shell: Operates data using hive command lines.
Thrift/JDBC: The THRIFT protocol operates on data in a standard JDBC manner.

2.2 Metastore

In Hive, table names, table structures, field names, field types, and table separators are called metadata. All metadata is stored in Hive’s built-in Derby database by default, but because Derby can only have one instance, which means that multiple command-line clients cannot access it at the same time, MySQL is often used instead of Derby in real production environments.

Hive implements unified metadata management, which means that you can create a table in Hive and use it directly in presto/Impala/SparkSQL. They will get unified metadata information from Metastore. In the same way that you create a table in Presto/Impala/SparkSQL, you can use it directly in Hive.

2.3 HQL Execution Process

To execute an HQL in Hive, perform the following steps:

Syntax parsing: Antlr defines the syntax rules of SQL, completes SQL morphology, syntax parsing, transforms SQL into abstract syntax Tree AST Tree;
Semantic analysis: traversing AST Tree, abstracting the basic component unit of query QueryBlock;
Generate a logical execution plan: iterate over QueryBlock, translate to execute OperatorTree;
Optimize the logical execution plan: The logical layer optimizer performs OperatorTree transformation, merges unnecessary ReduceSinkOperator, and reduces shuffle data volume.
Generate a physical execution plan: traverse the OperatorTree, translate to MapReduce task;
Optimize physical execution plan: The physical layer optimizer transforms MapReduce tasks to generate the final execution plan.

For details about Hive SQL execution process, refer to the compilation Process of Hive SQL provided by Meituan Technical team

Data types

3.1 Basic data types

Columns in Hive tables support the following basic data types:

Categories:	type
Integers (int)	TINYINT – A signed integer of 1 byte SMALLINT – A signed integer of 2 bytes INT – a signed integer of 4 bytes BIGINT – An 8-byte signed integer
Boolean (Boolean)	BOOLEAN – TRUE/FALSE
Floating point numbers	FLOAT – single-precision floating point type DOUBLE – A DOUBLE precision floating point type
Fixed point numbers	DECIMAL – user-defined precision fixed-point numbers, such as DECIMAL(7,2)
String types (String)	STRING – Specifies the character sequence of the character set VARCHAR – a character sequence with a maximum length limit CHAR – a sequence of characters of fixed length
Date and time types	TIMESTAMP – TIMESTAMP TIMESTAMP WITH LOCAL TIME ZONE – TIME stamp, nanosecond precision DATE – The DATE type
Binary types	BINARY – a sequence of bytes

TIMESTAMP WITH LOCAL TIME ZONE TIMESTAMP WITH LOCAL TIME ZONE

TIMESTAMP WITH LOCAL TIME ZONE: The TIME submitted by the user to the database is converted to the TIME ZONE where the database resides. The time is changed to the time zone where the client is located based on the client.

TIMESTAMP: Commit any time to save any time, query does not do any conversion.

3.2 Implicit conversion

Base data types in Hive follow the following hierarchy, which allows implicit conversions from subtypes to ancestor types. For example, data of type INT can be converted implicitly to type BIGINT. An additional note: The type hierarchy allows strings to be converted implicitly to DOUBLE.

3.3 Complex Types

type	describe	The sample
STRUCT	Similar to objects, is a collection of fields, fields can be different types, can be used`Name. Field name`Way to access	STRUCT (‘xiaoming’, 12 , ‘2018-12-12’)
MAP	A collection of key-value pairs that can be used`The name [key]`To access the corresponding value	map(‘a’, 1, ‘b’, 2)
ARRAY	An array is a collection of variables of the same type and name that can be used`The name [index]`Access the corresponding value	ARRAY(‘a’, ‘b’, ‘c’, ‘d’)

3.4 the sample

Here is an example of the use of primitive and complex data types:

CREATE TABLE students(
  name      STRING.Name -
  age       INT.Age -
  subject   ARRAY<STRING>,   - discipline
  score     MAP<STRING.FLOAT>,  -- Test scores of various subjects
  address   STRUCT<houseNumber:int, street:STRING, city:STRING, the province:STRING>  -- Home address
) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Copy the code

4. Content format

When data is stored in text files, rows and columns must be Separated according to a format, such as comma-separated Values (CSV) file or tab-separated Values (TSV) file. A disadvantage is that commas or tabs can be found in normal file content.

Therefore, by default, Hive uses several characters that rarely appear in files. The following table lists the default row and column separators of Hive.

The separator	describe
\n	For text files, each line is a record, so you can use a newline character to split the record
^A (Ctrl+A)	Split fields (columns). Octal encoding can also be used in CREATE TABLE statements`\ 001`To represent the
^B	Used to split elements in an ARRAY or STRUCT, or between key-value pairs in a MAP, Octal encoding can also be used in the CREATE TABLE statement`\ 002`said
^C	Used for partitioning keys and values in maps, and octal encoding can also be used in CREATE TABLE statements`\ 003`said

The following is an example:

CREATE TABLE page_view(viewTime INT, userid BIGINT)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\ 001'
   COLLECTION ITEMS TERMINATED BY '\ 002'
   MAP KEYS TERMINATED BY '\ 003'
 STORED AS SEQUENCEFILE;
Copy the code

Five, storage format

5.1 Supported Storage Formats

Hive creates a directory for each database in HDFS. Tables in the database are subdirectories of the directory, and data in the table is stored as files in the corresponding table directory. Hive supports the following file storage formats:

format	instructions
TextFile	Store as a plain text file. This is the default Hive file storage format. This storage method does not compress data, resulting in high disk overhead and high data parsing overhead.
SequenceFile	SequenceFile is a binary file provided by the Hadoop API that serializes data into a file in the form of <key,value>. The binary is internally serialized and deserialized using Hadoop’s standard Writable interface. It is compatible with MapFile in the Hadoop API. SequenceFile of Hive inherits SequenceFile of Hadoop API. However, its key is empty and value is used to store actual values. In this way, MR does not need to perform additional sorting operations during map running.
RCFile	RCFile is a Hive file storage format developed by FaceBook. A table is divided into several row groups and the data in each row group is stored in columns. The data in each column is stored separately.
ORC Files	ORC extends RCFile to some extent and is an optimization for RCFile.
Avro Files	Avro is a data serialization system designed for applications that support large volume data exchange. Its main features are: support binary serialization, can be convenient, fast processing of a large number of data; Dynamic languages are friendly, and Avro provides mechanisms that make it easy for dynamic languages to process Avro data.
Parquet	Parquet is a columnar storage format for analytical business based on Dremel’s data model and algorithm. It reduces storage space while improving IO efficiency through efficient compression by column and special coding techniques.

ORC and Parquet are recommended because of their outstanding comprehensive performance and wide use.

5.2 Specifying the Storage format

The STORED AS argument is usually used to specify when creating a table:

CREATE TABLE page_view(viewTime INT, userid BIGINT)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\ 001'
   COLLECTION ITEMS TERMINATED BY '\ 002'
   MAP KEYS TERMINATED BY '\ 003'
 STORED AS SEQUENCEFILE;
Copy the code

The storage file types can be specified as follows:

STORED AS TEXTFILE
STORED AS SEQUENCEFILE
STORED AS ORC
STORED AS PARQUET
STORED AS AVRO
STORED AS RCFILE

Internal table and external table

An Internal Table is also called a Managed/Internal Table. By default, an Internal Table is created without specifying a Managed Table. To create an External Table, you need to decorate it with External. The main differences between an internal table and an external table are as follows:

	The inner table	External tables
Data storage location	. The location of the internal table data is stored by the hive metastore. Warehouse. Dir parameters specified, the default table data is stored in the HDFS`User /hive/warehouse/ database name db/ table name /`directory	Storage location of external table data`Location`Parameter specification;
Import data	After importing data to an internal table, the internal table moves the data to its own data warehouse directory. The data life cycle is managed by Hive	The external table does not move the data to its own data warehouse directory, but merely stores the location of the data in the metadata
Delete table	Delete metadata and files	Delete metadata only

The resources

Hive Getting Started
Hive SQL compilation process
LanguageManual DDL
LanguageManual Types
Managed vs. External Tables

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Hive Series 1 – Hive introduction and core concepts

A list,

Hive architecture

2.1 Command-line shell & Thrift/JDBC

2.2 Metastore

2.3 HQL Execution Process

Data types

3.1 Basic data types

3.2 Implicit conversion

3.3 Complex Types

3.4 the sample

4. Content format

Five, storage format

5.1 Supported Storage Formats

5.2 Specifying the Storage format

Internal table and external table

The resources

Hive Series 1 – Hive introduction and core concepts

A list,

Hive architecture

2.1 Command-line shell & Thrift/JDBC

2.2 Metastore

2.3 HQL Execution Process

Data types

3.1 Basic data types

3.2 Implicit conversion

3.3 Complex Types

3.4 the sample

4. Content format

Five, storage format

5.1 Supported Storage Formats

5.2 Specifying the Storage format

Internal table and external table

The resources

Related Posts

Can the soul of end game plus the body of mobile game bring new vitality to mobile game?

Value-chain reconstruction of cloud tools | ali cloud EDAS major upgrade

Android — Image processing related