1. Hive Overview

1.1 What is HIVE

  • Open source by Facebook to solve the data statistics of massive structured logs
  • A data warehouse tool based on Hadoop that can map structured data files into a table and provide SQL-like query functionality
  • Hive is just a tool that does not store data itself but provides a way of managing it. It also does not involve the concept of distribution. It is just software
  • Hive is essentially MapReduce, converting SQL classes (HQL) into MapReduce programs

1.1.1 HQL conversion MR process

Explanation:

  1. The data Hive processes is stored in HDFS
  2. The underlying default implementation for Hive analysis data is MapReduce[can be changed to Spark]
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution  engine (i.e. spark, tez) or using Hive 1.X releases.
  1. The program to execute runs in yarn
  2. Hive is a client of Hadoop
  3. Hive is not distributed

1.2 Hive Advantages and Disadvantages

1.2.1 advantages

  1. The operation interface adopts SQL-like syntax, providing the ability of rapid development (simple and easy to use)
  2. Avoid writing MR and reduce learning costs for developers
  3. Hive has a high latency (due to the high MR latency), so it is often used for data analysis
  4. Hive’s advantage is that it handles big data (small amount of data is not as good as MySQL etc.)
  5. Hive supports user-defined functions. You can implement your own functions according to your own

1.2.2 shortcomings

  1. Hive has limited HQL expression capacity (determined by MR)

    1. Iterative algorithms cannot be expressed
    2. Not suitable for data mining
  2. Hive is less efficient

    1. The MR jobs generated automatically by Hive are usually not smart enough
    2. Hive is difficult to tune (only for resources, SQL level, not deep logic of the job)

1.3 Principle of Hive Architecture

  1. User interface: Client

    CLI (Hive Shell), JDBC/ODBC (Java Access Hive), WebUI (Browser Access Hive)

  2. Metadata: Metastore

    This includes the name of the table, the database to which the table belongs, the owner of the table, the column/partition field, the type of the table, the directory in which the table data resides, etc. (it has a Derby database, recommended for MySQL)

  3. Underlying storage: HDFS

    HDFS is used for storage and MapReduce is used for computation

  4. Driver: Driver

    1. Parser (SQL Parser) : Convert SQL strings into abstract syntax tree AST, and perform syntax analysis on the syntax tree, such as: SQL syntax, whether the table/character exists
    2. Physical Plan: The AST is compiled to generate a logical execution Plan
    3. Query Optimizer: Optimizes the logical execution plan
    4. Execution: Converts logical Execution calculations into a running physical plan, i.e. MR/Spark

Hive accepted SQL written by users through a series of interactive interfaces provided to users, and used its own Driver combined with Metastore to translate SQL instructions into MapReduce and submit them to Hadoop for execution, and output the execution results to the user interactive interface.

1.4 Hive and traditional database comparison

Hive has nothing in common with traditional databases except that it provides a similar syntax to SQL. Hive was designed with the data warehouse in mind.

1.4.1 Data storage location

Hive is built on Hadoop. All Hive data is stored in HDFS. Traditional databases keep data in the local file system; As a result, Hive can process much larger and larger amounts of data

1.4.2 Data update

Hive is designed for data warehouse applications, so data is written once and read many times. In Hive, it is not recommended to rewrite data. All data is determined when it is loaded. Frequent additions, deletions and changes are usually required for the database

1.4.3 index

Hive does not perform any processing on the data during the data loading process. Because of the large amount of data, it is not cost-effective to build indexes. Therefore, Hive needs to scan the whole data to meet the specific value of the data, so the access latency is high. With the introduction of MapReduce, Hive can access data in parallel, even without indexes, and can be used to access large amounts of data. Traditional databases are typically indexed against one or more columns, which makes accessing data slow and inefficient, meaning Hive is not suitable for real-time data analysis

1.4.4 perform

The execution engine for Hive is MR/Spark. Traditional databases have their own execution engine

1.4.5 Expansibility

Because Hadoop is highly scalable, Hive is also highly scalable; The expansion of traditional database will be limited to some extent

1.4.6 Data Scale

Hive can use MapReduce for parallel computation of large-scale data. Traditional databases support smaller amounts of data

II. Preliminary Hive

2.1 Hive installation

To configure metadata into MySQL, you need to initialize it. Initialize the command (you can do the rest of the steps by yourself) :

schematool -dbType mysql -initSchema

2.2 Hive basic operation

  1. Start the hive

    [root @ master hive - 3.2.1] # hive
  2. View the database

    hive (hive)> show databases; OK database_name default hive Time taken: 0.02 seconds, touchdown: 2 row(s)

    Hive has a default database, which is used by default

  3. Switch Databases

    hive (hive)> use hive;
    OK
    Time taken: 0.031 seconds
  4. Create a table

    hive (hive)> create table if not exists tbl_1(id int,name string);
    OK
    Time taken: 0.628 seconds

    The syntax is basically the same as MySQL, except that Hive’s data types are similar to Java’s

  5. View table structure

    hive (hive)> desc tbl_1; OK col_name data_type comment id int name string Time taken: 0.084 seconds, initialization: 2 row (s) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the delimiter -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # to check the table details hive (hive) > desc formatted tbl_1; OK col_name data_type comment # col_name data_type comment id int name string # Detailed Table Information Database: hive OwnerType: USER Owner: root CreateTime: Wed Aug 26 19:55:58 CST 2020 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/hive.db/tbl_1 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"name\":\"true\"}} bucketing_version 2 numFiles 0 numRows 0 rawDataSize 0 totalSize 0 transient_lastDdlTime 1598442958 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization. Format 1 Time taken: 0.154 seconds, touchdown: 32 row(s)
  6. Insert data (don’t, don’t, don’t)

    hive (hive)> insert into tbl_1 values(1,'zhangsan'); . . . The Time seems: 84.754 seconds

    Who uses who knows

  7. Query data

    hive (hive)> select * from tbl_1; TBL_1.ID TBL_1.name 1 zhangsan Time taken: 0.214 seconds, touchdown: 1 row(s)
  8. Withdraw from the hive

    hive (hive)> quit;
  9. Perform HDFS shell

    hive (hive)> dfs -ls /;
    Found 3 items
    drwxr-xr-x   - root supergroup          0 2020-07-21 15:57 /HBase
    drwx-wx-wx   - root supergroup          0 2020-07-21 18:27 /tmp
    drwxrwxrwx   - root supergroup          0 2020-07-21 18:00 /user
  10. Execute Linux shell

    hive (hive)> ! pwd; / usr/local/soft/hive - 3.2.1

2.3 HIVE general operation

hive (hive)> insert into tbl_1 values(1,'zhangsan'); . . . The Time seems: 84.754 seconds

Inserting a piece of data for 84 seconds is obviously not realistic… The most violent and straightforward way to insert data for Hive is to simply put the data files on the path specified by HDFS. However, not all data will work. You need to specify the delimiter when creating the table.

hive (hive)> create table if not exists tbl_2(id int,name string)
           > row format delimited fields terminated by '\t';
OK
Time taken: 0.118 seconds

To prepare data

[root@master data]# cat student.txt 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi [root@master data]# pwd / usr/local/soft/hive - 3.2.1 / data

2.3.1 Hive Insert Data 1

Load local data into Hive

Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/student.txt' into table tbl_2; TBL2 OK Time taken: 0.311 seconds Hive (Hive)> select * from TBL2; Name 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi Time taken: 0.192 seconds, touchdown: 5 row(s)

Evaluation: Convenient and recommended

2.3.2 Hive Insert Data II

The data managed by Hive is stored in HDFS. You can configure the file to specify the storage path. You can go to the specified path to view the dataDesc formatted table nameSee the storage path for the table



  • The root path for Hive stored in HDFS is in /… / warehouse/under
  • One database for each folder, named by database name.db (except default database)
  • Each table also corresponds to a folder, named by the table name
  • The data file is placed directly in the corresponding folder of the table, so the underlying call is called by LOADhadoop fs -put
  • Tables under the default database are placed directly under warehouse and the naming method remains unchanged

Based on the above rules, there is a second way to insert, directly through the Hadoop shell to put the file to the specified HDFS path can be

[root@master data]# hadoop fs -put student.txt /user/hive/warehouse/hive.db/tbl_2/student_1.txt
hive (hive)> select * from tbl_2; OK tbl_2.id tbl_2.name 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi Time Taken: 0.396 seconds, touchdown: 10 row(s)

Summary: can also be used, but must know the path of the table in HDFS, so this way the road is narrow ah!

After looking at how Hive databases, tables, and data are stored on HDFS, try to load the data again, this time by loading the data on HDFS

hive (hive)> load data inpath '/student.txt' into table tbl_2; Loading data to table Hive. TBL_2 OK Time taken: 0.683 seconds

Loading HDFS file does not need to add the local load way (obviously), then look again at HDFS information, the original data will be deleted at this time (in fact, the hadoop fs – mv)

In the case of multiple loads of the same data file, Hive will rename the data file and upload it to the specified path of HDFS. And Windows under the same name file processing similar.

2.4 Hive data types

2.4.1 Basic data types

Hive Data Types Java data types The length of the
tinyint byte 1byte
smalint short 2byte
int int 4byte
bigint long 8byte
boolean boolean true/false
float float Single precision
double double double
string String string
timestamp
bigary

The basic data types commonly used are int, bigint, double, string and are case-insensitive. Boolean is usually replaced with 0/1 to reduce storage; String is used most often for logging, and can theoretically store 2GB of data (one row).

2.4.2 Collection data types

The data type describe Syntax examples
struct Structure, complex and unrelated data struct\<k1:v,k2:v>
map Dictionary, set of key-value pair tuples map<k,v>
array Arrays, collections of the same type array\<v>

The difference between a struct and a map is that a map can only store groups of k-v pairs, and the k in a map cannot be the same. A struct can store a number of data structures with the same key and different value, that is, each group of data in a map has a different key, and each group of data in a map has the same key at its location. Collection data types allow any level of nesting.

2.4.3 Type Conversion

Hive supports Java-like data type conversions

Implicit conversion

  • tinyint -> smalint -> int -> bigint -> float -> double
  • The string type can only be converted to a number
  • Boolean cannot be cast to any type

cast

hive (hive)> select cast('1' as int); OK _C0 1 Time taken: 0.937 seconds, touchdown: 1 row(s) Hive (Hive)> select cast('a' as int); OK _C0 NULL Time taken: 0.184 seconds, touchdown: 1 row(s)

Well, don’t you see any effect

2.4.4 Test collection data types

You need to store data in the following format (JSON)

{" name ":" Chen xiao chun ", "friends" : [" cheng ", "Michael tse"], / / list Array "children" : {/ / key/value Map "jasper" : 3, "baby" : 1,} "address" : {/ / structure Struct "street" : "queen's road", "city" : "Hong Kong"}} {" name ":" liu ", "friends" : [" zhang ", "grandson couple"], "children" : {", a ": 8, "neon" : 6,} "address" : {" street ":" changan avenue ", "city" : "Beijing"}}

Convert a single line of data to a single line of data, eliminating unnecessary data. In a single line of data, fields are separated by ‘,’, collection elements are separated by ‘_’, and the KV of the map is separated by ‘:’, so it can be converted to the following format

Chen Xiaochun, Zheng Yijian _ Xie Tianhua, Jasper :3_baby:1, Queen's Road _ Hong Kong Liu Ye, Zhang Ziyi _ Sun Li, Nuo Yi :8_ Ni Na :6, Chang 'an Avenue _ Beijing

Create the following table for the above data

hive (hive)> create table tbl_3(name string,friends array<string>,childress map<string,int>,address struct<street:string,city:string>)
           > row format delimited fields terminated by ','
           > collection items terminated by '_'
           > map keys terminated by ':';
OK
Time taken: 0.124 seconds

Explanation:

  • row format delimited fields terminated by ','Sets the field divider
  • collection items terminated by '_'Sets the collection element divider
  • map keys terminated by ':'Set the Map key-value pair delimiter
  • lines terminated by '\n'Set line delimiter, default\n

Import data test

Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_collection' into table tbl_3; Loading data to table hive. TBL_3 OK Time taken: 0.281 seconds Hive (Hive)> select * from TBL_3; OK tbl_3.name tbl_3.friends tbl_3.childress tbl_3.address Chen xiao-chun [" zhengyi-jien "," zhengyi-hua "] {"jasper":3,"baby":1} {" street ":" the queen's road ", "city" : "Hong Kong"} liu ye [" zhang ", "grandson couple"] {8, ", a ":" neon ": 6} {" street" : "changan avenue", "city" : "Beijing"} Time seems: 0.176 seconds, touchdown: 2 row(s) hive (hive)> select name,friends[0],childress['baby'], touchdown. Street from TBL_3; Time taken: 0.222 seconds, touchdown: 2 row(s)

DDL data definition language

3.1 Database Operation

3.1.1 Create the database

1. The way a

Create a database that is stored in HDFS by default /user/hive/warehouse/*.db

hive (default)> create database test;
OK
Time taken: 0.827 seconds

If the database already exists and an Execution Error is reported, method 2 is recommended

2. 2

To avoid errors that already exist in the created database, use the if not exists notation

hive (default)> create database if not exists hive;
OK
Time taken: 0.029 seconds

Three (3) way

Specifies where the database is stored in HDFS

create database test location '/hive/test.db';
OK
Time taken: 0.097 seconds

3.1.2 Query database

1. Display the database

Display database

hive (hive)> show databases; OK database_name default hive Time taken: 0.03 seconds, touchdown: 2 row(s)

Filter the database that displays the query

hive (hive)> show databases like 'h*'; OK database_name Hive Time Taken: 0.022 seconds, touchdown: 1 row(s)

2. Check the database details

Display database information

hive (hive)> desc database hive; OK db_name comment location owner_name owner_type parameters hive hive test HDFS: / / master: 9000 / user/hive/warehouse/hive. Db root user Time seems: 0.049 seconds, Fetched: 1 row (s)

Displays database details

hive (hive)> desc database extended hive; OK db_name comment location owner_name owner_type parameters hive hive test HDFS: / / master: 9000 / user/hive/warehouse/hive. Db root user {creator = wj} Time seems: 0.03 seconds, Fetched: 1 row (s)

{creator=wj} is a custom property and can be used as a comment

3.1.3 Modify the database

The information of the database that has been created cannot be changed, including the name of the database and the directory location where the database is located. Here changing the database refers to changing the key value pair of the dbProperties of the database

hive (test)> alter database test set dbproperties('creator'='wj');
OK
Time taken: 0.234 seconds

3.1.4 Delete database

1. Delete empty database

hive (hive)> drop database d1;
OK
Time taken: 0.435 seconds

2. Forced delete database

For a non-empty database, the above command cannot be deleted

hive (d1)> drop database d1;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database d1 is not empty. One or more tables exist.)

So you can use cascade to force deletion

hive (d1)> drop database d1 cascade;
OK
Time taken: 0.231 seconds

Table 3.2 operation

Create table 3.2.1

Standard statement

create [external] table [if not exists] table_name [(col_name data_type [comment col_comment],...)]  [comment table_comment] [partitioned by (col_name data_type [col_name data_type],...)]  [clustered by (col_name,col_name)] [row format ...]  [collection items ...]  [map keys ...]  [location hdfs_path]

3.2.2 modify table

Mysql > rename table

hive (hive)> alter table emp rename to tbl_emp; OK Time taken: 0.215 seconds Hive (Hive)> Show Tables; OK tab_name score student tbl_1 tbl_2 tbl_3 tbl_4 tbl_emp Time taken: 0.025 seconds, touchdown: 5 row(s)

2. Add columns

hive (hive)> alter table tbl_emp add columns(emp_id int);
OK
Time taken: 0.147 seconds

3. Modify the columns

This includes changing column names, column properties

hive (hive)> alter table tbl_emp change emp_id c_emp_id string;
OK
Time taken: 0.234 seconds

Replace the column

hive (hive)> alter table tbl_emp replace columns(s_id string,c_id string,c_grade string);
OK
Time taken: 0.157 seconds

3.2.3 delete table

hive (hive)> drop table tbl_emp;
OK
Time taken: 0.227 seconds

3.2.4 Internal/External Tables

1. The inner table

Also known as MANAGED_TABLE, because the corresponding table is called external table (EXTERNAL_TABLE), so like to call it external table, create a table default is internal table, when deleting the table metadata and HDFS data will be deleted

2. The external table

In contrast to the internal table, when you delete an external table, you will not delete only metadata in the HDFS. When you create a table with the same name again, you will “restore” the table.

hive (hive)> create external table tbl_5(id int,name string);
OK
Time taken: 0.183 seconds
hive (hive)> desc formatted tbl_5;
OK
col_name        data_type       comment
# col_name              data_type               comment
id                      int
name                    string

# Detailed Table Information
Database:               hive
OwnerType:              USER
Owner:                  root
CreateTime:             Thu Aug 27 19:57:54 CST 2020
LastAccessTime:         UNKNOWN
Retention:              0
Location:               hdfs://master:9000/user/hive/warehouse/hive.db/tbl_5
Table Type:             EXTERNAL_TABLE

3. Internal and external table conversion

Internal table -> external table

hive (hive)> alter table tbl_4 set tblproperties('EXTERNAL'='TRUE');
OK
Time taken: 0.13 seconds

External table -> internal table

hive (hive)> alter table tbl_4 set tblproperties('EXTERNAL'='FALSE');
OK
Time taken: 0.13 seconds

4. Application scenarios

In the actual production environment, when it comes to sharing data, external tables must be used to prevent data loss caused by misoperation. For personal use, the middle table of data analysis can use internal tables to facilitate management.

3.2.5 the partition table

Suppose that Hive has stored data in its database for a year, and now needs to find the data of a certain day. When using the WHERE statement, Hive will perform a full table scan on all data in the table (folder), resulting in low query efficiency. So the partitioned table concept was introduced to optimize the query (in the sense of predicate pushdown)

1. Create partition table

hive (hive)> create table tbl_6(id int,name string) partitioned by(month string) row format delimited fields terminated by '\t';
OK
Time taken: 0.106 seconds

Mysql > add partitioned by (col_name data_type) after creating tables

2. Import data

Note that when you create a partition table, you need to add partition information to load the data or it will be saved (because Hive does not know where to put the data)

Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_partition' into table tbl_6 partition(month='2020-08'); Loading data to table hive.tbl_6 partition (month='2020-08') OK Time taken: 0.782 seconds hive (hive)> select * from tbl_6; OK tbl_6.id tbl_6.name tbl_6.month 1 tzhangsan 2020-08 2 tlisi 2020-08 3 twangwu 2020-08 Time taken: 0.141 seconds, touchdown: 3 row(s)

You can see that the partition field is automatically added to the table data (the original data file is not added, Hive does this by automatically adding the partition field data when the data is read), after adding multiple partitions

Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_partition' into table tbl_6 partition(month='2020-09'); Loading data to table hive. TBL_6 partition (month='2020-09') OK Time taken: 0.634 seconds

The data in different partitions is placed in different folders. The advantage of this is that when Hive executes a WHERE query, it will scan the data directly in the partition folder, which is much more efficient.

3. Operate partitions

Adding partitions (essentially creating folders)

hive (hive)> alter table tbl_6 add partition(month='2020-10');
OK
Time taken: 0.169 seconds

Delete partitions (essentially deleting folders)

hive (hive)> alter table tbl_6 drop partition(month='2020-10');
OK
Time taken: 0.269 seconds

Create a secondary partition

hive (hive)> create table tbl_6(id int,name string) partitioned by(month string,day string) 
row format delimited fields terminated by '\t';
OK
Time taken: 0.106 seconds

In theory and in operation, there is no limit to the upper limit of the partition. In combination with the above knowledge, you can also directly put the file into the partition table to the specified path

Check the partition

hive (hive)> show partitions tbl_6; OK partition month=2020-08 month=2020-09 month=2020-10 Time taken: 0.127 seconds, touchdown: 3 row(s)

DML data manipulation language

4.1 Data import

4.1.1 Load data to table

1. Standard grammar

load data [local] inpath path [overwrite] into table table_name [partition(p1=v1,...)]
args explain
load data Load the data
local Load local files without default HDFS loading
inpath path Path to load the data file
overwrite Overwriting existing data
into table Additional data
table_name The specific table name
partition Loads data to the specified partition

2. Operation cases

See 2.3 HIVE General Operations

4.1.2 the insert

INSERT into a partitioned table, not recommended

hive (hive)> insert into tbl_6 partition(month='2020-07') values(4,'zhaoliu');

Basic insert mode, where data is inserted based on the result of the query

hive (hive)> insert into table tbl_5 select * from tbl_1;
hive (hive)> insert overwrite table tbl_5 select * from tbl_1;

4.1.3 the as the select

hive (hive)> create table tbl_7 as select * from tbl_6;

4.1.4 export

Can only be used with import, see inport usage below

4.2 Data Export

2 insert

hive (hive)> insert overwrite local directory '/tmp/hive' select * from tbl_6;
[root@master data]# CD/TMP /hive/ [hive]# ll -rw-r--r-- 1 root root 108 August 27 20:55 000000_0 [root@master hive]# cat 000000_0 1tzhangsan2020-08 2tlisi2020-08 3twangwu2020-08 1tzhangsan2020-09 2tlisi2020-09 3twangwu2020-09

The exported data will overwrite all files in the exported path. Be sure to write a path that does not exist. But the cat file finds the data unfriendly, so you need to format the exported data

hive (hive)> insert overwrite local directory '/tmp/hive' row format delimited fields terminated by '\t' select * from tbl_6;
[root@master hive]# cat 000000_0
1       tzhangsan       2020-08
2       tlisi   2020-08
3       twangwu 2020-08
1       tzhangsan       2020-09
2       tlisi   2020-09
3       twangwu 2020-09

4.2.2 hive shell

Hive-e ‘SQL’ > file is not commonly used to take advantage of Linux redirection

Holdings of export/import

Lead out and import

hive (hive)> export table tbl_1 to '/hive'; OK Time taken: 0.117 seconds Hive (Hive)> truncate table TBL_1; OK Time taken: 0.149 seconds Hive (Hive)> select * from TBL_1; TBL_1.id TBL_1.name Time taken: 0.141 seconds Hive (Hive)> import table TBL_1 from '/ Hive '; Copying data from hdfs://master:9000/hive/data Copying file: HDFS: / / master: 9000 / hive/data / 000000 _0 Loading data to the table hive. Tbl_1 OK Time seems: 0.197 seconds

Five, DQL data query language

5.1 Local mode

For small amounts of data, you can set the Hive runtime mode to local mode by learning a few operational commands. For small data sets, the time can be significantly reduced by the following configuration

/ / open local mode set hive. The exec. Mode. Local. Auto = true; / / set the local Mr Maximum input data, the data quantity is less than this value (128 m) by default when using local Mr Set hive. The exec. Mode. Local. Auto. Inputbytes. Max = 50000000; / / set the local Mr Maximum input file number, when the file number is less than this value (the default 4) with the local Mr Set hive. The exec. Mode. Local. Auto.. Input files. Max = 10;

Turn off local mode

set hive.exec.mode.local.auto=true;

5.2 Basic Query

Same syntax as MySQL, basic syntax

select ... from table_name

5.2.1 Full table and field-specific queries

1. Full table query

select * from student;

2. Specific field queries

select st_id,st_name from student;

Conclusion:

  1. HQL syntax is not case sensitive
  2. HQL can be written in one line or multiple lines
  3. Keywords cannot be abbreviated or wrapped
  4. The words are usually written in branches
  5. Use indentation to make your code more readable

5.2.2 List aliases

Basic grammar, as can be omitted

select st_id as id,st_name name from student;

5.2.3 Arithmetic Operators

The operator describe
A+B A plus B
A-B A minus B
A*B A times B
A/B A divided by B
A%B Mod A with respect to B
A&B A and B are bitwise and
A\ B A and B are bitwise or
A^B A and B take XOR bitwise
~A A is reversed by digit

5.2.4 limit statement

The basic grammar

select st_id,st_name from student limit 2;

5.2.5 where clause

Same syntax as MySQL, basic syntax

select * from student where st_age > 20;

1. The comparison operator

The operator Supported data types describe
A=B Basic data types Returns TRUE if A is equal to B and FALSE if A is equal to B
A! =B Basic data types Returns TRUE if A is not equal to B and FALSE if A is not equal to B
A<B Basic data types Returns TRUE if A is less than B and FALSE if A is less than B
A<=B Basic data types Returns TRUE if A is less than or equal to B and FALSE if A is less than or equal to B
A>B Basic data types Returns TRUE if A is greater than B and FALSE if A is greater than B
A>=B Basic data types Returns TRUE if A is greater than or equal to B, and FALSE if A is greater than or equal to B
A [NOT] BETWEEN B AND C Basic data types The result is TRUE if the value of A is greater than or equal to B and less than or equal to C, and FALSE if not. The opposite effect can be achieved if you use the NOT keyword.
A IS NULL All data types Returns TRUE if A is NULL and FALSE if A is NULL
A IS NOT NULL All data types Returns TRUE if A is not NULL and FALSE if A is not NULL
In (value 1, value 2) All data types Equals value 1, value 2, and returns TRUE
A [NOT] LIKE B Type STRING B is A simple regular expression under SQL that returns TRUE if A matches it; Otherwise returns FALSE. The expression for B is described as follows: ‘x%’ means that A must begin with the letter ‘x’, ‘%x’ means that A must end with the letter ‘x’, and ‘%x%’ means that A contains the letter ‘x’, which can be at the beginning, end, or middle of A string. The opposite effect can be achieved if you use the NOT keyword.

2. Logical operators

The operator meaning
AND Logic and
OR Logic or
NOT Logical not

5.4 Group query

Same syntax as MySQL

5.4.1 group by

The GROUP BY statement is typically used with aggregate functions (COUNT, MAX, MIN, AVG, SUM) to group by one or more queued results, and then aggregate each group. Also select fields must appear after group by or in aggregate functions

select st_dept,count(*) from student group by st_dept;

5.4.2 having

It’s used in the same way as MySQL

Differences between having and where:

  1. Where plays on the columns in the table, and having plays on the columns of the query result
  2. Where cannot be followed by an aggregate function; having can be followed by an aggregate function
  3. Having can only be used after group by

5.5 Connection Query

5.5.1 Equivalent connection

Hive only supports equivalent connections, not non-equivalent connections, and its usage is consistent with MySQL syntax

5.5.2 summary

Inner join, left join, right join, full join, multi-table join, and Cartesian product are all consistent with MySQL syntax

5.6 the sorting

Sort in Hive will be very different from MySQL. In order to display Hive sorting, you need to understand the Hive configuration

hive (hive)> set mapreduce.job.reduces;
mapreduce.job.reduces=-1

Set the number of reduce. Default -1 will dynamically plan the number of reduce according to SQL, data volume, etc., and specify the number of reduce by setting the number greater than 0

5.6.1 Global sorting

This is the global sorting of MapReduce, which is represented in Hive as order by for one reduce, because from the perspective of MapReduce the global sorting must output one file and therefore there must be only one reduce.

1. Use

select * from student order by age;

Also default ascending order (ASC), can sort by descending order (DESC)

Details of 2.

When sorting by, one reduce is started. When manually setting the number of reduces, how many reduces are started?

// set the number of reduces to 3 hive (test)> set mapreduce.job.states =3; hive (test)> select * from student order by age; Query ID = root_20200829092604_3b647fd2-3d10-46ac-b498-0f34941dee6a Total jobs = 1 Launching Job 1 out of 1 ... Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

Submit a job and start a reduce. It is found that no matter how many reduce numbers are set, the global sort will only start one redcue.

Added 3.

Support for alias sorting

select stu_id id  from student order by id;

Support for sorting of multiple fields

hive (test)> select * from student order by age,stu_id;

5.6.2 Local sorting

In Hive, sort by is used to realize the local order of each reduce, which may not guarantee the global order.

Set the number of reduce to three

set mapreduce.job.reduces=3;

The specific implementation

hive (test)> select * from emp sort by deptno desc; . Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3

The results are as follows

It can be roughly seen that there are three partitions (three reduce corresponds to three partitions). Of course, you can output the results to a file and you can find that there are also three files

insert overwrite local directory '/tmp/hive' 
select * from emp sort by deptno desc;

The problem is that in MapReduce the partition is by default a HashPartitioner remainder based on the number of hashes and reduces of the key. So how does Hive partition it? Which (how many) fields are partitioned? According to the HQL above, Hive is partitioned by random values.

Why is that? If Hive partitioned by field when we didn’t specify field partitioning, who would we call in case of data skew problems? So Hive uses random loading to minimize data skew.

5.6.3 Partition sorting

Distribute by: Partitions like MapReduce are distributed by assigning a field to the HashPartitioner to sort by to create an in-partition order

Partitions by department ID are sorted in ascending order of salary

hive (test)> select * from emp distribute by deptno sort by sal;

5.6.4 cluster by

Cluster by can be used instead when the fields distribute by and sort by are consistent, so distribute by serves both partitioning and sorting functions, but cluster by sorting only supports descending sorting and cannot specify a DESC or ASC sentence

cluster by col <=> distribute by col sort by col

So some people ask the meaning of a field partition and sorting in? When we have a lot of data, a small number of partitions but there are a lot of different field types, so we’re going to have a lot of fields going into the same partition, sorting by that field on the same partition.

5.6.5 summary

grammar conclusion
group by Partition of the field, the same field is partitioned again, followed by an aggregation operation
distribute by Only the fields are partitioned
order by Global sort, only one reduce will be raised
sort by If the number of reduce is set to 1, it will be the same as order by

5.7 bucket list

Please be different from the partitioned table (I think the names of these two tables are not very friendly). After learning, it is found that the partitioned table only stores data separately, while the buckets table is corresponding to the partition of MR, which stores data files separately. Note that the partition table aims at the storage path of the data, storing each file into folders, while the partition table partitions each file. A partition corresponds to a file, that is, a data file is split.

Partitioning provides a convenient way to isolate data and optimize queries, but not all data sets can be partitioned well; Bucking is another means of breaking up a data set into more manageable pieces.

5.7.1 Create buckets table

create table stu_buck(id int, name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t';

The bucket table uses clustered by (note the partition sort separation) with fields in the table (partition fields cannot be in the table) and divided into 4 buckets

View table structure

hive (test)>  desc formatted stu_buck;

Load the data into the bucket table, make sure before loadingmapreduce.job.reduces=-1In this way, Hive will automatically create the corresponding number of Reduce based on the number of buckets in the bucket table



And you can see roughly mod by the hash of the id

In Hive 3.1.2, this configuration is no longer available. In Hive 3.1.2, this configuration is no longer available. In Hive 3.1.2, this configuration is no longer available

hive (test)> set hive.enforce.bucketing;
hive.enforce.bucketing is undefined

After reading the official documentation, we can see that Hive 2.x loads data directly, instead of creating an intermediate table and then inserting it into a bucket table using MR

5.7.2 Application of bucket table

As buckets are mainly used for sampling queries, a large data set often requires a representative query result rather than all the query results, so HIVE uses sampling statements to implement sampling queries

tablesample(bucket x out of y on col)

Col: is the bucket field

X: Indicates which bucket to start from

Y: It must be a multiple or factor of the number of barrels to determine the proportion of sampling. For example, if the number of barrels is assumed to be 4, when y=2, the data of (4/2)=2 barrels is extracted. When y=4, extract (4/4)=1 bucket of data; When y=8, extract (4/8)=1/2 bucket of data

Assuming that the number of buckets is z: z/y>1, then finally take x, x+z/y+z/y*2…

tablesample(bucket 1 out of 2 on id)

So we’re going to start with the first bucket, and we’re going to start with two buckets, so we’re going to start with one and three buckets

Therefore, x <= y is required for the tablesample because the last one is

x+(z/y-1)*y => x+z-y <= z => x <= y

X > y: Failed: semanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

Six, functions,

Like MySQL, Hive’s functions can be divided into system built-in functions and user-defined functions, but there will be a big difference between the custom functions and MySQL

Check the system’s built-in functions

hive (test)> show functions;

See the usage of the built-in functions

hive (test)> desc function upper;

See the usage of the built-in functions for details

hive (test)> desc function extended upper;

6.1 System built-in functions

Here are some common built-in functions that differ from MySQL (with *) [216 in total]

function explanation
round rounded
ceil Take up the whole
floor Take down the whole
rand Take a random number from 0 to 1
lower Turn to lowercase
upper Turn the capital
length Return string length
concat String concatenation
concat_ws Specify delimiter concatenation
collect_set Merge fields *
substr Strives for the substring
trim Remove space before and after
split String segmentation
to_date String to date
Year, month… Extract year, month and day from date
from_unixtime Time stamp to date *
unix_timestamp Date to time stamp *
case… when… Conditions for function
if Judging function
count Number finding (aggregation)
sum Summation (aggregation)
min Minimize (aggregation)
max Find the maximum value (aggregation)
avg Averaging (aggregation)
explode Expansion function *
lateral view Expand explodes *
over Open the window function

6.1.1 collect_set

It is usually used with GROUP BY to collect the data of each group and encapsulate it into a SET.

The specific use

hive (test)> select college,collect_set(stu_name) from student group by college; . 先生 ES [" Zheng Nan Song Xia ", ""," sweet liu ", "zhang haitao," "zhang", "zhang wei", "Liu Meng"] IS [" li ", ", ", "li-li deng," "Liu Xiaotang", "li-li chang", "his"] MA [" may ", "xue-wen li", "Zhang Hailing", "high focused", "xiao-peng liu", "zheng chenggong", "guo-wei li", "hong-wei zhou", "hai-bo sun"]

Group students by colleges, print out the information of each student and which students there are in each college, and use the collect_set function to encapsulate the set of students from the same college

6.1.2 Date related

Date to timestamp UNIX_TIMESTAMP

hive (test)> select unix_timestamp('2020-08-29 14:14:00','yyyy-MM-dd HH:mm:ss'); OK _C0 1598710440 Time taken: 0.24 seconds, touchdown: 1 row(s)

Time stamp to date from_unixtime

hive (test)> select from_unixtime(1598710440,'yyyy-MM-dd HH:mm:ss'); OK _C0 2020-08-29 14:14:00 Time taken: 0.146 seconds, touchdown :1 row(s)

6.1.3 Expansion function

1.explode

Convert a row of data to a column, and in Hive only works with Array and Map data types

Used for the Array data type

hive (hive)> select * from tbl_3; OK tbl_3.name tbl_3.friends tbl_3.childress tbl_3.address Chen xiao-chun [" zhengyi-jien "," zhengyi-hua "] {"jasper":3,"baby":1} {" street ":" the queen's road ", "city" : "Hong Kong"} liu ye [" zhang ", "grandson couple"] {8, ", a ":" neon ": 6} {" street" : "changan avenue", "city" : "Beijing"} Time seems: 0.153 seconds, touchdown: 2 row(s) Hive (Hive)> select explode(friends) as FRST; Time taken: 0.156 seconds, touchdown: 4 row(s)

For the Map data type

hive (hive)> select explode(childress) as (name,age) from tbl_3; OK name age jasper 3 baby 3 Time taken: 0.133 seconds, touchdown: 4 row(s)

But the explode function has a big drawback

  1. Cannot associate other fields from the original table
  2. Cannot group, sort
  3. UDTF (user-defined table generation function) nesting is not possible

2.lateral view

Lateral View (Lateral View) is a combination of Hive and UDTF. It can solve the problem that UDTF cannot add additional SELECT. The principle is that the Lateral View is similar to MySQL view. UDTF) and join the input row on a view to explode.

Standard grammar

lateral view udtf(expression) tableAlias as columnAlias,columnAlias...

Udtf (expression) : The UDTF function used. Such as explodes ()

TableAlias: Represents the name of the virtual table

ColumnAlias: Name of the fields used in the virtual table. ColumnAlias: Name of the fields used in the virtual table

Resolve the shortcoming of explode()

hive (hive)> select name, friend from tbl_3 lateral view explode(friends) tmp_tbl as friend; Time taken: 0.086 seconds, touchdown: 4 row(s)

6.2 User defined functions

Custom functions in Hive are divided into three types based on the number of lines of input and output:

  • User-Defined Functions (UDFs)
  • User-Defined Aggregate Function UDAF
  • User-defined table-generating UDTF. User-defined table-generating functions
Function types describe
UDF Line input line output, such as string class functions
UDAF Multiple lines of input and one line of output, such as an aggregate function
UDTF One line of input and multiple lines of output, such as the expansion function

6.2.1 UDF

1. Programming steps

  1. inheritanceorg.apache.hadoop.hive.ql.exec.UDF
  2. Implement the evaluate function, which supports overloading
  3. Add a JAR package to Hive
  4. Create a function
  5. Using the function
  6. Delete function

Note: UDFs must have a return value. NULL can be returned, but cannot be returned

2. Implementation

Implement the length() function

1) Add dependencies

< the dependency > < groupId > org. Apache. Hive < / groupId > < artifactId > hive - exec < / artifactId > < version > 3.1.2 < / version > </dependency>

2) Create class inheritance UDF

package hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

@SuppressWarnings("deprecation")
public class UDFLength extends UDF {

}

3) Implement the evaluate function

package hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; @SuppressWarnings("deprecation") public class UDFLength extends UDF { public int evaluate(String str) { return str.length(); }}

4) Add a JAR package to Hive

Put the program into a jar and place it in hive/lib. The jar under this jar will be loaded automatically when the hive starts, but you can also load the jar manually

Hive (hive)> add jar /usr/local/soft/hive-3.2.1/lib/ hadoop-1.0-snapshot. Jar; Added [/usr/local/soft/ hiv-3.2.1 /lib/ hadoop-1.0-snapshot.] Added [/usr/local/soft/ hiv-3.2.1 /lib/ hadoop-1.0-snapshot. [/ usr/local/soft/hive - 3.2.1 / lib/hadoop - 1.0 - the SNAPSHOT. Jar]

5) Create a function

When creating a custom function in Hive, it can be divided into temporary functions and permanent functions. For temporary functions, they are only available in the current session and the current database.

Standard grammar

create [temporary] function fun_name as 'package.class';
hive (hive)> create temporary function getlength as 'hive.udf.UDFLength';
OK
Time taken: 0.047 seconds

6) Use functions

Returns an int by passing in a string logically

hive (hive)> select *,getlength(st_name) from student; St_age student. St_age student. St_dept _C1 10001 male 20 ES 2 10002 Li Na female 19 IS 2 Time taken: 0.145 seconds, touchdown: 2 row(s)
hive (hive)> desc function extended getlength; OK tab_name There is no documentation for function 'getlength' Function class:hive.udf.UDFLength Function type:TEMPORARY Time taken: 0.017 seconds, touchdown: 3 row(s)

7) Delete function

hive (hive)> drop function if exists getlength;
OK
Time taken: 0.648 seconds

6.2.2 GenericUDF

The UDF class is obsolete, and Hive recommends using GenericUDF, which supports more data types and is more efficient

package hive.udf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; public class GenericUDFLength extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { return null; } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { return null; } @Override public String getDisplayString(String[] children) { return null; }}
  • initializeInitialization, parameter validation, object instantiation, and so on
  • evaluateThe business logic
  • getDisplayStringDisplays help information for the function

So what’s in Initialize? Open a GenericUDF GenericUDFCharacterLength implementation class

@Description(name = "character_length,char_length", value = "_FUNC_(str | binary) - Returns the number of characters in str or binary data", > SELECT _FUNC_('안녕하세요') FROM SRC LIMIT 1; \n" + " 5")

This function should be written by a Korean, we can see what its initialize returns

outputOI = PrimitiveObjectInspectorFactory.writableIntObjectInspector;
return outputOI;

Understand roughly by the static variables in the PrimitiveObjectInspectorFactory return values of this function need to return, speculated that his function should return an int type, so we also can return directly

@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
    return null;
}

This evaluate is clearly superior to UDF, but lets our function accept any number of values of any type.

package hive.udf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.IntWritable; public class GenericUDFLength extends GenericUDF { IntWritable result = new IntWritable(); @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { return PrimitiveObjectInspectorFactory.writableIntObjectInspector; } @Override public IntWritable evaluate(DeferredObject[] arguments) throws HiveException { if (arguments.length > 1) { System.out.println(" This function does not support multiple arguments for the time being "); return null; } result.set(arguments[0].get().toString().length()); return result; } @Override public String getDisplayString(String[] Children) {return "getString length "; }}
hive (default)> create temporary function getlength as 'hive.udf.GenericUDFLength'; OK Time taken: 0.519 seconds Hive (default)> select getlength('123'); C0 3 Time taken: 3.072 seconds, touchdown: 1 row(s)

There are pits here, avoiding Java cast, and the return value is recommended to return Hadoop’s data type

6.2.2 UDAF

UDAF is different from UDF in that UDAF needs to accept any number of values and then return after calculation, so the structure of UDAF will be more complex than UDA

1. Programming steps

  1. Function class inheritanceAbstractGenericUDAFResolver, computing class implementationGenericUDAFEvaluatorinterface
  2. Implement the UDAValuator interface init, iterate, TerminatePartial, Merge, Terminate

    1. initInitialize the
    2. iterateReceive incoming parameters for internal iteration
    3. terminatePartialreturniterateAfter the data
    4. mergereceiveterminatePartialReturn the result and proceedmergeoperation
    5. terminateReturns the final result

2. Implementation

6.2.3 UDTF

Realize split function function

1. Programming steps

  1. Create a class inheritanceGenericUDTF
  2. implementationinitialize,process,closemethods

GenericUDTF can be found by overwriting the process and close methods. If we do this, we will get an error. As you can see from the GenericUDTF source code

public StructObjectInspector initialize(StructObjectInspector argOIs)
      throws UDFArgumentException {
    List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
    ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];
    for (int i = 0; i < inputFields.size(); i++) {
      udtfInputOIs[i] = inputFields.get(i).getFieldObjectInspector();
    }
    return initialize(udtfInputOIs);
  }
  
@Deprecated
  public StructObjectInspector initialize(ObjectInspector[] argOIs)
      throws UDFArgumentException {
    throw new IllegalStateException("Should not be called directly");
  }

The source code throws an exception directly, so we must override Initialize

2. Code implementation

And the same is true of how did someone do itGenericUDTFJSONTuple

The initialization logic is pretty simple to look at. Numcols is the length of the method form argument, fieldNames holds the name of the function return value field, and fieldOIs holds the function return value type, so we can write our own initialization and implement our own logic at the same time

package hive.udtf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.Text; import java.util.ArrayList; import java.util.List; Public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF; public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { List<String> fieldNames = new ArrayList<>(); List<ObjectInspector> fieldOIs = new ArrayList<>(); // This field will eventually be displayed as the field name of the result fieldnames.add (" result "); / / that function returns a value type fieldOIs. Add (PrimitiveObjectInspectorFactory. WritableStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } @Override public void process(Object[] args) throws HiveException { if (args.length ! = 2) {throw new RuntimeException(" parameter number does not match "); } // String[] split = args[0].toString().split(args[1].toString()); for (String value : split) { result.set(value); Write () forward(result); }} @Override public void close() throws HiveException {// Override public void close() throws HiveException {// Override public void close() throws HiveException {

use

hive (default)> select udtfsplit('hello_world_hello_hive','_'); Hello world hello hive Time taken: 2.406 seconds, touchdown: 4 row(s)

VII. Enterprise tuning

To be honest, you can’t write about Hive tuning without a few years of work experience, and it’s important to ask about it in critical interviews. Not much is actually produced. Why is that? For the average person, to be able to write out the requirements of the SQL will be thankful for what tuning ah, the requirements are not implemented take what tuning!! Isn’t it.

  • [] fetch
  • [x] Local mode
  • [] table optimization
  • [] MR optimization
  • [] Parallel execution
  • [] strict mode
  • [] JVM reuse
  • [] presumed execution
  • [] compressed