1. Hive Overview
1.1 What is HIVE
- Open source by Facebook to solve the data statistics of massive structured logs
- A data warehouse tool based on Hadoop that can map structured data files into a table and provide SQL-like query functionality
- Hive is just a tool that does not store data itself but provides a way of managing it. It also does not involve the concept of distribution. It is just software
- Hive is essentially MapReduce, converting SQL classes (HQL) into MapReduce programs
1.1.1 HQL conversion MR process
Explanation:
- The data Hive processes is stored in HDFS
- The underlying default implementation for Hive analysis data is MapReduce[can be changed to Spark]
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
- The program to execute runs in yarn
- Hive is a client of Hadoop
- Hive is not distributed
1.2 Hive Advantages and Disadvantages
1.2.1 advantages
- The operation interface adopts SQL-like syntax, providing the ability of rapid development (simple and easy to use)
- Avoid writing MR and reduce learning costs for developers
- Hive has a high latency (due to the high MR latency), so it is often used for data analysis
- Hive’s advantage is that it handles big data (small amount of data is not as good as MySQL etc.)
- Hive supports user-defined functions. You can implement your own functions according to your own
1.2.2 shortcomings
-
Hive has limited HQL expression capacity (determined by MR)
- Iterative algorithms cannot be expressed
- Not suitable for data mining
-
Hive is less efficient
- The MR jobs generated automatically by Hive are usually not smart enough
- Hive is difficult to tune (only for resources, SQL level, not deep logic of the job)
1.3 Principle of Hive Architecture
- User interface: Client
CLI (Hive Shell), JDBC/ODBC (Java Access Hive), WebUI (Browser Access Hive)
- Metadata: Metastore
This includes the name of the table, the database to which the table belongs, the owner of the table, the column/partition field, the type of the table, the directory in which the table data resides, etc. (it has a Derby database, recommended for MySQL)
- Underlying storage: HDFS
HDFS is used for storage and MapReduce is used for computation
-
Driver: Driver
- Parser (SQL Parser) : Convert SQL strings into abstract syntax tree AST, and perform syntax analysis on the syntax tree, such as: SQL syntax, whether the table/character exists
- Physical Plan: The AST is compiled to generate a logical execution Plan
- Query Optimizer: Optimizes the logical execution plan
- Execution: Converts logical Execution calculations into a running physical plan, i.e. MR/Spark
Hive accepted SQL written by users through a series of interactive interfaces provided to users, and used its own Driver combined with Metastore to translate SQL instructions into MapReduce and submit them to Hadoop for execution, and output the execution results to the user interactive interface.
1.4 Hive and traditional database comparison
Hive has nothing in common with traditional databases except that it provides a similar syntax to SQL. Hive was designed with the data warehouse in mind.
1.4.1 Data storage location
Hive is built on Hadoop. All Hive data is stored in HDFS. Traditional databases keep data in the local file system; As a result, Hive can process much larger and larger amounts of data
1.4.2 Data update
Hive is designed for data warehouse applications, so data is written once and read many times. In Hive, it is not recommended to rewrite data. All data is determined when it is loaded. Frequent additions, deletions and changes are usually required for the database
1.4.3 index
Hive does not perform any processing on the data during the data loading process. Because of the large amount of data, it is not cost-effective to build indexes. Therefore, Hive needs to scan the whole data to meet the specific value of the data, so the access latency is high. With the introduction of MapReduce, Hive can access data in parallel, even without indexes, and can be used to access large amounts of data. Traditional databases are typically indexed against one or more columns, which makes accessing data slow and inefficient, meaning Hive is not suitable for real-time data analysis
1.4.4 perform
The execution engine for Hive is MR/Spark. Traditional databases have their own execution engine
1.4.5 Expansibility
Because Hadoop is highly scalable, Hive is also highly scalable; The expansion of traditional database will be limited to some extent
1.4.6 Data Scale
Hive can use MapReduce for parallel computation of large-scale data. Traditional databases support smaller amounts of data
II. Preliminary Hive
2.1 Hive installation
To configure metadata into MySQL, you need to initialize it. Initialize the command (you can do the rest of the steps by yourself) :
schematool -dbType mysql -initSchema
2.2 Hive basic operation
-
Start the hive
[root @ master hive - 3.2.1] # hive
-
View the database
hive (hive)> show databases; OK database_name default hive Time taken: 0.02 seconds, touchdown: 2 row(s)
Hive has a default database, which is used by default
-
Switch Databases
hive (hive)> use hive; OK Time taken: 0.031 seconds
-
Create a table
hive (hive)> create table if not exists tbl_1(id int,name string); OK Time taken: 0.628 seconds
The syntax is basically the same as MySQL, except that Hive’s data types are similar to Java’s
-
View table structure
hive (hive)> desc tbl_1; OK col_name data_type comment id int name string Time taken: 0.084 seconds, initialization: 2 row (s) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the delimiter -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # to check the table details hive (hive) > desc formatted tbl_1; OK col_name data_type comment # col_name data_type comment id int name string # Detailed Table Information Database: hive OwnerType: USER Owner: root CreateTime: Wed Aug 26 19:55:58 CST 2020 LastAccessTime: UNKNOWN Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/hive.db/tbl_1 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"name\":\"true\"}} bucketing_version 2 numFiles 0 numRows 0 rawDataSize 0 totalSize 0 transient_lastDdlTime 1598442958 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization. Format 1 Time taken: 0.154 seconds, touchdown: 32 row(s)
-
Insert data (don’t, don’t, don’t)
hive (hive)> insert into tbl_1 values(1,'zhangsan'); . . . The Time seems: 84.754 seconds
Who uses who knows
-
Query data
hive (hive)> select * from tbl_1; TBL_1.ID TBL_1.name 1 zhangsan Time taken: 0.214 seconds, touchdown: 1 row(s)
-
Withdraw from the hive
hive (hive)> quit;
-
Perform HDFS shell
hive (hive)> dfs -ls /; Found 3 items drwxr-xr-x - root supergroup 0 2020-07-21 15:57 /HBase drwx-wx-wx - root supergroup 0 2020-07-21 18:27 /tmp drwxrwxrwx - root supergroup 0 2020-07-21 18:00 /user
-
Execute Linux shell
hive (hive)> ! pwd; / usr/local/soft/hive - 3.2.1
2.3 HIVE general operation
hive (hive)> insert into tbl_1 values(1,'zhangsan'); . . . The Time seems: 84.754 seconds
Inserting a piece of data for 84 seconds is obviously not realistic… The most violent and straightforward way to insert data for Hive is to simply put the data files on the path specified by HDFS. However, not all data will work. You need to specify the delimiter when creating the table.
hive (hive)> create table if not exists tbl_2(id int,name string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.118 seconds
To prepare data
[root@master data]# cat student.txt 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi [root@master data]# pwd / usr/local/soft/hive - 3.2.1 / data
2.3.1 Hive Insert Data 1
Load local data into Hive
Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/student.txt' into table tbl_2; TBL2 OK Time taken: 0.311 seconds Hive (Hive)> select * from TBL2; Name 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi Time taken: 0.192 seconds, touchdown: 5 row(s)
Evaluation: Convenient and recommended
2.3.2 Hive Insert Data II
The data managed by Hive is stored in HDFS. You can configure the file to specify the storage path. You can go to the specified path to view the dataDesc formatted table name
See the storage path for the table
- The root path for Hive stored in HDFS is in /… / warehouse/under
- One database for each folder, named by database name.db (except default database)
- Each table also corresponds to a folder, named by the table name
- The data file is placed directly in the corresponding folder of the table, so the underlying call is called by LOAD
hadoop fs -put
- Tables under the default database are placed directly under warehouse and the naming method remains unchanged
Based on the above rules, there is a second way to insert, directly through the Hadoop shell to put the file to the specified HDFS path can be
[root@master data]# hadoop fs -put student.txt /user/hive/warehouse/hive.db/tbl_2/student_1.txt
hive (hive)> select * from tbl_2; OK tbl_2.id tbl_2.name 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi 1 zhangsan 2 lisi 3 wangwu 4 zhaoliu 5 tianqi Time Taken: 0.396 seconds, touchdown: 10 row(s)
Summary: can also be used, but must know the path of the table in HDFS, so this way the road is narrow ah!
After looking at how Hive databases, tables, and data are stored on HDFS, try to load the data again, this time by loading the data on HDFS
hive (hive)> load data inpath '/student.txt' into table tbl_2; Loading data to table Hive. TBL_2 OK Time taken: 0.683 seconds
Loading HDFS file does not need to add the local load way (obviously), then look again at HDFS information, the original data will be deleted at this time (in fact, the hadoop fs – mv)
In the case of multiple loads of the same data file, Hive will rename the data file and upload it to the specified path of HDFS. And Windows under the same name file processing similar.
2.4 Hive data types
2.4.1 Basic data types
Hive Data Types | Java data types | The length of the |
---|---|---|
tinyint | byte | 1byte |
smalint | short | 2byte |
int | int | 4byte |
bigint | long | 8byte |
boolean | boolean | true/false |
float | float | Single precision |
double | double | double |
string | String | string |
timestamp | ||
bigary |
The basic data types commonly used are int, bigint, double, string and are case-insensitive. Boolean is usually replaced with 0/1 to reduce storage; String is used most often for logging, and can theoretically store 2GB of data (one row).
2.4.2 Collection data types
The data type | describe | Syntax examples |
---|---|---|
struct | Structure, complex and unrelated data | struct\<k1:v,k2:v> |
map | Dictionary, set of key-value pair tuples | map<k,v> |
array | Arrays, collections of the same type | array\<v> |
The difference between a struct and a map is that a map can only store groups of k-v pairs, and the k in a map cannot be the same. A struct can store a number of data structures with the same key and different value, that is, each group of data in a map has a different key, and each group of data in a map has the same key at its location. Collection data types allow any level of nesting.
2.4.3 Type Conversion
Hive supports Java-like data type conversions
Implicit conversion
- tinyint -> smalint -> int -> bigint -> float -> double
- The string type can only be converted to a number
- Boolean cannot be cast to any type
cast
hive (hive)> select cast('1' as int); OK _C0 1 Time taken: 0.937 seconds, touchdown: 1 row(s) Hive (Hive)> select cast('a' as int); OK _C0 NULL Time taken: 0.184 seconds, touchdown: 1 row(s)
Well, don’t you see any effect
2.4.4 Test collection data types
You need to store data in the following format (JSON)
{" name ":" Chen xiao chun ", "friends" : [" cheng ", "Michael tse"], / / list Array "children" : {/ / key/value Map "jasper" : 3, "baby" : 1,} "address" : {/ / structure Struct "street" : "queen's road", "city" : "Hong Kong"}} {" name ":" liu ", "friends" : [" zhang ", "grandson couple"], "children" : {", a ": 8, "neon" : 6,} "address" : {" street ":" changan avenue ", "city" : "Beijing"}}
Convert a single line of data to a single line of data, eliminating unnecessary data. In a single line of data, fields are separated by ‘,’, collection elements are separated by ‘_’, and the KV of the map is separated by ‘:’, so it can be converted to the following format
Chen Xiaochun, Zheng Yijian _ Xie Tianhua, Jasper :3_baby:1, Queen's Road _ Hong Kong Liu Ye, Zhang Ziyi _ Sun Li, Nuo Yi :8_ Ni Na :6, Chang 'an Avenue _ Beijing
Create the following table for the above data
hive (hive)> create table tbl_3(name string,friends array<string>,childress map<string,int>,address struct<street:string,city:string>)
> row format delimited fields terminated by ','
> collection items terminated by '_'
> map keys terminated by ':';
OK
Time taken: 0.124 seconds
Explanation:
row format delimited fields terminated by ','
Sets the field dividercollection items terminated by '_'
Sets the collection element dividermap keys terminated by ':'
Set the Map key-value pair delimiterlines terminated by '\n'
Set line delimiter, default\n
Import data test
Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_collection' into table tbl_3; Loading data to table hive. TBL_3 OK Time taken: 0.281 seconds Hive (Hive)> select * from TBL_3; OK tbl_3.name tbl_3.friends tbl_3.childress tbl_3.address Chen xiao-chun [" zhengyi-jien "," zhengyi-hua "] {"jasper":3,"baby":1} {" street ":" the queen's road ", "city" : "Hong Kong"} liu ye [" zhang ", "grandson couple"] {8, ", a ":" neon ": 6} {" street" : "changan avenue", "city" : "Beijing"} Time seems: 0.176 seconds, touchdown: 2 row(s) hive (hive)> select name,friends[0],childress['baby'], touchdown. Street from TBL_3; Time taken: 0.222 seconds, touchdown: 2 row(s)
DDL data definition language
3.1 Database Operation
3.1.1 Create the database
1. The way a
Create a database that is stored in HDFS by default /user/hive/warehouse/*.db
hive (default)> create database test;
OK
Time taken: 0.827 seconds
If the database already exists and an Execution Error is reported, method 2 is recommended
2. 2
To avoid errors that already exist in the created database, use the if not exists notation
hive (default)> create database if not exists hive;
OK
Time taken: 0.029 seconds
Three (3) way
Specifies where the database is stored in HDFS
create database test location '/hive/test.db';
OK
Time taken: 0.097 seconds
3.1.2 Query database
1. Display the database
Display database
hive (hive)> show databases; OK database_name default hive Time taken: 0.03 seconds, touchdown: 2 row(s)
Filter the database that displays the query
hive (hive)> show databases like 'h*'; OK database_name Hive Time Taken: 0.022 seconds, touchdown: 1 row(s)
2. Check the database details
Display database information
hive (hive)> desc database hive; OK db_name comment location owner_name owner_type parameters hive hive test HDFS: / / master: 9000 / user/hive/warehouse/hive. Db root user Time seems: 0.049 seconds, Fetched: 1 row (s)
Displays database details
hive (hive)> desc database extended hive; OK db_name comment location owner_name owner_type parameters hive hive test HDFS: / / master: 9000 / user/hive/warehouse/hive. Db root user {creator = wj} Time seems: 0.03 seconds, Fetched: 1 row (s)
{creator=wj} is a custom property and can be used as a comment
3.1.3 Modify the database
The information of the database that has been created cannot be changed, including the name of the database and the directory location where the database is located. Here changing the database refers to changing the key value pair of the dbProperties of the database
hive (test)> alter database test set dbproperties('creator'='wj');
OK
Time taken: 0.234 seconds
3.1.4 Delete database
1. Delete empty database
hive (hive)> drop database d1;
OK
Time taken: 0.435 seconds
2. Forced delete database
For a non-empty database, the above command cannot be deleted
hive (d1)> drop database d1;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database d1 is not empty. One or more tables exist.)
So you can use cascade to force deletion
hive (d1)> drop database d1 cascade;
OK
Time taken: 0.231 seconds
Table 3.2 operation
Create table 3.2.1
Standard statement
create [external] table [if not exists] table_name [(col_name data_type [comment col_comment],...)] [comment table_comment] [partitioned by (col_name data_type [col_name data_type],...)] [clustered by (col_name,col_name)] [row format ...] [collection items ...] [map keys ...] [location hdfs_path]
3.2.2 modify table
Mysql > rename table
hive (hive)> alter table emp rename to tbl_emp; OK Time taken: 0.215 seconds Hive (Hive)> Show Tables; OK tab_name score student tbl_1 tbl_2 tbl_3 tbl_4 tbl_emp Time taken: 0.025 seconds, touchdown: 5 row(s)
2. Add columns
hive (hive)> alter table tbl_emp add columns(emp_id int);
OK
Time taken: 0.147 seconds
3. Modify the columns
This includes changing column names, column properties
hive (hive)> alter table tbl_emp change emp_id c_emp_id string;
OK
Time taken: 0.234 seconds
Replace the column
hive (hive)> alter table tbl_emp replace columns(s_id string,c_id string,c_grade string);
OK
Time taken: 0.157 seconds
3.2.3 delete table
hive (hive)> drop table tbl_emp;
OK
Time taken: 0.227 seconds
3.2.4 Internal/External Tables
1. The inner table
Also known as MANAGED_TABLE, because the corresponding table is called external table (EXTERNAL_TABLE), so like to call it external table, create a table default is internal table, when deleting the table metadata and HDFS data will be deleted
2. The external table
In contrast to the internal table, when you delete an external table, you will not delete only metadata in the HDFS. When you create a table with the same name again, you will “restore” the table.
hive (hive)> create external table tbl_5(id int,name string);
OK
Time taken: 0.183 seconds
hive (hive)> desc formatted tbl_5;
OK
col_name data_type comment
# col_name data_type comment
id int
name string
# Detailed Table Information
Database: hive
OwnerType: USER
Owner: root
CreateTime: Thu Aug 27 19:57:54 CST 2020
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://master:9000/user/hive/warehouse/hive.db/tbl_5
Table Type: EXTERNAL_TABLE
3. Internal and external table conversion
Internal table -> external table
hive (hive)> alter table tbl_4 set tblproperties('EXTERNAL'='TRUE');
OK
Time taken: 0.13 seconds
External table -> internal table
hive (hive)> alter table tbl_4 set tblproperties('EXTERNAL'='FALSE');
OK
Time taken: 0.13 seconds
4. Application scenarios
In the actual production environment, when it comes to sharing data, external tables must be used to prevent data loss caused by misoperation. For personal use, the middle table of data analysis can use internal tables to facilitate management.
3.2.5 the partition table
Suppose that Hive has stored data in its database for a year, and now needs to find the data of a certain day. When using the WHERE statement, Hive will perform a full table scan on all data in the table (folder), resulting in low query efficiency. So the partitioned table concept was introduced to optimize the query (in the sense of predicate pushdown)
1. Create partition table
hive (hive)> create table tbl_6(id int,name string) partitioned by(month string) row format delimited fields terminated by '\t';
OK
Time taken: 0.106 seconds
Mysql > add partitioned by (col_name data_type) after creating tables
2. Import data
Note that when you create a partition table, you need to add partition information to load the data or it will be saved (because Hive does not know where to put the data)
Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_partition' into table tbl_6 partition(month='2020-08'); Loading data to table hive.tbl_6 partition (month='2020-08') OK Time taken: 0.782 seconds hive (hive)> select * from tbl_6; OK tbl_6.id tbl_6.name tbl_6.month 1 tzhangsan 2020-08 2 tlisi 2020-08 3 twangwu 2020-08 Time taken: 0.141 seconds, touchdown: 3 row(s)
You can see that the partition field is automatically added to the table data (the original data file is not added, Hive does this by automatically adding the partition field data when the data is read), after adding multiple partitions
Hive (hive)> load data local inpath '/usr/local/soft/hive-3.2.1/data/test_partition' into table tbl_6 partition(month='2020-09'); Loading data to table hive. TBL_6 partition (month='2020-09') OK Time taken: 0.634 seconds
The data in different partitions is placed in different folders. The advantage of this is that when Hive executes a WHERE query, it will scan the data directly in the partition folder, which is much more efficient.
3. Operate partitions
Adding partitions (essentially creating folders)
hive (hive)> alter table tbl_6 add partition(month='2020-10');
OK
Time taken: 0.169 seconds
Delete partitions (essentially deleting folders)
hive (hive)> alter table tbl_6 drop partition(month='2020-10');
OK
Time taken: 0.269 seconds
Create a secondary partition
hive (hive)> create table tbl_6(id int,name string) partitioned by(month string,day string)
row format delimited fields terminated by '\t';
OK
Time taken: 0.106 seconds
In theory and in operation, there is no limit to the upper limit of the partition. In combination with the above knowledge, you can also directly put the file into the partition table to the specified path
Check the partition
hive (hive)> show partitions tbl_6; OK partition month=2020-08 month=2020-09 month=2020-10 Time taken: 0.127 seconds, touchdown: 3 row(s)
DML data manipulation language
4.1 Data import
4.1.1 Load data to table
1. Standard grammar
load data [local] inpath path [overwrite] into table table_name [partition(p1=v1,...)]
args | explain |
---|---|
load data | Load the data |
local | Load local files without default HDFS loading |
inpath path | Path to load the data file |
overwrite | Overwriting existing data |
into table | Additional data |
table_name | The specific table name |
partition | Loads data to the specified partition |
2. Operation cases
See 2.3 HIVE General Operations
4.1.2 the insert
INSERT into a partitioned table, not recommended
hive (hive)> insert into tbl_6 partition(month='2020-07') values(4,'zhaoliu');
Basic insert mode, where data is inserted based on the result of the query
hive (hive)> insert into table tbl_5 select * from tbl_1;
hive (hive)> insert overwrite table tbl_5 select * from tbl_1;
4.1.3 the as the select
hive (hive)> create table tbl_7 as select * from tbl_6;
4.1.4 export
Can only be used with import, see inport usage below
4.2 Data Export
2 insert
hive (hive)> insert overwrite local directory '/tmp/hive' select * from tbl_6;
[root@master data]# CD/TMP /hive/ [hive]# ll -rw-r--r-- 1 root root 108 August 27 20:55 000000_0 [root@master hive]# cat 000000_0 1tzhangsan2020-08 2tlisi2020-08 3twangwu2020-08 1tzhangsan2020-09 2tlisi2020-09 3twangwu2020-09
The exported data will overwrite all files in the exported path. Be sure to write a path that does not exist. But the cat file finds the data unfriendly, so you need to format the exported data
hive (hive)> insert overwrite local directory '/tmp/hive' row format delimited fields terminated by '\t' select * from tbl_6;
[root@master hive]# cat 000000_0
1 tzhangsan 2020-08
2 tlisi 2020-08
3 twangwu 2020-08
1 tzhangsan 2020-09
2 tlisi 2020-09
3 twangwu 2020-09
4.2.2 hive shell
Hive-e ‘SQL’ > file is not commonly used to take advantage of Linux redirection
Holdings of export/import
Lead out and import
hive (hive)> export table tbl_1 to '/hive'; OK Time taken: 0.117 seconds Hive (Hive)> truncate table TBL_1; OK Time taken: 0.149 seconds Hive (Hive)> select * from TBL_1; TBL_1.id TBL_1.name Time taken: 0.141 seconds Hive (Hive)> import table TBL_1 from '/ Hive '; Copying data from hdfs://master:9000/hive/data Copying file: HDFS: / / master: 9000 / hive/data / 000000 _0 Loading data to the table hive. Tbl_1 OK Time seems: 0.197 seconds
Five, DQL data query language
5.1 Local mode
For small amounts of data, you can set the Hive runtime mode to local mode by learning a few operational commands. For small data sets, the time can be significantly reduced by the following configuration
/ / open local mode set hive. The exec. Mode. Local. Auto = true; / / set the local Mr Maximum input data, the data quantity is less than this value (128 m) by default when using local Mr Set hive. The exec. Mode. Local. Auto. Inputbytes. Max = 50000000; / / set the local Mr Maximum input file number, when the file number is less than this value (the default 4) with the local Mr Set hive. The exec. Mode. Local. Auto.. Input files. Max = 10;
Turn off local mode
set hive.exec.mode.local.auto=true;
5.2 Basic Query
Same syntax as MySQL, basic syntax
select ... from table_name
5.2.1 Full table and field-specific queries
1. Full table query
select * from student;
2. Specific field queries
select st_id,st_name from student;
Conclusion:
- HQL syntax is not case sensitive
- HQL can be written in one line or multiple lines
- Keywords cannot be abbreviated or wrapped
- The words are usually written in branches
- Use indentation to make your code more readable
5.2.2 List aliases
Basic grammar, as can be omitted
select st_id as id,st_name name from student;
5.2.3 Arithmetic Operators
The operator | describe | |
---|---|---|
A+B | A plus B | |
A-B | A minus B | |
A*B | A times B | |
A/B | A divided by B | |
A%B | Mod A with respect to B | |
A&B | A and B are bitwise and | |
A\ | B | A and B are bitwise or |
A^B | A and B take XOR bitwise | |
~A | A is reversed by digit |
5.2.4 limit statement
The basic grammar
select st_id,st_name from student limit 2;
5.2.5 where clause
Same syntax as MySQL, basic syntax
select * from student where st_age > 20;
1. The comparison operator
The operator | Supported data types | describe |
---|---|---|
A=B | Basic data types | Returns TRUE if A is equal to B and FALSE if A is equal to B |
A! =B | Basic data types | Returns TRUE if A is not equal to B and FALSE if A is not equal to B |
A<B | Basic data types | Returns TRUE if A is less than B and FALSE if A is less than B |
A<=B | Basic data types | Returns TRUE if A is less than or equal to B and FALSE if A is less than or equal to B |
A>B | Basic data types | Returns TRUE if A is greater than B and FALSE if A is greater than B |
A>=B | Basic data types | Returns TRUE if A is greater than or equal to B, and FALSE if A is greater than or equal to B |
A [NOT] BETWEEN B AND C | Basic data types | The result is TRUE if the value of A is greater than or equal to B and less than or equal to C, and FALSE if not. The opposite effect can be achieved if you use the NOT keyword. |
A IS NULL | All data types | Returns TRUE if A is NULL and FALSE if A is NULL |
A IS NOT NULL | All data types | Returns TRUE if A is not NULL and FALSE if A is not NULL |
In (value 1, value 2) | All data types | Equals value 1, value 2, and returns TRUE |
A [NOT] LIKE B | Type STRING | B is A simple regular expression under SQL that returns TRUE if A matches it; Otherwise returns FALSE. The expression for B is described as follows: ‘x%’ means that A must begin with the letter ‘x’, ‘%x’ means that A must end with the letter ‘x’, and ‘%x%’ means that A contains the letter ‘x’, which can be at the beginning, end, or middle of A string. The opposite effect can be achieved if you use the NOT keyword. |
2. Logical operators
The operator | meaning |
---|---|
AND | Logic and |
OR | Logic or |
NOT | Logical not |
5.4 Group query
Same syntax as MySQL
5.4.1 group by
The GROUP BY statement is typically used with aggregate functions (COUNT, MAX, MIN, AVG, SUM) to group by one or more queued results, and then aggregate each group. Also select fields must appear after group by or in aggregate functions
select st_dept,count(*) from student group by st_dept;
5.4.2 having
It’s used in the same way as MySQL
Differences between having and where:
- Where plays on the columns in the table, and having plays on the columns of the query result
- Where cannot be followed by an aggregate function; having can be followed by an aggregate function
- Having can only be used after group by
5.5 Connection Query
5.5.1 Equivalent connection
Hive only supports equivalent connections, not non-equivalent connections, and its usage is consistent with MySQL syntax
5.5.2 summary
Inner join, left join, right join, full join, multi-table join, and Cartesian product are all consistent with MySQL syntax
5.6 the sorting
Sort in Hive will be very different from MySQL. In order to display Hive sorting, you need to understand the Hive configuration
hive (hive)> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
Set the number of reduce. Default -1 will dynamically plan the number of reduce according to SQL, data volume, etc., and specify the number of reduce by setting the number greater than 0
5.6.1 Global sorting
This is the global sorting of MapReduce, which is represented in Hive as order by for one reduce, because from the perspective of MapReduce the global sorting must output one file and therefore there must be only one reduce.
1. Use
select * from student order by age;
Also default ascending order (ASC), can sort by descending order (DESC)
Details of 2.
When sorting by, one reduce is started. When manually setting the number of reduces, how many reduces are started?
// set the number of reduces to 3 hive (test)> set mapreduce.job.states =3; hive (test)> select * from student order by age; Query ID = root_20200829092604_3b647fd2-3d10-46ac-b498-0f34941dee6a Total jobs = 1 Launching Job 1 out of 1 ... Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
Submit a job and start a reduce. It is found that no matter how many reduce numbers are set, the global sort will only start one redcue.
Added 3.
Support for alias sorting
select stu_id id from student order by id;
Support for sorting of multiple fields
hive (test)> select * from student order by age,stu_id;
5.6.2 Local sorting
In Hive, sort by is used to realize the local order of each reduce, which may not guarantee the global order.
Set the number of reduce to three
set mapreduce.job.reduces=3;
The specific implementation
hive (test)> select * from emp sort by deptno desc; . Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
The results are as follows
It can be roughly seen that there are three partitions (three reduce corresponds to three partitions). Of course, you can output the results to a file and you can find that there are also three files
insert overwrite local directory '/tmp/hive'
select * from emp sort by deptno desc;
The problem is that in MapReduce the partition is by default a HashPartitioner remainder based on the number of hashes and reduces of the key. So how does Hive partition it? Which (how many) fields are partitioned? According to the HQL above, Hive is partitioned by random values.
Why is that? If Hive partitioned by field when we didn’t specify field partitioning, who would we call in case of data skew problems? So Hive uses random loading to minimize data skew.
5.6.3 Partition sorting
Distribute by: Partitions like MapReduce are distributed by assigning a field to the HashPartitioner to sort by to create an in-partition order
Partitions by department ID are sorted in ascending order of salary
hive (test)> select * from emp distribute by deptno sort by sal;
5.6.4 cluster by
Cluster by can be used instead when the fields distribute by and sort by are consistent, so distribute by serves both partitioning and sorting functions, but cluster by sorting only supports descending sorting and cannot specify a DESC or ASC sentence
cluster by col <=> distribute by col sort by col
So some people ask the meaning of a field partition and sorting in? When we have a lot of data, a small number of partitions but there are a lot of different field types, so we’re going to have a lot of fields going into the same partition, sorting by that field on the same partition.
5.6.5 summary
grammar | conclusion |
---|---|
group by | Partition of the field, the same field is partitioned again, followed by an aggregation operation |
distribute by | Only the fields are partitioned |
order by | Global sort, only one reduce will be raised |
sort by | If the number of reduce is set to 1, it will be the same as order by |
5.7 bucket list
Please be different from the partitioned table (I think the names of these two tables are not very friendly). After learning, it is found that the partitioned table only stores data separately, while the buckets table is corresponding to the partition of MR, which stores data files separately. Note that the partition table aims at the storage path of the data, storing each file into folders, while the partition table partitions each file. A partition corresponds to a file, that is, a data file is split.
Partitioning provides a convenient way to isolate data and optimize queries, but not all data sets can be partitioned well; Bucking is another means of breaking up a data set into more manageable pieces.
5.7.1 Create buckets table
create table stu_buck(id int, name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t';
The bucket table uses clustered by (note the partition sort separation) with fields in the table (partition fields cannot be in the table) and divided into 4 buckets
View table structure
hive (test)> desc formatted stu_buck;
Load the data into the bucket table, make sure before loadingmapreduce.job.reduces=-1
In this way, Hive will automatically create the corresponding number of Reduce based on the number of buckets in the bucket table
And you can see roughly mod by the hash of the id
In Hive 3.1.2, this configuration is no longer available. In Hive 3.1.2, this configuration is no longer available. In Hive 3.1.2, this configuration is no longer available
hive (test)> set hive.enforce.bucketing;
hive.enforce.bucketing is undefined
After reading the official documentation, we can see that Hive 2.x loads data directly, instead of creating an intermediate table and then inserting it into a bucket table using MR
5.7.2 Application of bucket table
As buckets are mainly used for sampling queries, a large data set often requires a representative query result rather than all the query results, so HIVE uses sampling statements to implement sampling queries
tablesample(bucket x out of y on col)
Col: is the bucket field
X: Indicates which bucket to start from
Y: It must be a multiple or factor of the number of barrels to determine the proportion of sampling. For example, if the number of barrels is assumed to be 4, when y=2, the data of (4/2)=2 barrels is extracted. When y=4, extract (4/4)=1 bucket of data; When y=8, extract (4/8)=1/2 bucket of data
Assuming that the number of buckets is z: z/y>1, then finally take x, x+z/y+z/y*2…
tablesample(bucket 1 out of 2 on id)
So we’re going to start with the first bucket, and we’re going to start with two buckets, so we’re going to start with one and three buckets
Therefore, x <= y is required for the tablesample because the last one is
x+(z/y-1)*y => x+z-y <= z => x <= y
X > y: Failed: semanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck
Six, functions,
Like MySQL, Hive’s functions can be divided into system built-in functions and user-defined functions, but there will be a big difference between the custom functions and MySQL
Check the system’s built-in functions
hive (test)> show functions;
See the usage of the built-in functions
hive (test)> desc function upper;
See the usage of the built-in functions for details
hive (test)> desc function extended upper;
6.1 System built-in functions
Here are some common built-in functions that differ from MySQL (with *) [216 in total]
function | explanation |
---|---|
round | rounded |
ceil | Take up the whole |
floor | Take down the whole |
rand | Take a random number from 0 to 1 |
lower | Turn to lowercase |
upper | Turn the capital |
length | Return string length |
concat | String concatenation |
concat_ws | Specify delimiter concatenation |
collect_set | Merge fields * |
substr | Strives for the substring |
trim | Remove space before and after |
split | String segmentation |
to_date | String to date |
Year, month… | Extract year, month and day from date |
from_unixtime | Time stamp to date * |
unix_timestamp | Date to time stamp * |
case… when… | Conditions for function |
if | Judging function |
count | Number finding (aggregation) |
sum | Summation (aggregation) |
min | Minimize (aggregation) |
max | Find the maximum value (aggregation) |
avg | Averaging (aggregation) |
explode | Expansion function * |
lateral view | Expand explodes * |
over | Open the window function |
6.1.1 collect_set
It is usually used with GROUP BY to collect the data of each group and encapsulate it into a SET.
The specific use
hive (test)> select college,collect_set(stu_name) from student group by college; . 先生 ES [" Zheng Nan Song Xia ", ""," sweet liu ", "zhang haitao," "zhang", "zhang wei", "Liu Meng"] IS [" li ", ", ", "li-li deng," "Liu Xiaotang", "li-li chang", "his"] MA [" may ", "xue-wen li", "Zhang Hailing", "high focused", "xiao-peng liu", "zheng chenggong", "guo-wei li", "hong-wei zhou", "hai-bo sun"]
Group students by colleges, print out the information of each student and which students there are in each college, and use the collect_set function to encapsulate the set of students from the same college
6.1.2 Date related
Date to timestamp UNIX_TIMESTAMP
hive (test)> select unix_timestamp('2020-08-29 14:14:00','yyyy-MM-dd HH:mm:ss'); OK _C0 1598710440 Time taken: 0.24 seconds, touchdown: 1 row(s)
Time stamp to date from_unixtime
hive (test)> select from_unixtime(1598710440,'yyyy-MM-dd HH:mm:ss'); OK _C0 2020-08-29 14:14:00 Time taken: 0.146 seconds, touchdown :1 row(s)
6.1.3 Expansion function
1.explode
Convert a row of data to a column, and in Hive only works with Array and Map data types
Used for the Array data type
hive (hive)> select * from tbl_3; OK tbl_3.name tbl_3.friends tbl_3.childress tbl_3.address Chen xiao-chun [" zhengyi-jien "," zhengyi-hua "] {"jasper":3,"baby":1} {" street ":" the queen's road ", "city" : "Hong Kong"} liu ye [" zhang ", "grandson couple"] {8, ", a ":" neon ": 6} {" street" : "changan avenue", "city" : "Beijing"} Time seems: 0.153 seconds, touchdown: 2 row(s) Hive (Hive)> select explode(friends) as FRST; Time taken: 0.156 seconds, touchdown: 4 row(s)
For the Map data type
hive (hive)> select explode(childress) as (name,age) from tbl_3; OK name age jasper 3 baby 3 Time taken: 0.133 seconds, touchdown: 4 row(s)
But the explode function has a big drawback
- Cannot associate other fields from the original table
- Cannot group, sort
- UDTF (user-defined table generation function) nesting is not possible
2.lateral view
Lateral View (Lateral View) is a combination of Hive and UDTF. It can solve the problem that UDTF cannot add additional SELECT. The principle is that the Lateral View is similar to MySQL view. UDTF) and join the input row on a view to explode.
Standard grammar
lateral view udtf(expression) tableAlias as columnAlias,columnAlias...
Udtf (expression) : The UDTF function used. Such as explodes ()
TableAlias: Represents the name of the virtual table
ColumnAlias: Name of the fields used in the virtual table. ColumnAlias: Name of the fields used in the virtual table
Resolve the shortcoming of explode()
hive (hive)> select name, friend from tbl_3 lateral view explode(friends) tmp_tbl as friend; Time taken: 0.086 seconds, touchdown: 4 row(s)
6.2 User defined functions
Custom functions in Hive are divided into three types based on the number of lines of input and output:
- User-Defined Functions (UDFs)
- User-Defined Aggregate Function UDAF
- User-defined table-generating UDTF. User-defined table-generating functions
Function types | describe |
---|---|
UDF | Line input line output, such as string class functions |
UDAF | Multiple lines of input and one line of output, such as an aggregate function |
UDTF | One line of input and multiple lines of output, such as the expansion function |
6.2.1 UDF
1. Programming steps
- inheritance
org.apache.hadoop.hive.ql.exec.UDF
- Implement the evaluate function, which supports overloading
- Add a JAR package to Hive
- Create a function
- Using the function
- Delete function
Note: UDFs must have a return value. NULL can be returned, but cannot be returned
2. Implementation
Implement the length() function
1) Add dependencies
< the dependency > < groupId > org. Apache. Hive < / groupId > < artifactId > hive - exec < / artifactId > < version > 3.1.2 < / version > </dependency>
2) Create class inheritance UDF
package hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
@SuppressWarnings("deprecation")
public class UDFLength extends UDF {
}
3) Implement the evaluate function
package hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; @SuppressWarnings("deprecation") public class UDFLength extends UDF { public int evaluate(String str) { return str.length(); }}
4) Add a JAR package to Hive
Put the program into a jar and place it in hive/lib. The jar under this jar will be loaded automatically when the hive starts, but you can also load the jar manually
Hive (hive)> add jar /usr/local/soft/hive-3.2.1/lib/ hadoop-1.0-snapshot. Jar; Added [/usr/local/soft/ hiv-3.2.1 /lib/ hadoop-1.0-snapshot.] Added [/usr/local/soft/ hiv-3.2.1 /lib/ hadoop-1.0-snapshot. [/ usr/local/soft/hive - 3.2.1 / lib/hadoop - 1.0 - the SNAPSHOT. Jar]
5) Create a function
When creating a custom function in Hive, it can be divided into temporary functions and permanent functions. For temporary functions, they are only available in the current session and the current database.
Standard grammar
create [temporary] function fun_name as 'package.class';
hive (hive)> create temporary function getlength as 'hive.udf.UDFLength';
OK
Time taken: 0.047 seconds
6) Use functions
Returns an int by passing in a string logically
hive (hive)> select *,getlength(st_name) from student; St_age student. St_age student. St_dept _C1 10001 male 20 ES 2 10002 Li Na female 19 IS 2 Time taken: 0.145 seconds, touchdown: 2 row(s)
hive (hive)> desc function extended getlength; OK tab_name There is no documentation for function 'getlength' Function class:hive.udf.UDFLength Function type:TEMPORARY Time taken: 0.017 seconds, touchdown: 3 row(s)
7) Delete function
hive (hive)> drop function if exists getlength;
OK
Time taken: 0.648 seconds
6.2.2 GenericUDF
The UDF class is obsolete, and Hive recommends using GenericUDF, which supports more data types and is more efficient
package hive.udf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; public class GenericUDFLength extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { return null; } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { return null; } @Override public String getDisplayString(String[] children) { return null; }}
initialize
Initialization, parameter validation, object instantiation, and so onevaluate
The business logicgetDisplayString
Displays help information for the function
So what’s in Initialize? Open a GenericUDF GenericUDFCharacterLength implementation class
@Description(name = "character_length,char_length", value = "_FUNC_(str | binary) - Returns the number of characters in str or binary data", > SELECT _FUNC_('안녕하세요') FROM SRC LIMIT 1; \n" + " 5")
This function should be written by a Korean, we can see what its initialize returns
outputOI = PrimitiveObjectInspectorFactory.writableIntObjectInspector;
return outputOI;
Understand roughly by the static variables in the PrimitiveObjectInspectorFactory return values of this function need to return, speculated that his function should return an int type, so we also can return directly
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
return null;
}
This evaluate is clearly superior to UDF, but lets our function accept any number of values of any type.
package hive.udf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.IntWritable; public class GenericUDFLength extends GenericUDF { IntWritable result = new IntWritable(); @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { return PrimitiveObjectInspectorFactory.writableIntObjectInspector; } @Override public IntWritable evaluate(DeferredObject[] arguments) throws HiveException { if (arguments.length > 1) { System.out.println(" This function does not support multiple arguments for the time being "); return null; } result.set(arguments[0].get().toString().length()); return result; } @Override public String getDisplayString(String[] Children) {return "getString length "; }}
hive (default)> create temporary function getlength as 'hive.udf.GenericUDFLength'; OK Time taken: 0.519 seconds Hive (default)> select getlength('123'); C0 3 Time taken: 3.072 seconds, touchdown: 1 row(s)
There are pits here, avoiding Java cast, and the return value is recommended to return Hadoop’s data type
6.2.2 UDAF
UDAF is different from UDF in that UDAF needs to accept any number of values and then return after calculation, so the structure of UDAF will be more complex than UDA
1. Programming steps
- Function class inheritance
AbstractGenericUDAFResolver
, computing class implementationGenericUDAFEvaluator
interface -
Implement the UDAValuator interface init, iterate, TerminatePartial, Merge, Terminate
init
Initialize theiterate
Receive incoming parameters for internal iterationterminatePartial
returniterate
After the datamerge
receiveterminatePartial
Return the result and proceedmerge
operationterminate
Returns the final result
2. Implementation
6.2.3 UDTF
Realize split function function
1. Programming steps
- Create a class inheritance
GenericUDTF
- implementation
initialize
,process
,close
methods
GenericUDTF can be found by overwriting the process and close methods. If we do this, we will get an error. As you can see from the GenericUDTF source code
public StructObjectInspector initialize(StructObjectInspector argOIs)
throws UDFArgumentException {
List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();
ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()];
for (int i = 0; i < inputFields.size(); i++) {
udtfInputOIs[i] = inputFields.get(i).getFieldObjectInspector();
}
return initialize(udtfInputOIs);
}
@Deprecated
public StructObjectInspector initialize(ObjectInspector[] argOIs)
throws UDFArgumentException {
throw new IllegalStateException("Should not be called directly");
}
The source code throws an exception directly, so we must override Initialize
2. Code implementation
And the same is true of how did someone do itGenericUDTFJSONTuple
The initialization logic is pretty simple to look at. Numcols is the length of the method form argument, fieldNames holds the name of the function return value field, and fieldOIs holds the function return value type, so we can write our own initialization and implement our own logic at the same time
package hive.udtf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.Text; import java.util.ArrayList; import java.util.List; Public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF {public class UDTFSplit extends GenericUDTF; public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { List<String> fieldNames = new ArrayList<>(); List<ObjectInspector> fieldOIs = new ArrayList<>(); // This field will eventually be displayed as the field name of the result fieldnames.add (" result "); / / that function returns a value type fieldOIs. Add (PrimitiveObjectInspectorFactory. WritableStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } @Override public void process(Object[] args) throws HiveException { if (args.length ! = 2) {throw new RuntimeException(" parameter number does not match "); } // String[] split = args[0].toString().split(args[1].toString()); for (String value : split) { result.set(value); Write () forward(result); }} @Override public void close() throws HiveException {// Override public void close() throws HiveException {// Override public void close() throws HiveException {
use
hive (default)> select udtfsplit('hello_world_hello_hive','_'); Hello world hello hive Time taken: 2.406 seconds, touchdown: 4 row(s)
VII. Enterprise tuning
To be honest, you can’t write about Hive tuning without a few years of work experience, and it’s important to ask about it in critical interviews. Not much is actually produced. Why is that? For the average person, to be able to write out the requirements of the SQL will be thankful for what tuning ah, the requirements are not implemented take what tuning!! Isn’t it.
- [] fetch
- [x] Local mode
- [] table optimization
- [] MR optimization
- [] Parallel execution
- [] strict mode
- [] JVM reuse
- [] presumed execution
- [] compressed