Learn more about Java basics


Hive Basic Concepts

Advantages and Disadvantages of Hive

advantages

  1. The operation interface uses SQL-like syntax to provide rapid development capabilities (simple and easy to use).
  2. Eliminating the need to write MapReduce reduces developer learning costs.
  3. Hive has a high execution latency. Therefore, Hive is commonly used for data analysis and does not require high real-time performance.
  4. Hive has the advantage of processing large data, but does not have the advantage of processing small data because of the execution delay of Hive

High. 5. Hive supports user-defined functions. Users can implement functions based on their own requirements.

disadvantages

  1. Hive has limited HQL expression capabilities
    • The iterative algorithm cannot be expressed
    • I am not good at data mining. Due to the limitation of MapReduce data processing process, more efficient algorithms cannot be implemented.
  2. Hive is inefficient
    • MapReduce jobs automatically generated by Hive are usually not intelligent enough
    • Hive tuning is difficult and coarse

Hive Architecture Principles

User interface: Client

CLI (Command-line interface), JDBC/ODBC(JDBC Access to Hive), and WEBUI (Browser Access to Hive)

Metadata: Metastore

Metadata includes the table name, database to which the table belongs (default), owner of the table, column/partition fields, type of the table (whether it is an external table), and directory where the table data resides. By default, it is stored in its own Derby database. It is recommended to use MySQL to store Metastore

Hadoop/Spark

HDFS is used for storage and MapReduce is used for calculation.

Driver: Driver

  • SQL Parser converts SQL strings into abstract syntax trees called AST. This is usually done using third-party libraries such as ANTLR. Parsing the AST, such as whether tables exist, whether fields exist, and whether SQL semantics are wrong.
  • Compiler (Physical Plan) : Compiles the AST to produce a logical execution Plan.
  • Query Optimizer: Optimizes the logical execution plan.
  • Execution: To convert a logical Execution plan into a physical plan that can be executed. For Hive, it is MR/Spark.

Hive receives SQL commands from users through a series of interactive interfaces, uses its own Driver and MetaStore to translate these commands into MapReduce, and submits them to Hadoop or Spark for execution. Output the results returned by the execution to the user interaction interface.

Hive Data Types

Basic data types

Hive Data Types Java data types The length of the example
TINYINT byte 1byte signed integer 20
SMALINT short 2byte Signed integer 20
INT int 4byte Signed integer 20
BIGINT long 8byte signed integer 20
BOOLEAN boolean Boolean type, true or false TRUE FALSE
FLOAT float Single-precision floating point number 3.14159
DOUBLE double A double – precision floating – point number 3.14159
STRING string Character series. Character sets can be specified. You can use single or double quotation marks. ‘Now is the time’ “For all ood Men”
TIMESTAMP Time to type
BINARY An array of bytes

The Hive String type is equivalent to the database vARCHar type. This type is a variable String, but it cannot be specified how many characters it can store. In theory, it can store up to 2GB of characters.

Set data type

The data type describe Syntax examples
STRUCT Like structs in C, you can access element content through “dot” notation. For example, if the data type of a column is STRUCT{first STRING, last STRING},The first element can then be referenced by the field.first. Struct (), for example: struct “street: string, city: string >
MAP A MAP is a collection of key-value pairs of tuples that can be accessed using array notation. For example, if the data type of a column is MAP, where the key -> value pairs are ‘first’ -> ‘John’ and ‘last’ -> ‘Doe’,The last element can then be retrieved by the field name [‘ last ‘] Map () example: map<string, int>
ARRAY An array is a collection of variables of the same type and name. These variables are called elements of the array, and each element of the array has a number starting from zero. For example, the array value is [‘ John ‘, ‘Doe’],The second element can then be referenced by the array name [1]. Array () such as Array < string >

Hive supports three complex data types: ARRAY, MAP, and STRUCT. ARRAY and MAP are similar to Arrays and maps in Java, while structs are similar to STRUCts in C (or Java objects) in that they encapsulate a collection of named fields and allow arbitrary levels of nesting of complex data types.

A case in field

  1. Suppose a table has the following row, and we represent its data structure in JSON format. Lattice accessed in Hive

Type for

{"name": "songsong", "friends": ["bingbing", "lili"], // list Array, "children": {// key values Map, "xiao Song ": Struct, "street": "hui long guan", "city": "Beijing"}}Copy the code
  1. Based on the above data structure, we created corresponding tables in Hive and imported data.

Create a local test file named test.txt

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing
Copy the code

Note that relationships between elements in maps, structs, and arrays can all be represented by the same character, in this case “_”.

  1. Create the test table on Hive
create table test(
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited fields terminated by ', '
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';
Copy the code

Field Description: Row format delimited fields terminated by ‘,’ — column delimiter Collection items terminated by ‘_’ –MAP STRUCT and ARRAY Map keys terminated by ‘:’ — lines terminated for key and value in map terminated by ‘\n’; — Line separator

  1. Import text data into the test table
load data local inpath '/opt/module/hive/datas/test.txt' into table test;
Copy the code
  1. To access data in three sets of columns, the following are ARRAY, MAP, STRUCT access methods respectively
hive (default)> select friends[1],children['xiao song'],address.city from test where name="songsong";
OK
_c0 _c1 city
lili 18 beijing
Time taken: 0.076 seconds, Fetched: 1 row(s)
Copy the code

DDL data definition

Create a table

Build predicative method

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)]  [COMMENT table_comment] [PARTITIONEDBY(col_name data_type [COMMENT col_comment], ...) ]  [CLUSTEREDBY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC],...). ]INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...) ] [AS select_statement]
Copy the code

Field Description

  • CREATE TABLE Creates a TABLE with the specified name. If a table with the same name already exists, an exception is thrown; The user can ignore this exception with the IF NOT EXISTS option.
  • The EXTERNAL keyword allows the user to create an EXTERNAL table and specify a LOCATION that points to the actual data. When the table is deleted, the metadata and data of the internal table are deleted together. The EXTERNAL table only deletes metadata, but does not delete data.
  • COMMENT: Adds comments to tables and columns.
  • PARTITIONED BY Creates a partition table
  • CLUSTERED BY Creates a bucket table
  • SORTED BY is uncommon and sorts one or more columns in the bucket separately
  • ROW FORMAT
ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value,property_name=property_value, ...)]  row format delimited fields terminated by ',' collection items terminated by '_' map keys terminated by ':' lines terminated by '\n';Copy the code

Users can customize serdes or use their own serdes when building tables. If ROWFORMAT or ROWFORMAT DELIMITED is not specified, the built-in SerDe is used. When building a table, you need to specify columns for the table. You can specify a custom SerDe as well as a table column. Hive uses SerDe to determine the data of a specific table column. SerDe is Serialize or Deserilize. Hive uses SerDe to sequence and deserialize row objects.

  • STORED AS Specifies the type of the STORED file. Common storage file types include SEQUENCEFILE, TEXTFILE, and RCFILE. If the file data is plain text, you can use STORED AS TEXTFILE. If the data needs to be compressed, use STORED AS SEQUENCEFILE.
  • LOCATION: specifies the storage LOCATION of the table in the HDFS.
  • AS: Followed by a query statement to create a table based on the query result.
  • LIKE allows users to copy existing table structures, but not data.

The sample

(1) Create a common table

create table if not exists student(
id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';
Copy the code

(2) Create table based on query result (query result will be added to the newly created table)

create table if not exists student2 as select id, name from student;
Copy the code

(3) Create a table based on the existing table structure

create table if not exists student3 like student;
Copy the code

DML data operations

Data import

Load data into a table

grammar
hive> load data [local] inpath 'Path of data' [overwrite] into table
student [partition (partcol1=Val1,...). ] ;Copy the code
  • Load Data: loads data
  • Local: loads data locally to the Hive table. Otherwise, load data from HDFS to hive table
  • Inpath: indicates the path for loading data
  • Overwrite: overwrite existing data in the table. Otherwise, append data
  • Into table: indicates which table to load
  • Student: Indicates a specific table
  • Partition: Upload data to a specified partition
In field case

(0) Create a table

hive (default)> create table student(id string, name string) row format delimited fields terminated by '\t';
Copy the code

(1) Load the local file to hive

hive (default)> load data local inpath '/opt/module/hive/datas/student.txt' into table default.student;
Copy the code

(2) Load HDFS files to Hive. Upload files to HDFS

hive (default)> dfs -put /opt/module/hive/data/student.txt /user/atguigu/hive;
Copy the code

Load HDFS data

hive (default)> load data inpath '/user/atguigu/hive/student.txt' into table default.student;
Copy the code

(3) Load data to overwrite the existing data in the table

hive (default)> load data inpath '/user/atguigu/hive/student.txt'
overwrite into table default.student;
Copy the code

Insert data into a table using a query statement

Basic insert data
hive (default)> insert into table student_par values(1.'wangwu'), (2.'zhaoliu');
Copy the code
Basic mode inserts (based on single table query results)
hive (default)> insert overwrite table student_par select id, name from student where month='201709';
Copy the code

Insert into: Adds data to a table or partition. Original data will not be deleted. Insert Overwrite: overwrites existing data in the table

Multi-table (multi-partition) insert mode (based on multi-table query results)
hive (default)> from student
insert overwrite table student partition(month='201707')
select id, name where month='201709'
insert overwrite table student partition(month='201706')
select id, name where month='201709';
Copy the code

Location specifies the path to load the data when creating the table

hive (default)> create external table if not exists student5(
id int, name string
)
row format delimited fields terminated by '\t'
location '/student;
Copy the code

Export data

Export the query results to a local directory

hive (default)> insert overwrite local directory '/opt/module/hive/data/export/student' select * from student;
Copy the code

Format and export the query results to a local directory

hive(default)>insert overwrite local directory '/opt/module/hive/data/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
Copy the code

Export query results to HDFS (without local)

hive (default)> insert overwrite directory '/user/atguigu/student2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
Copy the code

The query

The basic query

Query statement syntax:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
Copy the code

Note: (1) SQL language is case insensitive. (2) SQL can be written in one or more lines (3) keyword can not be abbreviated nor branch (4) each clause generally should be written in line. (5) Use indentation to improve the readability of statements.

Like and RLike

  • % represents zero or more characters (any character).
  • _ represents a character.

RLIKE clause The RLIKE clause is an extension of this feature in Hive, which allows matching conditions to be specified using Java’s more powerful regular expressions language

A case in field

(1) Find the information of employees whose names start with A

hive (default)> select * from emp where ename LIKE 'A%';
Copy the code

(2) Find the information of the employee whose name is the second letter A

hive (default)> select * from emp where ename LIKE '_A%';
Copy the code

(3) Find the information of the employee whose name contains A

hive (default)> select * from emp where ename RLIKE '[A]';
Copy the code

Having statements

Having is not a group function. Having is not a group function. Having is not a group function. (2) Having is only used in group by statistics statements.

Multi-table join query

hive (default)>SELECT e.ename, d.dname, l.loc_name
FROM emp e
JOIN dept d
ON d.deptno = e.deptno
JOIN location l
ON d.loc = l.loc;
Copy the code

In most cases, Hive starts a MapReduce job for each pair of JOIN objects. In this example, a MapReduce job is started to connect tables E and D, and another MapReduce job is started to connect the output of the first MapReduce job to table L. Perform the connection operation.

Note: why not join tables D and L first? This is because Hive is always executed from left to right. Optimization: When joining three or more tables, only one MapReduce job is created if each ON clause uses the same join key.

The sorting

Global sort (Order By)

Order By: Global ordering, with only one Reducer

Internal Sort for each Reduce (Sort By)

Sort By: Order By is very inefficient for large data sets. In many cases, global sorting is not required and you can use sort by. Sort by produces a Sort file for each Reducer. Sorting is done internally by each Reducer, not for the global result set.

  1. Setting the Number of Reduce Tasks
hive (default)> set mapreduce.job.reduces=3;
Copy the code
  1. Check the number of Reduce tasks
hive (default)> set mapreduce.job.reduces;
Copy the code
  1. View employee information in descending order by department number
hive (default)> select * from emp sort by deptno desc;
Copy the code
  1. Import query results into a file (sort by descending department number)
hive (default)> insert overwrite local directory '/opt/module/data/sortby-result' 
select * from emp sort by deptno desc;
Copy the code

Partition (Distribute By)

Distribute By: In some cases, we need to control which Reducer a particular row should go to, usually for subsequent aggregation operations. The distribute by clause does this. Distribute by is similar to partition (custom partition) in MR. It is used in combination with Sort by. Distribute by test, allocate multiple Reduces for processing; otherwise, the effect of distribute BY cannot be seen.

Case practice:

First by department number, then by employee number in descending order.

hive (default)> set mapreduce.job.reduces=3;

hive (default)> insert overwrite local directory
'/opt/module/data/distribute-result' select * from emp distribute by
deptno sort by empno desc;
Copy the code

Note:

  • A partition rule that distributes by distributes the hash code of a partition field into an area based on the number of reduce fields.
  • Hive requires that a DISTRIBUTE BY statement be written before a SORT BY statement.

Cluster By

When the distribute BY and SORTS BY fields are the same, the cluster by method can be used. In addition to being distribute by, Cluster BY also functions as sort BY. However, the sort can only be in ascending order and cannot be specified as ASC or DESC.

The following two notations are equivalent: hive (default)> select * from emp cluster by deptno;
hive (default)> select * from emp distribute by deptno sort by deptno;
Copy the code

Partition table and bucket table

The partition table

A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.

Partitioning table basic operations

Create partition table syntax
hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Copy the code

Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.

Load the data
hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table dept_partition
partition(day='20200401');
hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200402.log' into table dept_partition
partition(day='20200402');
hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200403.log' into table dept_partition
partition(day='20200403');
Copy the code
Query data in a partitioned table
Querying hive information in a single partitiondefault)> select * from dept_partition where day='20200401'; Querying Hive information in multiple partitions (default)> select * from dept_partition where day='20200401'
union
select * from dept_partition where day='20200402'
union
select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or day='20200402' or day='20200403';
Copy the code
Increase the partition
Create a single partition hive (default)> alter table dept_partition add partition(day='20200404'); Create multiple partitions with hive (default)> alter table dept_partition add partition(day='20200405')
partition(day='20200406');
Copy the code
Deleted partitions
Delete a partition hive (default)> alter table dept_partition drop partition
(day='20200406'); Delete multiple partitions hive (default)> alter table dept_partition drop partition
(day='20200404'), partition(day='20200405');
Copy the code
See how many partitions the partition table has
hive> show partitions dept_partition;
Copy the code

Secondary partition

Create a secondary partition table
hive (default)> create table dept_partition2(
deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';
Copy the code
Load data normally

(1) Load data into the secondary partition table

hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401'.hour='12');
Copy the code

(2) Query partition data

hive (default)> select * from dept_partition2 where day='20200401' and
hour='12';
Copy the code

There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories

Method 1: Upload data and restore it

Upload data

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
Copy the code

Query data (the data just uploaded cannot be queried)

hive (default)> select * from dept_partition2 where day='20200401' and
hour='13';
Copy the code

Execute repair command

hive> msck repair table dept_partition2;
Copy the code

Query data again

hive (default)> select * from dept_partition2 where day='20200401' and
hour='13';
Copy the code
Method 2: Add a partition after uploading data

Upload data

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
Copy the code

Adding a Partition

hive (default)> alter table dept_partition2 add
partition(day='201709'.hour='14');
Copy the code

Query data

hive (default)> select * from dept_partition2 where day='20200401' and
hour='14';
Copy the code
Method 3: Load data to a partition after creating a folder

Create a directory

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;
Copy the code

Upload data

hive (default)> load data local inpath
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401'.hour='15');
Copy the code

Query data

hive (default)> select * from dept_partition2 where day='20200401' and
hour='15';
Copy the code

Dynamic partition adjustment

In a relational database, the database automatically inserts data into the corresponding Partition based on the value of the Partition field. Hive provides a similar mechanism for Dynamic Partition. You need to perform corresponding configurations.

Example Enable dynamic partition parameter setting

(1) Enable dynamic partitioning (default true, enable)

hive.exec.dynamic.partition=true
Copy the code

(2) Set it to non-strict mode (dynamic partition mode: strict, indicating that at least one partition must be specified as static partition; nonstrict, indicating that dynamic partition is allowed for all partitioned fields.)

hive.exec.dynamic.partition.mode=nonstrict
Copy the code

(3) The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000

hive.exec.max.dynamic.partitions=1000
Copy the code

(4) The maximum number of dynamic partitions that can be created on each MR node. Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.

hive.exec.max.dynamic.partitions.pernode=100
Copy the code

(5) Maximum number of HDFS files that can be created in the entire MR Job. The default is 100000

hive.exec.max.created.files=100000
Copy the code

(6) Whether an exception is thrown when empty partitions are generated. Generally, no setting is required. The default false

hive.error.on.empty.partition=false
Copy the code
A case in field

Requirement: insert dept_partition into dept_partition by region (loC field); (1) Create a target partition table

hive (default)> create table dept_partition_dy(id int, name string)
partitioned by (loc int) row format delimited fields terminated by '\t';
Copy the code

(2) Set dynamic partition

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select
deptno, dname, loc from dept;
Copy the code

(3) Check the partition status of the target partition table

hive (default)> show partitions dept_partition;
Copy the code

Bucket table and sample query

Hive data buckets and their application scenarios