[Hive] HQL DDL

This is the 30th day of my participation in the August Challenge

Data Definition Language (DDL): The main commands include CREATE, ALTER, and DROP.

DDL is used to define and modify the structure or data type of a database object.

As shown in figure:

First, database operation

Hive has a default database, default. If you do not specify a library to use when performing HQL operations, use the default database.

Hive database and table names are case-insensitive.
Names cannot start with a number;
Do not use keywords and do not use special symbols.

Create database syntax:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...) ] ;-- Create a database and store it in /user/hive/warehouse/*.db on the HDFS
hive (default)> create database mydb;
hive (default)> dfs -ls /user/hive/warehouse;
Found 2 items
drwxr-xr-x   - root supergroup          0 2020- 08- 16 17:37 /user/hive/warehouse/mydb.db
drwxr-xr-x   - root supergroup          0 2020- 08- 16 16:15 /user/hive/warehouse/test1.db


SQL > select * from database if not exists
hive (default)> create database if not exists mydb;


Create database. Add remarks to specify where the database is stored
hive (default)> create database if not exists mydb2
              > comment 'this is mydb2'
              > location '/user/hive/mydb2.db';
Copy the code

Viewing a Database

View all databases
hive (default)> show database;

-- Check database information
hive (default)> desc database mydb2;

hive (default)> desc database extended mydb2;

hive (default)> describe database extended mydb2;
Copy the code

Using a database

use mydb;
Copy the code

Deleting a Database

Delete an empty database
hive (default)> drop database databasename;

If the database is not empty, use cascade to force the database to drop
hive (default)> drop database databasename cascade;
Copy the code

2. Construct a predicate sentence

create [external] table [IF NOT EXISTS] table_name
[(colName colType [comment 'comment'],...). ]  [comment table_comment] [partition by(colName colType [comment col_comment], ...) ]  [clusteredBY (colName, colName, ...)
[sorted by (col_name [ASC|DESC],...). ]into num_buckets buckets]
[row format row_format]
[stored as file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...) ] [AS select_statement];


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
Copy the code

CREATE TABLE: Creates a TABLE with the given name and throws an exception if the TABLE already exists. You can avoid this by using the if not exists command.
EXTERNAL keyword: Create an EXTERNAL table, otherwise create an internal table (administrative table).

When an internal table is deleted, both data and table definitions are deleted.
When an external table is deleted, only the definition of the table is deleted, and data is retained.
In production environments, use external tables;

Comment: Indicates the comment of the table
Partition by: Specifies the partition field of the table to be partitioned
Clustered by: Creates a bucket partitioning table and specifies bucket partitioning fields
Sorted by: Sorts one or more columns in a bucket. Rarely used
Store clauses.

ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char] | SERDE serde_name
[WITH SERDEPROPERTIES (property_name=property_value,
property_name=property_value, ...) ]Copy the code

You can specify SerDe when creating a table. If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the default SerDe is used. When creating a table, you also need to specify columns for the table. When specifying columns, you also specify custom serdes. Hive uses SerDe to determine the specific column data of a table. SerDe is Serialize or Deserilize. Hive uses SerDe to sequence and deserialize objects.

stored as SEQUENCEFILE|TEXTFILE|RCFILE

If the file data is plain text, STORED AS TEXTFILE (the default) can be used. If the data needs to be compressed, use STORED AS SEQUENCEFILE.

LOCATION: LOCATION of the table in the HDFS
TBLPROPERTIES: Defines the properties of the table
AS: Can be followed by a query statement to create a table based on the query result
LIKE: the name of the LIKE table, which allows users to copy existing table structures without copying data

Three, internal table & external table

When creating a table, you can specify the type of the table.

There are two types of tables:

Internal table (Management table)
External tables

Tips:

By default, internal tables are created. If you want to create an external table, you need to use the keywordexternal
When an internal table is deleted, the definition of the table (metadata) and the data are deleted simultaneously
When an external table is dropped, only the table definition is deleted and the data is preserved
In a production environment, use external tables

(1) Internal table

T1. dat File content

2; zhangsan; book,TV,code; beijing:chaoyang,shagnhai:pudong 3; lishi; book,code; nanjing:jiangning,taiwan:taibei 4; wangwu; music,book; heilongjiang:haerbinCopy the code

Create a tableSQL

Create internal table
create table t1(
  id int,
  name string,
  hobby array<string>,
  addr  map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";

-- Displays table definitions with less information
hive (mydb)> desc t1;


-- Display table definition, display information, friendly format
hive (mydb)> desc formatted t1;


Load data
hive (mydb)> load data local inpath '/home/hadoop/data/t1.dat' into table t1;
Loading data to table mydb.t1
OK
Time taken: 0.546 seconds


Select * from database;
hive (mydb)> select * from t1;
OK
t1.id	t1.name	t1.hobby	t1.addr
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}
Time taken: 1.214 seconds, Fetched: 3 row(s)


-- Query data files
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
Found 1 items
-rwxr-xr-x   3 root supergroup        148 2020- 08- 16 18:28 /user/hive/warehouse/mydb.db/t1/t1.dat


Drop table. Tables and data are deleted simultaneously
hive (mydb)> drop table t1;
OK
Time taken: 0.317 seconds


SQL > alter table datafile
Copy the code

(2) External table

Create an external table
create external table t2(
  id int,
  name string,
  hobby array<string>,
  addr  map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";


-- Displays table definitions
hive (mydb)> desc formatted t2;

Load data
hive (mydb)> load data local inpath '/home/hadoop/data/t1.dat' into table t2;
Loading data to table mydb.t2
OK
Time taken: 0.707 seconds

Select * from database;
hive (mydb)> select * from t2;
OK
t2.id	t2.name	t2.hobby	t2.addr
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}
Time taken: 0.298 seconds, Fetched: 3 row(s)


Drop table. Table dropped, directory still exists
hive (mydb)> drop table t2;
OK
Time taken: 0.291 seconds


-- Query data file again, it still exists
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db;
Found 1 items
drwxr-xr-x   - root supergroup          0 2020- 08- 16 18:32 /user/hive/warehouse/mydb.db/t2


hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;
Found 1 items
-rwxr-xr-x   3 root supergroup        148 2020- 08- 16 18:32 /user/hive/warehouse/mydb.db/t2/t1.dat


hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/ *; 2; zhangsan; book,TV,code; beijing:chaoyang,shagnhai:pudong 3; lishi; book,code; nanjing:jiangning,taiwan:taibei 4; wangwu; music,book; heilongjiang:haerbinCopy the code

(3) Conversion of internal table and external table

Create internal tables, load data, and check data files and table definitions
create table t1(
  id int,
  name string,
  hobby array<string>,
  addr map<string, string>
)
row format delimited
fields terminated by ";"
collection items terminated by ","
map keys terminated by ":";


load data local inpath '/home/hadoop/data/t1.dat' into table t1;
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;
hive (mydb)> desc formatted t1;

Internal table to external table
hive (mydb)> alter table t1 set tblproperties('EXTERNAL'='TRUE');

Query table information to see if the conversion is successful
hive (mydb)> desc formatted t1;

-- External table to internal table. EXTERNAL uppercase,false is size insensitive
hive (mydb)> alter table t1 set tblproperties('EXTERNAL'='FALSE');

Query table information to see if the conversion is successful
hive (mydb)> desc formatted t1;
Copy the code

(4) Summary

When building a list:

If not specifiedexternalKeyword to create an internal table;
The specifiedexternalKeyword to create an external table;

Delete table

When an external table is deleted, only the definition of the table is deleted. Data in the table is not affected
When an internal table is deleted, its data and definition are deleted simultaneously

Usage scenarios for external tables

Used when you want to retain data. Production of multipurpose external watches

4, partition table

Hive scans data in the entire table during query.

Due to the large amount of data in a table, full table scanning takes a long time and is inefficient.

Sometimes, only part of the data in a table needs to be scanned for query. Hive uses the concept of partitioned tables to store data in different subdirectories. Each subdirectory has a partition.

When only partial data is queried, full table scan is avoided and query efficiency is improved.

In practice, partitions are usually performed based on information such as time and region.

Partitioned table creation and data loading

Note: Partitioned fields are not existing data in the table and can be regarded as pseudo columns

- create a table
create table if not exists t3(
  id int,
  name string,
  hobby array<string>,
  addr map<String,string>
)
partitioned by (dt string)
row format delimited
fields terminated by '; '
collection items terminated by ', '
map keys terminated by ':';

-- Load data.
load data local inpath "/home/hadoop/data/t1.dat" into table t3
partition(dt="2020-06-01");
load data local inpath "/home/hadoop/data/t1.dat" into table t3
partition(dt="2020-06-02");
Copy the code

Check the partition

hive (mydb)> show partitions t3;
OK
partition
dt=2020- 06- 01
dt=2020- 06.
Time taken: 0.107 seconds, Fetched: 2 row(s)

hive (mydb)> select * from t3;
OK
t3.id	t3.name	t3.hobby	t3.addr	t3.dt
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 06- 01
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 06- 01
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 06- 01
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 06.
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 06.
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 06.
Time taken: 1.948 seconds, Fetched: 6 row(s)


Partitions are actually pseudo columns, directories
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3;
Found 5 items
drwxr-xr-x   - root supergroup          0 2020- 08- 16 18:41 /user/hive/warehouse/mydb.db/t3/dt=2020- 06- 01
drwxr-xr-x   - root supergroup          0 2020- 08- 16 18:41 /user/hive/warehouse/mydb.db/t3/dt=2020- 06.

Copy the code

Add partitions and set data

Add a partition without loading data
hive (mydb)> alter table t3 add partition(dt='2020-06-03');
OK
Time taken: 0.157 seconds


Add multiple partitions without loading data
hive (mydb)> alter table t3 add partition(dt='2020-06-05') partition(dt='2020-06-06');
OK
Time taken: 0.717 seconds


Add multiple partitions. To prepare data
hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2020- 06- 01 /user/hive/warehouse/mydb.db/t3/dt=2020- 0607 -;

hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=2020- 06- 01
/user/hive/warehouse/mydb.db/t3/dt=2020- 06- 08


Add multiple partitions. Load the data
hive (mydb)> alter table t3 add partition(dt='2020-06-07') location '/user/hive/warehouse/mydb.db/t3/dt=2020-06-07' partition(dt='2020-06-08') location '/user/hive/warehouse/mydb.db/t3/dt=2020-06-08';
OK
Time taken: 0.393 seconds


Select * from database;
hive (mydb)> select * from t3;
OK
t3.id	t3.name	t3.hobby	t3.addr	t3.dt
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 06- 01
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 06- 01
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 06- 01
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 06.
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 06.
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 06.
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 0607 -
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 0607 -
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 0607 -
2	zhangsan	["book","TV","code"]	{"beijing":"chaoyang","shagnhai":"pudong"}	2020- 06- 08
3	lishi	["book","code"]	{"nanjing":"jiangning","taiwan":"taibei"}	2020- 06- 08
4	wangwu	["music","book"]	{"heilongjiang":"haerbin"}	2020- 06- 08
Time taken: 1.724 seconds, Fetched: 12 row(s)

Copy the code

Modified partitionhdfsThe path

alter table t3 partition(dt='2020-06-01') set location
'/user/hive/warehouse/t3/dt=2020-06-03';
Copy the code

Deleted partitions

One or more partitions can be deleted, separated by commas
hive (mydb)> alter table t3 drop partition(dt='2020-06-03'),
           > partition(dt='2020-06-04');
Dropped the partition dt=2020- 06- 03
OK
Time taken: 0.558 seconds
Copy the code

Five, bucket table

If the amount of data in a single partition or table is too large to divide data in finer granularity, the bucket technology is used to divide data into finer granularity.

Divide data into multiple buckets based on specified fields. That is, divide data into multiple files based on fields.

Principle of bucket division:

先生In:key.hashCode % reductTask
HiveMedium: bucket division fieldhashCode% Number of buckets

- the test data/home/hadoop/data/course dat 1 Java python 91 1 90 1 c 78 hadoop 80 75 Java 2 c 2 python 2 hadoop 93 3 80 76  java 98 3 c 74 3 python 89 3 hadoop 91 5 java 93 6 c 76 7 python 87 8 hadoop 88Copy the code

Create a bucket table and divide it into three buckets with a row separated by 'Tab'
create table course(
  id int,
  name string,
  score int
)
clustered by (id) into 3 buckets
row format delimited fields terminated by "\t";

Create a normal table
create table course_common(
  id int,
  name string,
  score int
)
row format delimited fields terminated by "\t";


Normal tables load data
load data local inpath '/home/hadoop/data/course.dat' into table course_common;



-- Insert... select ... Load data to the bucket table
hive (mydb)> insert into table course select * from course_common;
WARNING: Hive-on-先生is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200816201230_d45d45b0-cc91- 473.a9 -ed00d512b941022
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Cannot run job locally: Number of reducers (= 3) is more than 1
Starting Job = job_1596350031879_0004, Tracking URL = http://linux123:8088/proxy/application_1596350031879_0004/
Kill Command = /opt/lagou/servers/hadoop2.92./bin/hadoop job  -kill job_1596350031879_0004
Hadoop job information for Stage- 1: number of mappers: 1; number of reducers: 3
2020- 08- 16 20:12:40.851 Stage- 1 map = 0%,  reduce = 0%
2020- 08- 16 20:12:47.369 Stage- 1 map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec
2020- 08- 16 20:12:53.622 Stage- 1 map = 100%,  reduce = 33%, Cumulative CPU 2.17 sec
2020- 08- 16 20:12:54.653 Stage- 1 map = 100%,  reduce = 67%, Cumulative CPU 3.77 sec
2020- 08- 16 20:12:55.710 Stage- 1 map = 100%,  reduce = 100%, Cumulative CPU 5.47 sec
MapReduce Total cumulative CPU time: 5 seconds 470 msec
Ended Job = job_1596350031879_0004
Loading data to table mydb.course
MapReduce Jobs Launched: 
Stage-Stage- 1: Map: 1  Reduce: 3   Cumulative CPU: 5.47 sec   HDFS Read: 16087 HDFS Write: 374 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 470 msec
OK
course_common.id	course_common.name	course_common.score
Time taken: 27.092 seconds



-- Observe bucket data. Data is partitioned according to :(partition field. HashCode) % (number of buckets)
hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/course;
Found 3 items
-rwxr-xr-x   3 root supergroup         57 2020- 08- 16 20:12 /user/hive/warehouse/mydb.db/course/000000_0
-rwxr-xr-x   3 root supergroup         53 2020- 08- 16 20:12 /user/hive/warehouse/mydb.db/course/000001_0
-rwxr-xr-x   3 root supergroup         63 2020- 08- 16 20:12 /user/hive/warehouse/mydb.db/course/000002_0


hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000000_0;
6	c	76
3	hadoop	91
3	python	89
3	c	74
3	java	98

hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000001_0;
1	python	91
1	c	78
1	java	90
7	python	87
1	hadoop	80

hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000002_0;
2	hadoop	93
2	python	80
2	c	76
8	hadoop	88
2	java	75
5	java	93
Copy the code

Remark:

Bucket division rule: Bucket division field. HashCode % Bucket division number
When loading data into bucket tables, use insert… select … manner
Use hive. Enforce. Bucketing =true

Hive 1.x Earlier versions; In Hive 2.x, if this parameter is deleted, buckets can always be divided.

Alter table & drop table

Alter table name. rename
hive (mydb)> alter table course_common rename to course_common1;
OK
Time taken: 0.134 seconds


-- Change the column name. change column
hive (mydb)> alter table course_common1 change column id cid int;
OK
Time taken: 0.175 seconds


-- Change the field type. change column
hive (mydb)> alter table course_common1 change column cid cid string;
OK
Time taken: 0.16 seconds


-- The following columns have types incompatible with the existing columns in their respective positions


When changing the data type of a field, meet the requirements of data type conversion. For example, int can be converted to string, but string cannot be converted to int
hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
cid                 	string              	                    
name                	string              	                    
score               	string              	                    
Time taken: 0.072 seconds, Fetched: 3 row(s)

hive (mydb)> alter table course_common1 change column score score int;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :
score



-- Add fields. add columns
hive (mydb)> alter table course_common1 add columns (common string);
OK
Time taken: 0.094 seconds

hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
cid                 	string              	                    
name                	string              	                    
score               	string              	                    
common              	string              	                    
Time taken: 0.078 seconds, Fetched: 4 row(s)



Delete column :replace Columns
-- This is just removing fields from metadata, not changing data files on HDFS
hive (mydb)> alter table course_common1 replace columns (id string, cname string, score string);
OK
Time taken: 0.131 seconds

hive (mydb)> desc course_common1;
OK
col_name	data_type	comment
id                  	string              	                    
cname               	string              	                    
score               	string              	                    
Time taken: 0.077 seconds, Fetched: 3 row(s)



- delete table
hive (mydb)> drop table course_common1;
OK
Time taken: 0.153 seconds

hive (mydb)> show tables;
OK
tab_name
course
t3
Time taken: 0.076 seconds, Fetched: 2 row(s)

Copy the code