Hive Learning Note 4: Partition table

Welcome to my GitHub

Github.com/zq2599/blog…

Content: all original article classification summary and supporting source code, involving Java, Docker, Kubernetes, DevOPS, etc.;

Hive Learning Notes series navigation

Basic data types
Complex data types
Internal and external tables
The partition table
Points barrels
HiveQL basis
Built-in function
Sqoop
Based the UDF
User-defined aggregation functions (UDAF)
UDTF

This paper gives an overview of

Hive partition table is a way to create a hierarchical directory, records in the same partition is in fact data in the same subdirectory, partition a total of two types: static and dynamic, then try one by one;

Static partitioning (single-field partitioning)

Table t9 has three columns: name city, age age, city city, and city city.

Build tables:

create table t9 (name string, age int) 
partitioned by (city string)
row format delimited 
fields terminated by ', ';
Copy the code

To view:

hive> desc t9;
OK
name                	string              	                    
age                 	int                 	                    
city                	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
city                	string              	                    
Time taken: 0.159 seconds, Fetched: 8 row(s)
Copy the code

Create a text file named 009.txt with only name and age on each line. The city field used for partitioning is not set here, but is set when executing the import command.

tom,11
jerry,12
Copy the code

The command for importing data is as follows. It can be seen that the city field is specified in the import command, that is, the value of the city field is the same for all data imported at one time:

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t9 
partition(city='shenzhen');
Copy the code

Run the import operation again. The command is as follows: change the value of city from shenzhen to Guangzhou:

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t9 
partition(city='guangzhou');
Copy the code

Query data, it can be seen that there are four data, city has two values:

hive> select * from t9;
OK
t9.name	t9.age	t9.city
tom	11	guangzhou
jerry	12	guangzhou
tom	11	shenzhen
jerry	12	shenzhen
Time taken: 0.104 seconds, Fetched: 4 row(s)
Copy the code

As mentioned earlier, the partition is actually a different subdirectory. Let’s see if this is the case. The red box is the file directory of T9, with two subdirectories city= Guangzhou and City =shenzhen:

The name and age fields are the same as the name and age fields.

[hadoop@node0 bin]$ ./hadoop fs -ls /user/hive/warehouse/t9/city=guangzhou
Found 1 items
-rwxr-xr-x   3 hadoop supergroup         16 2020-10-31 16:47 /user/hive/warehouse/t9/city=guangzhou/009.txt
[hadoop@node0 bin]$ ./hadoop fs -cat /user/hive/warehouse/t9/city=guangzhou/009.txt
tom,11
jerry,12
[hadoop@node0 bin]$
Copy the code

That’s the practice of doing static partitioning with a single field. Next, try multi-field partitioning.

Static partitioning (multi-field partitioning)

Select * from t10; select * from t10; select * from t10;

create table t10 (name string, age int) 
partitioned by (province string, city string)
row format delimited 
fields terminated by ', ';
Copy the code

In the above statement, the partition field type is written in front of city, which means that the first subdirectory is the province value. Each province subdirectory is created according to the city value, as shown in the following figure:

Type =’shanxi’, city=’xian’ :

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t10 
partition(province='shanxi', city='xian');
Copy the code

Type =’shanxi’, city=’xian’ :

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t10 
partition(province='shanxi', city='hanzhong');
Copy the code

Province =’guangdong’, city=’guangzhou’ :

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t10 
partition(province='guangdong', city='guangzhou');
Copy the code

Type =’guangdong’, city=’shenzhen’ :

load data 
local inpath '/home/hadoop/temp/202010/25/009.txt' 
into table t10 
partition(province='guangdong', city='shenzhen');
Copy the code

The full data are as follows:

hive> select * from t10;
OK
t10.name	t10.age	t10.province	t10.city
tom	11	guangdong	guangzhou
jerry	12	guangdong	guangzhou
tom	11	guangdong	shenzhen
jerry	12	guangdong	shenzhen
tom	11	shanxi	hanzhong
jerry	12	shanxi	hanzhong
tom	11	shanxi	xian
jerry	12	shanxi	xian
Time taken: 0.129 seconds, Fetched: 8 row(s)
Copy the code

Check the HDFS folder as shown in the following figure. The level 1 directory is the value of the province field:

Open a level 1 directory, as shown in the figure below. It can be seen that the level 2 directory is the value of city:

10. Check the data:

[hadoop@node0 bin]$ ./hadoop fs -cat /user/hive/warehouse/t10/province=shanxi/city=hanzhong/009.txt
tom,11
jerry,12
Copy the code

The above is the basic operation of static partition. It can be seen that static partition has an inconvenience: when adding data for each partition to use the load command to operate, this time using dynamic partition to solve this problem.

Dynamic partitioning

Dynamic partitioning allows Hive to select a partition directory instead of specifying one.
Run the following command to enable dynamic partitioning:

set hive.exec.dynamic.partition=true
Copy the code

Called hive. The exec. Dynamic. The partition. The mode of the property, the default value is strict, the partitioning column is not allowed in meaning is dynamic, all here to nostrict to cancel this ban, allow all partitions are dynamic partition:

set hive.exec.dynamic.partition.mode=nostrict;
Copy the code

Create an external table named t11 with only four columns:

create external table t11 (name string, age int, province string, city string) 
row format delimited 
fields terminated by ', ' 
location '/data/external_t11';
Copy the code

Create a file named 011.txt with the following contents:

tom,11,guangdong,guangzhou
jerry,12,guangdong,shenzhen
tony,13,shanxi,xian
john,14,shanxi,hanzhong
Copy the code

Load four records from 011.txt into table T11:

load data 
local inpath '/home/hadoop/temp/202010/25/011.txt' 
into table t11;
Copy the code

Create table T12 and add table T11 to t12;
T12 iS divided according to province+city:

create table t12 (name string, age int) 
partitioned by (province string, city string)
row format delimited 
fields terminated by ', ';
Copy the code

To write all data from t11 to t12, use overwrite:

insert overwrite table t12 
partition(province, city) 
select name, age, province, city from t11;
Copy the code

Check the folder through HDFS. Level-1 and level-2 subdirectories are as expected:

11. Finally check the data files in the secondary subdirectory, you can see the records in this partition:

[hadoop@node0 bin]$ ./hadoop fs -cat /user/hive/warehouse/t12/province=guangdong/city=guangzhou/000000_0
tom,11
Copy the code

At this point, partition table study is completed, I hope to give you some reference;

You are not alone, Xinchen original accompany all the way

Java series
Spring series
The Docker series
Kubernetes series
Database + middleware series
The conversation series

Welcome to pay attention to the public number: programmer Xin Chen

Wechat search “programmer Xin Chen”, I am Xin Chen, looking forward to enjoying the Java world with you…