This section describes Hive partition tables

A concept,

Introduction to the

Hive tables correspond to specific directories in the HDFS. When querying data, all tables are scanned by default, consuming a large amount of time and performance.

A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.

meaning

The purpose of partitioning is to distribute data into multiple subdirectories. During query, you can select only the data in certain subdirectories to speed up query efficiency.
Only partitioned tables have subdirectories (partitioned directories);
The name of a partition directory is determined by two parts: partition column name = partition column value;
After data is imported into the specified partition, partition column information is appended to the meta information table.
The ultimate purpose of partitioning is to use partitioned columns for filtering when querying.

Second, the classification of

Hive partitions are classified into static partitions and dynamic partitions:

Static partitioning: If the value of a partition is determined, it is called static partitioning. The partition name is specified when you add a partition or load partition data.
Dynamic partitioning: The value of a partition is nondeterministic, determined by the input data.

Three, use scenarios

Partitioning is often required when managing large data sets, such as partitioning log files by day, to ensure fine-grained partitioning of data and improve query performance.

Basic operation of partition table

4.1 Static Partitioning

According toPARTITIONED BYCreate a partition table. A table can have one or more partitions. Each partition is stored as a separate folder in the directory of the table folder.
Partitions exist in table structures in the form of fields, throughdescribe tableYou can view that the field exists, but the field does not store the actual data content, but only represents the partition.
Partition table is divided into two kinds, one is single partition, that is, there is only one level folder directory under the table folder directory. Another is multi-partition, table folders appear under the multi-folder nested pattern.

Data preparation

dept_01.txt

10	ACCOUNTING	1700
20	RESEARCH	1800
Copy the code

dept_02.txt

30	SALES	1900
40	OPERATIONS	1700
Copy the code

dept_03.txt

50	TEST	2000
60	DEV	1900
Copy the code

Single stage partition

(1) Create partition table language method

create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Copy the code

Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.

(2) View the partition table structure

describe formatted dept_partition;
Copy the code

(3) Load data

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_01.txt' into table dept_partition 
partition(day='20211205');

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_02.txt' into table dept_partition 
partition(day='20211206');

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_03.txt' into table dept_partition 
partition(day='20211207');
Copy the code

Note: When the partitioned table loads data, you must specify a partition, otherwise a default partition will be generated

(4) Query data in the partition table

Single partition query

select * from dept_partition where day='20211205';
Copy the code

Multi-partition federated query

select * from dept_partition where day='20211205'
union
select * from dept_partition where day='20211206'
union
select * from dept_partition where day='20211207';
Copy the code

Method 2:

select * from dept_partition where day='20211205' or day='20211206' or day='20211207';
Copy the code

(5) Add partitions

Creating a single partition

alter table dept_partition add partition(day='20211208');
Copy the code

Create multiple partitions at the same time

alter table dept_partition add partition(day='20211209') partition(day='20211210') partition(day='20211211');
Copy the code

Note: Each partition is separated by a space

(6) Delete partitions

Deleting a Partition

alter table dept_partition drop partition (day='20211211');
Copy the code

Delete multiple partitions simultaneously

alter table dept_partition drop partition (day='20211209'),partition (day='20211210');
Copy the code

Note: Partitions are separated by empty commas

(7) Check the number of partitions in the partition table

show partitions dept_partition;

-- Run result
+---------------+
|   partition   |
+---------------+
| day=20211205  |
| day=20211206  |
| day=20211207  |
| day=20211208  |
+---------------+
Copy the code

(8) Rename partitions

alter table default.dept_partition partition(day = '20211208') rename to partition(day = '20211209');
Copy the code

Secondary partition

(1) Create a secondary partition table

create table dept_partition2(
 deptno int, dname string, loc string
 )
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';
Copy the code

(2) Load data into secondary partition table

load data local inpath 
'/home/hadoop/datas/dept/dept_01.txt' into table
dept_partition2 partition(day='20211205'.hour='10');

load data local inpath 
'/home/hadoop/datas/dept/dept_02.txt' into table
dept_partition2 partition(day='20211205'.hour='11');

load data local inpath 
'/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211206'.hour='10');
Copy the code

(3) Add zones

alter table  dept_partition2 add partition(day='20211206'.hour = '11');
Copy the code

(4) Query partition data

select * from dept_partition2 where day='20211205' and hour='10';

-- Multi-condition query
select * from dept_partition2 where (day='20211205' and hour='10') or (day='20211206' and hour='10');
Copy the code

(5) Rename partitions

alter table dept_partition2 partition(day='20211206',hour='11') rename to partition(day='20211206',hour='12');
Copy the code

Repair the partition

There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories

(1) Method 1: Upload data and repair it

Upload data

-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=12;

-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_01.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=12;
Copy the code

Query data

0: jdbc:hive2://hadoop001:10000> select * from dept_partition2 where day='20211205' and hour='12';

-- Query result
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
| dept_partition2.deptno  | dept_partition2.dname  | dept_partition2.loc  | dept_partition2.day  | dept_partition2.hour  |
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
Copy the code

The query result is empty, that is, the data we uploaded is not queried

Execute repair command

0: jdbc:hive2://hadoop001:10000> msck repair table dept_partition2;
Copy the code

The following results are obtained after the above query

(2) Method 2: Add a partition after uploading data

Upload data

-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=13;

-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_02.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=13;
Copy the code

Adding a Partition

alter table dept_partition2 add partition(day='20211205'.hour='13');
Copy the code

Query data

select * from dept_partition2 where day='20211205' and hour='13';
Copy the code

(3) Method 3: Load data to a partition after creating a folder

Create a directory

dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=14;
Copy the code

Upload data

load data local inpath '/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211205'.hour='14');
Copy the code

Query data

select * from dept_partition2 where day='20211205' and hour='14';
Copy the code

4.2 Dynamic Partitioning

Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition

The value of a dynamic partition is nondeterministic and is determined by the input data.

1. Enable dynamic partition parameters

(1) Enable dynamic partitioning (default true, enable)

0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition=true;
Copy the code

(2) Specify the dynamic partitioning mode

0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition.mode=nonstrict;
Copy the code

Dynamic partitioning mode:

strict: Indicates that at least one partition must be static (default)
nonstrictThe: mode allows dynamic partitioning for all partitioned fields

Other unnecessary configurations:

The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000
```
hive.exec.max.dynamic.partitions=1000
Copy the code
```
The maximum number of dynamic partitions that can be created on each node that performs MR
```
hive.exec.max.dynamic.partitions.pernode=100
Copy the code
```
Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.
Maximum number of HDFS files that can be created in a MR Job. The default is 100000
```
hive.exec.max.created.files=100000
Copy the code
```
Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false
```
hive.error.on.empty.partition=false
Copy the code
```

2. Case practice

demand

Insert data from dept into the appropriate partition of the target table dept_partition by locale (LOC field).

(1) Create a target partition table

create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';
Copy the code

(2) Add data

insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept_partition;
Copy the code

(3) Check the partition status of the target partition table

show partitions dept_partition_dy;
Copy the code

conclusion

Try not to use dynamic partition, because when dynamic partition, reducer number will be allocated to each partition. When the number of partitions is large, reducer number will increase, which is a disaster to the server.
The difference between dynamic and static partitioning:
- A static partition will be created with or without data;
- Dynamic partitions are created with result sets, otherwise not.
Strict mode provided by Hive to prevent users from accidentally submitting malicious HQL
```
hive.mapred.mode=nostrict : strict
Copy the code
```
If the mode value is strict, the following three queries are blocked:
- SQL > select partition from table where filter field is not partition field;
- Join query without on condition or WHERE condition;
- For an ORDER by query, there is no limit statement for an order by query.

Attached is a reference article link:

www.boxuegu.com/news/372.ht…

Dunwu. Making. IO/bigdata – tut…

Segmentfault.com/a/119000003…