A concept,

Introduction to the

Hive tables correspond to specific directories in the HDFS. When querying data, all tables are scanned by default, consuming a large amount of time and performance.

A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.

meaning

  • The purpose of partitioning is to distribute data into multiple subdirectories. During query, you can select only the data in certain subdirectories to speed up query efficiency.
  • Only partitioned tables have subdirectories (partitioned directories);
  • The name of a partition directory is determined by two parts: partition column name = partition column value;
  • After data is imported into the specified partition, partition column information is appended to the meta information table.
  • The ultimate purpose of partitioning is to use partitioned columns for filtering when querying.

Second, the classification of

Hive partitions are classified into static partitions and dynamic partitions:

  • Static partitioning: If the value of a partition is determined, it is called static partitioning. The partition name is specified when you add a partition or load partition data.
  • Dynamic partitioning: The value of a partition is nondeterministic, determined by the input data.

Three, use scenarios

Partitioning is often required when managing large data sets, such as partitioning log files by day, to ensure fine-grained partitioning of data and improve query performance.

Basic operation of partition table

4.1 Static Partitioning

  • According toPARTITIONED BYCreate a partition table. A table can have one or more partitions. Each partition is stored as a separate folder in the directory of the table folder.
  • Partitions exist in table structures in the form of fields, throughdescribe tableYou can view that the field exists, but the field does not store the actual data content, but only represents the partition.
  • Partition table is divided into two kinds, one is single partition, that is, there is only one level folder directory under the table folder directory. Another is multi-partition, table folders appear under the multi-folder nested pattern.
Data preparation
  • dept_01.txt
10	ACCOUNTING	1700
20	RESEARCH	1800
Copy the code
  • dept_02.txt
30	SALES	1900
40	OPERATIONS	1700
Copy the code
  • dept_03.txt
50	TEST	2000
60	DEV	1900
Copy the code
Single stage partition

(1) Create partition table language method

create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Copy the code

Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.

(2) View the partition table structure

describe formatted dept_partition;
Copy the code

(3) Load data

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_01.txt' into table dept_partition 
partition(day='20211205');

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_02.txt' into table dept_partition 
partition(day='20211206');

0: jdbc:hive2://hadoop001:10000> load data local inpath 
'/home/hadoop/datas/dept/dept_03.txt' into table dept_partition 
partition(day='20211207');
Copy the code

Note: When the partitioned table loads data, you must specify a partition, otherwise a default partition will be generated

(4) Query data in the partition table

  • Single partition query
select * from dept_partition where day='20211205';
Copy the code
  • Multi-partition federated query

A:

select * from dept_partition where day='20211205'
union
select * from dept_partition where day='20211206'
union
select * from dept_partition where day='20211207';
Copy the code

Method 2:

select * from dept_partition where day='20211205' or day='20211206' or day='20211207';
Copy the code

(5) Add partitions

  • Creating a single partition
alter table dept_partition add partition(day='20211208');
Copy the code

  • Create multiple partitions at the same time
alter table dept_partition add partition(day='20211209') partition(day='20211210') partition(day='20211211');
Copy the code

Note: Each partition is separated by a space

(6) Delete partitions

  • Deleting a Partition
alter table dept_partition drop partition (day='20211211');
Copy the code
  • Delete multiple partitions simultaneously
alter table dept_partition drop partition (day='20211209'),partition (day='20211210');
Copy the code

Note: Partitions are separated by empty commas

(7) Check the number of partitions in the partition table

show partitions dept_partition;

-- Run result
+---------------+
|   partition   |
+---------------+
| day=20211205  |
| day=20211206  |
| day=20211207  |
| day=20211208  |
+---------------+
Copy the code

(8) Rename partitions

alter table default.dept_partition partition(day = '20211208') rename to partition(day = '20211209');
Copy the code
Secondary partition

(1) Create a secondary partition table

create table dept_partition2(
 deptno int, dname string, loc string
 )
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';
Copy the code

(2) Load data into secondary partition table

load data local inpath 
'/home/hadoop/datas/dept/dept_01.txt' into table
dept_partition2 partition(day='20211205'.hour='10');

load data local inpath 
'/home/hadoop/datas/dept/dept_02.txt' into table
dept_partition2 partition(day='20211205'.hour='11');

load data local inpath 
'/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211206'.hour='10');
Copy the code

(3) Add zones

alter table  dept_partition2 add partition(day='20211206'.hour = '11');
Copy the code

(4) Query partition data

select * from dept_partition2 where day='20211205' and hour='10';

-- Multi-condition query
select * from dept_partition2 where (day='20211205' and hour='10') or (day='20211206' and hour='10');
Copy the code

(5) Rename partitions

alter table dept_partition2 partition(day='20211206',hour='11') rename to partition(day='20211206',hour='12');
Copy the code
Repair the partition

There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories

(1) Method 1: Upload data and repair it

  • Upload data
-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=12;

-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_01.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=12;
Copy the code

  • Query data
0: jdbc:hive2://hadoop001:10000> select * from dept_partition2 where day='20211205' and hour='12';

-- Query result
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
| dept_partition2.deptno  | dept_partition2.dname  | dept_partition2.loc  | dept_partition2.day  | dept_partition2.hour  |
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
Copy the code

The query result is empty, that is, the data we uploaded is not queried

  • Execute repair command
0: jdbc:hive2://hadoop001:10000> msck repair table dept_partition2;
Copy the code
  • The following results are obtained after the above query

(2) Method 2: Add a partition after uploading data

  • Upload data
-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=13;

-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_02.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=13;
Copy the code
  • Adding a Partition
alter table dept_partition2 add partition(day='20211205'.hour='13');
Copy the code
  • Query data
select * from dept_partition2 where day='20211205' and hour='13';
Copy the code

(3) Method 3: Load data to a partition after creating a folder

  • Create a directory
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=14;
Copy the code
  • Upload data
load data local inpath '/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211205'.hour='14');
Copy the code
  • Query data
select * from dept_partition2 where day='20211205' and hour='14';
Copy the code

4.2 Dynamic Partitioning

Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition

The value of a dynamic partition is nondeterministic and is determined by the input data.

1. Enable dynamic partition parameters

(1) Enable dynamic partitioning (default true, enable)

0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition=true;
Copy the code

(2) Specify the dynamic partitioning mode

0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition.mode=nonstrict;
Copy the code

Dynamic partitioning mode:

  • strict: Indicates that at least one partition must be static (default)
  • nonstrictThe: mode allows dynamic partitioning for all partitioned fields

Other unnecessary configurations:

  • The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000

    hive.exec.max.dynamic.partitions=1000
    Copy the code
  • The maximum number of dynamic partitions that can be created on each node that performs MR

    hive.exec.max.dynamic.partitions.pernode=100
    Copy the code

    Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.

  • Maximum number of HDFS files that can be created in a MR Job. The default is 100000

    hive.exec.max.created.files=100000
    Copy the code
  • Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false

    hive.error.on.empty.partition=false
    Copy the code
2. Case practice

demand

Insert data from dept into the appropriate partition of the target table dept_partition by locale (LOC field).

(1) Create a target partition table

create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';
Copy the code

(2) Add data

insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept_partition;
Copy the code

(3) Check the partition status of the target partition table

show partitions dept_partition_dy;
Copy the code

conclusion

  • Try not to use dynamic partition, because when dynamic partition, reducer number will be allocated to each partition. When the number of partitions is large, reducer number will increase, which is a disaster to the server.

  • The difference between dynamic and static partitioning:

    • A static partition will be created with or without data;
    • Dynamic partitions are created with result sets, otherwise not.
  • Strict mode provided by Hive to prevent users from accidentally submitting malicious HQL

    hive.mapred.mode=nostrict : strict
    Copy the code

    If the mode value is strict, the following three queries are blocked:

    • SQL > select partition from table where filter field is not partition field;
    • Join query without on condition or WHERE condition;
    • For an ORDER by query, there is no limit statement for an order by query.

Attached is a reference article link:

www.boxuegu.com/news/372.ht…

Dunwu. Making. IO/bigdata – tut…

Segmentfault.com/a/119000003…