A concept,
Introduction to the
Hive tables correspond to specific directories in the HDFS. When querying data, all tables are scanned by default, consuming a large amount of time and performance.
A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.
meaning
- The purpose of partitioning is to distribute data into multiple subdirectories. During query, you can select only the data in certain subdirectories to speed up query efficiency.
- Only partitioned tables have subdirectories (partitioned directories);
- The name of a partition directory is determined by two parts: partition column name = partition column value;
- After data is imported into the specified partition, partition column information is appended to the meta information table.
- The ultimate purpose of partitioning is to use partitioned columns for filtering when querying.
Second, the classification of
Hive partitions are classified into static partitions and dynamic partitions:
- Static partitioning: If the value of a partition is determined, it is called static partitioning. The partition name is specified when you add a partition or load partition data.
- Dynamic partitioning: The value of a partition is nondeterministic, determined by the input data.
Three, use scenarios
Partitioning is often required when managing large data sets, such as partitioning log files by day, to ensure fine-grained partitioning of data and improve query performance.
Basic operation of partition table
4.1 Static Partitioning
- According to
PARTITIONED BY
Create a partition table. A table can have one or more partitions. Each partition is stored as a separate folder in the directory of the table folder. - Partitions exist in table structures in the form of fields, through
describe table
You can view that the field exists, but the field does not store the actual data content, but only represents the partition. - Partition table is divided into two kinds, one is single partition, that is, there is only one level folder directory under the table folder directory. Another is multi-partition, table folders appear under the multi-folder nested pattern.
Data preparation
- dept_01.txt
10 ACCOUNTING 1700
20 RESEARCH 1800
Copy the code
- dept_02.txt
30 SALES 1900
40 OPERATIONS 1700
Copy the code
- dept_03.txt
50 TEST 2000
60 DEV 1900
Copy the code
Single stage partition
(1) Create partition table language method
create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
Copy the code
Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.
(2) View the partition table structure
describe formatted dept_partition;
Copy the code
(3) Load data
0: jdbc:hive2://hadoop001:10000> load data local inpath
'/home/hadoop/datas/dept/dept_01.txt' into table dept_partition
partition(day='20211205');
0: jdbc:hive2://hadoop001:10000> load data local inpath
'/home/hadoop/datas/dept/dept_02.txt' into table dept_partition
partition(day='20211206');
0: jdbc:hive2://hadoop001:10000> load data local inpath
'/home/hadoop/datas/dept/dept_03.txt' into table dept_partition
partition(day='20211207');
Copy the code
Note: When the partitioned table loads data, you must specify a partition, otherwise a default partition will be generated
(4) Query data in the partition table
- Single partition query
select * from dept_partition where day='20211205';
Copy the code
- Multi-partition federated query
A:
select * from dept_partition where day='20211205'
union
select * from dept_partition where day='20211206'
union
select * from dept_partition where day='20211207';
Copy the code
Method 2:
select * from dept_partition where day='20211205' or day='20211206' or day='20211207';
Copy the code
(5) Add partitions
- Creating a single partition
alter table dept_partition add partition(day='20211208');
Copy the code
- Create multiple partitions at the same time
alter table dept_partition add partition(day='20211209') partition(day='20211210') partition(day='20211211');
Copy the code
Note: Each partition is separated by a space
(6) Delete partitions
- Deleting a Partition
alter table dept_partition drop partition (day='20211211');
Copy the code
- Delete multiple partitions simultaneously
alter table dept_partition drop partition (day='20211209'),partition (day='20211210');
Copy the code
Note: Partitions are separated by empty commas
(7) Check the number of partitions in the partition table
show partitions dept_partition;
-- Run result
+---------------+
| partition |
+---------------+
| day=20211205 |
| day=20211206 |
| day=20211207 |
| day=20211208 |
+---------------+
Copy the code
(8) Rename partitions
alter table default.dept_partition partition(day = '20211208') rename to partition(day = '20211209');
Copy the code
Secondary partition
(1) Create a secondary partition table
create table dept_partition2(
deptno int, dname string, loc string
)
partitioned by (day string, hour string)
row format delimited fields terminated by '\t';
Copy the code
(2) Load data into secondary partition table
load data local inpath
'/home/hadoop/datas/dept/dept_01.txt' into table
dept_partition2 partition(day='20211205'.hour='10');
load data local inpath
'/home/hadoop/datas/dept/dept_02.txt' into table
dept_partition2 partition(day='20211205'.hour='11');
load data local inpath
'/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211206'.hour='10');
Copy the code
(3) Add zones
alter table dept_partition2 add partition(day='20211206'.hour = '11');
Copy the code
(4) Query partition data
select * from dept_partition2 where day='20211205' and hour='10';
-- Multi-condition query
select * from dept_partition2 where (day='20211205' and hour='10') or (day='20211206' and hour='10');
Copy the code
(5) Rename partitions
alter table dept_partition2 partition(day='20211206',hour='11') rename to partition(day='20211206',hour='12');
Copy the code
Repair the partition
There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories
(1) Method 1: Upload data and repair it
- Upload data
-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=12;
-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_01.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=12;
Copy the code
- Query data
0: jdbc:hive2://hadoop001:10000> select * from dept_partition2 where day='20211205' and hour='12';
-- Query result
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
| dept_partition2.deptno | dept_partition2.dname | dept_partition2.loc | dept_partition2.day | dept_partition2.hour |
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
+-------------------------+------------------------+----------------------+----------------------+----------------------- +
Copy the code
The query result is empty, that is, the data we uploaded is not queried
- Execute repair command
0: jdbc:hive2://hadoop001:10000> msck repair table dept_partition2;
Copy the code
- The following results are obtained after the above query
(2) Method 2: Add a partition after uploading data
- Upload data
-- Create directories in HDFS
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=13;
-- Upload local data to the HDFS
dfs -put /home/hadoop/datas/dept/dept_02.txt /user/hive/warehouse/dept_partition2/day=20211205/hour=13;
Copy the code
- Adding a Partition
alter table dept_partition2 add partition(day='20211205'.hour='13');
Copy the code
- Query data
select * from dept_partition2 where day='20211205' and hour='13';
Copy the code
(3) Method 3: Load data to a partition after creating a folder
- Create a directory
dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211205/hour=14;
Copy the code
- Upload data
load data local inpath '/home/hadoop/datas/dept/dept_03.txt' into table
dept_partition2 partition(day='20211205'.hour='14');
Copy the code
- Query data
select * from dept_partition2 where day='20211205' and hour='14';
Copy the code
4.2 Dynamic Partitioning
Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition (Hive dynamic partition) : Hive dynamic partition
The value of a dynamic partition is nondeterministic and is determined by the input data.
1. Enable dynamic partition parameters
(1) Enable dynamic partitioning (default true, enable)
0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition=true;
Copy the code
(2) Specify the dynamic partitioning mode
0: jdbc:hive2://hadoop001:10000> set hive.exec.dynamic.partition.mode=nonstrict;
Copy the code
Dynamic partitioning mode:
strict
: Indicates that at least one partition must be static (default)nonstrict
The: mode allows dynamic partitioning for all partitioned fields
Other unnecessary configurations:
-
The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000
hive.exec.max.dynamic.partitions=1000 Copy the code
-
The maximum number of dynamic partitions that can be created on each node that performs MR
hive.exec.max.dynamic.partitions.pernode=100 Copy the code
Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.
-
Maximum number of HDFS files that can be created in a MR Job. The default is 100000
hive.exec.max.created.files=100000 Copy the code
-
Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false
hive.error.on.empty.partition=false Copy the code
2. Case practice
demand
Insert data from dept into the appropriate partition of the target table dept_partition by locale (LOC field).
(1) Create a target partition table
create table dept_partition_dy(id int, name string)
partitioned by (loc int) row format delimited fields terminated by '\t';
Copy the code
(2) Add data
insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept_partition;
Copy the code
(3) Check the partition status of the target partition table
show partitions dept_partition_dy;
Copy the code
conclusion
-
Try not to use dynamic partition, because when dynamic partition, reducer number will be allocated to each partition. When the number of partitions is large, reducer number will increase, which is a disaster to the server.
-
The difference between dynamic and static partitioning:
- A static partition will be created with or without data;
- Dynamic partitions are created with result sets, otherwise not.
-
Strict mode provided by Hive to prevent users from accidentally submitting malicious HQL
hive.mapred.mode=nostrict : strict Copy the code
If the mode value is strict, the following three queries are blocked:
- SQL > select partition from table where filter field is not partition field;
- Join query without on condition or WHERE condition;
- For an ORDER by query, there is no limit statement for an order by query.
Attached is a reference article link:
www.boxuegu.com/news/372.ht…
Dunwu. Making. IO/bigdata – tut…
Segmentfault.com/a/119000003…