1. The partition table
1. Understand the concept of partitioned tables
A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query
2. Create a partition table
create table if not exists dept_partition
(
deptno int,
dname string,
loc string
)
partitioned by (month string) Unlike normal table creation, partitioned by,month partition directory needs to be added
row format delimited fields terminated by '\t';
Copy the code
3. Load data into the partitioned table
Note: Partitions must be specified when the partitioned table loads data
load data local inpath '/home/mayi/mayi_data/dept.txt'
into table mayi.dept_partition partition (month='20210405');
load data local inpath '/home/mayi/mayi_data/dept.txt'
into table mayi.dept_partition partition (month='20210406');
Copy the code
View partitioned table data
4. Query partition data
-- Single-partition query
select * from dept_partition where month = '20210405';
-- Multi-partition joint query
select * from dept_partition where month = '20210405';
union
select * from dept_partition where month = '20210406';
Copy the code
5. Add a partition.
Create multiple partitions at the same time
alter table dept_partition add partition (month='20210301') partition (month='20210302');
Copy the code
6. Drop a partition
You can delete multiple partitions simultaneously
alter table dept_partition drop partition (month='20210301'), partition (month='20210302');
Copy the code
7. Show partitions tableName
hive (mayi)> show partitions dept_partition; OK partition month=20210405 Month =20210406 Time taken: 0.666 seconds, Touch_type: 2 row(s)Copy the code
8. Display the partition table structure.
hive (mayi)> desc formatted dept_partition;
OK
col_name data_type comment
#col_name data_type comment
deptno int
dname string
loc string
# Partition Information
#col_name data_type comment
month string
# Detailed Table Information
Database: mayi
Owner: mayi
CreateTime: Tue Apr 06 09:49:20 CST 2021
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mayi101:9000/other/warehouse/mayi.db/dept_partition
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1617673760
# Storage InformationSerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field. Delim serialization. Format Time taken: 0.207 seconds, touchback: 34 row(s)Copy the code
2. Precautions for partition tables
1. Create a secondary zone
create table if not exists dept_partition2
(
deptno int,
dname string,
loc string
)
partitioned by (month string,day string) Different from normal table creation, partitioned by,month partition directory, and day partition directory need to be added
row format delimited fields terminated by '\t';
Copy the code
2. Load data to the secondary partition
1. Load properly
load data local inpath '/home/mayi/mayi_data/dept.txt'
into table mayi.dept_partition2 partition (month='202103'.day='01');
- the query
select * from dept_partition2 where month = '202103' and day= '01';
Copy the code
2. Upload data directly to the partitioned directory and associate the partitioned table with the data
There are three ways to do this
After a partition table is created, data is directly loaded to the corresponding partition, that is, data is normally loaded
1. Upload data and restore it
-- 1. Upload data
hive (default)> dfs -mkdir -p
/user/hive/warehouse/dept_partition2/month=201709/day=12;
hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;
-- 2. The uploaded data cannot be queried
hive (default)> select * from dept_partition2 where month='201709' and day='12';
- 3. You can query it after running the repair command
hive> msck repair table dept_partition2;
-- 4. Query information again
hive (default)> select * from dept_partition2 where month='201709' and day='12';
Copy the code
2. Add a partition after uploading data
-- 1. Upload data
hive (default)> dfs -mkdir -p
/user/hive/warehouse/dept_partition2/month=201709/day=11;
hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;
-- 2. Add a partition
hive (default)> alter table dept_partition2 add partition(month='201709'.day='11');
-- 3. Query partitions
hive (default)> select * from dept_partition2 where month='201709' and day='11';
Copy the code
3. Load data to the partition after creating a folder
In fact, in normal mode, their own initiative to create a directory, in fact, there is no need to load data directly on the line
Create a folder
hive (default)> dfs -mkdir -p
/user/hive/warehouse/dept_partition2/month=201709/day=10;
-- 2. Upload data
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table
dept_partition2 partition(month='201709'.day='10');
-- 3. Query data
hive (default)> select * from dept_partition2 where month='201709' and day='10';
Copy the code
3. Dynamic partition adjustment
In a relational database, the database automatically inserts data into the corresponding Partition based on the value of the Partition field. Hive provides a similar Dynamic Partition mechanism, but Dynamic partitions need to be configured accordingly.
1. Enable dynamic partition parameters
1.Open dynamic partition function (the default is true, open) hive. The exec. Dynamic. The partition = true2.Set it to non-strict mode (for dynamic partitioning, the default is strict, which indicates that at least one partition must be static, and the default is nonstrict, which indicates that dynamic partitioning is allowed for all partitioned fields.) hive.exec.dynamic.partition.mode=nonstrict3.The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default 1000 hive. The exec. Max. Dynamic. Partitions = 10004. 'How many maximum dynamic partitions can be created on each node that performs MR'. Set this parameter based on actual data. For example, if the source data contains one year's data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs. hive.exec.max.dynamic.partitions.pernode=1005.Maximum number of HDFS files that can be created in a MR Job. The default 100000 hive. The exec. Max. Created. Files = 1000006.Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false hive. Error. On. Empty. Partition = falseCopy the code
Example 2.
Requirement: insert dept_partition into dept_partition by region (loC field);
-
Create a partitioned table
create table if not exists dept_partition( id int, name string ) partitioned by (location int) row format delimited fields terminated by '\t' Copy the code
-
Setting dynamic Partitioning
Enable dynamic partitioning hive (mayi)> set hive.exec.dynamic.partition=true -- Set to non-strict mode hive (mayi)> set hive.exec.dynamic.partition.mode=nonstrict; Data is inserted into the partition insert into table dept_partition partition (location) select deptno, dname, loc from dept; Copy the code
-
View the partition table
hive (mayi)> show partitions dept_partition; OK partition location=1700 location=1800 location=1900 Time taken: 0.163 seconds, Fetched: 3 row(s) hive (mayi)> Copy the code
-
Pay attention to
How does the target partition table match the partition field?
Select loC from dPartition, select loC from dPartition, select loC from DPartition, select LOC from DPartition, select LOC from DPartition. So partition needs to partition the field can only be placed behind, can not put the order wrong. If we query for four fields, we will get an error because the table with partition fields is only three. Note that the system inferences partition names based on the location of the query field, not the field name.
4. Barrel table
Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can be reasonably partitioned. For a table or partition, Hive can be further organized into buckets, which are more fine-grained data ranges.
Bucking is another technique for breaking up a data set into more manageable parts.
Partitioning is for the storage path of data;
Buckets are for data files.
1. Create a bucket table and import data (do not load data).
Create a bucket table
create table stu_buck(
id int,
name string
)
clustered by (id) into 4 buckets -- Specify 4 channels by id value
row format delimited fields terminated by '\t';
Copy the code
Note: Load Data is not suitable for importing directly into the bucket table
Load data does not load data into buckets.
To avoid using load data to insert data into bucket tables, we can restrict load to bucket tables.
set hive.strict.checks.bucketing = true;
This configuration can also be modified in hive configuration items in CM. An error will be reported when performing load data operations on bucket tables.
Create a temporary table, load data, and import TXT text into the temporary table.
Create temporary tables
create table temp_buck(id int, name string)
row format delimited fields terminated by '\t';
-- Import data
load data local inpath '/tools/test_buck.txt' into table temp_buck;
1
2
3
4
5Then create a table with bucket splitting in Hive, enable mandatory bucket splitting, and run theinsert selectStatement indirectly imports data from a temporary table into a bucket table.1
2
`-- Enable the bucket table
set hive.enforce.bucketing=true;
-- Restrict bucket table load operations
set hive.strict.checks.bucketing = true;
--insert select
insert into table test_buck select id, name from temp_buck;
- bucket success
Copy the code
2. When creating a bucket table, import data in subquery mode
(1) First create a normal STU table
create table temp_buck(
id int,
name string
)
row format delimited fields terminated by '\t';
-- (2) Import data into a normal STU table
load data local inpath '/home/mayi/mayi_data/student.txt' into table temp_buck;
-- (3) Clear the stu_BUCK table
truncate table stu_buck;
select * from stu_buck;
-- (4) Import data to bucket table, through the way of sub-query
insert into table stu_buck
select id, name from temp_buck;
-- (5) There is still only one bucket, as shown in the figure below
Copy the code
3. Pay special attention
hive (mayi)> set hive.enforce.bucketing=true; Hive (mayi)> insert into stu_buck select ID,name from temp_buck; Note: -- limit on the table for the load operation set hive. Strict. Checks. The bucketing = true; -- Do not enable reduce hive (default)> set mapreduce.job.reduce =-1;Copy the code
4. Query by bucket sampling
For very large data sets, sometimes users need to use a representative query result rather than the whole result. Hive can meet this requirement by sampling tables.
Query data in table STU_BUCK.
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Copy the code
Note: Tablesample is a sampling statement. The syntax is: tablesample (BUCKET X OUT OF Y).
Y must be a multiple or factor of the total number of buckets in the table. Hive determines the sampling ratio based on y. For example, the table is divided into four parts. When y=2, (4/2=) two buckets are extracted, and when y=8, (4/8=)1/2 buckets are extracted.
X indicates the bucket from which to extract. If multiple partitions need to be extracted, the subsequent partition number is the current partition number plus Y. For example, if the total number of buckets in a table is 4 and tablesample(Bucket 1 out of 2), data is extracted from two buckets (4/2=) and data is extracted from the first (X) and third (x+ Y) buckets.
Note: the value of x must be less than or equal to the value of y, otherwise
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck