“This is the 25th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
Partition table and bucket table
1, partition table
A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.
1.1. Basic operations of partition table
-
Introduce partitioned tables (logs need to be managed by date, simulated by department information)
dept_20211122.log dept_20211123.log dept_20211124.log Copy the code
-
Create partition table syntax
hive (default)> create table dept_partition( > deptno int, dname string, loc string) > partitioned by (day string) > row format delimited fields terminated by '\t'; Copy the code
Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.
-
Load data into a partitioned table
-
Data preparation
dept_20211122.log
10 ACCOUNTING 1700 20 RESEARCH 1800 Copy the code
dept_20211123.log
30 SALES 1900 40 OPERATIONS 1700 Copy the code
dept_20211124.log
50 TEST 2000 60 DEV 1900 Copy the code
-
Load the data
Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211122.log' into table dept_partition partition(day='20211122'); Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211123.log' into table dept_partition partition(day='20211123'); Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211124.log' into table dept_partition partition(day='20211124');Copy the code
Note: Partitions must be specified when the partitioned table loads data
-
-
Query data in a partitioned table
-
Single partition query
hive (default)> select * from dept_partition where day='20211124'; Copy the code
-
Multi-partition federated query
hive (default)> select * from dept_partition where day='20211123' > union > select * from dept_partition where day='20211124' > union > select * from dept_partition where day='20211122'; Copy the code
or
hive (default)> select * from dept_partition where day = '20211122' or day = '20211123' or day = '20211124'; Copy the code
-
-
Increase the partition
-
Creating a single partition
hive (default)> alter table dept_partition add partition(day='20211125'); Copy the code
-
Create multiple partitions
hive (default)> alter table dept_partition add partition(day='20211126') partition(day='20211127'); Copy the code
-
-
Deleted partitions
-
Deleting a Partition
hive (default)> alter table dept_partition drop partition(day='20211125'); Copy the code
-
Deleting Multiple Partitions
hive (default)> alter table dept_partition drop partition(day='20211126'),partition(day='20211127'); Copy the code
-
-
See how many partitions the partition table has
hive (default)> show partitions dept_partition; Copy the code
-
View the partitioned table structure
hive (default)> desc formatted dept_partition; Copy the code
1.2. Secondary partition
Consider: If there is a large amount of log data in a day, how can you split the data again?
-
Create a secondary partition table
hive (default)> create table dept_partition2 > (deptno int, dname string, loc string) > partitioned by (day string, hour string) > row format delimited fields terminated by '\t'; Copy the code
-
Load data normally
-
Load data into a secondary partitioned table
Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211124.log' into table dept_partition2 partition(day='20211124', hour=17);Copy the code
-
Querying partition data
hive (default)> select * from dept_partition2 where day = '20211124' and hour = 17; Copy the code
-
-
There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories
-
Method 1: Upload data and restore it
-
Upload data
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=18; Hive (default)> dfs-put /opt/module/hive-3.1.2/datas/dept_20211123.log /user/hive/warehouse/dept_partition2/day=20211124/hour=18;Copy the code
-
Query data (the data just uploaded cannot be queried)
hive (default)> select * from dept_partition2 where day = '20211124' and hour = 18; Copy the code
- Execute repair command
hive (default)> msck repair table dept_partition2; OK Partitions not in metastore: dept_partition2:day=20211124/hour=18 Repair: Added partition to metastore dept_partition2:day=20211124/hour=18 Time taken: 0.172 seconds, touchdown: 2 row(s)Copy the code
- Query data again
hive (default)> select * from dept_partition2 where day = '20211124' and hour = 18; Copy the code
-
-
Method 2: Add a partition after uploading data
-
Upload data
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=19; Hive (default)> dfs-put /opt/module/hive-3.1.2/datas/dept_20211122.log /user/hive/warehouse/dept_partition2/day=20211124/hour=19;Copy the code
-
Adding a Partition
hive (default)> alter table dept_partition2 add partition(day='20211124', hour=19); Copy the code
-
Query data
hive (default)> select * from dept_partition2 where day = '20211124' and hour = 19; Copy the code
-
-
Method 3: Load data to a partition after creating a folder
-
Create a directory
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=20; Copy the code
-
Upload data
Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211122.log' into table dept_partition2 partition(day='20211124', hour=20);Copy the code
-
Query data
hive (default)> select * from dept_partition2 where day = '20211124' and hour = 20; Copy the code
-
-
1.3. Dynamic partition adjustment
In a relational database, the database automatically inserts data into the corresponding Partition based on the value of the Partition field. Hive provides a similar mechanism for Dynamic Partition. You need to perform corresponding configurations.
-
Example Enable dynamic partition parameter setting
-
Enable dynamic partitioning (default true, enabled)
hive.exec.dynamic.partition=true Copy the code
-
Set it to non-strict mode (for dynamic partitioning, the default is strict, which indicates that at least one partition must be static, and the default is nonstrict, which indicates that dynamic partitioning is allowed for all partitioned fields.)
hive.exec.dynamic.partition.mode=nonstrict Copy the code
-
The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000
hive.exec.max.dynamic.partitions=1000 Copy the code
-
The maximum number of dynamic partitions that can be created on each node that performs MR. Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.
hive.exec.max.dynamic.partitions.pernode=100 Copy the code
-
Maximum number of HDFS files that can be created in a MR Job. The default is 100000
hive.exec.max.created.files=100000 Copy the code
-
Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false
hive.error.on.empty.partition=false Copy the code
-
-
A case in field
Requirement: insert dept_partition into dept_partition by region (loC field);
-
Create the destination partition table
hive (default)> create table dept_partition_dy(id int, name string) > partitioned by (loc int) > row format delimited fields terminated by '\t'; Copy the code
-
Setting dynamic Partitioning
hive (default)> set hive.exec.dynamic.partition.mode = nonstrict; hive (default)> insert into table dept_partition_dy partition(loc) select deptno,dname,loc from dept; Copy the code
-
View the partition status of the target partition table
hive (default)> show partitions dept_partition_dy; OK partition loc=1700 loc=1800 loc=1900 Copy the code
-
2. Bucket table
Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can be reasonably partitioned. For a table or partition, Hive can be further organized into buckets, which are more fine-grained data ranges.
Bucking is another technique for breaking up a data set into more manageable parts.
Partitioning is for the storage path of data; Buckets are for data files.
-
Create a bucket table first
-
Data preparation
1001 ss1 1002 ss2 1003 ss3 1004 ss4 1005 ss5 1006 ss6 1007 ss7 1008 ss8 1009 ss9 1010 ss10 1011 ss11 1012 ss12 1013 ss13 1014 ss14 1015 ss15 1016 ss16Copy the code
-
Create a bucket table
hive (default)> create table stu_buck(id int, name string) > clustered by(id) into 4 buckets > row format delimited fields terminated by '\t'; Copy the code
-
View table structure
hive (default)> desc formatted stu_buck; Num Buckets: 4 Copy the code
-
Import data to the bucket table in load mode
Hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/student' into table stu_buck;Copy the code
-
Check whether the created bucket table is divided into four buckets
-
Query bucket data
hive (default)> select * from stu_buck; Copy the code
-
The barrel rule
Based on the result, Hive bucket partitioning hashes the value of the bucket partitioning field and divides the value by the number of buckets to determine which bucket to store the record in
-
-
Precautions for bucket table operation
-
Set the number of Reduce jobs to -1. The Job determines how many Reduce jobs need to use or set the number of Reduce jobs to be greater than or equal to the number of buckets in the bucket table
-
Load data from the HDFS to the bucket table, avoiding the problem that local files cannot be found
-
Do not use local mode
-
-
Insert imports data into the bucket table
hive(default)>insert into table stu_buck select * from student_insert; Copy the code
3. Sample query
For very large data sets, sometimes users need to use a representative query result rather than the whole result. Hive can meet this requirement by sampling tables.
Syntax: TABLESAMPLE(BUCKET X OUT OF Y)
Query data in table STU_BUCK.
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Copy the code
Note: the value of x must be less than or equal to the value of y, otherwise
Two, friendship links
Big data Hive learning journey 3
Big data Hive learning journey 2
Big data Hive learning journey 1