“This is the 25th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Partition table and bucket table

1, partition table

A partition table is actually a separate folder corresponding to an HDFS file system, in which all data files of the partition are stored. Hive partitions are directories that divide a large data set into smaller data sets based on service requirements. Queries are much more efficient when the expression in the WHERE clause is used to select the specified partition required by the query.

1.1. Basic operations of partition table

Introduce partitioned tables (logs need to be managed by date, simulated by department information)
```
dept_20211122.log
dept_20211123.log
dept_20211124.log
Copy the code
```

Create partition table syntax

hive (default)> create table dept_partition(
              > deptno int, dname string, loc string)
              > partitioned by (day string)
              > row format delimited fields terminated by '\t';
Copy the code

Note: Partitioned fields cannot be data that already exists in the table; you can think of partitioned fields as pseudo columns of the table.

Load data into a partitioned table

Data preparation

dept_20211122.log

10	ACCOUNTING	1700
20	RESEARCH	1800
Copy the code

dept_20211123.log

30	SALES	1900
40	OPERATIONS	1700
Copy the code

dept_20211124.log

50	TEST	2000
60	DEV	1900
Copy the code

Load the data

Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211122.log' into table dept_partition partition(day='20211122'); Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211123.log' into table dept_partition partition(day='20211123'); Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211124.log' into table dept_partition partition(day='20211124');Copy the code

Note: Partitions must be specified when the partitioned table loads data

Query data in a partitioned table

Single partition query

hive (default)> select * from dept_partition where day='20211124';
Copy the code

Multi-partition federated query

hive (default)> select * from dept_partition where day='20211123'
      > union
      > select * from dept_partition where day='20211124'
      > union
      > select * from dept_partition where day='20211122';
Copy the code

hive (default)> select * from dept_partition where day = '20211122' or day = '20211123' or day = '20211124';
Copy the code

Increase the partition

Creating a single partition

hive (default)> alter table dept_partition add partition(day='20211125');
Copy the code

Create multiple partitions

hive (default)> alter table dept_partition add partition(day='20211126') partition(day='20211127');
Copy the code

Deleted partitions

Deleting a Partition

hive (default)> alter table dept_partition drop partition(day='20211125');
Copy the code

Deleting Multiple Partitions

hive (default)> alter table dept_partition drop partition(day='20211126'),partition(day='20211127');
Copy the code

See how many partitions the partition table has

hive (default)> show partitions dept_partition;
Copy the code

View the partitioned table structure

hive (default)> desc formatted dept_partition;
Copy the code

1.2. Secondary partition

Consider: If there is a large amount of log data in a day, how can you split the data again?

Create a secondary partition table

hive (default)> create table dept_partition2
              > (deptno int, dname string, loc string)
              > partitioned by (day string, hour string)
              > row format delimited fields terminated by '\t';
Copy the code

Load data normally

Load data into a secondary partitioned table

Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211124.log' into table dept_partition2 partition(day='20211124', hour=17);Copy the code

Querying partition data

hive (default)> select * from dept_partition2 where day = '20211124' and hour = 17;
Copy the code

There are three ways to associate partitioned tables with data by directly uploading data to partitioned directories

Method 1: Upload data and restore it

Upload data

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=18; Hive (default)> dfs-put /opt/module/hive-3.1.2/datas/dept_20211123.log /user/hive/warehouse/dept_partition2/day=20211124/hour=18;Copy the code

Query data (the data just uploaded cannot be queried)

hive (default)> select * from dept_partition2 where day = '20211124' and hour = 18;
Copy the code

Execute repair command

hive (default)> msck repair table dept_partition2; OK Partitions not in metastore: dept_partition2:day=20211124/hour=18 Repair: Added partition to metastore dept_partition2:day=20211124/hour=18 Time taken: 0.172 seconds, touchdown: 2 row(s)Copy the code

Query data again

hive (default)> select * from dept_partition2 where day = '20211124' and hour = 18;
Copy the code

Method 2: Add a partition after uploading data

Upload data

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=19; Hive (default)> dfs-put /opt/module/hive-3.1.2/datas/dept_20211122.log /user/hive/warehouse/dept_partition2/day=20211124/hour=19;Copy the code

Adding a Partition

hive (default)> alter table dept_partition2 add partition(day='20211124', hour=19);
Copy the code

Query data

hive (default)> select * from dept_partition2 where day = '20211124' and hour = 19;
Copy the code

Method 3: Load data to a partition after creating a folder

Create a directory

hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/day=20211124/hour=20;
Copy the code

Upload data

Hive (default)> Load data local inpath '/opt/module/hive-3.1.2/datas/dept_20211122.log' into table dept_partition2 partition(day='20211124', hour=20);Copy the code

Query data

hive (default)> select * from dept_partition2 where day = '20211124' and hour = 20;
Copy the code

1.3. Dynamic partition adjustment

In a relational database, the database automatically inserts data into the corresponding Partition based on the value of the Partition field. Hive provides a similar mechanism for Dynamic Partition. You need to perform corresponding configurations.

Example Enable dynamic partition parameter setting
- Enable dynamic partitioning (default true, enabled)
```
hive.exec.dynamic.partition=true
Copy the code
```
- Set it to non-strict mode (for dynamic partitioning, the default is strict, which indicates that at least one partition must be static, and the default is nonstrict, which indicates that dynamic partitioning is allowed for all partitioned fields.)
```
hive.exec.dynamic.partition.mode=nonstrict
Copy the code
```
- The maximum number of dynamic partitions that can be created on all nodes that perform MR. The default is 1000
```
hive.exec.max.dynamic.partitions=1000
Copy the code
```
- The maximum number of dynamic partitions that can be created on each node that performs MR. Set this parameter based on actual data. For example, if the source data contains one year’s data, that is, the day field has 365 values, the parameter must be set to a value greater than 365. If the default value 100 is used, an error occurs.
```
hive.exec.max.dynamic.partitions.pernode=100
Copy the code
```
- Maximum number of HDFS files that can be created in a MR Job. The default is 100000
```
hive.exec.max.created.files=100000
Copy the code
```
- Whether an exception is thrown when an empty partition is generated. Generally, no setting is required. The default false
```
hive.error.on.empty.partition=false
Copy the code
```

A case in field

Requirement: insert dept_partition into dept_partition by region (loC field);

Create the destination partition table

hive (default)> create table dept_partition_dy(id int, name string)
              > partitioned by (loc int)
              > row format delimited fields terminated by '\t';
Copy the code

Setting dynamic Partitioning

hive (default)> set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno,dname,loc from dept;
Copy the code

View the partition status of the target partition table

hive (default)> show partitions dept_partition_dy;
OK
partition
loc=1700
loc=1800
loc=1900
Copy the code

2. Bucket table

Partitioning provides a convenient way to isolate data and optimize queries. However, not all data sets can be reasonably partitioned. For a table or partition, Hive can be further organized into buckets, which are more fine-grained data ranges.

Bucking is another technique for breaking up a data set into more manageable parts.

Partitioning is for the storage path of data; Buckets are for data files.

Create a bucket table first

Data preparation

1001 ss1 1002 ss2 1003 ss3 1004 ss4 1005 ss5 1006 ss6 1007 ss7 1008 ss8 1009 ss9 1010 ss10 1011 ss11 1012 ss12 1013 ss13  1014 ss14 1015 ss15 1016 ss16Copy the code

Create a bucket table

hive (default)> create table stu_buck(id int, name string)
              > clustered by(id) into 4 buckets
              > row format delimited fields terminated by '\t';
Copy the code

View table structure

hive (default)> desc formatted stu_buck;
Num Buckets:        	4
Copy the code

Import data to the bucket table in load mode

Hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/student' into table stu_buck;Copy the code

Check whether the created bucket table is divided into four buckets

Query bucket data

hive (default)> select * from  stu_buck;
Copy the code

The barrel rule

Based on the result, Hive bucket partitioning hashes the value of the bucket partitioning field and divides the value by the number of buckets to determine which bucket to store the record in

Precautions for bucket table operation
- Set the number of Reduce jobs to -1. The Job determines how many Reduce jobs need to use or set the number of Reduce jobs to be greater than or equal to the number of buckets in the bucket table
- Load data from the HDFS to the bucket table, avoiding the problem that local files cannot be found
- Do not use local mode

Insert imports data into the bucket table

hive(default)>insert into table stu_buck select * from student_insert;
Copy the code

3. Sample query

For very large data sets, sometimes users need to use a representative query result rather than the whole result. Hive can meet this requirement by sampling tables.

Syntax: TABLESAMPLE(BUCKET X OUT OF Y)

Query data in table STU_BUCK.

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);
Copy the code

Note: the value of x must be less than or equal to the value of y, otherwise

Two, friendship links

Big data Hive learning journey 3

Big data Hive learning journey 2

Big data Hive learning journey 1

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Hive learning journey 4

Partition table and bucket table

1, partition table

1.1. Basic operations of partition table

1.2. Secondary partition

1.3. Dynamic partition adjustment

2. Bucket table

3. Sample query

Two, friendship links

Big data Hive learning journey 4

Partition table and bucket table

1, partition table

1.1. Basic operations of partition table

1.2. Secondary partition

1.3. Dynamic partition adjustment

2. Bucket table

3. Sample query

Two, friendship links

Related Posts

How to use Golang | Go to realize a simple shell?

The log of MyBatis

Use Python to write a script to delete QQ space for many years