Hive series (5) : Hive partition table and bucket table

1, partition table

1.1 concept

Hive tables correspond to specific directories in the HDFS. When querying data, all tables are scanned by default, consuming a large amount of time and performance.

The partition is a subdirectory of the table directory in the HDFS. Data is stored in the subdirectory based on the partition. If the query where clause contains a partition condition, the query is performed directly from the partition instead of scanning the entire table directory. A reasonable partition design can greatly improve the query speed and performance.

Here’s an illustration of the concept of partitioned tables and Hive unique, which is actually quite common. For example, in our common Oracle database, when the amount of data in the table increases, the speed of the query data will decrease, then you can also partition the table. After a table is partitioned, data in the table is stored in multiple tablespaces (physical files), which eliminates the need to scan the entire table each time to improve query performance.

1.2 Application Scenarios

Partitioning is often required when managing large data sets, such as partitioning log files by day, to ensure fine-grained partitioning of data and improve query performance.

1.3 Creating a Partition Table

In Hive, you can create a PARTITIONED table using the PARTITIONED BY clause. A table can contain one or more partitioned columns, and the program creates a separate data directory for each different combination of values in the partitioned column. Let’s create a table of employees as a test:

CREATE EXTERNAL TABLE emp_partition( empno INT, ename STRING, job STRING, mgr INT, hiredate TIMESTAMP, Sal DECIMAL (7, 2), Comm DECIMAL(7,2)) PARTITIONED BY (deptno INT) -- partition BY department number ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" LOCATION '/hive/emp_partition';Copy the code

1.4 Loading data into a partitioned table

When loading data into a partitioned table, you must specify the partition where the data is stored:

#Load department 20 into the table
LOAD DATA LOCAL INPATH "/usr/file/emp20.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=20)
#Load department 30 into the table
LOAD DATA LOCAL INPATH "/usr/file/emp30.txt" OVERWRITE INTO TABLE emp_partition PARTITION (deptno=30)
Copy the code

1.5 Viewing Partition Directories

Deptno =20; deptno=30; deptno=20; deptno=30;

#hadoop fs -ls hdfs://hadoop001:8020/hive/emp_partition/
Copy the code

Select * from table where deptno=20; select * from table where deptno=20;

Two, bucket table

1.1 introduction

Partitioning provides a feasible solution to isolate data and optimize queries, but not all data sets can be reasonably partitioned, and the number of partitions is not the more the better. Excessive partitioning conditions may lead to no data on many partitions. In addition, Hive limits the maximum number of partitions that can be created by dynamic partitions to prevent excessive partition files from overloading the file system. Hive also provides a more fine-grained data splitting solution: bucket tables.

The bucket table hashes the values of the specified column, resists buckets, and stores them in the corresponding buckets.

1.2 Understanding the bucket table

Just like partitioning, bucket partitioning is not unique to Hive and is probably an everyday concept for Java developers. The concept of bucket partitioning in Hive is consistent with the concept of bucket partitioning in HashMap, a Java data structure.

When you call the put() method of a HashMap to store data, the program calculates hashCode by calling hashCode() on the key value, then modulo the array length to calculate index, and finally stores the data in a linked list at the index position. When the chain expression reaches a certain threshold, it is transformed into a red-black tree (JDK1.8+). The following is the data structure of HashMap:

Image from: HashMap vs. Hashtable

1.3 Creating a Bucket Table

In Hive, we specify bucket columns BY CLUSTERED BY, and SORTED BY specifies SORTED reference columns for the data in the bucket. The following is an example of bucket table construction sentences:

  CREATE EXTERNAL TABLE emp_bucket(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7.2),
    comm DECIMAL(7.2),
    deptno INT)
    CLUSTERED BY(empno) SORTED BY(empno ASC) INTO 4 BUCKETS  Hash into four buckets by employee number
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/emp_bucket';
Copy the code

1.4 Loading Data to the bucket table

The Load statement is directly used to Load data into the bucket table. The data can be loaded successfully, but the data will not be pilled.

This is because the essence of bucket splitting is to hash the specified fields and store them in corresponding files. This means that data must be inserted into the bucket splitting table through MapReduce, and the Reducer number must be equal to the number of buckets. Because of the above reasons, data in bucket tables can be inserted only in CREATE TABLE AS SELECT (CTAS) mode, because CTAS triggers MapReduce. The procedure for loading data is as follows:

1. Set mandatory bucket splitting

set hive.enforce.bucketing = true; Hive 2.x does not require this step
Copy the code

In Hive 0. X and 1.x, you must set hive.enforce. Bucketing = true, which indicates mandatory bucket sharing. This allows the Reducer and Cluster by column to be automatically selected based on the table structure.

2. CTAS imports data

INSERT INTO TABLE emp_bucket SELECT *  FROM emp;  The EMP table here is just a normal employee table
Copy the code

You can see from the execution log that THE MapReduce operation is triggered by CTAS and the Reducer number is the same as the number of buckets specified during table creation:

1.5 Viewing bucket Sharing Files

A bucket is essentially a specific file in a table directory:

3. Use the partition table and bucket table together

The essence of partitioned tables and bucket tables is to split data according to different granularity. In this way, only partitions or buckets need to be scanned instead of all tables to improve query efficiency. The two can be used together to ensure that table data is properly split at different granularity. The following is an official Hive example:

CREATE TABLE page_view_bucketed(
	viewTime INT, 
    userid BIGINT,
    page_url STRING, 
    referrer_url STRING,
    ip STRING )
 PARTITIONED BY(dt STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\ 001'
   COLLECTION ITEMS TERMINATED BY '\ 002'
   MAP KEYS TERMINATED BY '\ 003'
 STORED AS SEQUENCEFILE;
Copy the code

In this case, you need to specify a partition when importing data:

INSERT OVERWRITE page_view_bucketed
PARTITION (dt='2009-02-25')
SELECT * FROM page_view WHERE dt='2009-02-25';
Copy the code

The resources

LanguageManual DDL BucketedTables

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series