The index
- Indexes are introduced in Hive from version 0.7.0 to improve the query speed of specified columns in Hive tables. Without indexes, Hive will load the entire table or partition (partitioning the table and providing relevant filtering criteria) when executing a query. Even if you add a query with a predicate (such as ‘WHERE table.column = 10’), Hive will still load the entire table or partition and process all rows. But when there is an index on the specified column and the query is performed through the specified column, only part of the file is loaded and processed. Indexes are often superior to partitions in situations where you can expect to see very large volumes of partitioned data.
- In addition, like traditional relational databases, adding indexes increases query speed but consumes additional resources to create indexes and requires more disk space to store indexes. However, Hive only has limited index functions. It does not have the concept of keys in common relational databases, and indexes do not update automatically
- Bitmap indexes are added to Hive 0.8.0.
1. Mechanism and principle
- A Hive index is an index table (a physical table of Hive) that contains the value of the index column, HDFS file path corresponding to the value, and offset of the value in the file. When index fields are involved in the query, the HDFS file path and offset corresponding to index column values are searched in the index table first to avoid full table scan.
- When the Hive by index column of the query is executed, the first by a MRJob to query indexing table, according to the index column filter conditions, query out the indexed column value corresponding to the HDFS file directory and offset, and the data output to a file of HDFS, then according to this file to filter the original file, as the input query Job
Creating an index on the specified column produces an index table (the table structure is as follows),
+--------------+----------------+----------+--+
| col_name | data_type | comment |
+--------------+----------------+----------+--+
| Empno | int | | indexed columns
| _bucketname | string | | HDFS file path
| _offsets | array < bigint > | offset |
+--------------+----------------+----------+--+
Copy the code
2. Create indexes
CREATE INDEX index_name -- index name
ON TABLE base_table_name (col_name, ...) Columns to be indexed
AS index_type -- Index type
[WITH DEFERRED REBUILD] Alter table alter table alter table
[IDXPROPERTIES (property_name=property_value, ...) ]-- Index additional attributes
[IN TABLE index_table_name] -- Name of the index table[[ROWFORMAT ... ] STOREDAS.| STORED BY. ]-- Index table row separator and storage format
[LOCATION hdfs_path] -- The storage location of the index table
[TBLPROPERTIES (...)] -- Index table Table attributes
[PARTITIONED BY(...). ]-- Partition of index
[COMMENT "index comment"]; -- Index comment
Copy the code
- The AS Index_type statement specifies index handlers, such AS CompactIndexHandler, which is an implementation, and of course the BITMAP we mentioned above
- If the partition by statement is omitted, the index will contain all partitions of the original table.
CREATE TABLE employees(
name string,
salary float,
subordinates ARRAY<string>,
deductions MAP<string,float>,
address struct<street:STRING, city:STRING, state:STRING, zip:INT>
)
partitioned by(country string,state string);
CREATE INDEX employees_index
on table employees(country)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH deferred rebuild
idxproperties('create ='kingcall', 'create_at'='2020- 12- 20') in table employees_index_table partitioned by(country,state) comment 'employees indexed by country and state';
Copy the code
3. View the index
-- Displays indexes for all columns on the table
SHOW FORMATTED INDEX ON table_name;
Copy the code
4. Delete indexes
Deleting an index deletes the corresponding index table.
DROP INDEX [IF EXISTS] index_name ON table_name;
Copy the code
If a table with an index is deleted, both its corresponding index and index table are deleted. If a partition of the indexed table is deleted, the partition index corresponding to the partition is also deleted.
5. Rebuild the index
- The newly created Hive index table does not contain data. You need to generate index data
- If the original table data is updated and the index data needs to be rebuilt,Hive starts a MapReduce job to create indexes
alter index test_index on test rebuild;
Copy the code
For example, in the example above, if I update only one partition’s data I can update only that partition’s index
ALTERINDEX employees_indexon TABLE employees
Partition(country='US')
rebuild;
Copy the code
6. Use indexes
By default, indexes are created but are not automatically used during Hive query. Therefore, you need to enable related configurations. When configuration is enabled, queries involving indexed columns will use the index function to optimize the query.
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;
Copy the code
conclusion
The index table has the advantages of reducing data scanning, but has many disadvantages. The most important one is that the index table cannot be automatically rebuilt. If data is added or deleted from the index table, the MapReduce job must be manually rebuilt to generate index table data.
Maintaining an index requires additional storage space and creating an index is computationally expensive, so it is best to use static fields that are not updated or infrequently so that index data is not always rebuilt.
In addition, Hive removes the indexing function from 3.0 for the following reasons:
- Materialized views with automatic rewriting can produce similar effects to indexes (Hive 2.3.0 added support for Materialized views, formally introduced after 3.0)
- When using the Columnar Storage file Format (Parquet, ORC) for storage, these formats support selective scanning to skip unwanted files or blocks.