The view,

1.1 introduction

A view in Hive is the same concept as a view in an RDBMS. It is a logical representation of a set of data, essentially the result set of a SELECT statement. Views are purely logical objects with no associated storage (except materialized views introduced in Hive 3.0.0). When a query references a view, Hive can combine the definition of a view with the query, for example, pushing a filter from the query to the view.

1.2 Creating a View

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name   -- View name
  [(column_name [COMMENT column_comment], ...) ]    - the column name
  [COMMENT view_comment]  -- View comments
  [TBLPROPERTIES (property_name = property_value, ...)]  -- Additional information
  AS SELECT. ;Copy the code

In Hive, you can use the CREATE VIEW to CREATE a VIEW. IF a table or VIEW with the same name EXISTS, an exception is thrown. You are advised to use the IF NOT EXISTS command. Note the following when using views:

  • Views are read-only and cannot be used as the target of LOAD/INSERT/ALTER;

  • The view is fixed when it is created, and subsequent changes to the base table (such as adding columns) will not be reflected in the view;

  • Deleting the base table does not delete the view, you need to manually delete the view;

  • Views may contain ORDER BY and LIMIT clauses. If a query that references a view also contains such clauses, the execution priority is lower than that of the view’s counterpart. For example, if the custom_view view specifies LIMIT 5 and the query statement is SELECT * from custom_view LIMIT 10, a maximum of 5 rows will be returned.

  • When a view is created, if no column name is provided, the column name is automatically derived from the SELECT statement;

  • When creating a view, if the SELECT statement contains other expressions, such as x + y, the column names will be generated as _C0, _C1, etc.

    CREATE VIEW  IF NOT EXISTS custom_view AS SELECT empno, empno+deptno , 1+2 FROM emp;
    Copy the code

1.3 Viewing Views

-- View all Views: There is no separate statement to view a list of views, only show tables
show tables;
-- View a view
desc view_name;
View the details of a view
desc formatted view_name;
Copy the code

1.4 Deleting a View

DROP VIEW [IF EXISTS] [db_name.]view_name;
Copy the code

If the deleted view is referenced by other views, the program does not issue a warning. However, other views referenced by this view become invalid and need to be rebuilt or deleted.

1.5 Modifying a View

ALTER VIEW [db_name.]view_name AS select_statement;
Copy the code

The view to be changed must exist and cannot have partitions. If the view has partitions, the modification fails.

1.6 Modifying View Properties

Grammar:

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;
 
table_properties:
  : (property_name = property_value, property_name = property_value, ...)
Copy the code

Example:

ALTER VIEW custom_view SET TBLPROPERTIES ('create'='heibaiying'.'date'='2019-05-05');
Copy the code

Second, the index

2.1 introduction

Hive introduced the index function in 0.7.0. Indexes are designed to speed up query of certain table columns. If there is no index, queries with predicates (such as ‘WHERE table1.column = 10’) load the entire table or partition and process all rows. But if column has an index, only part of the file needs to be loaded and processed.

2.2 Index Principles

After an index is created on a specified column, an index table (the table structure is as follows) is generated. The fields in the index column include the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file. When index fields are involved in the query, the HDFS file path and offset corresponding to index column values are searched in the index table first to avoid full table scan.

+--------------+----------------+----------+--+ | col_name | data_type | comment | + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + - + | empno | int | indexed column | | _bucketname | string | HDFS file path | | _offsets | array < bigint > | offset | + -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + - +Copy the code

2.3 Creating An Index

CREATE INDEX index_name     -- index name
  ON TABLE base_table_name (col_name, ...)  Columns to be indexed
  AS index_type    -- Index type
  [WITH DEFERRED REBUILD]    Alter table alter table alter table
  [IDXPROPERTIES (property_name=property_value, ...)]  -- Index additional attributes
  [IN TABLE index_table_name]    -- Name of the index table[[ROW FORMAT. ]STORED AS. |STORED BY. ]-- Index table row separator and storage format
  [LOCATION hdfs_path]  -- The storage location of the index table
  [TBLPROPERTIES (...)]   -- Index table Table attributes
  [COMMENT "index comment"];  -- Index comment
Copy the code

2.4 Viewing Indexes

-- Displays indexes for all columns on the table
SHOW FORMATTED INDEX ON table_name;
Copy the code

2.4 Deleting an Index

Deleting an index deletes the corresponding index table.

DROP INDEX [IF EXISTS] index_name ON table_name;
Copy the code

If a table with an index is deleted, both its corresponding index and index table are deleted. If a partition of the indexed table is deleted, the partition index corresponding to the partition is also deleted.

2.5 Rebuilding An Index

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;
Copy the code

Rebuild the index. If a PARTITION is specified, only the index of the PARTITION is rebuilt.

Index case

3.1 Creating An Index

Create emp_index for the EMPNO field on the EMP table and store the index data in the EMP_INDEx_table index table

create index emp_index on table emp(empno) as  
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
with deferred rebuild 
in table emp_index_table ;
Copy the code

At this time, there is no data in the index table, and you need to rebuild the index to create the index data.

3.2 Rebuilding An Index

alter index emp_index on emp rebuild; 
Copy the code

Hive starts MapReduce jobs to create indexes. After indexes are created, view the following table data: The three table fields represent the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

3.3 Using Indexes automatically

By default, indexes are created but are not automatically used during Hive query. Therefore, you need to enable related configurations. When configuration is enabled, queries involving indexed columns will use the index function to optimize the query.

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;
Copy the code

3.4 Viewing Indexes

SHOW INDEX ON emp;
Copy the code

4. Index defects

One of the major defects of index tables is that the index table cannot be automatically rebuilt. This means that if data is added or deleted from the table, the MapReduce job must be manually rebuilt to generate index table data.

In addition, Hive removes the indexing function from 3.0 for the following reasons:

  • Materialized views with automatic rewriting can produce similar effects to indexes (Hive 2.3.0 added support for Materialized views, formally introduced after 3.0).
  • When using the Columnar Storage file Format (Parquet, ORC) for storage, these formats support selective scanning to skip unwanted files or blocks.

See this article: Hive Performance Optimized ORC Indexes – Row Group Index vs Bloom Filter Index

The resources

  1. Create/Drop/Alter View
  2. Materialized views
  3. Hive index
  4. Overview of Hive Indexes

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series