preface
For example, HQL processing is very complex and slow. At this time, we can use Hive to add an index to improve our speed. I won’t talk about that. Let’s just get started.
1. Hive View
1.1 introduction
A view in Hive is the same concept as a view in an RDBMS. It is a logical representation of a set of data, essentially the result set of a SELECT statement. Views are purely logical objects with no associated storage (except materialized views introduced in Hive 3.0.0). When a query references a view, Hive can combine the definition of a view with the query, for example, pushing a filter from the query to the view.
1.2 the characteristics of
- Materialized views are not supported
- You can only query data, but cannot load data
- View creation, just save
A piece of metadata
Before the query view executes the corresponding subquery - If the view definition contains an ORDER BY/LIMIT statement, the ORDER BY/LIMIT statement is also used when querying the view. The view definition has a higher priority.
- Hive views Support iteration views
1.3 Creating a View
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name -- View name
[(column_name [COMMENT column_comment], ...) ] - the column name
[COMMENT view_comment] -- View comments
[TBLPROPERTIES (property_name =property_value, ...) ]-- Additional information
AS SELECT. ;Copy the code
Considerations for creating a view
- CREATE VIEW creates a VIEW with the given name. If a table or view already exists with the same name, an error is raised. You can use
IF NOT EXISTS
Skip this error. - Deleting the base table does not delete the view, you need to manually delete the view;
- The view is read-only and cannot be used
LOAD
/INSERT
/ALTER
The goal of - When a view is created, if no column name is provided, the column name is automatically derived from the SELECT statement;
- A view may contain ORDER BY and LIMIT clauses. If the reference query also contains these terms, the query level clause evaluates the back view terms (and any other actions later in the query). For example, if the view specifies LIMIT 5 and the reference query executes as (select * from V LIMIT 10), a maximum of 5 rows will be returned.
To prepare data
Create test table
create table default.user(
id string , - the primary key
sex string, - gender
name string The name of the -
);
-- Import data
insert into default.user (id, sex, name)
values(" 1 ", "male" and "zhang"), (" 2 ", "female", "flower"), (" 3 ", "male", "Zhao Liu"), (" 4 ", "male", "hey hey li");Copy the code
Create a test view
hive (default)> create view if not exists default.user_view as select * from default.user;
OK
id sex name
Time taken: 0.181 seconds
Copy the code
1.4 Querying views
Query view contents
select * from default.user_view;
Query the view structure
desc default.user_view;
-- Query view details
desc formatted default.user_view;
Query views are not specified in the same way as all tables are queried
show tables;
Copy the code
1.5 Deleting a View
Templates -
DROP VIEW [IF EXISTS] [db_name.]view_name;
Delete view
DROP VIEW IF EXISTS user_view;
Copy the code
1.6 Modifying View Properties
Grammar:
ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ...)
Copy the code
Example:
alter view default.user_view set tblproperties ('name'='DSJLG'.'GZH'='DSJLG')
Copy the code
Display the display through desc formatted default.user_view; The details information
Second, the index
2.1 introduction
Hive introduced the index function in 0.7.0. Indexes are designed to speed up query of certain table columns. If there is no index, queries with predicates (such as ‘WHERE table1.column = 10’) load the entire table or partition and process all rows. But if column has an index, only part of the file needs to be loaded and processed.
2.2 Creating index Templates
CREATE INDEX index_name -- index name
ON TABLE base_table_name (col_name, ...) Columns to be indexed
AS index_type -- Index type
[WITH DEFERRED REBUILD] Alter table alter table alter table
[IDXPROPERTIES (property_name=property_value, ...) ]-- Index additional attributes
[IN TABLE index_table_name] -- Name of the index table[[ROWFORMAT ... ] STOREDAS.| STORED BY. ]-- Index table row separator and storage format
[LOCATION hdfs_path] -- The storage location of the index table
[TBLPROPERTIES (...)] -- Index table Table attributes
[COMMENT "index comment"]; -- Index comment
Copy the code
2.3 Creating An Index
We create the id column user_index using the user table created above, and the index is stored in the user_index_table index table
create index user_index on table user(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild in table user_index_table;
Copy the code
At this time, there is no data in the index table, and you need to rebuild the index to create the index data.
2.4 Rebuilding an Index
hive (default)> ALTER index user_index on user rebuild ;
Query ID = root_20201015081313_879ce697-a6a44 -c38-a1a9-0e72a52feb6b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1602711568359_0002, Tracking URL = http://node01:8088/proxy/application_1602711568359_0002/
Kill Command = /export/servers/hadoop2.6. 0-cdh514.. 0/bin/hadoop job -kill job_1602711568359_0002
Hadoop job information for Stage- 1: number of mappers: 1; number of reducers: 1
2020- 10- 15 08:13:47.425 Stage- 1 map = 0%, reduce = 0%
2020- 10- 15 08:13:48.546 Stage- 1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec
2020- 10- 15 08:13:49.576 Stage- 1 map = 100%, reduce = 100%, Cumulative CPU 2.5 sec
MapReduce Total cumulative CPU time: 2 seconds 500 msec
Ended Job = job_1602711568359_0002
Loading data to table default.user_index_table
Table default.user_index_table stats: [numFiles=1, numRows=4, totalSize=231, rawDataSize=227]
MapReduce Jobs Launched:
Stage-Stage- 1: Map: 1 Reduce: 1 Cumulative CPU: 2.5 sec HDFS Read: 12945 HDFS Write: 581944 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 500 msec
OK
Time taken: 12.85 seconds
Copy the code
Hive starts MapReduce jobs to create indexes. After indexes are created, view the following table data: The three table fields represent the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.
hive (default)> select * from user_index_table;
OK
user_index_table.id user_index_table._bucketname user_index_table._offsets
1 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [0]
2 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [13]
3 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [26]
4 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [39]
Time taken: 0.047 seconds, Fetched: 4 row(s)
Copy the code
2.5 Automatic Use of Indexes
By default, indexes are created but are not automatically used during Hive query. Therefore, you need to enable related configurations. When configuration is enabled, queries involving indexed columns will use the index function to optimize the query.
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET hive.optimize.index.filter=true;
SET hive.optimize.index.filter.compact.minsize=0;
Copy the code
2.6 Viewing Indexes
show index on user;
2.7 Deleting An Index
Deleting an index deletes the corresponding index table.
DROP INDEX [IF EXISTS] index_name ON table_name;
Copy the code
If a table with an index is deleted, both its corresponding index and index table are deleted. If a partition of the indexed table is deleted, the partition index corresponding to the partition is also deleted.
2.8 Principles of Indexes
After an index is created on a specified column, an index table (a physical table of Hive) is generated. The index table contains the value of the index column, HDFS file path corresponding to the value, and the offset of the value in the file. During index field query, an additional MapReduce job is generated. Based on the filtering conditions for index columns, the HDFS file path and offset corresponding to index column values are filtered from the index table and output to a file in HDFS. Then, based on the HDFS file path and offset in these files, Filter the original input file and generate a new split as a split for the entire job, thus eliminating the need for a full table scan.
conclusion
Indexes can improve query efficiency and grouping efficiency, but they also have disadvantages. Creating indexes cannot automatically rebuild data, which means that you need to manually modify data and add data. Indexes are not recommended for frequently modified data. Believe in yourself, hard work and sweat will always pay off. I’m big data, and I’ll see you next time
Flink Interview Questions, Spark Interview questions, Hive Interview Questions, Hadoop Interview questions, Docker interview questions, resume templates and other resources please go to GitHub to download github.com/lhh2002/Fra…