Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
Hive is a data warehouse tool based on Hadoop. Hive maps structured data files into a table and provides SQL-like query functions
Hive document address: cwiki.apache.org/confluence/…
table
Hive tables have the following types of tables
- MANAGED_TABLE internal table
- EXTERNAL_TABLE external tables
Table application scenarios
MANAGED_TABLE
- Also known as internal tables tables that are managed by Hive. Data and metadata are all managed by Hive
- When a table is dropped, both the table structure and data are deleted
EXTERNAL_TABLE
- External table data is also shared to external hive applications, such as Spark and Flink
- Data protection: Only the table structure is deleted when a table is deleted, and data is stored in the HDFS
Table creation
Create a table, and specify the row separator as comma, if not delimited, default to a SOH, and then all data together, there is no way to further analysis
create table test7(id int,name string) row format delimited fields terminated by ', ';
insert into test7 values(10001."zhangsan")
Copy the code
View table structure
desc formatted test7;
Copy the code
Modify the name of the table
An error is reported if the new table name exists
alter table test1 rename to test2;
Copy the code
Modifying column information
Change the column name. The column name type cannot be forcibly converted and can only be upgraded. For example, a string cannot be changed to an int
Alter table test2 change id new_id int; Alter table test2 add columns(email string); Alter table test2 replace columns(ID string,gender string) alter table test2 replace columns(ID string,gender string)Copy the code
Table type change
The keyword is case sensitive and must be uppercase
Change the internal table to the external table
alter table table_name set tblproperties('EXTERNAL'='TRUE')
Copy the code
Change the external table to the internal table
alter table table_name set tblproperties('EXTERNAL'='FALSE')
Copy the code
Data manipulation
Load the data
Load the data when the table is created, or import the data after the table is built
The data content needs to conform to the table structure
Load Data Loads data
Load data inpath from HDFS to hive. Load data inpath from HDFS to hive'/test/dmp.txt'into table test1; Load data local inpath load data local inpath'/root/dmp.txt' into table test1;
Copy the code
Import data from other tables
insert into table test7 select * from test1;
Copy the code
Specify load data when creating the table
create external table if not exists test10(id int,name string)
for format delimited fields terminated by '\t'
location '/test/data10.txt'
Copy the code
Load data from another table when creating a table
create table test11 as select id,name from test12;
Copy the code
Export data
Exporting data requires few operations but loading more data
To export generally according to the query conditions, write to the local, because the export is all the export
Export the query results to the local PC
insert overwirte local directory '/tmp/test2.txt' select * from test2;
Copy the code
The query results are formatted and exported to the local PC, separated by Spaces
insert overwirte local directory '/tmp/test2.txt'
FOR FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from test2;
Copy the code
Send the query results to HDFS and separate them with commas
insert overwirte directory '/tmp/test2.txt'
FOR FORMAT DELIMITED FIELDS TERMINATED BY ', '
select * from test2;
Copy the code
Import and export, data migration
export table default.test7 to '/user/hive/warehouse/export/test7'
import table default.test7 from '/user/hive/warehouse/export/test7'
Copy the code
Clear data and drop tables
# display desc formatted table_name; Truncate table table_name; # remove external table data hadoop fs - rm -r -f/user/hive/warehouse/test. The db/talbe_name # delete table (internal table will delete all data and table structure, Alter table table_name drop partition(load_date= load_date'2019-01-01')
Copy the code
The partition table
A partition table corresponds to an independent folder on an HDFS file system. Hive partitions are subdirectories that divide a large amount of data into small folders and store them in a directory
Dynamic partitioning
The advantage of dynamic partitioning is that it is not necessary to add a partition each time data is written, but it is handled automatically. If you partition by day, and write data every day, you have to partition beforehand, but the number of tables is too large, this is really inefficient
Cwiki.apache.org/confluence/…
Create a partitioned table
create table if not exists test5(name string) partitioned by(age int) row format delimited fields terminated by ' ' lines terminated by '\n';
Copy the code
Data import partition table must first execute the following statements in Hive.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Copy the code
Data can be imported into a partitioned table in static or dynamic partitioning modes
Insert into table test5 partition(age= 125) select name fromtest6; Insert overwrite table test5 partition select name,agefrom test6;
Copy the code
Partition by day example (static partition)
---------txt110 ACCOUNTING 1700
20 RESEARCH 1800
---------txt2
30 SALES 1900
40 OPERATIONS 1700
---------txt3
50 TEST 2000
60 DEV 1900
Copy the code
Create table dept_PAR (deptno int,dname string, loc string) partitioned by (day string) row format delimited fields terminated by'\t'; Load data local inpath'/opt/dept1.txt' into table dept_par partition(day='the 20221-09-01'); Select * from single partitionfrom dept_par where day='2021-09-01'; # select * from 'select *'from dept_par where day='2021-09-01'
union
select * from dept_par where day='2021-09-02'
union
select * from dept_par where day='2021-09-03';
----
select * from dept_par where day='2021-09-01' or day='2021-09-02' or day='2021-09-03'; Alter table dept_par add partition(day='2021-09-04'); Alter table dept_PAR DROP partition(day='2021-09-04'Show partitions dept_PAR desc fromatted dept_PAR desc fromatted dept_PAR Create table dept_par2(deptno int,dname string, loc string) partitioned by (day string,hour string) row format delimited fields terminated by'\t';
Copy the code