Hive data models
-
Internal Table (Table saves data to Hive’s own data warehouse directory: /usr/hive/warehouse)
-
External tables (As opposed to internal tables, the data is not in its own data warehouse and only stores metadata of the data)
-
Partition Table (Partition Table stores data separately according to the specified conditions to improve query efficiency. Partition —–> Directory)
-
Bucket Table (a Bucket Table is essentially a partition Table, similar to a hash partition Bucket —-> file)
-
Visual chart (Visual chart is a virtual table that does not store data and is used to simplify complex queries)
Note: After an internal table is deleted, data is also deleted. After an external table is deleted, data is not deleted from the HDFS
1. Internal table/management table
-
Each Table has a corresponding directory to store data in Hive
-
All Table data is stored in this directory
# to create table
create table if not exists aiops.appinfo (
appname string,
level string,
leader string,
appline string,
dep string,
ips array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
COLLECTION ITEMS TERMINATED BY ', ';
# Customize file and record formats
Create table create table create table create table create table create table
# database authorization
hive> grant create on database dbname to user hadoop;
# import data (local import and HDFS import)
hive> load data inpath 'hdfs://hdfs-name/sure.csv' overwrite into table aiops.appinfo;
load data local inpath '/home/hdfs/online_state1' overwrite into table online_state PARTITION (end_dt='99991231');
Check table structure
hive> describe extended bgops;
hive> describe bgops;
Change the column name
This command can change the column name, data type, column comment and column position order of the table. FIRST puts the column in the FIRST column, AFTER col_name puts the column in the next column
hive> ALTER TABLE aiops.appinfo CHANGE hostnum ipnum int comment 'some comments' AFTER col3;
Alter table structureALTER TABLE aiops.appinfo replace columns (appname string,level string,leader string,appline string,dep string,ips array<string>); ALTER TABLE appinfo replace columns (appname string,appline string,level string,leader string,dep string,idcnum int,idcs array<string>,hostnum int,ips array<string>);Add the column column to the table (by default, add the column to the last column)
hive> alter table appinfo add columns (appclass string comment 'app_perf_class');
Select * from testOutput;
hive> insert overwrite local directory './testoutput'
> row format delimited fields terminated by "\t"
> select ip,appname,leader from appinfo LATERAL VIEW explode(ips) tmpappinfo AS ip;Copy the code
Usage scenarios for external tables
-
External tables are required for raw log files or data sets that are operated on simultaneously by multiple departments
-
If you delete meta Data accidentally, data in the HDFS is still available and can be restored to improve data security
Note: Temporary tables are created when data is inserted using insert, and the tables become smaller after reconnecting. Therefore, you are not advised to use insert tips1: in the HIVE directory of HDFS, all databases ending in. Db are actual databases. Tips2: The default database is in the Hive home directory
3. The partition table
Note: Partitioned tables are usually divided into statically partitioned tables, where partitions are specified statically when data needs to be imported, and dynamically partitioned tables, where partitions can be directly partitioned based on imported data. The benefit of partitioning is that it allows data to be sorted by region, avoiding full table scans at query time.
Create an external partition table with static partition dt
CREATE EXTERNAL TABLE if not exists aiops.tmpOnline(ip string,
status string,
....
)
PARTITIONED BY (
dt string);
Select * from static partition table where dt is not present
load data local inpath '/home/hdfs/tmpOnline' overwrite into table aiops.tmpOnline PARTITION (dt='99991231');
Use of dynamically partitioned tables (there is no difference between dynamically partitioned tables and statically partitioned tables)
Note: Hive does not enable dynamic partitioning by default
In order to be recognized by a dynamically partitioned table, a field must contain a dynamically partitioned field at the specified location
hive>sethive.exec.dynamic.partition.mode=nonstrict; hive> insert overwrite table aiops.tmpOnline partition(dt) select ip,appname,.... ,from_unixtime(unix_timestamp(),'yyyyMMdd') as dt from table;
Add partition manually
alter table tablename add partition (dt='20181009');
If the partition is deleted, the data will also be deleted.
Note: If the data changes, it is not possible to load the data into the same time partition records
alter table tablename drop partition (dt='20181009');
Select * from table_name where table_name = '1' and table_name = '1'
set hive.mapred.mode = strict|nostrict;Copy the code
Note: In an external partitioned table, if the table is deleted, you only need to load the partition into the rebuilt table to restore the data of the historically related partition.
Use of multiple partitions
Create a multipartition table
create table log_m (
id int,
name string,
age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by '|'
collection items terminated by ', '
map keys terminated by ':'
lines terminated by '\n';
# insert data
insert into table log_m partition (year='2018',month='10',day='10') values(1,'biaoge', 24); insert into table log_m partition (year='2018',month='10',day='09') values(2,'bgbiao', 25); hive> show partitions log_m; OK year=2018/month=10/day=09 year=2018/month=10/day=10 Time taken: 0.055 seconds, touch_type: 2 row(s)# Multiple dynamic partitioning
A dynamically partitioned table cannot load data directly
hive> insert into table log_m partition(year,month,day) values(3,'xuxuebiao', 28.'2016'.'09'.'10');
hive> show partitions log_m;
OK
year=2016/month=09/day=10
year=2018/month=10/day=09
year=2018/month=10/day=10
Query partition data
hive> select * from log_m where year = '2018';
OK
2 bgbiao 25 2018 10 09
1 biaoge 24 2018 10 10
2 bgbiao 25 2018 10 10Copy the code
Use of Hive complex data types
Note :Hive is popular in the big data space in large part because it supports more complex data types than other SQL-like storage systems
-
map: (key1, value1, key2, value2, …) Some columns of k/ V for map
,string…> -
struct: (var1,var2,var3…) Struct < ABC :string,def:int… >
-
array: (var1,var2,var3…) A combination of values of one type array
-
uniontype: (string,map<>,struct<>,array<>)
Note: When creating hive tables, you can identify the types of imported data and create appropriate data types. Hive Data Type Data Identification identifier:
Field segmentation identifier | meaning |
---|---|
FIELDS TERMINATED BY | Represents the separator between fields |
COLLECTION ITEMS TERMINATED BY | A separator between items in a field [can be used for array and struct types] |
MAP KEYS TERMINATED BY | Delimiter representing key/value in map type [available for map type] |
# to create table
create table union_testnew(
foo uniontype<int, double, string, array<string>, map<string, string>>
)
row format delimited
collection items terminated by ', '
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
# Data preparation
[root@master wadeyu]# vim union_test.log1 0,1 2 1 3.0 3 2,world 4 3,wade: Tom :polly 5 4,k1^Dv1:k2^Dv2# import data
hive (badou)> load data local inpath './union_test.log' overwrite into table union_testnew;
# query data
hive (badou)> select * from union_testnew;
OK
union_testnew.foo
{0:1}
{1:3.0}
{2:"world"} {3: ["wade"."tom"."polly"]} {4: {"k1":"v1"."k2":"v2"}} Time taken: 0.225 seconds, touch_type: 5 row(s)Copy the code
1. Use the array type
1.1 Basic Array Usage
Type structure :array< struct> e.g. Array
,array
Data representation: e.g. [string1,string2],[int1,int2]
# original fileBmpjob P2 bgops service team 10.0.0.212 10.0.0.225, 10.0.0.243, 10.0.55.31Create database
hive> create table appinfo
> (
> appname string,
> level string,
> leader string,
> dep string,
> ips array<string>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> COLLECTION ITEMS TERMINATED BY ', ';
Load data to Hive
hive> load data inpath 'hdfs://hdfs-name/aiops/wander/appinfo.txt' overwrite into table appinfo;
Loading data to table test.appinfo
Table test.appinfo stats: [numFiles=1, numRows=0, totalSize=32568, rawDataSize=0]
OK
# query related data
hive> select * from appinfo limit1; OK BMPJob P2 BGOPS Services R&d Team ["10.0.0.212"."10.0.0.225"."10.0.0.243"."10.0.55.31"]
hive> select appname,leader,ips[0] from appinfo limit 1;
OK
bmpjob bgops 10.0.0.212Copy the code
1.2 array
Type data conversion processing
Background: When the array structure is used, a field usually has multiple values. In this case, a value needs to be filtered. In general, user-defined table-generating Functions (UDTF) are used to filter a value. The UDTF typically solves the need for explode() to output multiple lines on one line.
Lateral view syntax structure
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (', ' columnAlias)Copy the code
Array <struct> to string
Use the split function to convert the contents of the array
structure into a string split with a ","
select split(array<string>,', ') from tablenameCopy the code
Syntax: exit view explode(col3) col3 as name
-
Explode (ARRAY): Explode (ARRAY): Generates a row for each element in the list
-
Explode (MAP): Explode (MAP): Generate a row from each key-value pair in the MAP
hive> select ip,appname from appinfo LATERAL VIEW explode(ips) tmpappinfo AS ip limit 2;
10.0.0.212 bmpjob
10.0.0.225 bmpjob
Copy the code
Hive uses the concat_ws() function to convert columns to rows
Use the concat_ws() and collect_set() functions to duplicate data conversions for the same column
The # collect_set() function merges related columns into arrays<>Type; The concat_ws() function converts the array<>The type is merged based on the specified delimiter
## Sample data
hive> select * from tmp_jiangzl_test;
tmp_jiangzl_test.col1 tmp_jiangzl_test.col2 tmp_jiangzl_test.col3
a b 1
a b 2
a b 3
c d 4
c d 5
c d 6
For the above data, we can merge the COL3 column based on the col1 and COL2 columns
hive> select col1,col2,concat_ws(', ',collect_set(col3)) from tmp_jiangzl_test group by col1,col2;
col1 col2 _c2
a b 1.2.3
c d 4.5.6Copy the code
2. Use of struct<> types
Struct
Example:
# Metadata format
1,zhou:30
2,yan:30
3,chen:20
# Related database structure
hive> create table test-struct(id INT, info struct<name:STRING, age:INT>)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
> COLLECTION ITEMS TERMINATED BY ':';
# load data
hive> LOAD DATA LOCAL INPATH '/home/work/data/test5.txt' INTO TABLE test-struct;
# query related data
hive> select info.age from test-struct;
Total MapReduce jobs= 1... Total MapReduce CPU Time Spent: 490 msec OK 30 30Copy the code
3. Use of the map<> type
Data definition: map
Data representation: key:value,key:value… Example:
# Raw data format
1 job:80,team:60,person:70
2 job:60,team:80
3 job:90,team:70,person:100
Create table structure with map structure
hive> create table employee(id string, perf map<string, int>)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> COLLECTION ITEMS TERMINATED BY ', '
> MAP KEYS TERMINATED BY ':';
# Data import
hive> LOAD DATA LOCAL INPATH '/home/work/data/test7.txt' INTO TABLE employee;
# data query
hive> select perf['person'] from employee;
Total MapReduce jobs= 1... Total MapReduce CPU Time Spent: 460 msec OK 70 NULL# explode(
hive> select explode(perf) as (p_name,p_score) from employee limit 4;
OK
job 80
team 60
person 70
# use explode() and lateral view
hive> select id,p_name,p_score from employee lateral view explode(perf) perf as p_name,p_score limit 3;
OK
1 job 80
1 team 60
1 person 70
# use size() to check the number of key-value pairs in the map structure.
hive> select size(perf) from employee
3
2
3Copy the code
Hive common functions
Note: Use show functions to view the related functions supported by Hive
1. List of common Hive functions
Standard functions use:
The function name | Function description |
---|---|
round()/floor() | You can convert a double to a Bigint |
abs() | Returns the absolute value of a number |
ucase() | Converting a string to full is uppercase |
reverse() | Flips the string |
concat() | Output concat(‘) as a single string171 |
Aggregate functions use:
The function name | Function description |
---|---|
sum() | Returns the sum of all input values |
avg() | Calculate the average of all input values |
min()/max() | Calculates the maximum and minimum values of the input values |
Note: Aggregate methods are usually used in combination with group by statements
Table generators: Table generators take zero or more inputs and produce multiple columns or rows of output.
The function name | Function description |
---|---|
array() | Converts the function content to an array<> type |
split(array,split) | Split the array<> type into strings using the split separator (escape using \ when escaping) |
explode() | The array data type is used as input to iterate over the data in the array, returning multiple rows of results |
collect_set() | Unsum the values of a field to generate an Array field |
collect_list() | Same as collect_set(), but without de-duplicating the field |
concat_ws(split,struct) | Split struct fields into strings (struct only supports string and array<> types) |
cast(column as type) | Convert data type (column to type) |
Note: four \ are used to escape when split is included in "" [such as split in hive-e ""]
The ## array() function converts a list of inputs into an array of outputsHive > select array(1,2,3) from xuxuebiao; OK [1, 2, 3] [1, 2, 3]## explode() takes the array data type as input and returns multiple rowsHive > SQL > explode(array(1,2,3)); OK 1 2 3## Use explode() to view an element in an array
hive> select * from appinfo LATERAL VIEW explode(ips) tmpappinfo AS realid where realid ='10.0.0.125' ;
# # collect_set function
Select * from Array; select * from Array; select * from Array
hive> select * from test;
OK
1 A
1 C
1 B
hive> select id,collect_set(name) from test group by id;
OK
1 ["A"."C"."B"]
Copy the code
2. Common condition judgment and data cleaning functions
When using Hive to process data, we usually need to clean and transform related data. In this case, we may use some conditional judgments and default value handlers.
The function name | Function description |
---|---|
IF( Test Condition, True Value, False Value ) | Judge the conditions, meet the value is True, do not meet the value is False |
CASE Statement | Multiple conditional judgment |
parse_url() | It is used to clean URL-related functions and provides common URL parsing functions |
parse_url_tuple() | Same as above |
regexp_replace() | Regular expression substitution |
regexp_extract() | Regular expression parsing |
COALESCE(column,”) | NULL value conversion in Hive (NULL value in Hive, stored in HDFS as \N) |
Example:
# If condition judgment is often used to clean data of different specifications
hive> select ip,if(assign ! ='Allocation status unknown',0,assign) as fenpei from asset ; OK 10.0.0.1 Allocation status unknown# case multi-condition judgment
hive> select ip,
case
when assign = 'Allocated' then 1
when assign = 'Unallocated' then 2
else 0
end
as fenpei
from asset
hive (ods)> select name,salary,
> case when salary < 800 then 'low'
> when salary >= 800 and salary <=5000 then 'middle'
> when salary >5000 and salary <10000 then 'high'
> else 'very high'
> end as bracket
> from emp1;
# parser_url () function
hive> select parse_url('https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E8%BF%AA%E5%A3%AB%E5%B0%BC%E6%94%B6%E8%B4%AD%E7%A6%8F%E5%85%8 B%E6%96%AF&rsv_idx=2'.'HOST'); www.baidu.com# regular expressions
hive> select regexp_replace('foobar'.'oo|ar'.' ');
select regexp_replace('foobar'.'oo|ar'.The '-');
## Print the first backreference (.*?) The matched content is the
select regexp_extract('foothebar'.'foo(.*?) (bar)', 1);
## Prints the first backreference (bar) matched to the bar
select regexp_extract('foothebar'.'foo(.*?) (bar)', 2);
Print the entire content
select regexp_extract('foothebar'.'foo(.*?) (bar)', 0);
# Cleaning combination
select if(4 > 5500, 0100), coalesce (null, 1,3,5), coalesce, null, null, null, null),case 3 when 1 then 'lala' when 2 then 'chye' else 'abc' end;
Copy the code
3. Hive advanced functions
row_number() over()
Common Hive environment variables
The environment variable | meaning |
---|---|
set hive.cli.print.header=true | Set the table header to display when querying |
set hive.exec.dynamic.partition=true | Enabling Dynamic Partitioning |
set hive.exec.dynamic.partition.mode=nonstrict | Example Set the dynamic partitioning mode to non-strict |
set hive.exec.max.dynamic.partitions.pernode = 1000 | Set the maximum number of partitions on each node that performs MR |
set hive.exec.max.dynamic.partitions=1000 | Set the maximum number of partitions on all MR nodes |
SET SERDEPROPERTIES(‘serialization.null.format’ = ‘\N’) | Set hive NULL storage mode to ‘\N'(when stored in HDFS, ‘\N’ is NULL) |
Javapdf, there is a big package ~