Hive data models

  • Internal Table (Table saves data to Hive’s own data warehouse directory: /usr/hive/warehouse)

  • External tables (As opposed to internal tables, the data is not in its own data warehouse and only stores metadata of the data)

  • Partition Table (Partition Table stores data separately according to the specified conditions to improve query efficiency. Partition —–> Directory)

  • Bucket Table (a Bucket Table is essentially a partition Table, similar to a hash partition Bucket —-> file)

  • Visual chart (Visual chart is a virtual table that does not store data and is used to simplify complex queries)

Note: After an internal table is deleted, data is also deleted. After an external table is deleted, data is not deleted from the HDFS

1. Internal table/management table



  • Each Table has a corresponding directory to store data in Hive

  • All Table data is stored in this directory

# to create table
create table if not exists aiops.appinfo (
    appname string,
    level string,
    leader string,
    appline string,
    dep string,
    ips  array<string>)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' '
    COLLECTION ITEMS TERMINATED BY ', ';

# Customize file and record formats
Create table create table create table create table create table create table

# database authorization
hive> grant create on database dbname to user hadoop;

# import data (local import and HDFS import)
hive> load data inpath  'hdfs://hdfs-name/sure.csv' overwrite into table aiops.appinfo;
load data local inpath '/home/hdfs/online_state1' overwrite into table online_state PARTITION (end_dt='99991231');

Check table structure
hive> describe extended bgops;
hive> describe bgops;

Change the column name
This command can change the column name, data type, column comment and column position order of the table. FIRST puts the column in the FIRST column, AFTER col_name puts the column in the next column
hive> ALTER TABLE aiops.appinfo CHANGE hostnum ipnum int comment 'some comments' AFTER col3;

Alter table structureALTER TABLE aiops.appinfo replace columns (appname string,level string,leader string,appline string,dep string,ips array<string>); ALTER TABLE appinfo replace columns (appname string,appline string,level string,leader string,dep string,idcnum int,idcs  array<string>,hostnum int,ips array<string>);Add the column column to the table (by default, add the column to the last column)
hive> alter table appinfo add columns (appclass string comment 'app_perf_class');

Select * from testOutput;
hive> insert overwrite local directory './testoutput'
    > row format delimited fields terminated by "\t"
    > select ip,appname,leader from appinfo  LATERAL VIEW explode(ips) tmpappinfo  AS ip;Copy the code

Usage scenarios for external tables

  • External tables are required for raw log files or data sets that are operated on simultaneously by multiple departments

  • If you delete meta Data accidentally, data in the HDFS is still available and can be restored to improve data security

Note: Temporary tables are created when data is inserted using insert, and the tables become smaller after reconnecting. Therefore, you are not advised to use insert tips1: in the HIVE directory of HDFS, all databases ending in. Db are actual databases. Tips2: The default database is in the Hive home directory



3. The partition table

Note: Partitioned tables are usually divided into statically partitioned tables, where partitions are specified statically when data needs to be imported, and dynamically partitioned tables, where partitions can be directly partitioned based on imported data. The benefit of partitioning is that it allows data to be sorted by region, avoiding full table scans at query time.

Create an external partition table with static partition dt
CREATE EXTERNAL TABLE if not exists aiops.tmpOnline(ip string,
status string,
....
)
PARTITIONED BY (
  dt string);

Select * from static partition table where dt is not present
load data local inpath '/home/hdfs/tmpOnline' overwrite into table aiops.tmpOnline PARTITION (dt='99991231');

Use of dynamically partitioned tables (there is no difference between dynamically partitioned tables and statically partitioned tables)
Note: Hive does not enable dynamic partitioning by default
In order to be recognized by a dynamically partitioned table, a field must contain a dynamically partitioned field at the specified location
hive>sethive.exec.dynamic.partition.mode=nonstrict; hive> insert overwrite table aiops.tmpOnline partition(dt) select ip,appname,.... ,from_unixtime(unix_timestamp(),'yyyyMMdd') as dt from table;

Add partition manually
alter table tablename add partition (dt='20181009');
If the partition is deleted, the data will also be deleted.
Note: If the data changes, it is not possible to load the data into the same time partition records
alter table tablename drop partition (dt='20181009');
Select * from table_name where table_name = '1' and table_name = '1'
set hive.mapred.mode = strict|nostrict;Copy the code

Note: In an external partitioned table, if the table is deleted, you only need to load the partition into the rebuilt table to restore the data of the historically related partition.

Use of multiple partitions

Create a multipartition table
create table log_m (
    id  int,
    name string,
    age int
)
partitioned by (year string,month string,day string)
row format delimited
fields terminated by '|'
collection items terminated by ', '
map keys terminated by ':'
lines terminated by '\n';

# insert data
insert into table log_m partition  (year='2018',month='10',day='10') values(1,'biaoge', 24); insert into table log_m partition (year='2018',month='10',day='09') values(2,'bgbiao', 25); hive> show partitions log_m; OK year=2018/month=10/day=09 year=2018/month=10/day=10 Time taken: 0.055 seconds, touch_type: 2 row(s)# Multiple dynamic partitioning
A dynamically partitioned table cannot load data directly
hive> insert into table log_m partition(year,month,day) values(3,'xuxuebiao', 28.'2016'.'09'.'10');
hive> show partitions log_m;
OK
year=2016/month=09/day=10
year=2018/month=10/day=09
year=2018/month=10/day=10

Query partition data
hive> select * from log_m where year = '2018';
OK
2 bgbiao  25  2018  10  09
1 biaoge  24  2018  10  10
2 bgbiao  25  2018  10  10Copy the code


Use of Hive complex data types



Note :Hive is popular in the big data space in large part because it supports more complex data types than other SQL-like storage systems

  • map: (key1, value1, key2, value2, …) Some columns of k/ V for map
    ,string…>

  • struct: (var1,var2,var3…) Struct < ABC :string,def:int… >

  • array: (var1,var2,var3…) A combination of values of one type array

  • uniontype: (string,map<>,struct<>,array<>)

Note: When creating hive tables, you can identify the types of imported data and create appropriate data types. Hive Data Type Data Identification identifier:

Field segmentation identifier meaning
FIELDS TERMINATED BY Represents the separator between fields
COLLECTION ITEMS TERMINATED BY A separator between items in a field [can be used for array and struct types]
MAP KEYS TERMINATED BY Delimiter representing key/value in map type [available for map type]
# to create table
create table union_testnew(
  foo uniontype<int, double, string, array<string>, map<string, string>>
)
row format delimited
collection items terminated by ', '
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;

# Data preparation
[root@master wadeyu]# vim union_test.log1 0,1 2 1 3.0 3 2,world 4 3,wade: Tom :polly 5 4,k1^Dv1:k2^Dv2# import data
hive (badou)> load data local inpath './union_test.log' overwrite into table union_testnew;

# query data
hive (badou)> select * from union_testnew;
OK
union_testnew.foo
{0:1}
{1:3.0}
{2:"world"} {3: ["wade"."tom"."polly"]} {4: {"k1":"v1"."k2":"v2"}} Time taken: 0.225 seconds, touch_type: 5 row(s)Copy the code

1. Use the array type

1.1 Basic Array Usage

Type structure :array< struct> e.g. Array

,array

Data representation: e.g. [string1,string2],[int1,int2]

# original fileBmpjob P2 bgops service team 10.0.0.212 10.0.0.225, 10.0.0.243, 10.0.55.31Create database
hive> create table appinfo
    > (
    > appname string,
    > level string,
    > leader string,
    > dep string,
    > ips  array<string>)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ' '
    > COLLECTION ITEMS TERMINATED BY ', ';

Load data to Hive
hive> load data inpath 'hdfs://hdfs-name/aiops/wander/appinfo.txt' overwrite into table appinfo;
Loading data to table test.appinfo
Table test.appinfo stats: [numFiles=1, numRows=0, totalSize=32568, rawDataSize=0]
OK

# query related data
hive> select * from appinfo limit1; OK BMPJob P2 BGOPS Services R&d Team ["10.0.0.212"."10.0.0.225"."10.0.0.243"."10.0.55.31"]

hive> select appname,leader,ips[0] from appinfo limit 1;
OK
bmpjob bgops 10.0.0.212Copy the code

1.2 array

Type data conversion processing

Background: When the array structure is used, a field usually has multiple values. In this case, a value needs to be filtered. In general, user-defined table-generating Functions (UDTF) are used to filter a value. The UDTF typically solves the need for explode() to output multiple lines on one line.

Lateral view syntax structure


lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (', ' columnAlias)Copy the code

Array <struct> to string

Use the split function to convert the contents of the array
      
        structure into a string split with a ","
      
select split(array<string>,', ') from tablenameCopy the code

Syntax: exit view explode(col3) col3 as name

  • Explode (ARRAY): Explode (ARRAY): Generates a row for each element in the list

  • Explode (MAP): Explode (MAP): Generate a row from each key-value pair in the MAP


hive> select ip,appname from appinfo LATERAL VIEW explode(ips) tmpappinfo  AS ip limit 2;
10.0.0.212 bmpjob
10.0.0.225 bmpjob
Copy the code

Hive uses the concat_ws() function to convert columns to rows

Use the concat_ws() and collect_set() functions to duplicate data conversions for the same column
The # collect_set() function merges related columns into arrays<>Type; The concat_ws() function converts the array<>The type is merged based on the specified delimiter
## Sample data
hive> select * from tmp_jiangzl_test;
tmp_jiangzl_test.col1   tmp_jiangzl_test.col2   tmp_jiangzl_test.col3
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
For the above data, we can merge the COL3 column based on the col1 and COL2 columns
hive> select col1,col2,concat_ws(', ',collect_set(col3)) from tmp_jiangzl_test group by col1,col2;
col1    col2    _c2
a       b       1.2.3
c       d       4.5.6Copy the code

2. Use of struct<> types

Struct

Example:

# Metadata format
1,zhou:30
2,yan:30
3,chen:20
# Related database structure
hive> create table test-struct(id INT, info struct<name:STRING, age:INT>)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
    > COLLECTION ITEMS TERMINATED BY ':';
# load data
hive> LOAD DATA LOCAL INPATH '/home/work/data/test5.txt' INTO TABLE test-struct;
# query related data
hive> select info.age from test-struct;
Total MapReduce jobs= 1... Total MapReduce CPU Time Spent: 490 msec OK 30 30Copy the code

3. Use of the map<> type

Data definition: map

Data representation: key:value,key:value… Example:
,int>

# Raw data format
1       job:80,team:60,person:70
2       job:60,team:80
3       job:90,team:70,person:100

Create table structure with map structure
hive> create table employee(id string, perf map<string, int>)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '\t'
    > COLLECTION ITEMS TERMINATED BY ', '
    > MAP KEYS TERMINATED BY ':';

# Data import
hive>  LOAD DATA LOCAL INPATH '/home/work/data/test7.txt' INTO TABLE employee;

# data query
hive> select perf['person'] from employee;
Total MapReduce jobs= 1... Total MapReduce CPU Time Spent: 460 msec OK 70 NULL# explode(
hive> select explode(perf) as (p_name,p_score) from employee limit 4;
OK
job 80
team 60
person 70

# use explode() and lateral view
hive> select id,p_name,p_score from employee lateral view explode(perf) perf as p_name,p_score limit 3;
OK
1 job 80
1 team 60
1 person 70

# use size() to check the number of key-value pairs in the map structure.
hive> select size(perf) from employee
3
2
3Copy the code

Hive common functions

Note: Use show functions to view the related functions supported by Hive

1. List of common Hive functions

Standard functions use:

The function name Function description
round()/floor() You can convert a double to a Bigint
abs() Returns the absolute value of a number
ucase() Converting a string to full is uppercase
reverse() Flips the string
concat() Output concat(‘) as a single string171

Aggregate functions use:

The function name Function description
sum() Returns the sum of all input values
avg() Calculate the average of all input values
min()/max() Calculates the maximum and minimum values of the input values

Note: Aggregate methods are usually used in combination with group by statements

Table generators: Table generators take zero or more inputs and produce multiple columns or rows of output.

The function name Function description
array() Converts the function content to an array<> type
split(array,split) Split the array<> type into strings using the split separator (escape using \ when escaping)
explode() The array data type is used as input to iterate over the data in the array, returning multiple rows of results
collect_set() Unsum the values of a field to generate an Array field
collect_list() Same as collect_set(), but without de-duplicating the field
concat_ws(split,struct) Split struct fields into strings (struct only supports string and array<> types)
cast(column as type) Convert data type (column to type)


Note: four \ are used to escape when split is included in "" [such as split in hive-e ""]

The ## array() function converts a list of inputs into an array of outputsHive > select array(1,2,3) from xuxuebiao; OK [1, 2, 3] [1, 2, 3]## explode() takes the array data type as input and returns multiple rowsHive > SQL > explode(array(1,2,3)); OK 1 2 3## Use explode() to view an element in an array
hive> select * from appinfo LATERAL VIEW explode(ips) tmpappinfo  AS realid         where realid ='10.0.0.125' ;

# # collect_set function
Select * from Array; select * from Array; select * from Array
hive> select * from test;
OK
1       A
1       C
1       B
hive>  select id,collect_set(name) from test group by id;
OK
1       ["A"."C"."B"]

Copy the code

2. Common condition judgment and data cleaning functions

When using Hive to process data, we usually need to clean and transform related data. In this case, we may use some conditional judgments and default value handlers.

The function name Function description
IF( Test Condition, True Value, False Value ) Judge the conditions, meet the value is True, do not meet the value is False
CASE Statement Multiple conditional judgment
parse_url() It is used to clean URL-related functions and provides common URL parsing functions
parse_url_tuple() Same as above
regexp_replace() Regular expression substitution
regexp_extract() Regular expression parsing
COALESCE(column,”) NULL value conversion in Hive (NULL value in Hive, stored in HDFS as \N)

Example:

# If condition judgment is often used to clean data of different specifications
hive> select ip,if(assign ! ='Allocation status unknown',0,assign) as fenpei from asset ; OK 10.0.0.1 Allocation status unknown# case multi-condition judgment
hive> select ip,
    case
when assign = 'Allocated' then 1
when assign = 'Unallocated' then 2
else 0
end
    as fenpei
from asset

hive (ods)> select name,salary,
          > case when salary < 800 then 'low'
          > when salary >= 800 and salary <=5000 then 'middle'
          > when salary >5000 and salary <10000 then 'high'
          > else 'very high'
          > end as bracket
          > from emp1;


# parser_url () function
hive> select parse_url('https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E8%BF%AA%E5%A3%AB%E5%B0%BC%E6%94%B6%E8%B4%AD%E7%A6%8F%E5%85%8 B%E6%96%AF&rsv_idx=2'.'HOST'); www.baidu.com# regular expressions
hive> select regexp_replace('foobar'.'oo|ar'.' ');
select regexp_replace('foobar'.'oo|ar'.The '-');
## Print the first backreference (.*?) The matched content is the
select regexp_extract('foothebar'.'foo(.*?) (bar)', 1);
## Prints the first backreference (bar) matched to the bar
select regexp_extract('foothebar'.'foo(.*?) (bar)', 2);
Print the entire content
select regexp_extract('foothebar'.'foo(.*?) (bar)', 0);


# Cleaning combination
select if(4 > 5500, 0100), coalesce (null, 1,3,5), coalesce, null, null, null, null),case 3 when 1 then 'lala' when 2 then 'chye' else 'abc' end;
Copy the code

3. Hive advanced functions

row_number() over()

Common Hive environment variables


The environment variable meaning
set hive.cli.print.header=true Set the table header to display when querying
set hive.exec.dynamic.partition=true Enabling Dynamic Partitioning
set hive.exec.dynamic.partition.mode=nonstrict Example Set the dynamic partitioning mode to non-strict
set hive.exec.max.dynamic.partitions.pernode = 1000 Set the maximum number of partitions on each node that performs MR
set hive.exec.max.dynamic.partitions=1000 Set the maximum number of partitions on all MR nodes
SET SERDEPROPERTIES(‘serialization.null.format’ = ‘\N’) Set hive NULL storage mode to ‘\N'(when stored in HDFS, ‘\N’ is NULL)

Javapdf, there is a big package ~