Hive of Sql
This article covers all of the SQL that Hive uses on a daily basis. Because there are too many SQL statements, it is classified as follows: Hive function: includes aggregate function, condition function, date function, string function and other row to row and column to row: Lateral View and Slide, as well as Reflect and Analyze and some other window functions
The article is first published in the public number [five minutes to learn big data], the original technical number in the field of big data, the weekly update of big data technical article and interview question analysis, after attention can receive elaborate big data interview bible!
The DDL syntax for Hive
Operations on a database
- CREATE DATABASE:
create database if not exists myhive; The hive table location mode by hive - site. An attribute of the specified XML: hive. Metastore. Warehouse. Dir create the database and specify the HDFS storage location: create database myhive2 location '/myhive2';
-
Alter database:
alter database myhive2 set dbproperties('createtime'='20210329');
Note: You can use the ALTER DATABASE command to modify some properties of the database. But the metadata information about the database is immutable, including the name of the database and the location of the database
- View the database details
Hive (myhive)> desc database myhive2; Hive (myhive)> desc database extended myhive2;
-
Delete database
Drop database myhive2 drop database myhive2 drop database myhive2 drop database myhive2
Drop database myHive cascade drop database myHive cascade drop database myHive cascade drop database myHive cascade drop database myHive cascade
# # # # # # # to the operation of the data table to the operation of the management table (internal) : - build internal table:
hive (myhive)> use myhive; Myhive > create table stu(id int,name string); hive (myhive)> insert into stu values (1,”zhangsan”); hive (myhive)> insert into stu values (1,”zhangsan”),(2,”lisi”); Hive (myhive)> select * from stu;
- SELECT * FROM 'hive' WHERE 'table' IS USED; | | classification of * * * * * * type * * description * * * * | | literal example * * * * | | -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - | | | primitive types BOOLEAN | | true/false true | | | TINYINT | 1 byte of a signed integer - 128 ~ 127 | 1 | y | | SMALLINT | 2 bytes of a signed integer, - 32768 ~ 32767 1 s | | | | | * * INT * * 4 bytes of signed integer | 1 | | | BIGINT | 8 byte signed integer 1 l | | | | FLOAT | 4 bytes single-precision floating-point number 1.0 | | | | DOUBLE | 8 bytes double-precision floating-point number 1.0 | | | | DEICIMAL | arbitrary precision signed decimal 1.0 | | | | | STRING STRING * * * *, Longer | "a", "b" | | | VARCHAR | | variable-length string "a", "b" | | | CHAR | | fixed length string "a", "b" | | | BINARY | | | byte array can't said | | TIMESTAMP | The time stamp, Millisecond precision | 122327493795 | | | | * * DATE * * DATE | '2016-03-29' | | | INTERVAL | time frequency INTERVAL | | | | complex type ARRAY of the same type of set | | orderly MAP for array (1, 2) | | | | key - value, the key must be a primitive type, The value can be any type | map (' a ', 1, "b", 2) | | | STRUCT | field collection, different types can | STRUCT,1.0 (' 1 ', 1), Named_stract (col2 'col1', '1', ', '1' clo3 ', 1.0) | | | UNION | in a limited range of values a value | create_union (1, 'a', 63) | > Usage: Decimal (1,2) represents a maximum of 11 digits, of which the last two digits are decimals and the integer part is 9 digits; If the integer part is more than 9 bits, the field becomes null; If the decimal part is less than two digits, then it is followed by a 0 to fill in two digits. If the decimal part is more than two digits, then the excess is rounded off. You can also write DECIMAL directly, without specifying digits. The default is DECIMAL (10,0) integer 10 digits, with no decimals - create a table and specify delimiters between fields
create table if not exists stu2(id int ,name string) row format delimited fields terminated by ‘\t’ stored as textfile location ‘/user/stu2’;
> row format delimited fields terminated by '\t' specifying field delimiter '\001' Stored as specified by query result
create table stu3 as select * from stu2;
- Create a table from an existing table structure
create table stu4 like stu2;
- Structure of the query table
Select * from table desc stu2;
Desc formatted stu2;
- Query the statement that created the table
show create table stu2;
When you delete a Hive table, the data will still be stored in HDFS. It will not be deleted. It will only delete the metadata of the table - the external table will be built
create external table student (s_id string,s_name string) row format delimited fields terminated by ‘\t’;
- Load data from the local file system into the table
Additional operating load data local inpath ‘/ export/servers/hivedatas/student CSV’ into table student;
Covering operating load data local inpath ‘/ export/servers/hivedatas/student CSV’ overwrite into table student;
- Load data from the HDFS file system to the table
load data inpath ‘/hivedatas/techer.csv’ into table techer;
Load data inpath ‘/hivedatas/techer. CSV ‘into table techer partition(cur_date=20201210);
> - ** Note ** : 1. Load data local from the local file system, the file will be copied to HDFS 2. When you load a file from the HDFS file system, you move the file directly to the Hive directory. Note that you do not copy the file, because Hive considers that there are already 3 copies of the HDFS file, so there is no need to copy the file again. 4. If you load a file with the same file name, it will be automatically renamed ####
create table score(s_id string, s_score int) partitioned by (month string);
- Create multiple partitions for one watchband
create table score2 (s_id string, s_score int) partitioned by (year string,month string,day string);
> ** Note: When creating a table with a partition, Hive will load all the files in that folder. When creating a table with a partition, Hive will load all the files in that folder. When creating a table with a partition, Hive will load all the files in that folder. For example, Partitioned By (Day String), then each of the folders under that folder will be a partition with a folder name of day=20201123 and use: MSCK Repair Table Score; ** - Load data into a partitioned table. ** - Load data into a partitioned table
load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score partition (month=’201806′);
- Load data into a multi-partitioned table
load data local inpath ‘/export/servers/hivedatas/score.csv’ into table score2 partition(year=’2018′,month=’06’,day=’01’);
- View the partition
show partitions score;
- Add a partition
alter table score add partition(month=’201805′);
- Add multiple partitions simultaneously
alter table score add partition(month=’201804′) partition(month = ‘201803’);
> Note: After adding the partition, you will see an extra folder under the table in the HDFS file system - Delete the partition
alter table score drop partition(month = ‘201806’);
#### Bucket table operation > to divide data into multiple buckets according to the specified fields, that is, according to the bucket field for hashing into a number of files to partition is folder, bucket is file > advantages: 1. 2. Improve sampling efficiency - enable Hive table unlocking function
set hive.enforce.bucketing=true;
- Set the number of reduce
set mapreduce.job.reduces=3;
- Create bucket table
create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets;
> bucket table data load: If you want to load buckets from HDFS, dfs-put, or load data, you can only load them through insert overwrite. INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE) INSERT OVERWRITE (INSERT OVERWRITE
insert overwrite table course select * from course_common cluster by(c_id); Finally, specify the bucket field
#### change table and drop table - change table name
alter table old_table_name rename to new_table_name;
- Add/modify column information
Query table structure desc score5;
Alter table score5 add columns (mycol string, mysco string);
Alter table score5 change column mysco mysconew int;
- delete table operation
drop table score5;
- Clear table operation
truncate table score6;
Note: can only clear the management table, that is, internal table; Clearing the external table produces an error
> ** Note: truncate and drop: if HDFS enables the recycle bin, drop can restore table data from the recycle bin. ** #### add data to hive table - insert data directly into partitioned table
insert into table score partition(month =’201807′) values (‘001′,’002′,’100’);
- Load data by means of load
load data local inpath ‘/export/servers/hivedatas/score.csv’ overwrite into table score partition(month=’201806′);
- Load data by querying
insert overwrite table score2 partition(month = ‘201806’) select s_id,c_id,s_score from score1;
- Create a table and load data in the query statement
create table score2 as select * from score1;
- When creating the table, specify the path to load data via location
create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ‘,’ location ‘/myscore’;
- mysql > select * from Hive table; - select * from Hive table;
create table techer2 like techer; Create table from existing table structure
export table techer to ‘/export/techer’;
import table techer2 from ‘/export/techer’;
#### Hive data export - INSERT export
Will the results of the query export to local insert overwrite local directory ‘/ export/servers/exporthive’ select * from score;
The results of the query format export to local insert overwrite local directory ‘/ export/servers/exporthive row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from student;
The query result is exported to the HDFS (not local) insert the overwrite directory ‘/ export/servers/exporthive row format delimited fields terminated by ‘\t’ collection items terminated by ‘#’ select * from score;
- The Hadoop command is exported locally
dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;
-hive shell command export
Basic syntax :(hive -f/-e) execute statement or script > file
hive -e “select * from myhive.score;” > /export/servers/exporthive/score.txt
hive -f export.sh > /export/servers/exporthive/score.txt
-export to HDFS
export table score to ‘/export/exporthive/score’;
## Hive DQL query syntax ### single table query
SELECT [ALL | DISTINCT] select_expr, select_expr, …
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
> Note: 1, ORDER BY does a global ordering on the input, so there is only one reducer, which will lead to a longer computation time when the input size is large. 2, SORT BY is not a global sort, it completes the sort before data enters the reducer. Thus, if sort by is used for sorting and mapred.reduce.tasks>1 is set, sort by only guarantees the output order of each reducer, not the global order. 3, distribute by(fields) to different reducers according to the specified fields, and the distribution algorithm is a hash. 4, Cluster By (fields) can be distributed by, as well as sorting the fields. Therefore, if the buckets and sort fields are the same, then cluster by = distribute by + sort by - WHERE statement
select * from score where s_score < 60;
> Note: < a value does not contain null, as shown in the above query result is to exclude rows with S_SCORE NULL GROUP BY GROUP
select s_id ,avg(s_score) from score group by s_id;
Select * from score group by s_id having avgscore > 85; select * from score group by s_id having avgscore > 85; select * from avg having avgscore > 85;
> Note: If you use group by, then select can only write groups of fields or aggregate functions where and having the difference: 1 HAVING filters data after GROUPING BY, so fields having to filter can only be grouped fields or aggregate functions. 2 WHERE HAVING filters fields directly from data table, so it cannot be followed by gruop by. You also cannot use the aggregate function-join join
Select * from techer t [INNER] JOIN course c on t.t_id = c.t_id; select * from techer t [INNER] JOIN course c on t.t_id = c.t_id; — inner can be left out
Select * from techer t LEFT JOIN course c on t.t_id = c.t_id; select * from techer t LEFT JOIN course c on t.t_id = c.t_id; Outer, outer, outer
Select * from Techer T RIGHT JOIN course c on t.t_id = c.t_id; select * from Techer T RIGHT JOIN course c on t.t_id = c.t_id;
FULL OUTER JOIN: Returns all records in all tables that meet the criteria. If the specified field in any of the tables does not have a value that meets the criteria, then a NULL value is used instead. SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;
> Note: (1) The Hive2 version already supports unequal join, which means that you can use > > < symbol after JOIN ON condition. It also supports the JOIN ON condition followed by OR (the previous version of ON only supports = and AND, not \ >\ < and OR) 2. If the Hive execution engine uses MapReduce, a join will start a job. If the Hive execution engine uses MapReduce, a join will start a job. If the Hive execution engine uses MapReduce, a join will start a job. Select * from table_a,table_b where table_a.id=table_b.id; select * from table_a,table_b where table_a.id=table_b.id; There is no difference in the execution efficiency between them, only the way they are written is different. A comma join is followed by a WHERE, and a join is followed by a ON. -order by sort
Global sort, there will only be one reduce ASC (scend) : ascending (default) DESC (hajj) : descend SELECT * FROM student s LEFT JOIN score sco ON s.s_id = sco. S_id ORDER BY sco. S_score DESC;
> Note: ORDER BY is a global sort, so there is only one reduce at the end, which is executed on a single node. If the amount of data is too large, it will take a long time - sort by local sort
Each MapReduce is sorted internally, not for the global result set.
Set MapReduce.Job. =3;
Set mapreduce.job.
Select * from score sort by s_score; select * from score sort by s_score;
The query results into the file (in accordance with the grade descending order) insert the overwrite local directory ‘/ export/servers/hivedatas/sort’ select * from score sort by s_score;
-distribute by partition sorting
DISTRIBUBY: is similar to partition in MR and is used in combination with SORT BY
Set the number of reduces, divide our corresponding s_id into the corresponding reduce, to set mapreduce.job.leave =7;
Select * from score distribute by s_id sort by s_score;
> Note: HIVE requires DISTRIBUBY statements to be written before SORT BY statements - CLUSTER BY
Cluster by can be used when the fields are equally distribute by and sort by. Cluster by has the function of distribute by as well as sort by. But the sort can only be a positive sort. You cannot specify a sort as ASC or DESC.
Select * from score cluster by s_id; select * from score cluster by s_id; select * from score distribute by s_id sort by s_id;
## Hive function ### aggregate function
Hive supports common aggregation functions such as count(), Max (),min(),sum(),avg(), etc
> note: Time aggregation operation must pay attention to the null values count (*) contain null values, statistics all the lines of the count (id) do not contain null values min minimum value is not null, unless all values are null avg averaging is also does not contain null - non-empty set overall variable function: var_pop
Var_pop (col) return value: double
- Non empty set sample variable function: var_samp
Var_samp (col) return value: double Sample variable for the col non-empty set in the result set
- Overall standard deviation function: stddev_pop
Description: This function evaluates the population standard deviation and returns the square root of the population variable, which returns the same value as the square root of VAR_POP
- Median function: percentile
For the exact Pth percentile, p must be between 0 and 1, but the COL field currently only supports integer-type, not floating-point
### Relational Operations
Support: equivalent (=), not equivalent (! = or <>), less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=)
NULL (IS NOT NULL)
B: I don't LIKE it
If string A or string B is NULL, return NULL; if string A or string B is NULL, return NULL; True if string A conforms to the regular syntax of expression B; Otherwise FALSE. In B, the character “_” represents any single character, and the character “%” represents any number of characters.
- Java LIKE operation: RLIKE
String A like B is NULL if string A or string B is NULL if string B is NULL. True if string A conforms to the regular syntax of the Java regular expression B; Otherwise FALSE.
-regexp operation: REGEXP
Select 1 from tableName where ‘footbar’ REGEXP ‘^f.*r$’; select 1 from tableName where ‘footbar’ REGEXP ‘^f.*r$’; Results: 1.
### Math
Support all numeric types: plus (+) and minus (-), multiply (*), in addition to more than (/), take (%), and (&), or (|), an exclusive or (^), an invert (~)
### Logical Operation
Support: Logic AND (AND), Logic OR (OR), Logic NOT (NOT)
### Integer function: round
Select round(3.1415926) from tableName; select round(3.1415926) from tableName; Results: 3
- Specifies the round function: round
Select round(3.1415926,4) from tableName select round(3.1415926,4) from tableName; 3.1416
- Rounddown function: Floor
Hive > select floor(3.641) from tableName; return floor(3.641) from tableName; 3
- Upround function: Ceil
Hive > select ceil(3.1415926) from tableName select ceil(3.1415926) from tableName; 4
- Take the random number function: RAND
Syntax: rand(),rand(int seed) Return value: double Description: Returns a random number between 0 and 1. Hive > select rand() from tableName; Hive > select rand() from tableName; — Each time you execute this statement, you get a different result 0.5577432776034763
hive> select rand(100) ; If you specify the seed, you will get 0.7220096548596434 if you execute this statement
- Natural exponential function: exp
Hive > select exp(2); Hive > select exp(2); 7.38905609893065
- Logarithm function base 10: log10
Hive > select log10(100) from log10(100); 2.0
> also has: base 2 logarithm function: log2(), logarithm function: log() - power function: poW
Hive > select pow(2,4) from pow(2,4); 16.0
-Square root function: SQRT
Hive > select SQRT (16) from Hive > select SQRT (16) from Hive > 4.0
- Binary function: bin
Hive > select bin(7) from hive> 111
Conv (bigint num, int from_base, int to_base) conv(bigint num, int from_base, int to_base) conv(int from_base, int to_base) In addition, there are many mathematical functions: absolute value: abs(), concomitant: pmod(), sine: sin(), arcsine: asin(), cosine: cos(), arccosine: Acos (), positive function: positive(), negative function: negative(
If (Boolean TestCondition, T ValueValue, T ValueValseOrNull) return value: T ValueValseOrNull hive> select if(1=2,100,200); 200 hive> select if(1=1,100,200); 100
- Non-null lookup function: COALESCE
SYNTAX: COALESCE (T v1, T v2…) Return value: T Description: Returns the first non-null value in the parameter; Return NULL hive> select coalesce(NULL,’100′,’50’); 100
- case when (case when)
Case when a then b [when c then d]* [else e] end If c is TRUE, then return d; E hive> select case when 1=2 then ‘Tom’ then ‘Mary’ else ‘Tim’ end from tableName; mary
- case when (case when)
Case a when b then c [when d then e]* [else f] end case a when b then c [when d then e]* [else f] end If a is equal to d, then return e; F hive> Select case 100 when 50 then ‘Tom’ when 100 then ‘Mary’ else ‘Tim’ end from tableName; mary
Mysql > select * from tableName, select * from tableName, select * from tableName, select * from tableName, select * from tableName, select * from tableName
Select unix_timestamp() from tableName select unix_timestamp() from tableName select unix_timestamp() from tableName; 1616906976
- 2) Unix timestamp to date function: from_unixTime
From_unixTime (bigint unixTime [, string format]) Hive > select from_unixtime(1616906976,’yyyyMMdd’) from tableName; select from_unixtime(1616906976,’yyyyMMdd’) from tableName; 20210328
- 3) Unix_timestamp function: unix_timestamp
Note: Convert the date in the format “yyyy-mm-dd HH: MM :ss” to a UNIX timestamp. If the conversion fails, 0 is returned. hive> select unix_timestamp(‘2021-03-08 14:21:15’) from tableName; 1615184475
-4) Specify the format date to UNIX timestamp function: UNIX_TIMESTAMP
SYNTAX: UNIX_TIMESTAMP (string date, string pattern) If the conversion fails, 0 is returned. hive> select unix_timestamp(‘2021-03-08 14:21:15′,’yyyyMMdd HH:mm:ss’) from tableName; 1615184475
- 5) Date: to_date
To_date (string timestamp) returns value: string Description: Returns the date part of the date time field. hive> select to_date(‘2021-03-28 14:03:01’) from tableName; 2021-03-28
-6) Date year function: year
Syntax: year(string date) Return value: int Description: Returns the year in the date. hive> select year(‘2021-03-28 10:03:01’) from tableName; 2021 hive> select year(‘2021-03-28’) from tableName; 2021
-7) Date to month function: month
Syntax: month (string date) return value: int Description: Returns the month in the date. hive> select month(‘2020-12-28 12:03:01’) from tableName; 12 hive> select month(‘2021-03-08’) from tableName; 8
-8) Date transfer function: day
Syntax: day (string date) Return value: int Description: Returns the day in the date. hive> select day(‘2020-12-08 10:03:01’) from tableName; 8 hive> select day(‘2020-12-24’) from tableName; 24
-9) Date to hour function: hour
Syntax: hour (String date) Return value: int Description: Returns the hour in the date. hive> select hour(‘2020-12-08 10:03:01’) from tableName; 10
-10) Date to minute function: minute
Syntax: minute (string date) Return value: int Description: Returns the minute in the date. hive> select minute(‘2020-12-08 10:03:01’) from tableName; 3
-11) Date in seconds function: second
Syntax: second (string date) Return value: int Description: Returns the second in the date. hive> select second(‘2020-12-08 10:03:01’) from tableName; 1
-12) Date cycle function: Weekofyear
Syntax: Weekofyear (String Date) Return value: int Description: Returns the date in the current week. hive> select weekofyear(‘2020-12-08 10:03:01’) from tableName; 49
-13) Date comparison function: datediff
Return datediff(string enddate, string startdate) return value: int hive> select datediff(‘2020-12-08′,’2012-05-09’) from tableName; 213
-14) Date increment function: date_add
Date_add (string startdate, int days) date_add(string startdate, int days) hive> select date_add(‘2020-12-08’,10) from tableName; 2020-12-18
Date reduction function: DATE_SUB
Date_sub (string startdate, int days) = date_sub (string startdate, int days); hive> select date_sub(‘2020-12-08’,10) from tableName; 2020-11-28
### string function - 1) string length function: length
Hive > select length(‘abcedfg’) from tableName select length(‘abcedfg’) from tableName; 7
- 2) String reversal function: reverse
Hive > select reverse(‘abcedfg’) from tableName; gfdecba
- 3) string concatenation function: concat
Concat (string A, string B…) Hive > select concat(‘ ABC ‘,’def ‘,’ gh’)from tableName; abcdefgh
-4) Delimited string concatenation function: concat_ws
Concat_ws (string SEP, string A, string B…) Select concat_ws(‘,’,’ ABC ‘,’def’,’gh’)from tableName select concat_ws(‘,’,’ ABC ‘,’def’,’gh’)from tableName; abc,def,gh
- 5) string interception function: substr,substring
Substr (string A, int start) substring(string A, int start) substr(string A, int start) Hive > select substr(‘abcde’,3) from tableName; cde hive> select substring(‘abcde’,3) from tableName; cde hive> select substr(‘abcde’,-1) from tableName; (same as Oracle) e
- 6) string intercept function: substr,substring
Substr (string A, int start, int len) substring(string A, int start, int len) Hive > select substr(‘abcde’,3,2) from tableName select substr(‘abcde’,3,2) from tableName; CD hive> select substring(‘abcde’,3,2) from tableName; CD hive>select substring(‘abcde’,-2,2) from tableName; de
-7) String to uppercase function: upper,ucase
Hive > select upper(‘abSEd’) from tableName; select upper(‘abSEd’) from tableName; ABSED hive> select ucase(‘abSEd’) from tableName; ABSED
String to lowercase function: lower,lcase
Hive > select lower(‘abSEd’) from tableName; absed hive> select lcase(‘abSEd’) from tableName; absed
-9) Trim the whitespace function
Hive > select trim(‘ ABC ‘) from tableName; abc
-10) Remove the blank function on the left: ltrim
Hive > select ltrim(‘ ABC ‘) from tableName select ltrim(‘ ABC ‘) from tableName; abc
-11) Remove space function on the right: rtrim
Hive > select rtrim(‘ ABC ‘) from tableName select rtrim(‘ ABC ‘) from tableName; abc
-12) Regular expression replacement function: regexp\_replace
Syntax: regexp_replace(string A, string B, string C) Return value: string Description: Replace the part of string A that matches the Java regular expression B with C. Note that there are situations where escape characters are used, similar to the regexp_replace function in Oracle. hive> select regexp_replace(‘foobar’, ‘oo|ar’, ”) from tableName; fb
-13) Regular expression parsing function: regexp\_extract
Regexp_extract (string subject, string pattern, int index) Split the string subject according to the pattern regular expression rules, returning the character specified by index. hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 1) from tableName; the hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 2) from tableName; bar hive> select regexp_extract(‘foothebar’, ‘foo(.*?) (bar)’, 0) from tableName; Note that in some cases the escape character is used. The equal sign below is escaped with a double vertical bar. This is a rule for Java regular expressions. select data_field, regexp_extract(data_field,’.*? bgStart\=(1+)’,1) as aaa, regexp_extract(data_field,’.*? contentLoaded_headStart\=(1+)’,1) as bbb, regexp_extract(data_field,’.*? AppLoad2Req\=(1+)’,1) as ccc from pt_nginx_loginlog_st where pt = ‘2021-03-28’ limit 2;
-14) URL parsing function: parse\_url
Syntax: parse_url(string urlString, string partToExtract [, string keyToExtract]) return value: string Description: Returns the specified portion of the URL. Valid values for PartToExtract are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. hive> select parse_url (‘https://www.tableName.com/pat… ‘, ‘HOST’) from tableName; www.tableName.com hive> select parse_url (‘https://www.tableName.com/pat… ‘, ‘QUERY’, ‘k1’) from tableName; v1
-15) JSON parsing function: GET \ _JSON \ _OBJECT
Syntax: get_json_object(string json_string, string path) return value: string Description: Parses the JSON string json_string, return the contents of the specified path. If the input JSON string is invalid, then NULL is returned. hive> select get_json_object(‘{“store”:{“fruit”:[{“weight”:8,”type”:”apple”},{“weight”:9,”type”:”pear”}], “Bicycle” : {” price “: 19.95,” color “:” red “}}, “email” : “amy@only_for_json_udf_test.net”, “owner” : “Amy”} ‘, ‘$. The owner’) from tableName;
-16) Space string function: space
Hive > select space(10) from tableName select space(10) from tableName; hive> select length(space(10)) from tableName; 10
-17) Repeat string function: repeat
Hive > select repeat(‘ ABC ‘,5) from tableName; repeat(string STR, int n) abcabcabcabcabc
The first character ASCII function: ASCII
Hive > select ASCII (‘abcde’) from tableName select ASCII (‘abcde’) from tableName; 97
Left complement function: LPAD
Lpad (string STR, int len, string pad) Hive > select lpad(‘ ABC ‘,10,’td’) from tableName; Tdtdtdtabc Note: Unlike GP, Oracle, Pad cannot default
Right complement function: RPAD
Rpad (string STR, int len, string pad) Hive > select rpad(‘ ABC ‘,10,’td’) from tableName; select rpad(‘ ABC ‘,10,’td’) from tableName; abctdtdtdt
-21) Split string function: split
Syntax: split(string STR, string pat) Hive > select split(‘abtcdtef’,’t’) from tableName; [“ab”,”cd”,”ef”]
-22) Set lookup function: find\_in\_set
Returns the position where STR first appears on the strList, which is a comma separated string. 0 hive> select find_in_set(‘ab’,’ef,ab,de’) from tableName; 2 hive> select find_in_set(‘at’,’ef,ab,de’) from tableName; 0
### Composite type build operation - Map type build: Map
Syntax: map (key1, value1, key2, value2…) Hive > Create table mapTable as select map(‘100′,’ Tom ‘,’200′,’ Mary ‘) as t from tableName; hive> describe mapTable; t map
hive> select t from tableName; {“100″:”tom”,”200″:”mary”}
- 2) Struct type construction: Struct
Struct (val1, val2, val3…) Struct hive> create table struct_table as select struct(‘ Tom ‘,’ Mary ‘,’ Tim ‘) as t from tableName; hive> describe struct_table; t struct
hive> select t from tableName; {“col1″:”tom”,”col2″:”mary”,”col3″:”tim”}
- 3) Array type build: Array
Syntax: array(val1, val2…) Array of type hive> create table arr_table as select array(” Tom “,” Mary “,” Tim “) as t from tableName; hive> describe tableName; t array
hive> select t from tableName; [“tom”,”mary”,”tim”]
- 1) Array: A[n]
A[n] returns the value of the NTH variable in A[n]. The array starts with a subscript of 0. For example, if A is an array of type [‘foo’, ‘bar’], A[0] returns ‘foo’, A[1] returns ‘bar’ hive> create table arr_table2 as select array(” Tom “,” Mary “,” Tim “) as t from tableName; hive> select t[0],t[1] from arr_table2; tom mary tim
- 2) Map type access: M[key]
M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M[Key] : M; M is the value, for example, {‘ f ‘- >’ foo ‘, ‘b’ – > ‘bar’, ‘all’ – > ‘foobar’} map types, M[‘all’] Create table map_table2 as select map(‘100′,’ Tom ‘,’200′,’ Mary ‘) as t from tableName; hive> select t[‘200’],t[‘100’] from map_table2; mary tom
- 3) struct type access: S.x
Type of operation: S for struct Type Description: Returns the X field in struct S. For example, for struct foobar {int foo, int bar}, Foobar. foo hive> create table str_table2 as select struct(‘ Tom ‘,’ Mary ‘,’ Tim ‘) as t from tableName; hive> describe tableName; t struct
hive> select t.col1,t.col3 from str_table2; tom tim
Size (size <k.V>) size(size <k.V>)
Hive > select size(t) from map_table2 select size(t) from map_table2; 2
- 2) array: size(array <T>)
Size (Array
) int Hive > select size(T) from arr_table2; 4
- 3) Type conversion function ***
Cast (expr as
) return value: Expected “=” to follow “type” Hive > select cast(‘1’ as bigint) from tableName; 1
SQL > explode; mysql > SELECT SELECT * FROM 'Map' AND 'Array' FROM 'hive' Lateral views call UDTF for each row of the original table, and UDTF will split a row into one or more rows. A lateral view can be used to split a row into split and explode. Lateral view is combining the results to produce a virtual table that supports other watches. It can also be used to explode a complex Array or Map structure in a Hive column into multiple row requirements. The data format is now as follows
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4
With the use of \t split between fields, it is necessary to separate all the children into one column
mychild |
---|
child1 |
child2 |
child3 |
child4 |
child5 |
child6 |
child7 |
child8 |
The key and value of the map are also broken down to the following result
mymapkey | mymapvalue |
---|---|
k1 | v1 |
k2 | v2 |
k3 | v3 |
k4 | v4 |
- 1) Create the Hive database
Hive (default)> create database hive_explode; hive (default)> use hive_explode;
- 2) Create a table and explode it
hive (hive_explode)> create table t3(name string,children array<string>,address Map<string,string>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’ stored as textFile;
- 3) Load data
Node03 executes the following command to create the table data file
mkdir -p /export/servers/hivedatas/
cd /export/servers/hivedatas/
vim maparray
Content is as follows: zhangsan child1, child2, child3, child4 k1: v1, k2: v2 lisi child5, child6, child7, child8 k3: v3, k4: v4
Load data hive table hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas maparray into table t3;
- 4) Explode the data from Hive
Hive (hive_explode)> SELECT children AS myChild FROM t3;
Split the data in the Map
hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;
Requirement: There are now some data formats as follows:
A: b: shandong, Beijing, c: hebei,2,3,4,5,6,7,8,9 | 1 | [{” source “:” 7 fresh “, “monthSales:” 4900, “userCount” : 1900, “score” : “9.9”}, {” s Ource “:” jd “and” monthSales “: 2090,” userCount “: 78981,” score “:” 9.8 “}, {” source “:” jdmart “, “monthSales:” 6987, “userCount” : 1600, “the sc Ore “:” 9.0 “}]
Between fields and field separators is | we had to resolve all monthSales corresponding values for the following this column turn (rows) 4900 2090 6987-1) to create a hive table
hive (hive_explode)> create table explode_lateral_view
> (`area` string,
> `goods_id` string,
> `sale_info` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS textfile;
- 2) Prepare data and load data
Prepare the following CD/export/data servers/hivedatas vim explode_json
A: b: shandong, Beijing, c: hebei,2,3,4,5,6,7,8,9 | 1 | [{” source “:” 7 fresh “, “monthSales:” 4900, “userCount” : 1900, “score” : “9.9”}, {” s Ource “:” jd “and” monthSales “: 2090,” userCount “: 78981,” score “:” 9.8 “}, {” source “:” jdmart “, “monthSales:” 6987, “userCount” : 1600, “the sc Ore “:” 9.0 “}]
Load the data to the hive table to hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas explode_json overwrite into the table explode_lateral_view;
- 3) Use explode to explode
hive (hive_explode)> select explode(split(goods_id,’,’)) as goods_id from explode_lateral_view;
- 4) Use explode
hive (hive_explode)> select explode(split(area,’,’)) as area from explode_lateral_view;
- 5) Disassemble the JSON field
hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘)) as sale_info from explode_lateral_view;
Then we want to use get_json_object to get the data with the key for monthSales:
hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘)),’$.monthSales’) as sale_info from explode_lateral_view;
Failed: semanticException [Error 10081]: UDTF’s are not supported outside the SELECT clause, nor nested in expression UDTF explode select explode(split(area,’,’)) as area,good_id from explode_lateral_view; Complains FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF’s. Error encountered near When token ‘good_id’ is used in UDTF, only one field is supported, and then a LATERAL VIEW is required
Use the Coordinate Lateral View to query multiple fields
hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,’,’))goods as goods_id2;
Lateral VIEW explodes (split(goods_id,’,’))goods, which is an explode_lateral_view Cartesian product
It can also be used multiple times
hive (hive_explode)> select goods_id2,sale_info,area2
from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2; It's also the Cartesian product of three tables
Finally, we can use the following sentence to completely convert this JSON formatted row of data into a two-dimensional table
hive (hive_explode)> select get_json_object(concat(‘{‘,sale_info_1,’}’),’$.source’) as source,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.monthSales’) as monthSales,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.userCount’) as monthSales,get_json_object(concat(‘{‘,sale_info_1,’}’),’$.score’) as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,’\[\{‘,”),’}]’,”),’},\{‘))sale_info as sale_info_1;
Conclusion: Lateral Views usually appear with UDTF to solve the problem that UDTF does not allow in the SELECT field. Multiple Lateral View can realize the Cartesian product. The OUTER keyword can be used to nullify empty UDTF results that are not output, preventing data loss. CONCAT(string A/col, string B/col...) : Returns the result of the input string concatenation, supporting any input string; CONCAT_WS(separator, str1, str2,...) : It is a special form of CONCAT(). The delimiter between the remaining parameters of the first argument. The delimiter can be the same string as the rest of the arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. A delimiter is added between the concatenated strings; Collect_set (col) : This function accepts only primitive data types. Its primary purpose is to desummarize the values of a field to produce a field of type Array. Data preparation: | name | constellation | blood_type | | -- - | -- -- -- -- -- -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- -- - | | | | Aries A Monkey King | | | Lao wang Sagittarius | A | | | song song Aries | B | | pig eight quit | | Aries A | | feng | A | | Sagittarius needs: the constellation and blood type people together. The results are as follows:
Sagittarius, A Lao wang | feng Aries, A monkey | pig Aries, song song B
- 1) Create local Constellation. TXT and import data
Node03 server performs the following command to create file, pay attention to the data using the \ t segmentation CD/export/servers/hivedatas vim constellation. TXT
The data is as follows: Sun Wukong Aries A Old Wang Sagittarius A Song Song Aries B Pig Bajie Aries A Sister Phoenix Sagittarius A
- 2) Create Hive table and import data
Mysql > create table person_info on hive_explode >;
name string,
constellation string,
blood_type string)
row format delimited fields terminated by "\t";
Load data hive (hive_explode) > load data local inpath ‘/ export/servers/hivedatas constellation. TXT’ into table person_info;
- 3) Query data according to requirements
hive (hive_explode)> select
t1.base,
concat_ws('|', collect_set(t1.name)) name
from
(select
name,
concat(constellation, "," , blood_type) base
from
person_info) t1
group by
t1.base;
SQL > EXPLODE(col); SQL > EXPLODE(col); SQL > EXPLODE(col); Lateral VIEW UDTF (Expression) TableAlias AS Columnalias Used with UDTFs such as Split and Split, it splits a column of data into rows and then aggregates the split data. Data preparation:
CD/export/servers/hivedatas vim movie. TXT file content is as follows: Use \t to split between data fields “Person of Interest” mystery, action, science fiction, drama “Lie to Me” mystery, police, action, psychology, drama “Wolf Warrior 2” war, action, disaster
Requirement: Expand the array data in the movie category. The results are as follows:
“Man of Interest” mystery “Man of Interest” action “Science fiction” Man of Interest “plot” Lie to Me “mystery” Lie to Me “police” Lie to Me “action” Lie to Me “psychology” Lie to Me “drama” Wolf Warrior 2″ War Wolf Warrior 2 Action Wolf Warrior 2 Disaster
- 1) Create a Hive table
create table movie_info(
movie string,
category array<string>)
row format delimited fields terminated by “\t”
collection items terminated by “,”;
- 2) Load data
load data local inpath “/export/servers/hivedatas/movie.txt” into table movie_info;
- 3) Query data according to requirements
select
movie,
category_name
from
movie_info lateral view explode(category) table_tmp as category_name;
The reflect function supports calling native Java functions in SQL, which can kill all UDF functions in seconds. Requirement 1: Use Max in java.lang.Math to find the maximum value in two columns
create table test_udf(col1 int,col2 int) row format delimited fields terminated by ‘,’;
- 2) Prepare data and load data
cd /export/servers/hivedatas
vim test_udf
The file is as follows: 1,2,3, 6,4, 7,5, 5,6
- 3) Load data
hive (hive_explode)> load data local inpath ‘/export/servers/hivedatas/test_udf’ overwrite into table test_udf;
-4) Find the maximum value of the two columns using Max in java.lang.Math
hive (hive_explode)> select reflect(“java.lang.Math”,”max”,col1,col2) from test_udf;
Requirement 2: Different records in files to perform different Java built-in function implementation steps: - 1) Create Hive tables
hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ‘,’;
- 2) Prepare data
cd /export/servers/hivedatas
vim test_udf2
Math,min,1,2, java.lang.Math, Max,2,3
- 3) Load data
hive (hive_explode)> load data local inpath ‘/export/servers/hivedatas/test_udf2’ overwrite into table test_udf2;
-4) Execute the query
hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;
Requirement 3: Deterine whether it is a digital implementation: Use a function from Apache Commons. The JAR under Commons is already included in the classpath of Hadoop, so it can be used directly.
select reflect(“org.apache.commons.lang.math.NumberUtils”,”isNumber”,”123″)
In SQL, there is a class of functions called aggregate functions, such as sum(), avg(), Max (), etc. These functions can regularly aggregate multiple rows into one row. Generally, the number of rows after the aggregate is less than the number of rows before the aggregate. But sometimes we want to display both the data before and after the aggregation, so we introduce window functions. Window functions, also known as OLAP functions/analysis functions, have both grouping and sorting functions. The most important keywords for a window function are ** PARTITION BY ** and ** ORDER BY. **over (partition by XXX order by XXX)** ### sum, avg, min, Max
Construction Sentences: create table test_t1( cookieid string, createtime string, –day pv int ) row format delimited fields terminated by ‘,’;
Load data local inpath ‘/root/hivedata/test_t1.dat’ into table test_t1;
Cookie1,2020-04-10,1 cookie1,2020-04-11,5 cookie1,2020-04-12,7 cookie1,2020-04-13,3 cookie1,2020-04-14,2 Cookie1, 2020-04-15, 4 cookie1, 2020-04-16, 4
Starting a SET of intelligent local hive. The exec. Mode. Local. Auto = true;
The SUM function is used in conjunction with the window function: the result is related to the ORDER BY and is in ascending ORDER BY default.
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
from test_t1;
PV1: PV accumulation from the starting point to the current row in the group, for example, PV1 of No. 11 = PV of No. 10 + PV of No. 11, PV2 of No. 12 = PV of No. 10 +11 +12, PV2: same as PV1, PV3: cumulative PV4 of all PVs in the group (cookie1) : Current row + 3 previous rows in the group, for example, number 11 =10 +11, number 12 =10 +11 +12,
The 13th is equal to the 10th plus the 11th plus the 12th plus the 13th, the 14th is equal to the 11th plus the 12th plus the 13th plus the 14th
Pv5: Current row in the group +3 rows in the front +1 row in the back, for example, 14 =11 +12 +13 +14 +14 +15 =5+7+3+2+4=21
14 is 14 plus 15 plus 16 is 2 plus 4 plus 4 is 10
If ROWS BETWEEN is not specified, the default is from the starting point to the current row. If you do not specify ORDER BY, all values in the group are added up; Following: preceding: current row, following: current row, following: unbounded: preceding: current row, following: preceding: current row, unbounded: preceding: current row, following: preceding: current row, unbounded: preceding: preceding: current row, following: preceding: current row, unbounded: To give a bounded preceding start, unbounded following: to the end points AVG, MIN, MAX, as in SUM. ### row_number, rank, dense_rank, ntile prepare data
Cookie1,2020-04-10,1 cookie1,2020-04-11,5 cookie1,2020-04-12,7 cookie1,2020-04-13,3 cookie1,2020-04-14,2 Cookie1,2020-04-15,4 cookie1,2020-04-16,4 cookie2,2020-04-10,2 cookie2,2020-04-11,3 cookie2,2020-04-12,5 Cookie2, 2020-04-13, 6 cookie2, 2020-04-14, 3 cookie2, 2020-04-15, 9 cookie2, 2020-04-16, 7
CREATE TABLE test_t2 (
cookieid string,
createtime string, –day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
stored as textfile;
Load data local inpath ‘/root/hivedata/test_t2.dat’ into table test_t2;
-row_number () uses ROW_NUMBER() to generate a sequence of records within the group, starting at 1, in order.
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM test_t2;
-rank and DENSE_RANK use RANK() to RANK items in the group. Equal rankings will leave a space in the RANK. DENSE_RANK() generates the rank of the data item in the group. Equal rank does not leave a space in the rank.
SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM test_t2
WHERE cookieid = ‘cookie1’;
-Ntile sometimes has a need: if the data is sorted into three parts, and the business person only cares about one part of the data, how to extract the middle third of the data? The NTILE function does. Ntile is a collection of ordered data divided equally among a specified number of buckets (num), with bucket numbers assigned to each row. If the distribution cannot be equal, the smaller numbered buckets are assigned first, and the number of rows that can be placed in each bucket differs by at most 1. You can then select the first or last fraction of N based on the bucket number. The data will be presented in its entirety, just tagging the corresponding data; To get a specific fraction of the data, you need to nest another layer according to the label out.
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM test_t2
ORDER BY cookieid,createtime;
## lag,lead,first\_value,last\_value - lag ** lag (col,n,DEFAULT) is used to count the NTH row up in the window ** The first argument is the column name and the second argument is the NTH row up in the window (optional, Default is 1), the third parameter is the default value (if the NTH behavior is NULL, the default value is taken, or if not specified, NULL).
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,’1970-01-01 00:00:00′) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_t4;
Last_1_time: Specifies the value of line 1 up, default is ‘1970-01-01 00:00:00’
On the first line of cookie1, the top 1 is NULL, so the default value is 1970-01-01 00:00:00 cookie1, the top 1 is the second line, and the top 1 is the fifth line. The 2015-04-10 10:50:01
LAST_2_TIME: Specifies the value of line 2 up, which is the default value specified
On line 1 of cookie1, up 2 acts NULL cookie1 on line 2, up 2 acts NULL cookie1 on line 4, up 2 acts value on line 2, up 2 acts value on line 5, 2015-04-10 10:00:02 cookie1 on line 7, up 2 acts value on line 5, The 2015-04-10 10:50:01
** Lead (col,n,DEFAULT) is used for the Nth row down in the statistics window ** The first argument is the column name, the second argument is the Nth row down (optional,DEFAULT is 1), and the third argument is the DEFAULT (if the Nth row is NULL, the DEFAULT value is set, if not specified, the DEFAULT value is set. The NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,’1970-01-01 00:00:00′) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_t4;
-first_value takes the first value by the current row after sorting within the group
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_t4;
-last_value takes the last value to the current row after sorting within the group
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_t4;
If you want the last value sorted in a group, you need to work around it:
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_t4
ORDER BY cookieid,createtime;
** Special attention to ORDER BY ** If you do not specify ORDER BY, the sorting is out of order and the wrong result will occur
SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_t4;
### CUME_DIST and PERCENT_RANK are two sequence analysis functions that are not commonly used. Note that sequence functions do not support the WINDOWS clause ** - data preparation
D1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000
CREATE EXTERNAL TABLE test_t3 (
dept STRING,
userid string,
sal INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
stored as textfile;
Load data local inpath ‘/root/hivedata/test_t3.dat’ into table test_t3;
-- -- -- -- -- -- -- CUME_DIST and order by the sort order with the byd CUME_DIST less than or equal to the current value of the order/group head office Numbers positive sequence in ascending order by default Statistics the number of less than or equal to the current salary, for example, accounts for the proportion of the total number
SELECT
dept,
userid,
sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM test_t3;
Rn1: no partition, all data are 1 group, the total number of rows is 5,
Line 1: The number of rows less than or equal to 1000 is 1, so 1/5=0.2. Line 3: The number of rows less than or equal to 3000 is 3, so 3/5=0.6
Rn2: Group by department, DPET = D1 has 3 rows,
Second row: The number of rows less than or equal to 2000 is 2, so 2/3= 0.66666666666666666666
PERCENT_RANK PERCENT_RANK The RANK value of the current row in the group -1/ the total number of rows in the group -1. After investigation, this function shows that its practical significance is unclear and needs further research
SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, PERCENT_RANK() OVER(ORDER BY sal) AS rn11 RANK (1) OVER(PARTITION BY NULL) AS rn12, PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM test_t3;
rn1: rn1 = (rn11-1) / (rn12-1)
The first line, (1-1)/(5-1) = 0/4 = 0, the second line, (2-1)/(5-1) = = 0.25 in the fourth row, a quarter (4-1)/(5-1) = 3/4 = 0.75
Rn2: Group according to dept,
Dept = D1 (1-1)/(3-1)=0
GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING SET, GROUPING__ID, CUBE, ROLLUP GROUPING__ID, CUBE, ROLLUP - Data preparation
2020-03202-03-10, cookie1 0-03-10, 2020-03202 cookie5 12, 2020-03202-03 – cookie7 0-04-12, 2020-04202 cookie3 13, 2020-04202-04 – cookie2 0-04-13, 2020-04202 cookie4 0-04-16, 2020-04202 cookie4 0-03-10, 2020-03202 cookie2 2020-03202-03-10, cookie3 0-04-12, 2020-04202 cookie5 0-04-13, 2020-04202 cookie6 0-04-15, 2020-04202 cookie3 15, 2020-04202-04 – cookie2 0-04-16, 2020-04202 cookie1
CREATE TABLE test_t5 (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
stored as textfile;
Load data local inpath ‘/root/hivedata/test_t5.dat’ into table test_t5;
------ - Grouping Sets Grouping Sets is a convenient way to write multiple group by logic in a single SQL statement. This is equivalent to UNION ALL of the Group By result sets of different dimensions. **GROUPING__ID**, which indicates which grouping set the result belongs to.
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month,day
GROUPING SETS (month,day)
ORDER BY GROUPING__ID;
Grouping_id represents the grouping set in which a group of results belongs. According to grouping sets, month, day, 1 represents month and 2 represents day
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day;
Such as:
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID;
Equivalent to SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month,day;
-Cube aggregates according to all combinations of the dimensions of Group By.
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;
Equivalent to the SELECT NULL, NULL, COUNT (DISTINCT cookieid) AS uv, 0 AS GROUPING__ID FROM test_t5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month,day;
-Rollup is a subset of Cube, dominated by the leftmost dimension from which the hierarchy is aggregated.
For example, hierarchical aggregation with the Month dimension: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test_t5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID;
If the order of month and day is switched, then the hierarchy is aggregated by day dimension:
SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM test_t5 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; (Here, the aggregation according to the sky and the moon is the same as the aggregation according to the day, because there is a parent-child relationship, if it is a combination of other dimensions, it will be different.)
- & ↩