Hive of Sql
This article covers all the SQL used by Hive on a daily basis. Because there is too much SQL, the SQL is classified as follows:
DDL statements (data definition statements) : Operations on the database include creating and modifying tables, including internal tables, external tables, partited tables, and bucket tables. DQL statements (data query statements) : single table query, associated query hive functions: This includes aggregation, conditional, date, string, and row transfer. Lateral View and Explode, reflect and parse, and other window functions
The article is published in the public account “five minutes to Learn Big Data”, the original technology number in the field of big data, and the weekly update of big data technology articles and the analysis of the interview question. After attention, you can get the elaborate big data interview bible!
Hive DDL syntax
Operations on the database
- Create database:
create database if not existsmyhive; Note: Hive tables are stored in hive mode-Site. An attribute of the specified XML: hive. Metastore. Warehouse. Dir create the database and specify the HDFS storage location:create database myhive2 location '/myhive2';
Copy the code
- Modify database:
alter database myhive2 set dbproperties('createtime'='20210329');
Copy the code
Note: You can use the ALTER database command to modify some attributes of a database. However, the metadata information of the database is not changeable, including the name of the database and the location of the database
- View database details
Viewing basic database Information Hive (myhive)> descdatabase myhive2; Hive (myhive)> desc database extended myhive2;
Copy the code
- Deleting a database
Delete an empty database, and an error will be reported if the database has tables underneathdropdatabase myhive2; Forcibly delete the database, including the tables below the databasedrop database myhive cascade;
Copy the code
Operations on a data table
Operations on administrative tables (internal tables) :
- Create internal table:
hive (myhive)> use myhive; -- Use the MyHive database
hive (myhive)> create table stu(id int,name string);
hive (myhive)> insert into stu values (1,"zhangsan");
hive (myhive)> insert into stu values (1,"zhangsan"),(2,"lisi"); Insert multiple pieces of data at a time
hive (myhive)> select * from stu;
Copy the code
- Field types used to create a hive table:
classification | type | describe | Example literals |
---|---|---|---|
The original type | BOOLEAN | true/false | TRUE |
TINYINT | A signed integer of 1 bytes -128 to 127 | 1Y | |
SMALLINT | A signed integer of two bytes ranging from -32768 to 32767 | 1S | |
INT | A signed integer of 4 bytes | 1 | |
BIGINT | An 8-byte signed integer | 1L | |
FLOAT | 4 bytes single-precision floating point number 1.0 | ||
DOUBLE | 8-byte double floating-point number | 1.0 | |
DEICIMAL | Signed decimals of arbitrary precision | 1.0 | |
STRING | String, getting longer | “A”, “b” | |
VARCHAR | Variable-length string | “A”, “b” | |
CHAR | Fixed length string | “A”, “b” | |
BINARY | An array of bytes | Can’t say | |
TIMESTAMP | Timestamp, millisecond value precision | 122327493795 | |
DATE | The date of | ‘2016-03-29’ | |
INTERVAL | Time frequency interval | ||
The complex type | ARRAY | An ordered set of the same type | Array (1, 2) |
MAP | Key-value. Key must be of the original type and value can be of any type | The map (‘ a ‘, 1, “b”, 2) | |
STRUCT | A collection of fields of different types | Struct (‘ 1 ‘, 1,1.0), named_stract (col2 ‘col1’, ‘1’, ‘ ‘, 1, ‘clo3’, 1.0) | |
UNION | A value in a finite range of values | Create_union (1, ‘a’, 63) |
A brief explanation of the Decimal type:
Usage: Decimal (11,2) represents a maximum of 11 digits, where the last two digits are decimal and the integer part is 9; If the integer part is more than 9 bits, the field becomes null; If the decimal part has less than two digits, then 0 is used to complete the two digits. If the decimal part has more than two digits, decimal can also be rounded up directly. No digits are specified after the decimal part, and the default value is decimal(10,0)
- Creates the table and specifies the delimiter between the fields
create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';
Copy the code
Row Format DELIMited fields terminated by ‘\t’. Default is ‘\001’
Stored as Specifies the storage format. Location Specifies the storage location
- Create a table based on the query results
create table stu3 as select * from stu2;
Copy the code
- Create tables based on existing table structures
create table stu4 like stu2;
Copy the code
- Query table structure
Query only the fields and properties in the tabledescstu2; Detailed querydesc formatted stu2;
Copy the code
- Query the statement that created the table
show create table stu2;
Copy the code
Operate on an external table
The external table is loaded with data from other HDFS paths. Therefore, the Hive table considers that it does not have exclusive access to the external table data. Therefore, when the hive table is deleted, the data is still stored in the HDFS
- Building an external table
create external table student (s_id string,s_name string) row format delimited fields terminated by '\t';
Copy the code
- Load data from the local file system to the table
Append load datalocal inpath '/export/servers/hivedatas/student.csv' into tablestudent; Override the load data operationlocal inpath '/export/servers/hivedatas/student.csv' overwrite into table student;
Copy the code
- Load data from the HDFS file system to the table
load data inpath '/hivedatas/techer.csv' into tabletecher; Load data inpath'/hivedatas/techer.csv' into table techer partition(cur_date=20201210);
Copy the code
- Note:
1. Run the load data local command to load files from the local file system and copy the files to the HDFS 2. Load data indicates that the HDFS file is loaded from the HDFS file system. The file is directly moved to the hive-related directory. Note that the HDFS file is not copied. 4. If a file with the same file name is loaded, it is automatically renamed
An operation on a partitioned table
- Syntax for creating a partitioned table
create table score(s_id string, s_score int) partitioned by (month string);
Copy the code
- Create a watchband with multiple partitions
create table score2 (s_id string, s_score int) partitioned by (year string,month string,day string);
Copy the code
* * note:
When creating a Hive table, use the location command to specify a file or folder. When a folder is specified, Hive loads all files in the folder. If there is no partition in the table, the folder must not contain any folders. For example, Partitioned By (Day String), each folder under this folder is a partition, and the folder name is day=20201123. Then, run the MSCK Repair table Score command. Fix the table structure and see that all the data has been loaded into the table **
- Load data into a partitioned table
load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');
Copy the code
- Load data into a multipartitioned table
load data local inpath '/export/servers/hivedatas/score.csv' into table score2 partition(year='2018'.month='06'.day='01');
Copy the code
- Check the partition
show partitions score;
Copy the code
- Adding a Partition
alter table score add partition(month='201805');
Copy the code
- Add multiple partitions simultaneously
alter table score add partition(month='201804') partition(month = '201803');
Copy the code
Note: After adding the partition, you can see an additional folder under the table in the HDFS file system
- Deleted partitions
alter table score drop partition(month = '201806');
Copy the code
The bucket table is operated
To divide data into multiple buckets according to the specified field, is to divide data into multiple files according to the bucket field hash
Partition is divided into folders, buckets is divided into files
Advantages of barrel:
1. Improve join query efficiency 2Copy the code
- The table poke function of hive was enabled
set hive.enforce.bucketing=true;
Copy the code
- Set the number of Reduce
set mapreduce.job.reduces=3;
Copy the code
- Create bucket list
create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets;
Copy the code
Bucket table data loading: Bucket table data cannot be loaded through HDFS DFs-put or load data. You can only load bucket table data through INSERT overwrite
To load files into a bucket table, create a normal table and insert overwrite the data from the normal table into the bucket table
- Load data into bucket table by insert overwrite
insert overwrite table course select * from course_common cluster by(c_id); -- Finally specify the bucket field
Copy the code
Modify tables and delete tables
- Modify table name
alter table old_table_name rename to new_table_name;
Copy the code
- Add/modify column information
Query table structuredescscore5; Add columnsalter table score5 addcolumns (mycol string, mysco string); Update the listalter table score5 change column mysco mysconew int;
Copy the code
- Delete table
drop table score5;
Copy the code
- Clear table operation
truncate tablescore6; Note: Can only clear the management table, that is, the internal table; Clearing an external table will cause an errorCopy the code
** Note: TRUNCate and drop:
If the recycle bin is enabled on the HDFS, the data of the dropped table can be recovered from the recycle bin, but the table structure cannot be recovered. You need to create the table again. The emptied tables do not go to the recycle bin. Therefore, the emptied tables cannot be restored. Therefore, use the emptied tables with caution
Example Load data to the Hive table
- Inserts data directly into a partitioned table
insert into table score partition(month ='201807') values ('001'.'002'.'100');
Copy the code
- Load data
load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');
Copy the code
- Load data by querying
insert overwrite table score2 partition(month = '201806') select s_id,c_id,s_score from score1;
Copy the code
- Create the table and load the data in the query statement
create table score2 as select * from score1;
Copy the code
- When creating a table, you specify the path to load the data using location
create external table score6 (s_id string,c_id string,s_score int) row format delimited fields terminated by ', ' location '/myscore';
Copy the code
- Export export and import Import Hive table data (internal table operations)
create table techer2 like techer; Create table based on existing table structure
export table techer to '/export/techer';
import table techer2 from '/export/techer';
Copy the code
Export data from the Hive table
- Insert the export
Export the query results to a local PCinsert overwrite local directory '/export/servers/exporthive' select * fromscore; The query results are formatted and exported to a local computerinsert overwrite local directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by The '#' select * fromstudent; Export query results to HDFS (nonelocal)
insert overwrite directory '/export/servers/exporthive' row format delimited fields terminated by '\t' collection items terminated by The '#' select * from score;
Copy the code
- The Hadoop command is exported to the local PC
dfs -get /export/servers/exporthive/000000_0 /export/servers/exporthive/local.txt;
Copy the code
- Exported using the hive shell command
Basic syntax :(hive-f/-E Execute statements or scripts>Hive file)-e "select * from myhive.score;" > /export/servers/exporthive/score.txt
hive -f export.sh > /export/servers/exporthive/score.txt
Copy the code
- Export Export to the HDFS
export table score to '/export/exporthive/score';
Copy the code
Syntax of HIVE DQL query
Single table query
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]
]
[LIMIT number]
Copy the code
Note:
1. Order BY does global sorting of inputs. Therefore, there is only one reducer, which requires a long calculation time when the input scale is large. 2. Sort by is not a global sort. It sorts data before entering the reducer. Therefore, if sort by is used and mapred.reduce.tasks is set to >1, sort by only guarantees the output of each Reducer in order, but does not guarantee the global order. 3. Distribute by(field) divides data into different reducer based on specified fields, and the distribution algorithm is hash. 4. Cluster BY (field) not only has the Distribute BY function, but also sorts the field. Therefore, if the bucket partition and sort fields are the same, then cluster BY = distribute BY + sort BY
- WHERE clause
select * from score where s_score < 60;
Copy the code
Note:
If s_score is less than a certain value, it does not contain null
- GROUP BY GROUP
select s_id ,avg(s_score) from score group bys_id; After grouping, the data is filtered and usedhaving
select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;
Copy the code
Note:
If you use group by, you can only write the group’s field or aggregate function after select. 2. Having a group of fields in a table and having a group of fields in a table cannot be followed by an aggregation function. 3
- Join the connection
INNER JOINInner join: Only data matching the join condition exists in both tables to be joinedselect * from techer t [inner] join course c on t.t_id = c.t_id; -- inner can be omitted
LEFT OUTER JOINLeft outer join: all the data on the left will be returned, and all the data on the right will be returnedselect * from techer t left join course c on t.t_id = c.t_id; -- outer is omitted
RIGHT OUTER JOINRight outer join: all the data on the right will be returned, and the data on the left will be returned.select * from techer t right join course c on t.t_id = c.t_id;
FULL OUTER JOINFull out (all out) join: all records in all tables that meet the criteria are returned. If any of the specified fields in the table have no matching values, then useNULLValue instead.SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;
Copy the code
Note: The hive2 version already supports non-equivalent join, which means that the join on condition can be followed by a greater than or less symbol. Join on condition followed by OR is also supported (earlier versions of on only support = and and, not > < and OR).
2. If the Hive execution engine uses MapReduce, a join starts a job. If multiple joins exist in an SQL statement, multiple jobs are started
Note: A comma (,) join between tables is the same as an inner join
select * from table_a,table_b where table_a.id=table_b.id; There is no difference in their execution efficiency, but the way they are written is different, using comma is the SQL 89 standard, join is the SQL 92 standard. The filter condition is where after comma join, and the filter condition is ON after join.
- The order by order
Global sorting, there is only one reduceASC(ascend) : ascending order (default)DESCFaithful, faithful, faithful, faithful, faithfulSELECT * FROM student s LEFT JOIN score sco ON s.s_id = sco.s_id ORDER BY sco.s_score DESC;
Copy the code
Note: Order by is a global order, so there is only one reduce at the end, that is, on one node. If the amount of data is too large, it will take a long time
- Sort by Local sort
Each MapReduce is sorted internally, not for global result sets. Setting the Number of Reduce tasksset mapreduce.job.reduces=3; View the number of Reducesetmapreduce.job.reduces; Query grades in descending order of gradesselect * from score sort bys_score; Import the query results into a file (in descending order by grade)insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;
Copy the code
- Distribute by partition sort
distribute by: Similar to MRpartition, and combine sortbyBy setting the number of reduce, divide our s_ID into the corresponding reduceset mapreduce.job.reduces=7; Through the distributebyPartition the dataselect * from score distribute by s_id sort by s_score;
Copy the code
Note: Hive requires that the distribute BY statement be written before the Sort BY statement
- cluster by
If the distribute BY and sort BY fields are the same, the cluster by mode can be used. Cluster BY has the function of both distribute BY and sort BY. However, the collation can only be positive. You cannot specify the collation as ASC or DESC. Select * from score cluster by s_id; select * from score distribute by s_id sort by s_id;Copy the code
Hive function
Aggregation function
Hive supportcount(),max(),min(),sum(),avg(), etcCopy the code
Note:
Note that the null value count(*) contains a null value and counts all rows. Count (id) does not contain a null value. Min does not include null unless all values are null
- Non-empty set population variable function: var_pop
Syntax: var_pop(col) Return value: double description: Total variable in the col non-empty set of statistics (ignore NULL)Copy the code
- Non-empty set sample variable function: VAR_SAMp
Syntax: var_samp (col) return value: double description: sample variable of col non-empty set of statistics (ignore NULL)Copy the code
- Overall standard deviation function: stddev_pop
Grammar:stddev_pop(col) return value:doubleNote: This function calculates the population standard deviation and returns the square root of the population variable, which is the same as the square root of the VAR_POP functionCopy the code
- Median function: Percentile
Grammar: the percentile (BIGINTCol, p)doubleNote: To find the exact PTH percentile, p must be between0and1, but currently only integer col fields are supported, not floating point numbersCopy the code
Relationship between operation
Support: Equivalent (=), not equivalent (! = 或 <>), less than (<), less than or equal to (< =), greater than (>Is greater than or equal to (> =Null value judgment (is null), non-empty judgment (is not null)
Copy the code
- LIKE, LIKE, LIKE
Grammar: ALIKEB Operation type: strings Description: If A or B isNULL, the returnNULL; If string A matches the regular syntax of expression B, otherwiseTRUE; Otherwise, forFALSE. The character "_" in B stands for any single character, and the character"%"Indicates any number of characters.Copy the code
- The JAVA LIKE operation: RLIKE
Syntax: A RLIKE B Operation type: strings Description: Return NULL if A or B is NULL; TRUE if string A conforms to the regular syntax of the JAVA regular expression B; Otherwise, FALSE.Copy the code
- REGEXP Operation: REGEXP
Syntax: A REGEXP B Operation type: strings Description: Function as RLIKE Example: select 1 from tableName where 'footbar' REGEXP '^f.*r$'; Results: 1.Copy the code
Mathematical operations
Support for all numeric types: plus (+), minus (-), multiply (*), in addition to (/), take mod (%), bit and (&), bit or (|), or (^), the inverse bit (~)
Copy the code
Logical operations
Support: Logic and (and), logic or (or), logical non (not)
Copy the code
The numerical computation
- The integer function is round
Grammar: round (doubleA) Return value:BIGINTDescription: ReturndoubleAn example of the integer value portion of the type (following rounding) :select round(3.1415926) fromtableName; Results:3
Copy the code
- Specifies the precision rounding function: round
Grammar: round (double a, intD) Return value:DOUBLEReturn precision ddoubleType hive> select round(3.1415926.4) from tableName;
3.1416
Copy the code
- Take down the integer function: floor
Grammar:floor(doubleA) Return value:BIGINTNote: Returns a value equal to or less than thisdoubleThe maximum integer of the variable hive> select floor(3.641) from tableName;
3
Copy the code
- Take up the integer function: ceil
Grammar:ceil(doubleA) Return value:BIGINTNote: Return equal to or greater than thisdoubleThe smallest integer of the variable hive> select ceil(3.1415926) from tableName;
4
Copy the code
- Select random number function: rand
Grammar: rand (), the rand (intSeed)doubleNote: Return one0to1Random number in the range. If a seed is specified, a stable sequence of random numbers hive is waited> select rand() from tableName; Each time this statement is executed, the results are different
0.5577432776034763
hive> select rand(100);If you specify the seed, you get the same result every time you execute this statement
0.7220096548596434
Copy the code
- The natural exponential is exp
Grammar:exp(doubleA) Return value:doubleReturn the natural logarithm e to the a power hive> select exp(2);7.38905609893065
Copy the code
- Logarithm base 10: log10
Grammar:log10(doubleA) Return value:doubleDescription: Return to10Is the logarithm of base A, hive> select log10(100);2.0
Copy the code
In addition, there are: log function base 2: log2(), log function: log()
- Power operation function: POW
Grammar: pow (double a, doubleP) Return value:doubleNote: Return a to the p power of hive> select pow(2.4);16.0
Copy the code
- Square root function: SQRT
Grammar:sqrt(doubleA) Return value:doubleReturn hive for the square root of a> select sqrt(16);4.0
Copy the code
- Binary function: bin
Grammar: bin (BIGINTA) Returned value: string Description: The binary code of A is returned to represent Hive> select bin(7);111
Copy the code
Hexadecimal function: hex(), convert hexadecimal to string function: unhex()
The conversion function is conV (bigint NUM, int from_base, int to_base). Note: Convert num from from_base to TO_base
In addition, there are many mathematical functions: absolute value function: abs(), positive cosine function: pmod(), sine function: sin(), arc sine function: asin(), cosine function: cos(), arc cosine function: acos(), positive function: Function: negative()
Conditions for function
- If function: If
Grammar: if (booleanTestCondition, T valueTrue, T valueFalseOrNull) Returns the value: T Description: If the condition testCondition isTRUEReturns valueTrue; Otherwise, valueFalseOrNull Hive is returned> select if(1=2.100.200);200
hive> select if(1=1.100.200);100
Copy the code
- Non-empty search function: coalesce
Grammar:coalesce(T v1, T v2...) Return value: T Description: Returns the first non-null value in the argument; If all values are zeroNULL, then returnNULL
hive> select coalesce(null.'100'.'50');100
Copy the code
- Conditional function: case when
Grammar:case when a then b [when c then d]* [else e] endReturn value: T Description: If a isTRUE, return b; If c isTRUE, return d; Otherwise, e Hive is returned> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
Copy the code
- Case when (2 kinds of writing, 2)
Grammar:case a when b then c [when d then e]* [else f] endReturn value: T Note: if a is equal to b, then return c; If a is equal to d, return e; Otherwise, f Hive is returned> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
Copy the code
Date function
Note: In the following SQL statements, “from tableName” can be removed without affecting the query results
-
- Get the current UNIX timestamp function: unix_timestamp
Syntax: unix_timestamp() return value:bigintNote: Get the UNIX timestamp hive for the current time zone> select unix_timestamp() from tableName;
1616906976
Copy the code
-
- UNIX timestamp to date function: from_unixtime
Grammar: from_unixtime (bigintUnixtime [, string format]) Return value: string Description: Converts the UNIX timestamp (from1970- 01- 01 00:00:00Time format from UTC to the current time zone Hive> select from_unixtime(1616906976.'yyyyMMdd') from tableName;
20210328
Copy the code
-
- Date to UNIX timestamp function: unix_timestamp
Grammar: unix_timestamp (stringdate) return value:bigintNote: Convert a date in the format "YYYY-MM-DD HH: MM: SS" to a UNIX timestamp. If the conversion fails, it returns0. hive> select unix_timestamp('the 2021-03-08 14:21:15') from tableName;
1615184475
Copy the code
-
- Convert a specified format date to a UNIX timestamp function: unix_timestamp
Grammar: unix_timestamp (stringdate, string pattern) return value:bigintDescription: ConversionpatternFormat date to UNIX timestamp. If the conversion fails, it returns0. hive> select unix_timestamp('the 2021-03-08 14:21:15'.'yyyyMMdd HH:mm:ss') from tableName;
1615184475
Copy the code
-
- Date from time to date function: to_date
Grammar: to_date (stringtimestamp) Returned value: string Description: Returns the date part of the date-time field. hive> select to_date('the 2021-03-28 14:03:01') from tableName;
2021- 03- 28
Copy the code
-
- Date-to-year function: year
Grammar:year(string date) return value:intNote: Returns the year from the date. hive> select year('the 2021-03-28 10:03:01') from tableName;
2021
hive> select year('2021-03-28') from tableName;
2021
Copy the code
-
- The month function is month
Grammar:month (string date) return value:intNote: Returns the month of the date. hive> select month('the 2020-12-28 12:03:01') from tableName;
12
hive> select month('2021-03-08') from tableName;
8
Copy the code
-
- Date to day function: day
Grammar:day (string date) return value:intNote: Returns the day of the date. hive> select day('the 2020-12-08 10:03:01') from tableName;
8
hive> select day('2020-12-24') from tableName;
24
Copy the code
-
- Date to hour function: hour
Grammar:hour (string date) return value:intDescription: Returns the hour in the date. hive> select hour('the 2020-12-08 10:03:01') from tableName;
10
Copy the code
-
- The function of converting a date to a minute is minute
Grammar:minute (string date) return value:intNote: Returns the minute from the date. hive> select minute('the 2020-12-08 10:03:01') from tableName;
3
Copy the code
-
- Date conversion function: second
Grammar:second (string date) return value:intNote: Returns the second in the date. hive> select second('the 2020-12-08 10:03:01') from tableName;
1
Copy the code
-
- Date cycle function: weekofyear
Syntax: weekofyear (string)date) return value:intNote: Returns the number of weeks the date is in. hive> select weekofyear('the 2020-12-08 10:03:01') from tableName;
49
Copy the code
-
- Date comparison function: datediff
Datediff (string enddate, string startdate)intNote: Returns the number of days from the end date minus the start date. hive> select datediff('2020-12-08'.'2012-05-09') from tableName;
213
Copy the code
-
- Date increment function: date_add
Date_add (string startdate,intDays) returned value: string description: the startdate is returned. Startdate specifies the date after days is added. hive> select date_add('2020-12-08'.10) from tableName;
2020- 1218
Copy the code
-
- Date reduction function: date_sub
Date_sub (string startdate,intDays) returned value: string description: the startdate is returned. Startdate the date after days is reduced. hive> select date_sub('2020-12-08'.10) from tableName;
2020- 11- 28
Copy the code
String function
-
- String length function: length
Syntax: length(string A)intNote: Return the length of string A, hive> select length('abcedfg') from tableName;
7
Copy the code
-
- String reversal function: reverse
Syntax: reverse(string A) Returned value: string Description: Returns the reversal result of string A, hive> select reverse('abcedfg') from tableName;
gfdecba
Copy the code
-
- String concatenation function: concat
Concat (string A, string B...) Returned value: string Description: Returns the concatenated result of the input string. Any input string hive is supported> select concat('abc'.'def', 'gh')from tableName;
abcdefgh
Copy the code
-
- Delimited string concatenation function: concat_ws
Concat_ws (string SEP, string A, string B...) Returned value: string Description: Returns the concatenated result of the input strings. SEP indicates the separator hive between each string> select concat_ws(', '.'abc'.'def'.'gh')from tableName;
abc,def,gh
Copy the code
-
- String interception functions: substr,substring
Syntax: substr(string A,int start),substring(string A, int start) Return value: string Description: Returns the string A fromstartPosition to the end of the string Hive> select substr('abcde'.3) from tableName;
cde
hive> select substring('abcde'.3) from tableName;
cde
hive> select substr('abcde'.- 1) fromtableName; (same as ORACLE) eCopy the code
-
- String interception functions: substr,substring
Syntax: substr(string A,int start.int len),substring(string A, int start.intLen) Returned value: string Description: Returns the string A fromstartThe value of len is hive> select substr('abcde'.3.2) from tableName;
cd
hive> select substring('abcde'.3.2) from tableName;
cd
hive>select substring('abcde'.2 -.2) from tableName;
de
Copy the code
-
- String uppercase functions: upper,ucase
Grammar:upper(string A) UCase (string A) Returned value: string Description: Uppercase format of string A hive is returned> select upper('abSEd') from tableName;
ABSED
hive> select ucase('abSEd') from tableName;
ABSED
Copy the code
-
- To lower a string: lower,lcase
Grammar:lower(string A) lCase (string A) Returned value: string Description: Returns the lowercase format of string A hive> select lower('abSEd') from tableName;
absed
hive> select lcase('abSEd') from tableName;
absed
Copy the code
-
- Whitespace removal function: trim
Grammar:trim(string A) Returned value: string Description: Remove the space on both sides of the string hive> select trim(' abc ') from tableName;
abc
Copy the code
-
- On the left, remove the space function: ltrim
Syntax: ltrim(string A) Return value: string Description: Remove the left space of the string hive> select ltrim(' abc ') from tableName;
abc
Copy the code
-
- On the right, remove the space function: rtrim
Syntax: rtrim(string A) Return value: string Description: Remove the space on the right of the string hive> select rtrim(' abc ') from tableName;
abc
Copy the code
-
- Regular expression replacement function: regexp_replace
Syntax: regexp_replace(String A, string B, string C) Return value: string Description: Replace the part of string A that matches the Java regular expression B with C. Note that escape characters are used in some cases, similar to the regexp_replace function in Oracle. hive> select regexp_replace('foobar'.'oo|ar'.' ') from tableName;
fb
Copy the code
-
- Regular expression parsing function: regexp_extract
Syntax: extract(string subject, stringpattern.intIndex) Returned value: string Description: Follows the string subjectpatternA regular expression split that returns the character specified by index. hive> select regexp_extract('foothebar'.'foo(.*?) (bar)'.1) from tableName;
the
hive> select regexp_extract('foothebar'.'foo(.*?) (bar)'.2) from tableName;
bar
hive> select regexp_extract('foothebar'.'foo(.*?) (bar)'.0) from tableName;
foothebar
strong>Note that in some cases escape characters are used, and the following equals sign is escaped with a double vertical bar, as is the rule for Java regular expressions.select data_field,
regexp_extract(data_field,". *? bgStart\\=([^&]+)'.1) as aaa,
regexp_extract(data_field,". *? contentLoaded_headStart\\=([^&]+)'.1) as bbb,
regexp_extract(data_field,". *? AppLoad2Req\\=([^&]+)'.1) as ccc
from pt_nginx_loginlog_st
where pt = '2021-03-28' limit 2;
Copy the code
-
- URL parsing function: parse_URL
Syntax: parse_URL (String urlString, string partToExtract [, string keyToExtract]) Return value: string Description: Returns the specified part of the URL. Valid values for partToExtract are: HOST, PATH, QUERY,REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1'.'HOST')
from tableName;
www.tableName.com
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1'.'QUERY'.'k1')
from tableName;
v1
Copy the code
-
- Json parsing function: get_json_object
Syntax: get_jSON_object (string jSON_string, string path) Returned value: string Description: Parses the JSON string json_string, and returns the content specified by path. If the input JSON string is invalid, then returnNULL. hive> select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "Bicycle" : {" price ": 19.95," color ":" red "}}, "email" : "amy@only_for_json_udf_test.net", "owner" : "Amy"} '.'$.owner') from tableName;
Copy the code
-
- Space string function: space
Grammar: space (intN) Returned value: string Description: A string of length n is returned> select space(10) from tableName;
hive> select length(space(10)) from tableName;
10
Copy the code
-
- The repeat string function: repeat
Syntax: repeat(string STR,intN) Returned value: string Description: Returns the STR string hive repeated for n times> select repeat('abc'.5) from tableName;
abcabcabcabcabc
Copy the code
-
- First character ASCII function: ASCII
Syntax: ASCII (string STR)intNote: return the first character of the string STR in ASCII hive> select ascii('abcde') from tableName;
97
Copy the code
-
- Left complement function: lpad
Syntax: lpad(string STR,intLen, string pad) Returned value: string Description: Use pad to left fill STR to len bit hive> select lpad('abc'.10.'td') fromtableName; Tdtdtdtabc Note: Unlike GP and ORACLE, pad cannot be defaultCopy the code
-
- Right complement function: rpad
Syntax: rpad(string STR,intLen, string pad) Returned value: string Description: Use pad to right complement STR to len bit hive> select rpad('abc'.10.'td') from tableName;
abctdtdtdt
Copy the code
-
- Split string function: split
Syntax: split(string STR, string pat)arrayNote: Split STR according to the PAT string, returns the split string array Hive> select split('abtcdtef'.'t') from tableName;
["ab","cd","ef"]
Copy the code
-
- Set lookup function: find_in_set
Syntax: find_in_set(string STR, string strList)intNote: Return the first occurrence of STR in strlist, which is a comma-separated string. If the STR character is not found, return0
hive> select find_in_set('ab'.'ef,ab,de') from tableName;
2
hive> select find_in_set('at'.'ef,ab,de') from tableName;
0
Copy the code
Compound type build operations
- Map Type construction: Map
Syntax: map (key1, value1, key2, value2,...) Note: According to the input key andvalueThe map type is Hive> Create table mapTable as select map('100'.'tom'.'200'.'mary') as t from tableName;
hive> describe mapTable;
t map<string ,string>
hive> select t from tableName;
{"100":"tom","200":"mary"}
Copy the code
-
- Struct Type construction: Struct
Syntax: struct(val1, val2, val3...) Struct type Hive is constructed based on the input parameters> create table struct_table as select struct('tom'.'mary'.'tim') as t from tableName;
hive> describe struct_table;
t struct<col1:string ,col2:string,col3:string>
hive> select t from tableName;
{"col1":"tom","col2":"mary","col3":"tim"}
Copy the code
-
- Array type construction: Array
Grammar:array(val1, val2,...). Description: Build an array from the input parametersarrayType hive> create table arr_table as select array("tom","mary","tim") as t from tableName;
hive> describe tableName;
t array<string>
hive> select t from tableName;
["tom","mary","tim"]
Copy the code
Complex type access operations
-
- Array type access: A[n]
Syntax: A[n] Operation type: A isarrayType, n isintReturn the value of the NTH variable in the array A. The array starts with a subscript0. For example, A is A value of ['foo'.'bar'], then A[0] will return'foo', and A [1] will return'bar'
hive> create table arr_table2 as select array("tom","mary","tim") as t
from tableName;
hive> select t[0],t[1] from arr_table2;
tom mary tim
Copy the code
-
- Map type access: M[key]
Syntax: M[key] Operation type: M indicates the map type, and key indicates the key value in the mapvalueValue. For example, M is the value {'f' -> 'foo'.'b' -> 'bar'.'all' -> 'foobar'}, then M['all'] will return'foobar'
hive> Create table map_table2 as select map('100'.'tom'.'200'.'mary') as t from tableName;
hive> select t['200'],t['100'] from map_table2;
mary tom
Copy the code
-
- Struct type access: S.x
Syntax: S.x Operation type: S for struct Type Description: return the x field in the structure S. For example, if you have a struct foobar {int foo, intBar}, foobar.foo returns hive in foo> create table str_table2 as select struct('tom'.'mary'.'tim') as t from tableName;
hive> describe tableName;
t struct<col1:string ,col2:string,col3:string>
hive> select t.col1,t.col3 from str_table2;
tom tim
Copy the code
Complex type length statistics function
-
- Size (Map< k.v >)
Grammar: the size (the Map<k .V>) return value:intNote: Return the length hive of map type> select size(t) from map_table2;
2
Copy the code
-
- Array length function: size(array)
Grammar: the size (Array<T>) return value:intDescription: ReturnarrayLength of type Hive> select size(t) from arr_table2;
4
Copy the code
-
- Type conversion function ***
Type conversion function: cast syntax:cast(expr as <type>Return value: Expected "="toFollow "type" Note: Returns the converted data type hive> select cast('1' as bigint) from tableName;
1
Copy the code
Lateral View and Explode and reflect and window functions in Hive
Use the explode function to separate the Map and Array fields in the Hive table
Lateral View is used with split, explode, and other UDTFs to split a row of data into multiple rows. In this way, the split view can be used to divide a row of data into multiple rows. The Lateral View combines the results to produce a virtual table that supports different tables.
Explode can also be used to divide complex array or map structures into rows in a single hive column
Requirements: The data format is now as follows
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4
Copy the code
The \t separation between fields requires splitting all the children into a single column
+----------+--+
| mychild |
+----------+--+
| child1 |
| child2 |
| child3 |
| child4 |
| child5 |
| child6 |
| child7 |
| child8 |
+----------+--+
Copy the code
Separate the key and value of the map to produce the following result
+-----------+-------------+--+
| mymapkey | mymapvalue |
+-----------+-------------+--+
| k1 | v1 |
| k2 | v2 |
| k3 | v3 |
| k4 | v4 |
+-----------+-------------+--+
Copy the code
-
- Creating a Hive Database
Create the Hive database.default)> create database hive_explode;
hive (default)> use hive_explode;
Copy the code
-
- Create the Hive table, then explode the map and array
hive (hive_explode)> create table t3(name string,children array<string>,address Map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ', ' map keys terminated by ':' stored as textFile;
Copy the code
-
- Load the data
Node03 Run the following command to create the table data file mkdir-p /export/servers/hivedatas/
cd /export/servers/hivedatas/Vim Maparray: Zhangsan child1, child2, child3, child4 k1: v1, k2: v2 lisi child5, child6, child7, child8 k3: v3, k4: v4 hive table load data of the hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;
Copy the code
-
- Use Explode to separate data from Hive
willarrayHive (hive_explode)> SELECT explode(children) AS myChild FROMt3; Hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;
Copy the code
Use Explode to split the JSON string
Requirements: Requirements: There are now some data formats as follows:
a:shandong,b:beijing,c:hebei|1.2.3.4.5.6.7.8.9|[{"source":"7fresh","monthSales":4900,"userCount":1900, "score" : "9.9"}, {" source ":" jd "and" monthSales ":2090,"userCount":78981, "score" : "9.8"}, {" source ":" jdmart ", "monthSales" :6987,"userCount":1600, "score" : "9.0"}]Copy the code
The separator is | between field and field
We are going to parse to find that all of our monthSales correspond to the following column (row rotation)
4900
2090
6987
-
- Create a hive table
hive (hive_explode)> create table explode_lateral_view
> (`area` string,
> `goods_id` string,
> `sale_info` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS textfile;
Copy the code
-
- Prepare and load the data
The data to be obtained is as follows/export/servers/hivedatas
vim explode_json
a:shandong,b:beijing,c:hebei|1.2.3.4.5.6.7.8.9|[{"source":"7fresh","monthSales":4900,"userCount":1900, "score" : "9.9"}, {" source ":" jd "and" monthSales ":2090,"userCount":78981, "score" : "9.8"}, {" source ":" jdmart ", "monthSales" :6987,"userCount":1600,"score":"9.0"}] Hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;
Copy the code
-
- Explode the Array
hive (hive_explode)> select explode(split(goods_id,', ')) as goods_id from explode_lateral_view;
Copy the code
-
- Use Explode to disassemble the Map
hive (hive_explode)> select explode(split(area,', ')) as area from explode_lateral_view;
Copy the code
-
- Unpack the JSON field
hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\ \ [\ \ {'.' '),'}] '.' '),'}, \ \ {')) as sale_info fromexplode_lateral_view; And then we want to use get_json_object to get "monthSales" : hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\ \ [\ \ {'.' '),'}] '.' '),'}, \ \ {')),'$.monthSales') as sale_info fromexplode_lateral_view; FAILED: SemanticException [Error10081]: UDTF's is not supported outside the SELECT clause, nor explode in expressions UDTF select explode(split(area,'.')) as area,good_id from explode_lateral_view; FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'When using UDTF, only one field is supported, this is neededLATERAL VIEWplayedCopy the code
Use with LATERAL VIEW
Use lateral View to query multiple fields
hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,', '))goods asgoods_id2; Among themLATERAL VIEW explode(split(goods_id,', ') Goods is equivalent to a virtual table, associated with the original table Explode_LATERal_View Cartesian productCopy the code
It can also be used multiple times
hive (hive_explode)> select goods_id2,sale_info,area2
from explode_lateral_view
LATERAL VIEW explode(split(goods_id,', '))goods as goods_id2
LATERAL VIEW explode(split(area,', '))area asarea2; It's the cartesian product of the three tablesCopy the code
Finally, we can use the following sentences to completely transform the json format of a row of data into a two-dimensional table
hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'} '),'$.source') as source,get_json_object(concat('{',sale_info_1,'} '),'$.monthSales') as monthSales,get_json_object(concat('{',sale_info_1,'} '),'$.userCount') as monthSales,get_json_object(concat('{',sale_info_1,'} '),'$.score') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\ \ [\ \ {'.' '),'}] '.' '),'}, \ \ {'))sale_info as sale_info_1;
Copy the code
Conclusion:
Lateral View is usually used with UDTF to solve the problem that UDTF does not allow select fields. So a Multiple Lateral View can implement something like a Cartesian product. The Outer keyword can be used to output the NULL result of the UDTF, so as to prevent data loss.
Transfer line column
Related parameters:
CONCAT (string A/col, string B/col… : Returns the result of concatenation of input strings. Any input string is supported.
CONCAT_WS(separator, str1, str2,…) : This is a special form of CONCAT(). The separator between the remaining parameters of the first argument. The delimiter can be the same string as the rest of the arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. The delimiter is added between the concatenated strings;
COLLECT_SET(col) : The COLLECT_SET(col) function accepts only basic data types. The COLLECT_SET(col) function is used to generate an array field by desummarizing the values of a certain field.
Data preparation:
name | constellation | blood_type |
---|---|---|
The Monkey King | Aries | A |
Lao wang | Sagittarius | A |
Song song | Aries | B |
Pig eight quit | Aries | A |
feng | Sagittarius | A |
Need: Group horoscopes with people with the same blood type. Here are the results:
Sagittarius,A Lao Wang|Sister Phoenix, Aries,A Monkey King|Pig eight quit Aries,B song SongCopy the code
Implementation steps:
-
- Create local constellation. TXT and import data
Node03 The server runs the following command to create the file. Note that the data is split into CD with \t/export/servers/Hivedatas vim constellation. TXT: Sun Wukong, Aries A, Lao Wang, Sagittarius A, Song Song, Aries B, Pig, Aries A, Phoenix, Sagittarius ACopy the code
-
- Create hive tables and import data
Hive (hive_explode)> create table person_info(
name string,
constellation string,
blood_type string)
row format delimited fields terminated by"\t"; Hive (hive_explode)> load data local inpath '/export/servers/hivedatas/constellation.txt' into table person_info;
Copy the code
-
- Query data on demand
hive (hive_explode)> select
t1.base,
concat_ws('|', collect_set(t1.name)) name
from
(select
name,
concat(constellation, "," , blood_type) base
from
person_info) t1
group by
t1.base;
Copy the code
Column turned
Required functions:
EXPLODE(col) : EXPLODE(col) separates the complex array or map structure of the Hive column into multiple rows.
LATERAL VIEW
Usage: LATERAL VIEW UDTF (expression) tableAlias AS columnAlias
解 析 : Used in conjunction with UDTF such as Split, Explode, etc. This separates a column of data into multiple rows and aggregates the split data.
Data preparation:
cd /export/servers/The contents of hivedatas vim movie. TXT file are as follows: Divide data fields with \ttoMe suspense, police bandits, action, psychology, drama Wolf Warrior2War, action, disasterCopy the code
Requirement: Expand the array data in the movie classification. Here are the results:
Person of Interest suspense Person of Interest Action Person of Interest Sci-fi Person of Interest Drama LietoMe suspense LietoMe, police and bandits, LietoMe action LietoMe "psychological" LietoMe drama Wolf Warrior2War Wolf Warrior2"Action" Wolf Warrior2"DisasterCopy the code
Implementation steps:
-
- Create a hive table
create table movie_info(
movie string,
category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by", ";Copy the code
-
- Load the data
load data local inpath "/export/servers/hivedatas/movie.txt" into table movie_info;
Copy the code
-
- Query data on demand
select
movie,
category_name
from
movie_info lateral view explode(category) table_tmp as category_name;
Copy the code
Reflect function
Reflect allows you to call Java’s own functions in SQL, killing all UDF functions.
Requirement 1: Use the Max in java.lang.Math to find the maximum value in two columns
Implementation steps:
-
- Create a hive table
create table test_udf(col1 int,col2 int) row format delimited fields terminated by ', ';
Copy the code
-
- Prepare and load the data
cd /export/servers/The hivedatas vim test_udf file contains the following contents:1.2
4.3
6.4
7.5
5.6
Copy the code
-
- Load the data
hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf' overwrite into table test_udf;
Copy the code
-
- Use the Max value in java.lang.Math to find the maximum value of two columns
hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;
Copy the code
Requirement 2: Different records in the file to perform different Java built-in functions
Implementation steps:
-
- Create a hive table
hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ', ';
Copy the code
-
- To prepare data
cd /export/servers/Java.lang.Math,min,1.2
java.lang.Math,max,2.3
Copy the code
-
- Load the data
hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf2' overwrite into table test_udf2;
Copy the code
-
- Execute the query
hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;
Copy the code
Requirement 3: Determine whether it is a number
Implementation method:
Use the functions in Apache Commons. The JAR under Commons is already included in the Hadoop CLASspath, so you can use it directly.
select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123")
Copy the code
Window function and analysis function
There are a class of SQL functions called aggregation functions, such as sum(), avg(), Max (), and so on. These functions can group multiple rows into one row according to the rules, generally the number of rows after the aggregation is less than the number of rows before the aggregation. But sometimes we want to show both the data before and after the aggregation, and we introduce window functions. Window functions, also known as OLAP functions/analysis functions, combine grouping and sorting functions.
The most important keywords for window functions are Partition by and Order BY.
Over (partition by XXX order by XXX)
Sum, avg, min, Max
To prepare data
Construct sentence:create table test_t1(
cookieid string,
createtime string, --day
pv int
) row format delimited
fields terminated by ', '; Load data: load datalocal inpath '/root/hivedata/test_t1.dat' into table test_t1;
cookie1,2020- 04- 10.1
cookie1,2020- 04- 11.5
cookie1,2020- 04- 12.7
cookie1,2020- 04- 13.3
cookie1,2020- 04- 14.2
cookie1,2020- 04- 15.4
cookie1,2020- 04- 16.4Enable the smart local modeSET hive.exec.mode.local.auto=true;
Copy the code
The SUM function is used in conjunction with the window function: the result is related to ORDER BY, which defaults to ascending.
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
from test_t1;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
fromtest_t1; Pv1: pv accumulation from the starting point to the current row within a group, e.g.,11Number of pv1=10Number of pv+11Number of pv,12No.=10No.+11No.+12Pv2: same as PV1. Pv3: add all PVS within the group (cookie1). Pv4: current row within the group+forward3Line, such as,11No.=10No.+11No.,12No.=10No.+11No.+12No.,13No.=10No.+11No.+12No.+13No.,14No.=11No.+12No.+13No.+14Pv5: indicates the current row in the group+forward3line+In the future1Line, such as,14No.=11No.+12No.+13No.+14No.+15No.=5+7+3+2+4=21Pv6: Indicates the current row in the group+All subsequent lines, such as,13No.=13No.+14No.+15No.+16No.=3+2+4+4=13.14No.=14No.+15No.+16No.=2+4+4=10
Copy the code
If rows between is not specified, the default is from the starting point to the current row.
If you do not specify order by, all values in the group are added.
The key is to understand the meaning of rows between, also known as the window clause:
Preceding: go
Back the following:
Current row: Indicates the current row
The starting point of unbounded:
Following unbounded preceding
To unbounded the following
AVG, MIN, MAX, and SUM are used the same way.
Row_number, RANK, dense_RANK, and Ntile
To prepare data
cookie1,2020- 04- 10.1
cookie1,2020- 04- 11.5
cookie1,2020- 04- 12.7
cookie1,2020- 04- 13.3
cookie1,2020- 04- 14.2
cookie1,2020- 04- 15.4
cookie1,2020- 04- 16.4
cookie2,2020- 04- 10.2
cookie2,2020- 04- 11.3
cookie2,2020- 04- 12.5
cookie2,2020- 04- 13.6
cookie2,2020- 04- 14.3
cookie2,2020- 04- 15.9
cookie2,2020- 04- 16.7
CREATE TABLE test_t2 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ', '
stored astextfile; Load data: load datalocal inpath '/root/hivedata/test_t2.dat' into table test_t2;
Copy the code
-
ROW_NUMBER () is used
ROW_NUMBER() generates a sequence of records within a group, starting with 1, in order.
SELECT
cookieid,
createtime,
pv,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM test_t2;
Copy the code
-
RANK and DENSE_RANK are used
RANK() generates the ranking of an item in a group. Equal rankings leave a void in the ranking.
DENSE_RANK() generates the rank of an item in a group. Equal rankings will leave no empty spots in the rank.
SELECT
cookieid,
createtime,
pv,
RANK(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK(a)OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM test_t2
WHERE cookieid = 'cookie1';
Copy the code
-
NTILE
Sometimes there is a need: if the data is sorted into three parts, and the business is only interested in one part, how do you get the middle third of the data out? The NTILE function satisfies.
An ntile can be thought of as dividing an ordered data set equally among a specified number of buckets (NUM), with the number of buckets assigned to each row. If the buckets with smaller numbers cannot be evenly allocated, buckets with smaller numbers are allocated first, and the number of rows that can be placed in each bucket differs by at most one.
And then you can pick a few fractions of the first or last n based on the bucket number. The data will be presented in its entirety, only the corresponding data will be labeled; To get a fraction of the data, you need to nest another layer and get it out according to the label.
SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM test_t2
ORDER BY cookieid,createtime;
Copy the code
Some other window functions
lag,lead,first_value,last_value
- LAG
LAG(col,n,DEFAULT) is used in the statistics window for the NTH row up. The first argument is the column name, the second argument is the NTH row up (optional,DEFAULT to 1), and the third argument is the DEFAULT value (NULL if the NTH row up is NULL, NULL if not specified).
SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1.'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROMtest_t4; Last_1_time: specifies the upper order1The value,defaultfor'1970-01-01 00:00:00'So cookie1, first row, up1behaviorNULL, so take the default value1970- 01- 01 00:00:00So cookie1, third row, up1The row value is the second row value,2015- 04- 10 10:00:02Cookie1 sixth row, up1The row value is the fifth row value,2015- 04- 10 10:50:01Last_2_time: specifies the upper order2The value of the row, which specifies the default value cookie1, is the first row above2behaviorNULLSo cookie1, second row, up2behaviorNULLSo cookie1, fourth row, up2The value of the second row,2015- 04- 10 10:00:02Cookie1, seventh row, up2The value in the fifth line of behavior,2015- 04- 10 10:50:01
Copy the code
- LEAD
LEAD(col,n,DEFAULT) is used for the NTH row in the statistics window. The first argument is the column name, the second argument is the NTH row down (optional,DEFAULT to 1), and the third argument is the DEFAULT value (NULL if n is NULL, NULL if not specified).
SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1.'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_t4;
Copy the code
-
FIRST_VALUE
Take the first value of the intra-group sort to the current row
SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_t4;
Copy the code
- LAST_VALUE
Take the last value until the current row after intra-group sort
SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_t4;
Copy the code
If you want to take the last value after the group sort, you need to work around this:
SELECT cookieid,
createtime,
url,
ROW_NUMBER(a)OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_t4
ORDER BY cookieid,createtime;
Copy the code
Pay special attention to order by
If you do not specify ORDER BY, the ordering will be chaotic and an incorrect result will occur
SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_t4;
Copy the code
cume_dist,percent_rank
These two sequence analysis functions are not very commonly used. Note: Sequence functions do not support the WINDOW clause
- Data preparation
d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000
CREATE EXTERNAL TABLE test_t3 (
dept STRING,
userid string,
sal INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ', '
stored astextfile; Load data: load datalocal inpath '/root/hivedata/test_t3.dat' into table test_t3;
Copy the code
-
CUME_DIST is related to the order order of ORDER BYD
CUME_DIST Number of rows less than or equal to the current value/total number of rows in the group order Default order ascending for example, count the percentage of the total number of people who are less than or equal to the current salary
SELECT
dept,
userid,
sal,
CUME_DIST(a)OVER(ORDER BY sal) AS rn1,
CUME_DIST(a)OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROMtest_t3; Rn1: nopartition, all data are1Group, the total number of rows is5, the first line: less than or equal to1000The number of rows1And, therefore,1/5=0.2Row 3: less than or equal to3000The number of rows3And, therefore,3/5=0.6Rn2: Grouping by department, DPET=The number of rows of D1 is zero3, second row: less than or equal to2000The number of rows2And, therefore,2/3=0.6666666666666666
Copy the code
-
PERCENT_RANK
PERCENT_RANK RANK value of the current row in the group -1/ total row number in the group -1
After investigation, the practical significance of this function is not clear and needs to be further verified
SELECT
dept,
userid,
sal,
PERCENT_RANK(a)OVER(ORDER BY sal) AS rn1, - groupings
RANK(a)OVER(ORDER BY sal) AS rn11, RANK value within a group
SUM(1) OVER(PARTITION BY NULL) AS rn12, -- Total number of rows in a group
PERCENT_RANK(a)OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM test_t3;
rn1: rn1 = (rn11- 1) / (rn12- 1In the first row,(1- 1)/(5- 1)=0/4=0The second line,2- 1)/(5- 1)=1/4=0.25In the fourth row,4- 1)/(5- 1)=3/4=0.75Rn2: group by dept, dept=The total number of rows of D1 is 03The first row, (1- 1)/(3- 1)=0The third row, (3- 1)/(3- 1)=1
Copy the code
grouping sets,grouping__id,cube,rollup
These analysis functions are typically used in OLAP, do not add up, and need to be counted according to different dimensions of drilling and running, such as UV numbers in minutes, hours, days, and months.
- Data preparation
2020- 03.2020- 03- 10,cookie1
2020- 03.2020- 03- 10,cookie5
2020- 03.2020- 03- 12,cookie7
2020- 04.2020- 04- 12,cookie3
2020- 04.2020- 04- 13,cookie2
2020- 04.2020- 04- 13,cookie4
2020- 04.2020- 04- 16,cookie4
2020- 03.2020- 03- 10,cookie2
2020- 03.2020- 03- 10,cookie3
2020- 04.2020- 04- 12,cookie5
2020- 04.2020- 04- 13,cookie6
2020- 04.2020- 04- 15,cookie3
2020- 04.2020- 04- 15,cookie2
2020- 04.2020- 04- 16,cookie1
CREATE TABLE test_t5 (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ', '
stored astextfile; Load data: load datalocal inpath '/root/hivedata/test_t5.dat' into table test_t5;
Copy the code
- GROUPING SETS
Grouping sets is a convenient way of grouping group by logic in a SINGLE SQL statement.
This is equivalent to UNION ALL for the GROUP BY result set of different dimensions.
GROUPING__ID, which grouping set the result belongs to.
SELECT
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month.day
GROUPING SETS (month.day)
ORDER BYGROUPING__ID; Grouping_id indicates the group set to which the results belong according togroupingSetsmonth.day.1To representmonth.2To representdayIs equivalent toSELECT month.NULL.COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month UNION ALL
SELECT NULL as month.day.COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day;
Copy the code
Such as:
SELECT
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month.day
GROUPING SETS (month.day, (month.day))
ORDER BYGROUPING__ID; Is equivalent toSELECT month.NULL.COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month
UNION ALL
SELECT NULL.day.COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day
UNION ALL
SELECT month.day.COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month.day;
Copy the code
- CUBE
Aggregated according to all combinations of the dimensions of GROUP BY.
SELECT
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month.day
WITH CUBE
ORDER BYGROUPING__ID; Is equivalent toSELECT NULL.NULL.COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test_t5
UNION ALL
SELECT month.NULL.COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_t5 GROUP BY month
UNION ALL
SELECT NULL.day.COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_t5 GROUP BY day
UNION ALL
SELECT month.day.COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_t5 GROUP BY month.day;
Copy the code
- ROLLUP
Is a subset of CUBE, dominated by the leftmost dimension, from which hierarchical aggregation is performed.
For example, in order tomonthHierarchical aggregation of dimensions:SELECT
month.day.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY month.day
WITH ROLLUP
ORDER BY GROUPING__ID;
Change the order of month and day to create a hierarchical aggregation of day:
SELECT
day.month.COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_t5
GROUP BY day.month
WITH ROLLUP
ORDER BYGROUPING__ID; (In this case, aggregating by day and month is the same as aggregating by day, because there is a parent-child relationship, which would be different if it were a combination of other dimensions)Copy the code
Wechat search public account [five minutes to learn big Data], weekly release of original big data technology, in-depth framework principles, dafa interview questions, etc