“This is the 26th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
A, functions,
1. Built-in system functions
1.1. Check the built-in functions of the system
hive (default)> show functions;
Copy the code
1.2, display the use of built-in functions
hive (default)> desc function upper;
Copy the code
1.3. Display the usage of built-in functions in detail
hive (default)> desc function extended upper;
Copy the code
2. Commonly used built-in functions
2.1. Empty field assignment
- Function description
NVL: Assign a value to data whose value is NULL in the format of NVL(value, default_value). The NVL function returns the value of default_value if value is NULL, value otherwise, and NULL if both arguments are NULL.
-
Data preparation: Use the employee table
-
Query: If comm is NULL, -1 is used instead
hive (default)> select comm,nvl(comm,-1) from emp; OK comm _c1 null-1.0 300.0 300.0 500.0 500.0 null-1.0 1400.0 1400.0 null-1.0 null-1.0 null-1.0 null-1.0 null-1.0 0.0 0.0 0.0 0.0 Null-1.0 null-1.0 null-1.0 null-1.0 null-1.0Copy the code
-
Query: If the comm of an employee is NULL, the leader ID is used instead
hive (default)> select comm,nvl(comm,mgr) from emp; OK comm _C1 NULL 7902.0 300.0 300.0 500.0 500.0 NULL 7839.0 1400.0 1400.0 NULL 7839.0 NULL 7839.0 NULL 7566.0 NULL 0.0 0.0 NULL 7788.0 NULL 7698.0 NULL 7566.0 NULL 7782.0Copy the code
2.2, CASE WHEN THEN ELSE END
-
Data preparation
-
demand
Find out how many men and women there are in different departments. The results are as follows:
-
Create emp_sex. TXT and import data
[moe@hadoop102 datas]$vi emp_sex.txt Wukong A male dahai A male Song Song B male Feng sister A female ting sister B female tingting B femaleCopy the code
-
Create hive tables and import data
hive (default)> create table emp_sex > (name string, dept_id string, sex string) > row format delimited fields terminated by '\t'; hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/emp_sex.txt' into table emp_sex; Copy the code
-
Query data as required
Hive (default)> select dept_id, > sum(case sex when 'male' then 1 else 0 end) male_count, > sum(case sex when 'female' then 1 else 0 end) female_count > from emp_sex > group by dept_id;Copy the code
2.3. Row to column
-
Description of correlation function
CONCAT (string A/col, string B/col… : Returns the concatenated result of the input string. Any input string is supported.
CONCAT_WS(separator, str1, str2,…) : it is a special form of CONCAT(). The separator between the remaining arguments of the first argument. The delimiter can be the same string as the remaining arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. The delimiter will be added between the concatenated strings;
Note: CONCAT_WS must be "string or array<string>
COLLECT_SET(col) : the COLLECT_SET(col) function accepts only basic data types. It is used to aggregate the values of a field to generate an Array.
-
Data preparation
-
demand
Group zodiac signs with people with the same blood type. The results are as follows:
-
Create a local constellation. TXT and import the data
Create a local constellation. TXT and import the data
[moe@hadoop102 datas]$vim person_info.txt Sun Wukong Aries A Sea Sagittarius A song Song Aries B pig eight jie Aries A Phoenix sister Sagittarius A old teacher Aries BCopy the code
-
Create hive tables and import data
hive (default)> create table person_info > (name string, constellation string, blood_type string) > row format delimited fields terminated by '\t'; hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/person_info.txt' into table person_info; Copy the code
-
Query data as required
hive (default)> SELECT > t1.c_b, > CONCAT_WS("|",collect_set(t1.name)) > FROM ( > SELECT > NAME, > CONCAT_WS(',',constellation,blood_type) c_b > FROM person_info > )t1 > GROUP BY t1.c_b; Copy the code
2.4. Column change
-
Function description
EXPLODE(col) : EXPLODE(col) : Divide complex Array or Map structures into multiple rows in a hive column.
LATERAL VIEW
Usage: LATERAL VIEW UDTF (expression) tableAlias AS columnAlias
Use with split, explode, and other UDTFs to split a column of data into multiple rows and then aggregate the split data.
-
Data preparation
-
demand
Expands the array data in the movie category. The results are as follows:
-
Create local movie. TXT and import data
[moe@hadoop102 datas]$vi movie_info.txt Person of Interest suspense, Action, Science Fiction, Plot Lie to Me Suspense, Police, Psychology, Plot Wolf Warrior 2 War, Action, DisasterCopy the code
-
Create hive tables and import data
hive (default)> create table movie_info > (movie string, category string) > row format delimited fields terminated by '\t'; hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/movie_info.txt' into table movie_info; Copy the code
-
Query data as required
hive (default)> SELECT > movie, > category_name > FROM > movie_info > lateral VIEW > explode(split(category,",")) movie_info_tmp AS category_name; Copy the code
2.5. Window Function (Windowing function)
-
Description of correlation function
OVER() : Specifies the size of the data window in which the analysis function works, which may vary from row to row. CURRENT ROW: indicates the CURRENT ROW
N PRECEDING: Indicates the PRECEDING N rows of data
N FOLLOWING: indicates the next n rows of data
-Leonard: I had an UNBOUNDED starting point.
Between the UNBOUNDED PRECEDING and the UNBOUNDED FOLLOWING
LAG(col,n,default_val) : indicates the data in the NTH row ahead
LEAD(col,n, default_val) : the NTH row after the data
NTILE(n) : Distributes the rows of an ordered window to the specified data group. The groups are numbered starting with 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.
-
Data preparation: name, OrderDate, cost
Jack,2017-01-01,10 Tony,2017-01-02,15 Jack,2017-02-03,23 Tony,2017-01-04,29 Jack,2017-01-05,46 Jack,2017-04-06,42 Tony,2017-01-07,50 Jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 Neil,2017-05-10,12 mart,2017-04-11,75 Neil, 2017-06-12, 80 mart, 2017-04-13, 94Copy the code
-
demand
- Query the total number of customers who purchased in April 2017
- Query customer purchase details and total monthly purchases
- In the above scenario, the cost of each customer is accumulated according to the date
- Query each customer’s last purchase time
- Query the order information for the first 20% of the time
-
Create local business.txt and import data
[moe@hadoop102 datas]$ vi business.txt Copy the code
-
Create hive tables and import data
hive (default)> create table business( > name string, > orderdate string, > cost int) > row format delimited fields terminated by ','; hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/business.txt' into table business; Copy the code
-
Query data as required
-
Query the total number of customers who purchased in April 2017
The over() function is not used
Hive (default)> select name,count(*) > from business > where substring(orderdate,1,7) = '2017-04' > group by name; hive (default)> select name,count(*) > from business > where substring(orderdate,1,7) = '2017-04' > group by name;Copy the code
Use the over() function
Hive (default)> select > name,count(*) over() > from business > where substring(orderdate,1,7) = '2017-04' > group by name;Copy the code
-
Query customer purchase details and total monthly purchases
hive (default)> select > name,orderdate,cost,sum(cost) over(partition by month(orderdate)) > from business; Copy the code
-
Add the cost of each customer according to the date
select name,orderdate,cost, Sum (cost) over() as sample1,-- sum(cost) over(partition by name) as sample2, Sum (cost) over(partition by name order by orderDate) as sample3, Sum (cost) over(partition by name order by OrderDate rows between UNBOUNDED PRECEDING and current row) as sample4 ,-- Same as sample3, sum(cost) over(partition by name order by orderDate rows between 1 PRECEDING and current row) as sample5, Sum (cost) over(partition by name order by orderDate rows between 1 PRECEDING AND 1 FOLLOWING) as sum(cost) over(partition by name order by OrderDate rows between 1 PRECEDING AND 1 FOLLOWING) as Sample6,-- Sum (cost) over(partition by name order by orderDate rows between current row and UNBOUNDED FOLLOWING) as sample7 -- current line and all subsequent lines from business;Copy the code
Rows must be followed by the ORDER BY clause to restrict the results of the sort, using a fixed number of rows to limit the number of data rows in the partition
-
Check the customer’s last purchase date
select name,orderdate,cost, lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 from business; Copy the code
-
Query the order information for the first 20% of the time
select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate) sorted from business ) t where sorted = 1; Copy the code
-
2.6, Rank
-
Function description
If the RANK() is the same, it will repeat.
DENSE_RANK() repeats when it has the same sort, reducing the total number.
ROW_NUMBER() is evaluated in order
-
Data preparation
-
demand
Calculate the ranking of achievement in each subject.
-
Create a local score. TXT file and import data
[moe@hadoop102 datas]$vi score.txt Sun Wukong Language 87 Sun Wukong mathematics 95 Sun Wukong English 68 Sea Language 94 Sea mathematics 56 Sea English 84 Song Song Language 64 Song Song Mathematics 86 Song Song English 84 Tingting Chinese 65 Tingting Mathematics 85 Tingting English 78Copy the code
-
Create hive tables and import data
hive (default)> create table score( > name string, > subject string, > score int) > row format delimited fields terminated by '\t'; Hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/score. TXT 'into table score;Copy the code
-
Query data as required
select name, subject, score, rank() over(partition by subject order by score desc) rp, dense_rank() over(partition by subject order by score desc) drp, row_number() over(partition by subject order by score desc) rmp from score; Copy the code
3. Custom functions
-
Hive provides a limited number of built-in functions, such as Max /min. You can customize udFs to facilitate extension.
-
If the built-in functions provided by Hive cannot meet your service processing requirements, you can use user-defined functions (UDF).
-
User-defined functions are classified into the following three categories
-
UDF (User Defined – Function)
When you come into a
-
User-defined Aggregation Function (UDAF) Indicates the Aggregation Function
Into a more
Similar to: count/ Max /min
-
User-defined table-generating Functions (UDTF)
More than one in and out the other
Such as lateral view explodes ()
-
-
Official Document Address
Cwiki.apache.org/confluence/…
-
The programming steps
-
Inherit classes provided by Hive
org.apache.hadoop.hive.ql.udf.generic.GenericUDF
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
-
Implement abstract methods in a class
-
Create functions in the Hive command line window
Add the jar
add jar linux_jar_path Copy the code
Create a function
create [temporary] function [dbname.]function_name AS class_name; Copy the code
-
Delete functions in the Hive command line window
drop [temporary] function [if exists] [dbname.]function_name; Copy the code
-
4. Customize UDF functions
-
demand
Create a custom UDF implementation that computs the length of a given string. , such as:
hive(default)> select my_len("abcd"); 4 Copy the code
-
Create a Maven project Hive
-
Import dependence
<dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> The < version > 3.1.2 < / version > < / dependency > < / dependencies >Copy the code
-
Create a class
public class MyUDF extends GenericUDF { // Number of verification data parameters @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { if(arguments.length ! =1) { throw new UDFArgumentException("The number of arguments is not one."); } return PrimitiveObjectInspectorFactory.javaIntObjectInspector; } // Process data @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { //1. Fetch the input data String input = arguments[0].get().toString(); //2. Check whether the input data is null if (input == null) { return 0; } //3. Return the length of input data return input.length(); } @Override public String getDisplayString(String[] children) { return ""; }}Copy the code
-
Upload the jar package to the server /opt/module/hive-3.1.2/datas/myudf.jar
-
Add jar packages to hive classpath
Hive (default)> add jar /opt/module/hive-3.1.2/datas/myudf.jar;Copy the code
-
Create temporary functions associated with developed Java classes
hive (default)> create temporary function my_len as "com.moe.hive.udf.MyUDF"; Copy the code
-
You can use custom functions in the HQL
hive (default)> select my_len('moe'); OK _c0 3 hive (default)> select length('moe'); OK _c0 3 Copy the code
5. Customize UDTF functions
-
demand
A custom UDTF implementation splits an arbitrary separator string into separate words, for example:
hive(default)> select myudtf("hello,world,hadoop,hive", ","); hello world hadoop hive Copy the code
-
Code implementation
public class MyUDTF extends GenericUDTF { // A collection of output data private ArrayList<String> outPutList = new ArrayList<>(); @Override public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { // The default column name of the output data, which can be overridden by an alias List<String> fieldNames = new ArrayList<>(); fieldNames.add("word"); // The type of output data List<ObjectInspector> fieldOIs = new ArrayList<>(); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); // The final return value return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } // Process input data: hello, MOE,hive @Override public void process(Object[] args) throws HiveException { //1. Fetch the input data String input = args[0].toString(); //2. Split the string according to "," String[] words = input.split(","); //3 for (String word : words) { // Clear the collection outPutList.clear(); // Put the data into the collection outPutList.add(word); // Output dataforward(outPutList); }}// End method @Override public void close(a) throws HiveException {}}Copy the code
-
Upload the jar package to the server /opt/module/hive-3.1.2/datas/myudtf.jar
-
Add the JAR package to the Hive classpath
Hive (default)> add jar /opt/module/hive-3.1.2/datas/myudtf.jar;Copy the code
-
Create temporary functions associated with developed Java classes
hive (default)> create temporary function myudtf as "com.moe.hive.udtf.MyUDTF"; Copy the code
-
Use custom functions
hive (default)> select myudtf('hello,moe,hive',','); OK word hello moe hive Copy the code
Two, friendship links
Big data Hive learning journey 4
Big data Hive learning journey 3
Big data Hive learning journey 2
Big data Hive learning journey 1