“This is the 26th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

A, functions,

1. Built-in system functions

1.1. Check the built-in functions of the system

hive (default)> show functions;
Copy the code

1.2, display the use of built-in functions

hive (default)> desc function upper;
Copy the code

1.3. Display the usage of built-in functions in detail

hive (default)> desc function extended upper;
Copy the code

2. Commonly used built-in functions

2.1. Empty field assignment

  1. Function description

NVL: Assign a value to data whose value is NULL in the format of NVL(value, default_value). The NVL function returns the value of default_value if value is NULL, value otherwise, and NULL if both arguments are NULL.

  1. Data preparation: Use the employee table

  2. Query: If comm is NULL, -1 is used instead

    hive (default)> select comm,nvl(comm,-1) from emp; OK comm _c1 null-1.0 300.0 300.0 500.0 500.0 null-1.0 1400.0 1400.0 null-1.0 null-1.0 null-1.0 null-1.0 null-1.0 0.0 0.0 0.0 0.0 Null-1.0 null-1.0 null-1.0 null-1.0 null-1.0Copy the code
  3. Query: If the comm of an employee is NULL, the leader ID is used instead

    hive (default)> select comm,nvl(comm,mgr) from emp; OK comm _C1 NULL 7902.0 300.0 300.0 500.0 500.0 NULL 7839.0 1400.0 1400.0 NULL 7839.0 NULL 7839.0 NULL 7566.0 NULL 0.0 0.0 NULL 7788.0 NULL 7698.0 NULL 7566.0 NULL 7782.0Copy the code

2.2, CASE WHEN THEN ELSE END

  1. Data preparation

  2. demand

    Find out how many men and women there are in different departments. The results are as follows:

  3. Create emp_sex. TXT and import data

    [moe@hadoop102 datas]$vi emp_sex.txt Wukong A male dahai A male Song Song B male Feng sister A female ting sister B female tingting B femaleCopy the code
  4. Create hive tables and import data

    hive (default)> create table emp_sex
                  > (name string, dept_id string, sex string)
                  > row format delimited fields terminated by '\t';
                  
    hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/emp_sex.txt' into table emp_sex;
    Copy the code
  5. Query data as required

    Hive (default)> select dept_id, > sum(case sex when 'male' then 1 else 0 end) male_count, > sum(case sex when 'female' then 1 else 0 end) female_count > from emp_sex > group by dept_id;Copy the code

2.3. Row to column

  1. Description of correlation function

    CONCAT (string A/col, string B/col… : Returns the concatenated result of the input string. Any input string is supported.

    CONCAT_WS(separator, str1, str2,…) : it is a special form of CONCAT(). The separator between the remaining arguments of the first argument. The delimiter can be the same string as the remaining arguments. If the delimiter is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter argument. The delimiter will be added between the concatenated strings;

    Note: CONCAT_WS must be "string or array<string>

    COLLECT_SET(col) : the COLLECT_SET(col) function accepts only basic data types. It is used to aggregate the values of a field to generate an Array.

  2. Data preparation

  3. demand

    Group zodiac signs with people with the same blood type. The results are as follows:

  4. Create a local constellation. TXT and import the data

    Create a local constellation. TXT and import the data

    [moe@hadoop102 datas]$vim person_info.txt Sun Wukong Aries A Sea Sagittarius A song Song Aries B pig eight jie Aries A Phoenix sister Sagittarius A old teacher Aries BCopy the code
  5. Create hive tables and import data

    hive (default)> create table person_info
                  > (name string, constellation string, blood_type string)
                  > row format delimited fields terminated by '\t';
    
    hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/person_info.txt' into table person_info;
    Copy the code
  6. Query data as required

    hive (default)> SELECT
                  > t1.c_b,
                  > CONCAT_WS("|",collect_set(t1.name))
                  > FROM (
                  > SELECT
                  > NAME,
                  > CONCAT_WS(',',constellation,blood_type) c_b
                  > FROM person_info
                  > )t1
                  > GROUP BY t1.c_b;
    Copy the code

2.4. Column change

  1. Function description

    EXPLODE(col) : EXPLODE(col) : Divide complex Array or Map structures into multiple rows in a hive column.

    LATERAL VIEW

    Usage: LATERAL VIEW UDTF (expression) tableAlias AS columnAlias

    Use with split, explode, and other UDTFs to split a column of data into multiple rows and then aggregate the split data.

  2. Data preparation

  3. demand

    Expands the array data in the movie category. The results are as follows:

  4. Create local movie. TXT and import data

    [moe@hadoop102 datas]$vi movie_info.txt Person of Interest suspense, Action, Science Fiction, Plot Lie to Me Suspense, Police, Psychology, Plot Wolf Warrior 2 War, Action, DisasterCopy the code
  5. Create hive tables and import data

    hive (default)> create table movie_info
                  > (movie string, category string)
                  > row format delimited fields terminated by '\t';
    
    hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/movie_info.txt' into table movie_info;
    Copy the code
  6. Query data as required

    hive (default)> SELECT
                  > movie,
                  > category_name
                  > FROM
                  > movie_info
                  > lateral VIEW
                  > explode(split(category,",")) movie_info_tmp AS category_name;
    Copy the code

2.5. Window Function (Windowing function)

  1. Description of correlation function

    OVER() : Specifies the size of the data window in which the analysis function works, which may vary from row to row. CURRENT ROW: indicates the CURRENT ROW

    N PRECEDING: Indicates the PRECEDING N rows of data

    N FOLLOWING: indicates the next n rows of data

    -Leonard: I had an UNBOUNDED starting point.

    Between the UNBOUNDED PRECEDING and the UNBOUNDED FOLLOWING

    LAG(col,n,default_val) : indicates the data in the NTH row ahead

    LEAD(col,n, default_val) : the NTH row after the data

    NTILE(n) : Distributes the rows of an ordered window to the specified data group. The groups are numbered starting with 1. For each row, NTILE returns the number of the group to which the row belongs. Note: n must be of type int.

  2. Data preparation: name, OrderDate, cost

    Jack,2017-01-01,10 Tony,2017-01-02,15 Jack,2017-02-03,23 Tony,2017-01-04,29 Jack,2017-01-05,46 Jack,2017-04-06,42 Tony,2017-01-07,50 Jack,2017-01-08,55 mart,2017-04-08,62 mart,2017-04-09,68 Neil,2017-05-10,12 mart,2017-04-11,75 Neil, 2017-06-12, 80 mart, 2017-04-13, 94Copy the code
  3. demand

    • Query the total number of customers who purchased in April 2017
    • Query customer purchase details and total monthly purchases
    • In the above scenario, the cost of each customer is accumulated according to the date
    • Query each customer’s last purchase time
    • Query the order information for the first 20% of the time
  4. Create local business.txt and import data

    [moe@hadoop102 datas]$ vi business.txt
    Copy the code
  5. Create hive tables and import data

    hive (default)> create table business(
                  > name string,
                  > orderdate string,
                  > cost int)
                  > row format delimited fields terminated by ',';
    
    hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/business.txt' into table business;
    Copy the code
  6. Query data as required

    • Query the total number of customers who purchased in April 2017

      The over() function is not used

      Hive (default)> select name,count(*) > from business > where substring(orderdate,1,7) = '2017-04' > group by name; hive (default)> select name,count(*) > from business > where substring(orderdate,1,7) = '2017-04' > group by name;Copy the code

      Use the over() function

      Hive (default)> select > name,count(*) over() > from business > where substring(orderdate,1,7) = '2017-04' > group by name;Copy the code

    • Query customer purchase details and total monthly purchases

      hive (default)> select
                    > name,orderdate,cost,sum(cost) over(partition by month(orderdate))
                    > from business;
      Copy the code

    • Add the cost of each customer according to the date

      select name,orderdate,cost, Sum (cost) over() as sample1,-- sum(cost) over(partition by name) as sample2, Sum (cost) over(partition by name order by orderDate) as sample3, Sum (cost) over(partition by name order by OrderDate rows between UNBOUNDED PRECEDING and current row) as sample4 ,-- Same as sample3, sum(cost) over(partition by name order by orderDate rows between 1 PRECEDING and current row) as sample5, Sum (cost) over(partition by name order by orderDate rows between 1 PRECEDING AND 1 FOLLOWING) as sum(cost) over(partition by name order by OrderDate rows between 1 PRECEDING AND 1 FOLLOWING) as Sample6,-- Sum (cost) over(partition by name order by orderDate rows between current row and UNBOUNDED FOLLOWING) as sample7 -- current line and all subsequent lines from business;Copy the code

      Rows must be followed by the ORDER BY clause to restrict the results of the sort, using a fixed number of rows to limit the number of data rows in the partition

    • Check the customer’s last purchase date

      select 
      name,orderdate,cost,
      lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1,
      lag(orderdate,2) over (partition by name order by orderdate) as time2
      from business;
      Copy the code

    • Query the order information for the first 20% of the time

      select * from (
      select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
      from business
      ) t
      where sorted = 1;
      Copy the code

2.6, Rank

  1. Function description

    If the RANK() is the same, it will repeat.

    DENSE_RANK() repeats when it has the same sort, reducing the total number.

    ROW_NUMBER() is evaluated in order

  2. Data preparation

  3. demand

    Calculate the ranking of achievement in each subject.

  4. Create a local score. TXT file and import data

    [moe@hadoop102 datas]$vi score.txt Sun Wukong Language 87 Sun Wukong mathematics 95 Sun Wukong English 68 Sea Language 94 Sea mathematics 56 Sea English 84 Song Song Language 64 Song Song Mathematics 86 Song Song English 84 Tingting Chinese 65 Tingting Mathematics 85 Tingting English 78Copy the code
  5. Create hive tables and import data

    hive (default)> create table score( > name string, > subject string, > score int) > row format delimited fields terminated by '\t'; Hive (default)> load data local inpath '/opt/module/hive-3.1.2/datas/score. TXT 'into table score;Copy the code
  6. Query data as required

    select name,
    subject,
    score,
    rank() over(partition by subject order by score desc) rp,
    dense_rank() over(partition by subject order by score desc) drp,
    row_number() over(partition by subject order by score desc) rmp
    from score;
    Copy the code

3. Custom functions

  1. Hive provides a limited number of built-in functions, such as Max /min. You can customize udFs to facilitate extension.

  2. If the built-in functions provided by Hive cannot meet your service processing requirements, you can use user-defined functions (UDF).

  3. User-defined functions are classified into the following three categories

    • UDF (User Defined – Function)

      When you come into a

    • User-defined Aggregation Function (UDAF) Indicates the Aggregation Function

      Into a more

      Similar to: count/ Max /min

    • User-defined table-generating Functions (UDTF)

      More than one in and out the other

      Such as lateral view explodes ()

  4. Official Document Address

    Cwiki.apache.org/confluence/…

  5. The programming steps

    • Inherit classes provided by Hive

      org.apache.hadoop.hive.ql.udf.generic.GenericUDF

      org.apache.hadoop.hive.ql.udf.generic.GenericUDTF

    • Implement abstract methods in a class

    • Create functions in the Hive command line window

      Add the jar

      add jar linux_jar_path
      Copy the code

      Create a function

      create [temporary] function [dbname.]function_name AS class_name;
      Copy the code
    • Delete functions in the Hive command line window

      drop [temporary] function [if exists] [dbname.]function_name;
      Copy the code

4. Customize UDF functions

  1. demand

    Create a custom UDF implementation that computs the length of a given string. , such as:

    hive(default)> select my_len("abcd");
    4
    Copy the code
  2. Create a Maven project Hive

  3. Import dependence

    <dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> The < version > 3.1.2 < / version > < / dependency > < / dependencies >Copy the code
  4. Create a class

    public class MyUDF extends GenericUDF {
    
        // Number of verification data parameters
        @Override
        public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    
            if(arguments.length ! =1) {
                throw new UDFArgumentException("The number of arguments is not one.");
            }
    
            return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
        }
    
        // Process data
        @Override
        public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
            //1. Fetch the input data
            String input = arguments[0].get().toString();
    
            //2. Check whether the input data is null
            if (input == null) {
                return 0;
            }
    
            //3. Return the length of input data
            return input.length();
        }
    
        @Override
        public String getDisplayString(String[] children) {
            return ""; }}Copy the code
  5. Upload the jar package to the server /opt/module/hive-3.1.2/datas/myudf.jar

  6. Add jar packages to hive classpath

    Hive (default)> add jar /opt/module/hive-3.1.2/datas/myudf.jar;Copy the code
  7. Create temporary functions associated with developed Java classes

    hive (default)> create temporary function my_len as "com.moe.hive.udf.MyUDF";
    Copy the code
  8. You can use custom functions in the HQL

    hive (default)> select my_len('moe');
    OK
    _c0
    3
    
    hive (default)> select length('moe');
    OK
    _c0
    3
    Copy the code

5. Customize UDTF functions

  1. demand

    A custom UDTF implementation splits an arbitrary separator string into separate words, for example:

    hive(default)> select myudtf("hello,world,hadoop,hive", ",");
    hello
    world
    hadoop
    hive
    Copy the code
  2. Code implementation

    public class MyUDTF extends GenericUDTF {
    
        // A collection of output data
        private ArrayList<String> outPutList = new ArrayList<>();
    
        @Override
        public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
    
            // The default column name of the output data, which can be overridden by an alias
            List<String> fieldNames = new ArrayList<>();
            fieldNames.add("word");
    
            // The type of output data
            List<ObjectInspector> fieldOIs = new ArrayList<>();
            fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    
            // The final return value
            return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
        }
    
        // Process input data: hello, MOE,hive
        @Override
        public void process(Object[] args) throws HiveException {
    
            //1. Fetch the input data
            String input = args[0].toString();
    
            //2. Split the string according to ","
            String[] words = input.split(",");
    
            //3
            for (String word : words) {
    
                // Clear the collection
                outPutList.clear();
    
                // Put the data into the collection
                outPutList.add(word);
    
                // Output dataforward(outPutList); }}// End method
        @Override
        public void close(a) throws HiveException {}}Copy the code
  3. Upload the jar package to the server /opt/module/hive-3.1.2/datas/myudtf.jar

  4. Add the JAR package to the Hive classpath

    Hive (default)> add jar /opt/module/hive-3.1.2/datas/myudtf.jar;Copy the code
  5. Create temporary functions associated with developed Java classes

    hive (default)> create temporary function myudtf as "com.moe.hive.udtf.MyUDTF";
    Copy the code
  6. Use custom functions

    hive (default)> select myudtf('hello,moe,hive',',');
    OK
    word
    hello
    moe
    hive
    Copy the code

Two, friendship links

Big data Hive learning journey 4

Big data Hive learning journey 3

Big data Hive learning journey 2

Big data Hive learning journey 1