Hive query 18 ways, you have learned all?

preface

You must be familiar with Hive! Hive is a data warehouse tool based on Hadoop that maps structured data files to a database table and provides SQL-like query (HQL) functions. Hive has the advantages of low learning cost and rapid MapReduce statistics using SQL statements, simplifying MapReduce without developing special MapReduce applications. Hive is therefore ideal for statistical analysis of data warehouses.

In this issue, we will explore 18 ways to query Hive data.

To prepare

In this issue, most HQL operations need to rely on the following two tables, and the specific data content is as follows:

course

student

1. SELECT query statement

SELECT (); SELECT ();

hive (hypers)> select name from student;
OK
name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code

Columns and tables in query statements can be aliased as follows:

hive (hypers)> select t.name from student t;
OK
t.name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code

You can use the following statement for nested queries:

hive (hypers)> select a.name, b.coursename
             > from (select stuid, name from student) a
             >          join (select stuid, coursename from course) b on a.stuid =b.stuid; OK A. Name B. Court sename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete mathematics Jerry C++
Copy the code

You can use regular expressions to specify the columns for the query, as follows:

hive (hypers)> select t.* from student t;
OK
t.stuid t.name  t.sex   t.age
15317408        Rose    1       21
15317412        Jack    0       20
15317432        Jimmy   1       21
15317423        Tom     1       20
15317478        Jerry   0       19
15317467        Alice   0       20
Copy the code

You can use LIMIT to LIMIT the number of results in a query as follows:

hive (hypers)> select * from student limit 1;
OK
student.stuid   student.name    student.sex     student.age
15317408        Rose    1       21
Copy the code

We can use the ORDER BY statement to sort the results. In ascending ORDER we can not add ASC after the sorted fields (default), but in reverse ORDER we need to specify DESC, as shown below:

hive (hypers)> select * from student order by age desc;
OK
student.stuid   student.name    student.sex     student.age
15317432        Jimmy   1       21
15317408        Rose    1       21
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317478        Jerry   0       19
Time taken: 10.631 seconds, Fetched: 5 row(s)

hive (hypers)> select * from student order by age;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317432        Jimmy   1       21
15317408        Rose    1       21
Copy the code

We can also use CASE… WHEN… The THEN statement processes the values of a column as follows:

hive (hypers)> SELECT stuid,
             >        name,
             >        age,
             >        sex,
             >        CASE
             >            WHEN sex = '1' THEN 'male'
             >            WHEN sex = '0' THEN 'woman'
             >            ELSE 'unknown'
             >            END
             >        FROM student;
OK
stuid   name    age     sex     _c4
15317408        Rose    21      1       男
15317412        Jack    20      0       女
15317432        Jimmy   21      1       男
15317423        Tom     20      1       男
15317478        Jerry   19      0       女
15317478        Alice   20      0       女
Copy the code

2, WHERE condition statement

The WHERE condition statement mainly conditions the query, as shown below:

hive (hypers)> select * from student where age = 21;
OK
student.stuid   student.name    student.sex     student.age
15317408        Rose    1       21
15317432        Jimmy   1       21
Copy the code

Common operators for WHERE conditional statements are shown in this table

The operator	Supported data types	instructions
A=B	Basic data types	Return true if A is equal to B, false otherwise
A<=>B	Basic data types	Return true if both A and B are NULL, otherwise the same as if A=B
< A > B, A! = B	Basic data types	If A or B is NULL, NULL is returned; Return true if A is not equal to B, false otherwise
A<B	Basic data types	If A or B is NULL, NULL is returned; Return true if A is less than B, false otherwise
A<=B	Basic data types	If A or B is NULL, NULL is returned; Returns true if A is less than or equal to B, false otherwise
A>B	Basic data types	If A or B is NULL, NULL is returned; Return true if A is greater than B, false otherwise
A>=B	Basic data types	If A or B is NULL, NULL is returned; Returns true if A is greater than or equal to B, false otherwise
A IS NULL	All data types	Returns true if A is NULL, false otherwise
A IS NOT NULL	All data types	Return true if A is not NULL, false otherwise
A BETWEEN B AND C	Basic data types	If A, B, or C is NULL, NULL is returned. Returns true if A is greater than or equal to B and less than or equal to C, false otherwise
A NOT BETWEEN B AND C	Basic data types	If A, B, or C is NULL, NULL is returned. Return true if A is less than B or A is greater than C, false otherwise
A LIKE B	Type STRING	Return true if A vaguely matches B, false otherwise
A NOT LIKE B	Type STRING	Return true if A does not fuzzily match B, false otherwise
A RLIKE B, A REGEXP B	Type STRING	B is A regular expression that returns true if A matches the regular expression and false otherwise

3, GROUP BY statement

The GROUP BY statement, which groups the data of a query, is usually used in conjunction with an aggregate function, as follows:

hive (hypers)> select sex,avg(age) from student group by sex;
OK
sex     _c1
0       19.666666666666668
1       20.666666666666668
Copy the code

4. HAVING statements

The HAVING statement is used to restrict the results of a GROUP BY statement, as follows:

hive (hypers)> select sex,avg(age) from student group by sex having avg(age) > 20;
OK
sex     _c1
1       20.666666666666668
Copy the code

INNER JOIN statement

In the INNER JOIN statement, data matching the JOIN condition is displayed in the result data only if there is data matching the JOIN condition in both tables, as follows:

hive (hypers)> select t1.name,t2.coursename from student t1 join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
Copy the code

LEFT OUTER JOIN statement

The LEFT OUTER JOIN statement represents the LEFT OUTER JOIN. The LEFT OUTER JOIN query will contain all the records in the LEFT OUTER JOIN table, and the right OUTER JOIN will be NULL, as shown below:

hive (hypers)> select t1.name,t2.coursename from student t1 left outer join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
Alice   NULL
Copy the code

RIGHT OUTER JOIN statement

Select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN;

hive (hypers)> select t1.name,t2.coursename from student t1 right outer join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
NULLBig data application developmentCopy the code

OUTER JOIN statement FULL OUTER JOIN statement

Select * from FULL OUTER JOIN; select * from FULL OUTER JOIN; select * from FULL OUTER JOIN;

hive (hypers)> select t1.name,t2.coursename from student t1 FULL  outer join course t2 on t1.stuid =t2.stuid; OK t1.name t2.coursename Rose C language Jack Java Tom Discrete mathematics Jimmy Advanced mathematicsNULLBig data application development AliceNULL
Jerry   C++
Copy the code

LEFT SEMI JOIN statement

The LEFT SEMI JOIN statement represents the LEFT half JOIN, and the result data corresponding to the right table satisfies the conditions in the ON statement, as shown below:

hive (hypers)> select t1.name from student t1 LEFT SEMI JOIN course t2 on t1.stuid = t2.stuid;
OK
t1.name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code

Note: | LEFT SEMI JOIN statement, SELECT and the WHERE clause cannot be referenced in the table right in the field. |

10, Cartesian product JOIN statement

The Cartesian product JOIN statement says that the number of rows in the left table multiplied by the number of rows in the right table equals the size of the result set, as follows:

hive (hypers)> select * from student join course;
OK
student.stuid   student.name    student.sex     student.age     course.stuid    course.coursename     course.score
15317408        Rose    1       21      15317408The C language50
15317412        Jack    0       20      15317408The C language50
15317432        Jimmy   1       21      15317408The C language50
15317423        Tom     1       20      15317408The C language50
15317478        Jerry   0       19      15317408The C language50
15317467        Alice   0       20      15317408The C language50
15317408        Rose    1       21      15317412        Java    60
15317412        Jack    0       20      15317412        Java    60
15317432        Jimmy   1       21      15317412        Java    60
15317423        Tom     1       20      15317412        Java    60
15317478        Jerry   0       19      15317412        Java    60
15317467        Alice   0       20      15317412        Java    60
15317408        Rose    1       21      15317432Higher mathematics70
15317412        Jack    0       20      15317432Higher mathematics70
15317432        Jimmy   1       21      15317432Higher mathematics70
15317423        Tom     1       20      15317432Higher mathematics70
15317478        Jerry   0       19      15317432Higher mathematics70
15317467        Alice   0       20      15317432Higher mathematics70
15317408        Rose    1       21      15317423Discrete mathematics80
15317412        Jack    0       20      15317423Discrete mathematics80
15317432        Jimmy   1       21      15317423Discrete mathematics80
15317423        Tom     1       20      15317423Discrete mathematics80
15317478        Jerry   0       19      15317423Discrete mathematics80
15317467        Alice   0       20      15317423Discrete mathematics80
15317408        Rose    1       21      15317478        C++     90
15317412        Jack    0       20      15317478        C++     90
15317432        Jimmy   1       21      15317478        C++     90
15317423        Tom     1       20      15317478        C++     90
15317478        Jerry   0       19      15317478        C++     90
15317467        Alice   0       20      15317478        C++     90
15317408        Rose    1       21      15317463Big data application development100
15317412        Jack    0       20      15317463Big data application development100
15317432        Jimmy   1       21      15317463Big data application development100
15317423        Tom     1       20      15317463Big data application development100
15317478        Jerry   0       19      15317463Big data application development100
15317467        Alice   0       20      15317463Big data application development100
Copy the code

Note: | if will Hive property Hive. Mapred. Mode set to strict, will prevent implementation of cartesian product queries. |

11, map-side JOIN statement

The map-site JOIN statement reads the small table into the memory during the map phase, and joins the small table directly on the map side. This connection must be explicitly declared in the query statement, as shown below:

SELECT /* + MapJOIN(t1) */ s1.stuid,s2.stuid  from student s1 JOIN student s2 ON s1.stuid = s2.stuid;
Copy the code

You can enable map-side join automatically by setting the Hive attribute hive.auto-convert. join=true. Can also set properties of the Hive Hive. Mapjoin. Smalltable. The filesize define the table size, the default is 25, 000, 000 B.

JOIN statement for multiple tables

Hive supports joining multiple tables. The following statements are used:

hive (hypers)> SELECT *
               FROM test1 t1
               JOIN test2 t2 ON t1.id = t2.id
               JOIN test3 t3 ON t2.id = t3.id
Copy the code

Each JOIN starts a MapReduce job. The first MapReduce job connects table Test1 to table Test2, and the second MapReduce job connects the output of the first MapReduce job to table Test3.

ORDER BY and SORT BY statements

The ORDER BY statement in Hive is similar to the SQL statement to sort result sets, as shown in the following:

hive (hypers)> select * from student order by age asc,stuId desc;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317432        Jimmy   1       21
15317408        Rose    1       21
Time taken: 11.929 seconds, Fetched: 6 row(s)
Copy the code

The above statement indicates that the age field is in ascending order and the stuId field is in descending order.

If there is too much data in the Hive table, sorting BY ORDER may take too long. In this case, you can set the Hive attribute hive.mapred. Mode to strict. To avoid the problem of long execution time caused by too much data, as shown below:

hive (hypers)> SET hive.mapred.mode = strict;
hive (hypers)> select * from student order by age asc,stuId desc limit 100;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317432        Jimmy   1       21
15317408        Rose    1       21
Time taken: 9.378 seconds, Fetched: 6 row(s)
Copy the code

The SORT BY statement sorts the data in each Reduce, ensuring that the output data of each Reduce is ordered (not necessarily globally ordered) and improving the performance of global sorting, as shown below:

hive (hypers)> select * from student sort by age asc,stuId desc limit 100;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317432        Jimmy   1       21
15317408        Rose    1       21
Copy the code

The above statement sorts the AGE field in ascending order and the CREATE_time field in descending order for each Reduce. If the number of Reduce is 1, the query results of the ORDER BY and SORT BY statements are the same. If the number of Reduce is greater than 1, the output of SORT BY is locally ordered.

DISTRIBUTE BY and SORT BY statements

DISTRIBUTE statement combined with SORT BY statement can SORT the data in the second column if the data in the first column is the same, as shown below:

hive (hypers)> select * from student distribute by sex sort by age,stuId;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317412        Jack    0       20
15317423        Tom     1       20
15317467        Alice   0       20
15317408        Rose    1       21
15317432        Jimmy   1       21
Copy the code

DISTRIBUTE BY statement can ensure that data with the same sex enter the same Reduce function, and then sort data BY age and stuId in Reduce to sort data in the second column when the first column is the same.

15, CLUSTER BY statement

If the columns in DISTRIBUTE BY and SORT BY statements are exactly the same and are sorted in ascending order, you can use the CLUSTER BY statement to replace the DISTRIBUTE BY and SORT BY statements, as follows:

select * from student distribute by age sort by age;
Copy the code

The above statement is equivalent to:

hive (hypers)> select * from student cluster by age;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
15317423        Tom     1       20
15317412        Jack    0       20
15317432        Jimmy   1       21
15317408        Rose    1       21
Copy the code

Type conversion

TYPE conversions can use the cast(value As TYPE) syntax, As shown below:

hive (hypers)> select * from student where cast(stuId AS INT) > = 15317450;
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
15317467        Alice   0       20
Copy the code

The above statement converts the stuId to an INT.

Sample by barrel

Hive supports bucket sampling query, as shown in the following:

hive (hypers)> SELECT * FROM student TABLESAMPLE (BUCKET 2 OUT OF 6 ON stuid);
OK
student.stuid   student.name    student.sex     student.age
15317467        Alice   0       20
Copy the code

The preceding statement indicates that six buckets are queried. The second bucket is divided by the remainder of dividing the hash value of the ID value by the number of buckets 6. Random sampling can also be adopted, as shown below:

hive (hypers)> SELECT * FROM student TABLESAMPLE (BUCKET 2 OUT OF 6 ON RAND());
OK
student.stuid   student.name    student.sex     student.age
15317478        Jerry   0       19
Time taken: 0.04 seconds, Fetched: 1 row(s)
Copy the code

You can specify buckets when creating a table. Set the Hive. Enforce. Bucketing property of Hive to true. This property can be configured in the hive-site.xml file, as follows:

<property>
        <name>hive.enforce.bucketing</name>
        <value>true</value>
</property>
Copy the code

You can also set the Hive command line interface (CLI) as follows:

hive (default)> SET hive.enforce.bucketing = true;
Copy the code

Select * from student; select * from student; select * from student;

hive (hypers)> CREATE TABLE test_bucket(id INT) CLUSTERED BY (id) INTO 3 BUCKETS ;
OK
Time taken: 0.086 seconds
hive (hypers)> INSERT OVERWRITE TABLE test_bucket SELECT stuid FROM student;
OK
stuid
Time taken: 24.261 seconds
Copy the code

The above statement first creates a test_bucket table, splits the test_bucket table into three buckets, and inserts the ID column data from the student table into the test_bucket table. The inserted data is stored in three files, one for each bucket, under the test_bucket table path.

18, UNION ALL statement

Hive supports UNION ALL query, which is mainly used to merge data of multiple tables. Use the UNION ALL statement to ask each table to query the field type must match exactly, as shown below:

SELECT t.id,t.name
FROM (
SELECT t1.id,t1.name FROM test1 t1
UNION ALL
SELECT t2.id,t2.name FROM test2 t2
UNION ALL
SELECT t3.id,t3.name FROM test3 t3
) t
Copy the code

Note: | using UNION ALL statements in the Hive, you must use a nested query. |

summary

Before Hadoop came along, there was a lot of focus on data analysis based on relational databases. With the emergence of Hadoop and the development of big data, people began to realize the advantages of using Hadoop to analyze massive data. The more you know, the more you don’t know. The more you know, the more you don’t know. I’m Alice, and I’ll see you next time!

Benefit of friends remember three even support small bacteria!

The article continues to update, you can search wechat “big data dreamers” for the first time to read, mind mapping, big data books, big data high-frequency interview questions, a large number of first-line big factory face… Looking forward to your attention!