preface
You must be familiar with Hive! Hive is a data warehouse tool based on Hadoop that maps structured data files to a database table and provides SQL-like query (HQL) functions. Hive has the advantages of low learning cost and rapid MapReduce statistics using SQL statements, simplifying MapReduce without developing special MapReduce applications. Hive is therefore ideal for statistical analysis of data warehouses.
In this issue, we will explore 18 ways to query Hive data.
To prepare
In this issue, most HQL operations need to rely on the following two tables, and the specific data content is as follows:
course
student
1. SELECT query statement
SELECT (); SELECT ();
hive (hypers)> select name from student;
OK
name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code
Columns and tables in query statements can be aliased as follows:
hive (hypers)> select t.name from student t;
OK
t.name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code
You can use the following statement for nested queries:
hive (hypers)> select a.name, b.coursename
> from (select stuid, name from student) a
> join (select stuid, coursename from course) b on a.stuid =b.stuid; OK A. Name B. Court sename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete mathematics Jerry C++
Copy the code
You can use regular expressions to specify the columns for the query, as follows:
hive (hypers)> select t.* from student t;
OK
t.stuid t.name t.sex t.age
15317408 Rose 1 21
15317412 Jack 0 20
15317432 Jimmy 1 21
15317423 Tom 1 20
15317478 Jerry 0 19
15317467 Alice 0 20
Copy the code
You can use LIMIT to LIMIT the number of results in a query as follows:
hive (hypers)> select * from student limit 1;
OK
student.stuid student.name student.sex student.age
15317408 Rose 1 21
Copy the code
We can use the ORDER BY statement to sort the results. In ascending ORDER we can not add ASC after the sorted fields (default), but in reverse ORDER we need to specify DESC, as shown below:
hive (hypers)> select * from student order by age desc;
OK
student.stuid student.name student.sex student.age
15317432 Jimmy 1 21
15317408 Rose 1 21
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317478 Jerry 0 19
Time taken: 10.631 seconds, Fetched: 5 row(s)
hive (hypers)> select * from student order by age;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317432 Jimmy 1 21
15317408 Rose 1 21
Copy the code
We can also use CASE… WHEN… The THEN statement processes the values of a column as follows:
hive (hypers)> SELECT stuid,
> name,
> age,
> sex,
> CASE
> WHEN sex = '1' THEN 'male'
> WHEN sex = '0' THEN 'woman'
> ELSE 'unknown'
> END
> FROM student;
OK
stuid name age sex _c4
15317408 Rose 21 1 男
15317412 Jack 20 0 女
15317432 Jimmy 21 1 男
15317423 Tom 20 1 男
15317478 Jerry 19 0 女
15317478 Alice 20 0 女
Copy the code
2, WHERE condition statement
The WHERE condition statement mainly conditions the query, as shown below:
hive (hypers)> select * from student where age = 21;
OK
student.stuid student.name student.sex student.age
15317408 Rose 1 21
15317432 Jimmy 1 21
Copy the code
Common operators for WHERE conditional statements are shown in this table
The operator | Supported data types | instructions |
---|---|---|
A=B | Basic data types | Return true if A is equal to B, false otherwise |
A<=>B | Basic data types | Return true if both A and B are NULL, otherwise the same as if A=B |
< A > B, A! = B | Basic data types | If A or B is NULL, NULL is returned; Return true if A is not equal to B, false otherwise |
A<B | Basic data types | If A or B is NULL, NULL is returned; Return true if A is less than B, false otherwise |
A<=B | Basic data types | If A or B is NULL, NULL is returned; Returns true if A is less than or equal to B, false otherwise |
A>B | Basic data types | If A or B is NULL, NULL is returned; Return true if A is greater than B, false otherwise |
A>=B | Basic data types | If A or B is NULL, NULL is returned; Returns true if A is greater than or equal to B, false otherwise |
A IS NULL | All data types | Returns true if A is NULL, false otherwise |
A IS NOT NULL | All data types | Return true if A is not NULL, false otherwise |
A BETWEEN B AND C | Basic data types | If A, B, or C is NULL, NULL is returned. Returns true if A is greater than or equal to B and less than or equal to C, false otherwise |
A NOT BETWEEN B AND C | Basic data types | If A, B, or C is NULL, NULL is returned. Return true if A is less than B or A is greater than C, false otherwise |
A LIKE B | Type STRING | Return true if A vaguely matches B, false otherwise |
A NOT LIKE B | Type STRING | Return true if A does not fuzzily match B, false otherwise |
A RLIKE B, A REGEXP B | Type STRING | B is A regular expression that returns true if A matches the regular expression and false otherwise |
3, GROUP BY statement
The GROUP BY statement, which groups the data of a query, is usually used in conjunction with an aggregate function, as follows:
hive (hypers)> select sex,avg(age) from student group by sex;
OK
sex _c1
0 19.666666666666668
1 20.666666666666668
Copy the code
4. HAVING statements
The HAVING statement is used to restrict the results of a GROUP BY statement, as follows:
hive (hypers)> select sex,avg(age) from student group by sex having avg(age) > 20;
OK
sex _c1
1 20.666666666666668
Copy the code
INNER JOIN statement
In the INNER JOIN statement, data matching the JOIN condition is displayed in the result data only if there is data matching the JOIN condition in both tables, as follows:
hive (hypers)> select t1.name,t2.coursename from student t1 join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
Copy the code
LEFT OUTER JOIN statement
The LEFT OUTER JOIN statement represents the LEFT OUTER JOIN. The LEFT OUTER JOIN query will contain all the records in the LEFT OUTER JOIN table, and the right OUTER JOIN will be NULL, as shown below:
hive (hypers)> select t1.name,t2.coursename from student t1 left outer join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
Alice NULL
Copy the code
RIGHT OUTER JOIN statement
Select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN; select * from RIGHT OUTER JOIN;
hive (hypers)> select t1.name,t2.coursename from student t1 right outer join course t2 on t1.stuid =t2.stuid; OK T1. name t2.coursename Rose C Language Jack Java Jimmy Advanced Mathematics Tom Discrete Mathematics Jerry C++
NULLBig data application developmentCopy the code
OUTER JOIN statement FULL OUTER JOIN statement
Select * from FULL OUTER JOIN; select * from FULL OUTER JOIN; select * from FULL OUTER JOIN;
hive (hypers)> select t1.name,t2.coursename from student t1 FULL outer join course t2 on t1.stuid =t2.stuid; OK t1.name t2.coursename Rose C language Jack Java Tom Discrete mathematics Jimmy Advanced mathematicsNULLBig data application development AliceNULL
Jerry C++
Copy the code
LEFT SEMI JOIN statement
The LEFT SEMI JOIN statement represents the LEFT half JOIN, and the result data corresponding to the right table satisfies the conditions in the ON statement, as shown below:
hive (hypers)> select t1.name from student t1 LEFT SEMI JOIN course t2 on t1.stuid = t2.stuid;
OK
t1.name
Rose
Jack
Jimmy
Tom
Jerry
Copy the code
Note: | LEFT SEMI JOIN statement, SELECT and the WHERE clause cannot be referenced in the table right in the field. |
10, Cartesian product JOIN statement
The Cartesian product JOIN statement says that the number of rows in the left table multiplied by the number of rows in the right table equals the size of the result set, as follows:
hive (hypers)> select * from student join course;
OK
student.stuid student.name student.sex student.age course.stuid course.coursename course.score
15317408 Rose 1 21 15317408The C language50
15317412 Jack 0 20 15317408The C language50
15317432 Jimmy 1 21 15317408The C language50
15317423 Tom 1 20 15317408The C language50
15317478 Jerry 0 19 15317408The C language50
15317467 Alice 0 20 15317408The C language50
15317408 Rose 1 21 15317412 Java 60
15317412 Jack 0 20 15317412 Java 60
15317432 Jimmy 1 21 15317412 Java 60
15317423 Tom 1 20 15317412 Java 60
15317478 Jerry 0 19 15317412 Java 60
15317467 Alice 0 20 15317412 Java 60
15317408 Rose 1 21 15317432Higher mathematics70
15317412 Jack 0 20 15317432Higher mathematics70
15317432 Jimmy 1 21 15317432Higher mathematics70
15317423 Tom 1 20 15317432Higher mathematics70
15317478 Jerry 0 19 15317432Higher mathematics70
15317467 Alice 0 20 15317432Higher mathematics70
15317408 Rose 1 21 15317423Discrete mathematics80
15317412 Jack 0 20 15317423Discrete mathematics80
15317432 Jimmy 1 21 15317423Discrete mathematics80
15317423 Tom 1 20 15317423Discrete mathematics80
15317478 Jerry 0 19 15317423Discrete mathematics80
15317467 Alice 0 20 15317423Discrete mathematics80
15317408 Rose 1 21 15317478 C++ 90
15317412 Jack 0 20 15317478 C++ 90
15317432 Jimmy 1 21 15317478 C++ 90
15317423 Tom 1 20 15317478 C++ 90
15317478 Jerry 0 19 15317478 C++ 90
15317467 Alice 0 20 15317478 C++ 90
15317408 Rose 1 21 15317463Big data application development100
15317412 Jack 0 20 15317463Big data application development100
15317432 Jimmy 1 21 15317463Big data application development100
15317423 Tom 1 20 15317463Big data application development100
15317478 Jerry 0 19 15317463Big data application development100
15317467 Alice 0 20 15317463Big data application development100
Copy the code
Note: | if will Hive property Hive. Mapred. Mode set to strict, will prevent implementation of cartesian product queries. |
11, map-side JOIN statement
The map-site JOIN statement reads the small table into the memory during the map phase, and joins the small table directly on the map side. This connection must be explicitly declared in the query statement, as shown below:
SELECT /* + MapJOIN(t1) */ s1.stuid,s2.stuid from student s1 JOIN student s2 ON s1.stuid = s2.stuid;
Copy the code
You can enable map-side join automatically by setting the Hive attribute hive.auto-convert. join=true. Can also set properties of the Hive Hive. Mapjoin. Smalltable. The filesize define the table size, the default is 25, 000, 000 B.
JOIN statement for multiple tables
Hive supports joining multiple tables. The following statements are used:
hive (hypers)> SELECT *
FROM test1 t1
JOIN test2 t2 ON t1.id = t2.id
JOIN test3 t3 ON t2.id = t3.id
Copy the code
Each JOIN starts a MapReduce job. The first MapReduce job connects table Test1 to table Test2, and the second MapReduce job connects the output of the first MapReduce job to table Test3.
ORDER BY and SORT BY statements
The ORDER BY statement in Hive is similar to the SQL statement to sort result sets, as shown in the following:
hive (hypers)> select * from student order by age asc,stuId desc;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317432 Jimmy 1 21
15317408 Rose 1 21
Time taken: 11.929 seconds, Fetched: 6 row(s)
Copy the code
The above statement indicates that the age field is in ascending order and the stuId field is in descending order.
If there is too much data in the Hive table, sorting BY ORDER may take too long. In this case, you can set the Hive attribute hive.mapred. Mode to strict. To avoid the problem of long execution time caused by too much data, as shown below:
hive (hypers)> SET hive.mapred.mode = strict;
hive (hypers)> select * from student order by age asc,stuId desc limit 100;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317432 Jimmy 1 21
15317408 Rose 1 21
Time taken: 9.378 seconds, Fetched: 6 row(s)
Copy the code
The SORT BY statement sorts the data in each Reduce, ensuring that the output data of each Reduce is ordered (not necessarily globally ordered) and improving the performance of global sorting, as shown below:
hive (hypers)> select * from student sort by age asc,stuId desc limit 100;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317432 Jimmy 1 21
15317408 Rose 1 21
Copy the code
The above statement sorts the AGE field in ascending order and the CREATE_time field in descending order for each Reduce. If the number of Reduce is 1, the query results of the ORDER BY and SORT BY statements are the same. If the number of Reduce is greater than 1, the output of SORT BY is locally ordered.
DISTRIBUTE BY and SORT BY statements
DISTRIBUTE statement combined with SORT BY statement can SORT the data in the second column if the data in the first column is the same, as shown below:
hive (hypers)> select * from student distribute by sex sort by age,stuId;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317412 Jack 0 20
15317423 Tom 1 20
15317467 Alice 0 20
15317408 Rose 1 21
15317432 Jimmy 1 21
Copy the code
DISTRIBUTE BY statement can ensure that data with the same sex enter the same Reduce function, and then sort data BY age and stuId in Reduce to sort data in the second column when the first column is the same.
15, CLUSTER BY statement
If the columns in DISTRIBUTE BY and SORT BY statements are exactly the same and are sorted in ascending order, you can use the CLUSTER BY statement to replace the DISTRIBUTE BY and SORT BY statements, as follows:
select * from student distribute by age sort by age;
Copy the code
The above statement is equivalent to:
hive (hypers)> select * from student cluster by age;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
15317423 Tom 1 20
15317412 Jack 0 20
15317432 Jimmy 1 21
15317408 Rose 1 21
Copy the code
Type conversion
TYPE conversions can use the cast(value As TYPE) syntax, As shown below:
hive (hypers)> select * from student where cast(stuId AS INT) > = 15317450;
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
15317467 Alice 0 20
Copy the code
The above statement converts the stuId to an INT.
Sample by barrel
Hive supports bucket sampling query, as shown in the following:
hive (hypers)> SELECT * FROM student TABLESAMPLE (BUCKET 2 OUT OF 6 ON stuid);
OK
student.stuid student.name student.sex student.age
15317467 Alice 0 20
Copy the code
The preceding statement indicates that six buckets are queried. The second bucket is divided by the remainder of dividing the hash value of the ID value by the number of buckets 6. Random sampling can also be adopted, as shown below:
hive (hypers)> SELECT * FROM student TABLESAMPLE (BUCKET 2 OUT OF 6 ON RAND());
OK
student.stuid student.name student.sex student.age
15317478 Jerry 0 19
Time taken: 0.04 seconds, Fetched: 1 row(s)
Copy the code
You can specify buckets when creating a table. Set the Hive. Enforce. Bucketing property of Hive to true. This property can be configured in the hive-site.xml file, as follows:
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
Copy the code
You can also set the Hive command line interface (CLI) as follows:
hive (default)> SET hive.enforce.bucketing = true;
Copy the code
Select * from student; select * from student; select * from student;
hive (hypers)> CREATE TABLE test_bucket(id INT) CLUSTERED BY (id) INTO 3 BUCKETS ;
OK
Time taken: 0.086 seconds
hive (hypers)> INSERT OVERWRITE TABLE test_bucket SELECT stuid FROM student;
OK
stuid
Time taken: 24.261 seconds
Copy the code
The above statement first creates a test_bucket table, splits the test_bucket table into three buckets, and inserts the ID column data from the student table into the test_bucket table. The inserted data is stored in three files, one for each bucket, under the test_bucket table path.
18, UNION ALL statement
Hive supports UNION ALL query, which is mainly used to merge data of multiple tables. Use the UNION ALL statement to ask each table to query the field type must match exactly, as shown below:
SELECT t.id,t.name
FROM (
SELECT t1.id,t1.name FROM test1 t1
UNION ALL
SELECT t2.id,t2.name FROM test2 t2
UNION ALL
SELECT t3.id,t3.name FROM test3 t3
) t
Copy the code
Note: | using UNION ALL statements in the Hive, you must use a nested query. |
summary
Before Hadoop came along, there was a lot of focus on data analysis based on relational databases. With the emergence of Hadoop and the development of big data, people began to realize the advantages of using Hadoop to analyze massive data. The more you know, the more you don’t know. The more you know, the more you don’t know. I’m Alice, and I’ll see you next time!
Benefit of friends remember three even support small bacteria!
The article continues to update, you can search wechat “big data dreamers” for the first time to read, mind mapping, big data books, big data high-frequency interview questions, a large number of first-line big factory face… Looking forward to your attention!