1. Data preparation
To demonstrate the query operation, you need to pre-create three tables and load the test data.
The data files emp.txt and dept.txt can be downloaded from the Resources directory of this repository.
1.1 the employee table
-- Build a predicate sentence
CREATE TABLE emp(
empno INT.-- Employee list Number
ename STRING.-- Employee name
job STRING.-- Job type
mgr INT,
hiredate TIMESTAMP.-- Date of employment
sal DECIMAL(7.2), Wages,
comm DECIMAL(7.2),
deptno INT) -- Department No.
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Load data
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp;
Copy the code
Table 1.2 department
-- Build a predicate sentence
CREATE TABLE dept(
deptno INT.-- Department No.
dname STRING.-- Department Name
loc STRING -- The city where the department is located
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Load data
LOAD DATA LOCAL INPATH "/usr/file/dept.txt" OVERWRITE INTO TABLE dept;
Copy the code
1.3 the partition table
Here we need to create an additional partition table, mainly to demonstrate partition queries:
CREATE EXTERNAL TABLE emp_ptn(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate TIMESTAMP,
sal DECIMAL(7.2),
comm DECIMAL(7.2)
)
PARTITIONED BY (deptno INT) -- By department number
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
Load data
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=20)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=30)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=40)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=50)
Copy the code
Two, single table query
2.1 SELECT
Query all data in the table
SELECT * FROM emp;
Copy the code
2.2 the WHERE
Query the information about the employee whose id is greater than 7782 in department 10
SELECT * FROM emp WHERE empno > 7782 AND deptno = 10;
Copy the code
2.3 the DISTINCT
Hive supports the DISTINCT keyword for deduplication.
-- Query all job types
SELECT DISTINCT job FROM emp;
Copy the code
2.4 Partition Query
Partition Based Queries. You can specify a Partition or a Partition range.
Query the employees whose department ids are between [20,40] in the partition table
SELECT emp_ptn.* FROM emp_ptn
WHERE emp_ptn.deptno >= 20 AND emp_ptn.deptno <= 40;
Copy the code
2.5 LIMIT
-- Query the top 5 highest-paid employees
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;
Copy the code
2.6 GROUP BY
Hive supports GROUP aggregation using GROUP BY.
set hive.map.aggr=true;
Query the comprehensive salary of each department
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno;
Copy the code
Hive.map. aggr controls how the aggregation process is performed. The default value is false. If set to true, Hive performs aggregation once during the Map phase. This makes aggregation more efficient, but consumes more memory.
2.7 the ORDER AND SORT
You can Sort the query results using either ORDER BY or Sort BY. The Sort fields can be integers or strings: if integer, Sort BY size; If it is a string, it is sorted lexicographically. ORDER BY and SORT BY differ as follows:
- When ORDER BY is used, there is a Reducer to sort all query results, which can ensure the global ORDER of data.
- When SORT BY is used, only SORT is done in each Reducer, which can ensure that the output data of each Reducer is in order, but not global order.
Since the ORDER BY may take a long time, if you set strict mode (hive.mapred.mode = strict), it must be followed BY a limit clause.
Note: Hive.mapred. mode defaults to nonstrict, which means non-strict mode.
Query employee salary, result by department ascending, by salary descending
SELECT empno, deptno, sal FROM emp ORDER BY deptno ASC, sal DESC;
Copy the code
2.8 HAVING
You can use HAVING to filter grouped data.
Query all departments whose total salaries are greater than 9000
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno HAVING SUM(sal)>9000;
Copy the code
2.9 DISTRIBUTE BY
If you want to DISTRIBUTE data with the same Key value to the same Reducer for processing, you can use the DISTRIBUTE BY statement. It should be noted that although DISTRIBUTE BY can DISTRIBUTE data with the same Key value to the same Reducer, it cannot ensure that data is in order on the Reducer. The situation is as follows:
The following five data were sent to two Reducer for processing:
k1
k2
k4
k3
k1
Copy the code
Reducer1 obtained the following out-of-order data:
k1
k2
k1
Copy the code
Reducer2 obtained the following data:
k4
k3
Copy the code
If you want the data to be ordered on the Reducer, you can use either SORT BY (as shown below) or CLUSTER BY, which we will introduce below.
-- Distribute data to the Reducer by department
SELECT empno, deptno, sal FROM emp DISTRIBUTE BY deptno SORT BY deptno ASC;
Copy the code
2.10 CLUSTER BY
If SORT BY and DISTRIBUTE BY specify the same field, and SORT BY is ASC, CLUSTER BY can be used to replace them, and CLUSTER BY can ensure that data is globally ordered.
SELECT empno, deptno, sal FROM emp CLUSTER BY deptno ;
Copy the code
Three, multi table join query
Hive supports inner join, outer join, left outer join, right outer join, and Cartesian join. These concepts are consistent with traditional database concepts, as shown in the following figure.
It is important to note that the JOIN condition must be specified with ON, not WHERE. Otherwise, cartesian product will be performed first and then filtered, which will result in the failure of the desired result (as explained in the following illustration).
3.1 INNER JOIN
-- Query the details of the employee whose id is 7369
SELECT e.*,d.* FROM
emp e JOIN dept d
ON e.deptno = d.deptno
WHERE empno=7369;
If three or more tables are joined, the syntax is as follows
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Copy the code
3.2 the LEFT OUTER JOIN
LEFT OUTER JOIN and LEFT JOIN are equivalent.
- left connection
SELECT e.*,d.*
FROM emp e LEFT OUTER JOIN dept d
ON e.deptno = d.deptno;
Copy the code
3.3 RIGHT OUTER JOIN
- the right connection
SELECT e.*,d.*
FROM emp e RIGHT OUTER JOIN dept d
ON e.deptno = d.deptno;
Copy the code
After the right link is executed, the employee information is NULL because there is no employee in department 40. This query nicely repeats what was mentioned above — JOIN statement association conditions must be specified with ON, not WHERE. You can change ON to WHERE, and you’ll never find department 40 anyway, because cartesian operations don’t have (NULL, 40).
3.4 FULL OUTER JOIN
SELECT e.*,d.*
FROM emp e FULL OUTER JOIN dept d
ON e.deptno = d.deptno;
Copy the code
3.5 LEFT SEMI JOIN
LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subqueries.
- The table ON the right of the JOIN clause can only be filtered in the ON clause.
- The query results only contain data from the left table, so only columns from the left table can be selected.
Query information about all employees working in New York
SELECT emp.*
FROM emp LEFT SEMI JOIN dept
ON emp.deptno = dept.deptno AND dept.loc="NEW YORK";
The above statement is equivalent to
SELECT emp.* FROM emp
WHERE emp.deptno IN (SELECT deptno FROM dept WHERE loc="NEW YORK");
Copy the code
3.6 the JOIN
Cartesian product connections are rare and consume a lot of performance. For this reason, if hive is in strict mode (hive.mapred.mode = strict), Hive prevents users from performing this operation.
SELECT * FROM emp JOIN dept;
Copy the code
4. JOIN optimization
4.1 STREAMTABLE
When multiple tables are joined, if each ON statement uses the same column (B. Key in the following figure), Hive optimizes to JOIN multiple tables in the same Map/Reduce job. Also assume that the last table in the query (table C below) is the largest table, and when joining each row, it will attempt to cache the other tables and scan the last table for calculation. Therefore, the user needs to ensure that the size of the queried table increases from left to right.
`SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key) JOIN c ON (c.key = b.key)`
Copy the code
Instead of always placing the largest table at the end of a query, Hive provides the /*+ STREAMTABLE() */ flag to identify the largest table, as shown in the following example:
SELECT /*+ STREAMTABLE(d) */ e.*,d.*
FROM emp e JOIN dept d
ON e.deptno = d.deptno
WHERE job='CLERK';
Copy the code
4.2 MAPJOIN
If only one of the tables is a small table, Hive loads that small table into memory. At this time, the program will directly match the data of another table with the data of the table in memory in the map stage. Since JOIN operation is performed in the map, the reduce process can be omitted and the efficiency can be greatly improved. Hive provides /*+ MAPJOIN() */ to mark small tables as shown in the following example:
SELECT /*+ MAPJOIN(d) */ e.*,d.*
FROM emp e JOIN dept d
ON e.deptno = d.deptno
WHERE job='CLERK';
Copy the code
Other uses of SELECT
View the current database:
SELECT current_database()
Copy the code
Local mode
In most of the preceding statements, MapReduce is triggered. For example, Select * from EMp Limit 5 does not trigger MR. In this case, Hive simply reads the contents of data files and formats them for output. In queries that require MapReduce execution, you may find that the execution time may be very long, so you can choose to enable local mode.
- The local mode is disabled by default. You need to manually enable this function
SET hive.exec.mode.local.auto=true;
Copy the code
After Hive is enabled, Hive analyzes the size of each Map-Reduce job in a query and can run it locally if:
- Homework to the total input of size below: hive. The exec. Mode. Local. Auto. Inputbytes. Max (default is 128 MB);
- The map – the total number of tasks is less than: hive. Exec. Mode. Local. Auto. The tasks. Max (the default is 4);
- The total number of reduce jobs required is 1 or 0.
Since the data set we tested was small, if you run the query above involving MR operations again, you will see a significant speed increase.
The resources
- LanguageManual Select
- LanguageManual Joins
- LanguageManual GroupBy
- LanguageManual SortBy
See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series