1. Data preparation

To demonstrate the query operation, you need to pre-create three tables and load the test data.

The data files emp.txt and dept.txt can be downloaded from the Resources directory of this repository.

1.1 the employee table

 -- Build a predicate sentence
 CREATE TABLE emp(
     empno INT.-- Employee list Number
     ename STRING.-- Employee name
     job STRING.-- Job type
     mgr INT,   
     hiredate TIMESTAMP.-- Date of employment
     sal DECIMAL(7.2),  Wages,
     comm DECIMAL(7.2),
     deptno INT)   -- Department No.
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";

  Load data
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp;
Copy the code

Table 1.2 department

 -- Build a predicate sentence
 CREATE TABLE dept(
     deptno INT.-- Department No.
     dname STRING.-- Department Name
     loc STRING    -- The city where the department is located
 )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
 
 Load data
 LOAD DATA LOCAL INPATH "/usr/file/dept.txt" OVERWRITE INTO TABLE dept;
Copy the code

1.3 the partition table

Here we need to create an additional partition table, mainly to demonstrate partition queries:

CREATE EXTERNAL TABLE emp_ptn(
      empno INT,
      ename STRING,
      job STRING,
      mgr INT,
      hiredate TIMESTAMP,
      sal DECIMAL(7.2),
      comm DECIMAL(7.2)
  )
 PARTITIONED BY (deptno INT)   -- By department number
 ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";


Load data
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=20)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=30)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=40)
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=50)
Copy the code

Two, single table query

2.1 SELECT

Query all data in the table
SELECT * FROM emp;
Copy the code

2.2 the WHERE

Query the information about the employee whose id is greater than 7782 in department 10
SELECT * FROM emp WHERE empno > 7782 AND deptno = 10;
Copy the code

2.3 the DISTINCT

Hive supports the DISTINCT keyword for deduplication.

-- Query all job types
SELECT DISTINCT job FROM emp;
Copy the code

2.4 Partition Query

Partition Based Queries. You can specify a Partition or a Partition range.

Query the employees whose department ids are between [20,40] in the partition table
SELECT emp_ptn.* FROM emp_ptn
WHERE emp_ptn.deptno >= 20 AND emp_ptn.deptno <= 40;
Copy the code

2.5 LIMIT

-- Query the top 5 highest-paid employees
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;
Copy the code

2.6 GROUP BY

Hive supports GROUP aggregation using GROUP BY.

set hive.map.aggr=true;

Query the comprehensive salary of each department
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno;
Copy the code

Hive.map. aggr controls how the aggregation process is performed. The default value is false. If set to true, Hive performs aggregation once during the Map phase. This makes aggregation more efficient, but consumes more memory.

2.7 the ORDER AND SORT

You can Sort the query results using either ORDER BY or Sort BY. The Sort fields can be integers or strings: if integer, Sort BY size; If it is a string, it is sorted lexicographically. ORDER BY and SORT BY differ as follows:

  • When ORDER BY is used, there is a Reducer to sort all query results, which can ensure the global ORDER of data.
  • When SORT BY is used, only SORT is done in each Reducer, which can ensure that the output data of each Reducer is in order, but not global order.

Since the ORDER BY may take a long time, if you set strict mode (hive.mapred.mode = strict), it must be followed BY a limit clause.

Note: Hive.mapred. mode defaults to nonstrict, which means non-strict mode.

Query employee salary, result by department ascending, by salary descending
SELECT empno, deptno, sal FROM emp ORDER BY deptno ASC, sal DESC;
Copy the code

2.8 HAVING

You can use HAVING to filter grouped data.

Query all departments whose total salaries are greater than 9000
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno HAVING SUM(sal)>9000;
Copy the code

2.9 DISTRIBUTE BY

If you want to DISTRIBUTE data with the same Key value to the same Reducer for processing, you can use the DISTRIBUTE BY statement. It should be noted that although DISTRIBUTE BY can DISTRIBUTE data with the same Key value to the same Reducer, it cannot ensure that data is in order on the Reducer. The situation is as follows:

The following five data were sent to two Reducer for processing:

k1
k2
k4
k3
k1
Copy the code

Reducer1 obtained the following out-of-order data:

k1
k2
k1
Copy the code

Reducer2 obtained the following data:

k4
k3
Copy the code

If you want the data to be ordered on the Reducer, you can use either SORT BY (as shown below) or CLUSTER BY, which we will introduce below.

-- Distribute data to the Reducer by department
SELECT empno, deptno, sal FROM emp DISTRIBUTE BY deptno SORT BY deptno ASC;
Copy the code

2.10 CLUSTER BY

If SORT BY and DISTRIBUTE BY specify the same field, and SORT BY is ASC, CLUSTER BY can be used to replace them, and CLUSTER BY can ensure that data is globally ordered.

SELECT empno, deptno, sal FROM emp CLUSTER  BY deptno ;
Copy the code

Three, multi table join query

Hive supports inner join, outer join, left outer join, right outer join, and Cartesian join. These concepts are consistent with traditional database concepts, as shown in the following figure.

It is important to note that the JOIN condition must be specified with ON, not WHERE. Otherwise, cartesian product will be performed first and then filtered, which will result in the failure of the desired result (as explained in the following illustration).

3.1 INNER JOIN

-- Query the details of the employee whose id is 7369
SELECT e.*,d.* FROM 
emp e JOIN dept d
ON e.deptno = d.deptno 
WHERE empno=7369;

If three or more tables are joined, the syntax is as follows
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Copy the code

3.2 the LEFT OUTER JOIN

LEFT OUTER JOIN and LEFT JOIN are equivalent.

- left connection
SELECT e.*,d.*
FROM emp e LEFT OUTER  JOIN  dept d
ON e.deptno = d.deptno;
Copy the code

3.3 RIGHT OUTER JOIN

- the right connection
SELECT e.*,d.*
FROM emp e RIGHT OUTER JOIN  dept d
ON e.deptno = d.deptno;
Copy the code

After the right link is executed, the employee information is NULL because there is no employee in department 40. This query nicely repeats what was mentioned above — JOIN statement association conditions must be specified with ON, not WHERE. You can change ON to WHERE, and you’ll never find department 40 anyway, because cartesian operations don’t have (NULL, 40).

3.4 FULL OUTER JOIN

SELECT e.*,d.*
FROM emp e FULL OUTER JOIN  dept d
ON e.deptno = d.deptno;
Copy the code

3.5 LEFT SEMI JOIN

LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subqueries.

  • The table ON the right of the JOIN clause can only be filtered in the ON clause.
  • The query results only contain data from the left table, so only columns from the left table can be selected.
Query information about all employees working in New York
SELECT emp.*
FROM emp LEFT SEMI JOIN dept 
ON emp.deptno = dept.deptno AND dept.loc="NEW YORK";

The above statement is equivalent to
SELECT emp.* FROM emp
WHERE emp.deptno IN (SELECT deptno FROM dept WHERE loc="NEW YORK");
Copy the code

3.6 the JOIN

Cartesian product connections are rare and consume a lot of performance. For this reason, if hive is in strict mode (hive.mapred.mode = strict), Hive prevents users from performing this operation.

SELECT * FROM emp JOIN dept;
Copy the code

4. JOIN optimization

4.1 STREAMTABLE

When multiple tables are joined, if each ON statement uses the same column (B. Key in the following figure), Hive optimizes to JOIN multiple tables in the same Map/Reduce job. Also assume that the last table in the query (table C below) is the largest table, and when joining each row, it will attempt to cache the other tables and scan the last table for calculation. Therefore, the user needs to ensure that the size of the queried table increases from left to right.

`SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key) JOIN c ON (c.key = b.key)`
Copy the code

Instead of always placing the largest table at the end of a query, Hive provides the /*+ STREAMTABLE() */ flag to identify the largest table, as shown in the following example:

SELECT /*+ STREAMTABLE(d) */  e.*,d.* 
FROM emp e JOIN dept d
ON e.deptno = d.deptno
WHERE job='CLERK';
Copy the code

4.2 MAPJOIN

If only one of the tables is a small table, Hive loads that small table into memory. At this time, the program will directly match the data of another table with the data of the table in memory in the map stage. Since JOIN operation is performed in the map, the reduce process can be omitted and the efficiency can be greatly improved. Hive provides /*+ MAPJOIN() */ to mark small tables as shown in the following example:

SELECT /*+ MAPJOIN(d) */ e.*,d.* 
FROM emp e JOIN dept d
ON e.deptno = d.deptno
WHERE job='CLERK';
Copy the code

Other uses of SELECT

View the current database:

SELECT current_database()
Copy the code

Local mode

In most of the preceding statements, MapReduce is triggered. For example, Select * from EMp Limit 5 does not trigger MR. In this case, Hive simply reads the contents of data files and formats them for output. In queries that require MapReduce execution, you may find that the execution time may be very long, so you can choose to enable local mode.

- The local mode is disabled by default. You need to manually enable this function
SET hive.exec.mode.local.auto=true;
Copy the code

After Hive is enabled, Hive analyzes the size of each Map-Reduce job in a query and can run it locally if:

  • Homework to the total input of size below: hive. The exec. Mode. Local. Auto. Inputbytes. Max (default is 128 MB);
  • The map – the total number of tasks is less than: hive. Exec. Mode. Local. Auto. The tasks. Max (the default is 4);
  • The total number of reduce jobs required is 1 or 0.

Since the data set we tested was small, if you run the query above involving MR operations again, you will see a significant speed increase.

The resources

  1. LanguageManual Select
  2. LanguageManual Joins
  3. LanguageManual GroupBy
  4. LanguageManual SortBy

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series