In order to help you as soon as possible to learn Hive SQL, raise the number, in the field of big data, to help you prepare the data, ready to prepare the SQL sample, according to learning, learning while practicing, is mainly to solve the problem of learning a new technology starting point high start difficult.
This self-written course is divided into the following parts, you are welcome to read and practice.
- Special instructions
- Prepare the environment
- Two sample tables
- SQL example
-
- Simple SELECT query
- Filter by criteria using the WHERE statement
- Use distinct statements to remove weights
- Use the order BY statement to sort the query results
- Use sum/count/ avG to calculate sum/count/ average etc
- Use group by to calculate by group
- The subquery
- Use joins to query data from multiple tables
- Use union All to splice multiple data
- Use case WHEN to control column output
- Advanced SQL Tutorial (to be continued)
-
- The window function
- Process data in Json format
- Process XML formatted data
- The ranks of conversion
- Use the with syntax
- Data sampling
Special instructions
The data set used in this article is the test data from the Internet, the data address: github.com/markgrover/…
Two sample tables
Ide_test. flights and IDE_test. airports tables ide_test.airports tables ide_test.flights and IDE_test. airports tables ide_test.airports The other two tables each have several rows of test data that you can use directly when practicing SQL.
Flights table
CREATE TABLE `ide_test.flights`(
`year` string COMMENT 'year'.`month` string COMMENT 'month'.`day` string COMMENT 'day'.`dep_time` string COMMENT 'Departure time'.`crs_dep_time` string COMMENT 'Scheduled time of departure'.`arr_time` string COMMENT 'Time of arrival'.`crs_arr_time` string COMMENT 'Planned time of arrival'.`carrier_code` string COMMENT 'Airline code'.`flight_num` string COMMENT 'Flight Number'.`actual_elapsed_time` string COMMENT 'Actual time'.`crs_elapsed_time` string COMMENT 'Planning time'.`air_time` string COMMENT 'Time of flight'.`arr_delay` string COMMENT 'Arrival delay'.`dep_delay` string COMMENT 'Take-off delay'.`origin` string COMMENT 'Airport of departure'.`dest` string COMMENT 'Target Airport'.`distance` string COMMENT 'distance')
COMMENT 'Test Data - Flight Information'
Copy the code
Airports table
CREATE TABLE `ide_test.airports`(
`name` string COMMENT 'name'.`country` string COMMENT 'country'.`area_code` string COMMENT 'area_code'.`code` string COMMENT 'code')
COMMENT 'Test Data - Airport Info'
Copy the code
SQL example
Simple SELECT query
Use the SELECT statement to view the data in the table
The select usage
select name from ide_test.airports limit 5Airports returns from the name column in ide_test.airports table5Line select *from ide_test.airports limit 5- the use ofThe '*'Represents all columns in the tableCopy the code
** It is a good practice to limit the number of rows returned by SQL. Limit 5 above means that a maximum of 5 rows of data will be returned
Information: In-depth understanding of Hive Select syntax
Filter by criteria using the WHERE statement
Using the WHERE statement, you can swipe data based on specified criteria, such as viewing information about flight 335
Where usage
Select * -- returns all columns in the tablefrom
ide_test.flights
where
flight_num= '335'-- Returns the flight_num column'335'A record of the limit5
Copy the code
Where is followed by a logical expression. Multiple logical expressions are concatenated by AND, OR
Use distinct statements to remove weights
A column (or columns) often has duplicate values, and sometimes you need to look at them later. You can use distinct statements to do this. For example, to see which years of data are contained in the IDE_test. flights table, use the following SQL statement:
Distinct usage
Select DISTINCT Year -- A distinct return is returned after the year column is de-duplicatedfrom
ide_test.flights
limit 10
Copy the code
There is a better way to see what values a column contains, but this is just to show the use of distinct statements
Use the order BY statement to sort the query results
Find the five latest flights
The order by usage
select
origin, dest, flight_num, dep_time
fromIde_test. flights ORDER by dep_time desc - ordered in reverse order desc and ASC order by dep_time DESC5
Copy the code
Use sum/count/ avG to calculate sum/count/ average etc
Use the count() function to count, for example, to see how many flights there are in the ide_test.flights table:
The count usage
select
count(flight_num) asThe CNT -- count function represents the countfrom
ide_test.flights
limit 10
Copy the code
Sum () means summing up the specified columns, such as how many kilometers flight 335 traveled
The sum usage
select
sum(cast(distance as int)) asThe total_distance --sum function represents the sumfrom
ide_test.flights
where
flight_num = '335'
limit 10
Copy the code
Count /sum = avg/ Max /min = avg/ Max /min = avg/ Max /min = avg/ Max /min = avg/ Max /min
Documentation: In-depth understanding of Hive functions
Use group by to calculate by group
The group by statement is a useful way to group data and then evaluate it, as you can see in an example. For example, to calculate the number of miles flown on each flight, take the top 10 from the largest to the smallest, as follows:
Group by usage
select
flight_num, sum(distance) asTotal_distance -- except aggregate functions such as sum/count/avg/min/ Max, columns explicitly written after select must also appear after group byfromIde_test. flights group by flight_num sum sum for each group by flight number order by total_distance desc limit10
Copy the code
In another case, the having statement is used to query flight numbers with a total flight mileage greater than 3000
Having the grammar
Select flight_num -- The column written after select must also appear after group byfrom
ide_test.flights
group by flight_num
having sum(distance) > 3000-- Swipe to select greater than total flight mileage3000The flight of limit5
Copy the code
Quick thought: Why use having as a filter in this case instead of where?
Documentation: In-depth understanding of Hive Group by syntax
The subquery
To calculate the number of flights with more than 3000 total miles flown, the SQL can be written like this
Subquery usage
select count(1) as cnt
from
(
select flight_num
from ide_test.flights
group by flight_num
having sum(distance) > 3000
) t
Copy the code
The use of subqueries is the same as that of normal Hive tables. Generally, subqueries are enclosed by parentheses ‘()’ for clarity
Documentation: In-depth understanding of Hive subqueries
Use joins to query data from multiple tables
For example, we plan to query the mileage of each flight and the name of the starting airport. Because these two information are stored in two different Hive tables, we need to query both hive tables to get the result
The join usage
Select p. Flights from flights table. Flights q.name airports from airports tablefrom
ide_test.flights p
left join ide_test.airports q
on p.origin = q.code
limit 10
Copy the code
Tips: * * * *
Left JOIN ** : ** The left table is associated with the right table. If a value exists in the coordinate but the right table does not exist, the corresponding information of the right table is replaced by NULL
Inner join** : ** only if the left and right tables have corresponding values
Right join** : ** is the opposite of left join
Full outer join** : ** Null is used when no value exists on the left or right table
Note:
If there are m records in A and N records in B, at most m* N records will be generated when A join B.
data
Understand Hive Join syntax
Use union All to splice multiple data
The UNION statement is used to combine the result sets of two or more SELECT statements.
Select flights with flight mileage greater than 1000 or less than 100, SQL is as follows:
Note: This is just to show the use of union, there should be a better way to write it, think for yourself:)
The union usage
select flight_num
from ide_test.flights
where distance > 1000
union
select flight_num
from ide_test.flights
where distance < 100
Copy the code
Tips: Note that SELECT statements within a UNION must have the same number of columns. Columns must also have similar data types. Also, the columns in each SELECT statement must be in the same order
Documentation: In-depth understanding of Hive Union syntax
Use case WHEN to control column output
Query the flight distance by “long”, “medium”, “short” three levels, using the case when statement, example:
The case when use
select flight_num,
distance,
case when distance > 3000 then 'Long distance flight'
when distance > 1000 then 'Medium distance flight'
else 'Close flight'
end as flight_level
from ide_test.flights
limit 10
Copy the code
Tips: Use case When to customize query results flexibly. Case when can also be used with aggregate functions such as count/sum to implement conditional counting/summing
Author: Zhang Yanlong (from my official account: Data Old Blacksmith)
- Didi cloud full line standard cloud server limited time special, registration is a gift package for novices
- New purchase cloud services 50% off in January, 40% off in March and 30% off in June
- Didi cloud emissary recruitment, recommended maximum commission 50%