HQL film review analysis

1. Prerequisites

There is now a crawler:

After the data is climbed, a log data is generated under local data/collect/

There is also a Flume script

Is a collection plan, the task is to import the contents of the directory into the HDFS

Import the local file in this location into flume cluster.

So in general, we get data from crawler, and then import data into HDFS via Flume (/flume/ year-month-day format folder)

We based on the above data we need to build hive data warehouse (planning data warehouse, table, load the data), the corresponding analysis.

Teacher’s design:

The crawler runs once a day at 1:00

Flume Agent keeps running in the background. If the corresponding log file is detected in the directory, it is automatically pushed to the HDFS

Hive writes new data at 2:00 every day and processes the data generated yesterday

Create a folder and upload the film_rating script from your teacher

2. About hive operations

  • Write commands directly in Hive

  • Use the SQL script and run the script name using hive -f. SQL

    But Linux commands cannot write SQL script files, such as commands to get yesterday’s date

  • Shell scripts are used

    Use hive -e “Specific Hive commands” inside scripts

    [hadoop@hadoop01 1_22]$database -e "show databases;" which: no hbase in (/ home/hadoop/local/bin: / home/hadoop/bin: / usr/local/bin: / usr/bin: / usr/local/sbin, / usr/sbin, / opt/model/jdk1.8 / bin: / opt/m Odel/nginx/sbin: / opt/model/hadoop - 3.2.1 / bin: / opt/model/hadoop - 3.2.1 / sbin, / opt/model/flume - 1.9.0 / bin: / opt/model/hive - 3.1. 2 / bin: / opt/model/sqoop2 / bin: / opt/model/scala - 2.12.8 / bin: / opt/model/spark - 3.0 / sbin: / opt/model/spark - 3.0 / bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [the jar file: / opt/model/hive - 3.1.2 / lib/log4j - slf4j - impl - 2.10.0. Jar! / org/slf4j/impl/StaticLoggerBinder class] slf4j: Found binding in [the jar file: / opt/model/hadoop - 3.2.1 / share/hadoop/common/lib/slf4j - log4j12-1.7.25. Jar! / org/slf4j/impl/StaticLoggerBinder cl ass] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = 88a6a1fb-dcb1-4a3e-8bc1-419db61ae918 Logging initialized using configuration in File: / opt/model/hive - 3.1.2 / conf/hive - log4j2. Properties Async: true Hive Session ID = d3af9381-9532-4739-bd5f-ce003a0cb257 OK apply_service_code city_db default demodb film_db film_db2 gmall marketdb model_service_code myschooldb original_service_code userdb weather_db Time taken: 2.28 seconds, Fetched: 13 row(s) [hadoop@hadoop01 1_22]$Copy the code

    Write a shell script

    [Query the Location of the Hive program]

    [hadoop@hadoop01 hive_dir]$which hive /opt/model/hive-3.1.2/bin/hiveCopy the code
    [hadoop@hadoop01 film_rating]$ vi test_hive.sh
    Copy the code

Add executable permissions to the script.

[hadoop@hadoop01 film_rating]$ chmod +x test_hive.sh 
Copy the code

Green is executable

“Performing”

[hadoop@hadoop01 film_rating]$ ./test_hive.sh which: no hbase in (/ home/hadoop/local/bin: / home/hadoop/bin: / usr/local/bin: / usr/bin: / usr/local/sbin, / usr/sbin, / opt/model/jdk1.8 / bin: / opt/m Odel/nginx/sbin: / opt/model/hadoop - 3.2.1 / bin: / opt/model/hadoop - 3.2.1 / sbin, / opt/model/flume - 1.9.0 / bin: / opt/model/hive - 3.1. 2 / bin: / opt/model/sqoop2 / bin: / opt/model/scala - 2.12.8 / bin: / opt/model/spark - 3.0 / sbin: / opt/model/spark - 3.0 / bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [the jar file: / opt/model/hive - 3.1.2 / lib/log4j - slf4j - impl - 2.10.0. Jar! / org/slf4j/impl/StaticLoggerBinder class] slf4j: Found binding in [the jar file: / opt/model/hadoop - 3.2.1 / share/hadoop/common/lib/slf4j - log4j12-1.7.25. Jar! / org/slf4j/impl/StaticLoggerBinder cl ass] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = 03e6ee68-0831-4f8c-8aae-453897b4d76e Logging initialized using configuration in File: / opt/model/hive - 3.1.2 / conf/hive - log4j2. Properties Async: true Hive Session ID = 2f8675c0-cf62-4eef-afce-b5283d58efee OK apply_service_code city_db default demodb film_db film_db2 gmall marketdb model_service_code myschooldb original_service_code userdb weather_db Time taken: 1.946 seconds, touch_type: 13 row(s)Copy the code

This way you can write shell commands and Hive commands in one script

Declare variable nowDate = “2021-01-25” in shell command

The following can be referenced with ${nowDate}

3. Prepare ahead

Analyze the data from the previous day, pull it down and pretend it’s yesterday

[hadoop@hadoop01 film_rating]$HDFS DFS -get /flume/2021-01-23/*./ # sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, RemoteHostTrusted = false [hadoop@hadoop01 film_rating]$ll total amount 476-rw-rw-r -- 1 Hadoop Hadoop 1415 1月 26 14:16 Sh -rw-rw-rw-1 hadoop hadoop 292954 January 26 14:53 log_file_14.1611384696195.log -rw-rw-1 hadoop Hadoop 183953 1月 26 14:53 log_file_20.1611403201575.log -rwxrwxr-x 1 hadoop hadoop 92 1月 26 14:41 test_hive.sh [hadoop@hadoop01 [hadoop@hadoop01 film_rating]$HDFS DFS -mkdir /flume/2021-01-24/# $HDFS DFS -mkdir [hadoop@hadoop01 film_rating]$HDFS DFS -mkdir [hadoop@hadoop01 film_rating]$HDFS DFS -put./*.log / flume / # 2021-01-25 / the suffix is the current directory. The log file upload that file 14:56:20 2021-01-26, 034: INFO sasl. SaslDataTransferClient: SASL encryption trust check: LocalHostTrusted = false, remoteHostTrusted = false 14:56:20 2021-01-26, 317: INFO sasl. SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = falseCopy the code

Let’s look at the first four numbers

4. The result of processing

Our goals:

Generate three tables in the database:

  • Source data table FILm_rating stores original film review data

Analysis results:

  • Rating_class Classification of film reviews (how many recommended, how many recommended, how many thought ok)

  • Rating_detail How many people thought “usecount” was useful, how many thought “nousecount” was useless, and how many replied “replycount” to the title of the review

The results of the analysis are also based on the source data table

5. Specific steps

It is useless to delete the data just get down

Use the -getmerge command to merge the two files and see how many pieces of data there are. There are more than 5000 pieces of data

Film_rating. Sh scripts

#! /bin/bash 
#The first line is the shell script identifier, must be added  
#For yesterday
yesterday=`date --date='1 days ago' +%Y-%m-%d`
year=`date --date='1 days ago' +%Y` 
month=`date --date='1 days ago' +%m`
echo ${yesterday}
echo ${year}
echo ${month}

#Create oDS layer for film_rating/opt/model/hive-3.1.2/bin/ hive-e" Create database if not exists film_db; create database if not exists film_db; --2. Reference database use film_db; Create table if not exists film_rating(spider_time string, -- name string, -- name name string, -- Rating_score string, -- rating_date string, -- review time use_count int, -- how many people think it's useful not_use_count int, (How many people think it's no use (how many people respond) partitioned by (year int, Month int) -- Row format delimited fields terminated by ',' stored as textfile location '/input/film_ratings'; Load data from the specified HDFS. Load data inpath '/flume/${yesterday}/log_file*. Log 'into table film_rating partition(year=${year},month=${month}); --5. Statistical analysis comment drop table if exists rating_detail; create table rating_detail row format delimited fields terminated by ',' stored as textfile as select title,use_count,reply_count,not_use_count from film_rating order by use_count desc,reply_count desc,not_use_count asc limit 10; --6. Statistical comment category DROP table if exists rating_class; create table rating_class row format delimited fields terminated by ',' stored as textfile as select Rating_score,count(*) from film_rating WHERE rating_score in(' I ',' I ',' I ') group by rating_score; "Copy the code

Remember to cascade libraries when deleting them

Development:

Linux has a scheduled task for crontab. Execute this shell script on a daily basis (do your own research)

Give the shell script executable permissions

1. Construct the movie review data table
#! /bin/bash 
#The first line is the shell script identifier, must be added  
#For yesterday
yesterday=`date --date='1 days ago' +%Y-%m-%d`
year=`date --date='1 days ago' +%Y` 
month=`date --date='1 days ago' +%m`
echo ${yesterday}
echo ${year}
echo ${month}

#Create oDS layer for film_rating/opt/model/hive-3.1.2/bin/ hive-e" Create database if not exists film_db; create database if not exists film_db; --2. Reference database use film_db; Create table if not exists film_rating(spider_time string, -- name string, -- name name string, -- Rating_score string, -- rating_date string, -- review time use_count int, -- how many people think it's useful not_use_count int, (How many people think it's no use (how many people respond) partitioned by (year int, Month int) -- Row format delimited fields terminated by ',' stored as textfile location '/input/film_ratings'; Load data from HDFS //load data from HDFS Load data inpath '/flume/${yesterday}/log_file*. Log 'into table film_rating partition(year=${year},month=${month});Copy the code
2. Statistical analysis of film reviews
--5. Statistical analysis of comments
drop table if exists rating_detail;
create table rating_detail
row format delimited 
fields terminated by ', '
stored as textfile
 as
select title,use_count,reply_count,not_use_count
from film_rating
order by use_count desc,reply_count desc,not_use_count asc
limit 10;
Copy the code
3. Classification of statistical comments
--6. Classification of statistical comments
drop table if exists rating_class;
create table rating_class
row format delimited 
fields terminated by ', '
stored as textfile
 as
select rating_score,count(*)
from film_rating
where rating_score in('strongly recommended'.'recommendations'.'还行')
group by rating_score;
Copy the code

Effect screenshot:

1. Screenshots of script codes


2. Generate three tables

Source data table film_rating

Too much data show the top 10

rating_class

Rating_detail (this is probably because I crawled the data twice one day and the data was duplicated)