“This is the 28th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.
Hive combat
1. Requirement description
Statistics of conventional indicators of silicon Valley video websites and various TopN indicators:
-
Top10 video views were counted
-
Statistics video category heat Top10
-
The categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category were counted
-
Collect statistics on the categories of Top50 associated videos
-
Take Music as an example and add up the Top10 videos in each category
-
Top10 videos watched in each category — take a look at the Top10 users who upload the most videos and the top 20 videos they upload
2. Data structure
-
Video table
-
The users table
3. Preparation
-
Ready to watch
-
Tables to be prepared
- Create raw data tables: guliVideo_ORI, guliVideo_user_ORI
- Create the final table: guliVideo_ORc, guliVideo_user_ORc
-
Create the raw data table
-
gulivideo_ori
create table gulivideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile; Copy the code
-
gulivideo_user_ori
create table gulivideo_user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as textfile; Copy the code
-
-
Create orC storage format tables with SNAPPY compression
-
gulivideo_orc
create table gulivideo_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) stored as orc tblproperties("orc.compress"="SNAPPY"); Copy the code
-
gulivideo_user_orc
create table gulivideo_user_orc( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as orc tblproperties("orc.compress"="SNAPPY"); Copy the code
-
Insert data into the ORI table
load data local inpath "/opt/module/data/video" into table gulivideo_ori; load data local inpath "/opt/module/data/user" into table gulivideo_user_ori; Copy the code
-
Insert data into the ORC table
insert into table gulivideo_orc select * from gulivideo_ori; insert into table gulivideo_user_orc select * from gulivideo_user_ori; Copy the code
-
-
-
Install Tez engine (understanding)
tez.apache.org/
Tez is a Hive running engine that performs better than MR. Why is it better than MR? A look.
Write MR programs directly with Hive. Assume that there are four dependent MR jobs. In the figure above, the green color is ReduceTask, and the cloud shape indicates write shielding, and the intermediate results need to be persistently written to HDFS.
Tez can convert multiple dependent jobs into one job. In this way, HDFS only needs to be written once and there are fewer intermediate nodes, greatly improving the computing performance of jobs.
-
Copy the TEZ installation package to the cluster and decompress the tar package
[moe@hadoop102 ~]$mkdir /opt/module/tez [moe@hadoop102 ~]$tar -zxvf /opt/software/ tez-0.10.1-snapshot-minimal.tar.gz -C /opt/module/tez/Copy the code
-
Uploading TEZ depends on HDFS
[moe@hadoop102 ~]$ hadoop fs -mkdir /tez [moe@hadoop102 ~]$ hadoop fs -put /opt/software/tez-0.10.1-SNAPSHOT.tar.gz /tez Copy the code
-
Tez – site. New XML
<configuration> <property> <name>tez.lib.uris</name> <value>${fs. DefaultFS} / tez/tez - 0.10.1 - the SNAPSHOT. Tar. Gz</value> </property> <property> <name>tez.use.cluster.hadoop-libs</name> <value>true</value> </property> <property> <name>tez.am.resource.memory.mb</name> <value>1024</value> </property> <property> <name>tez.am.resource.cpu.vcores</name> <value>1</value> </property> <property> <name>tez.container.max.java.heap.fraction</name> <value>0.4</value> </property> <property> <name>tez.task.resource.memory.mb</name> <value>1024</value> </property> <property> <name>tez.task.resource.cpu.vcores</name> <value>1</value> </property> </configuration> Copy the code
-
Modify Hadoop environment variables
[moe@hadoop102 ~]$ vim $HADOOP_HOME/etc/hadoop/shellprofile.d/tez.sh Copy the code
Add Tez Jar package related information
hadoop_add_profile tez function _tez_hadoop_classpath { hadoop_add_classpath "$HADOOP_HOME/etc/hadoop" after hadoop_add_classpath "/opt/module/tez/*" after hadoop_add_classpath "/opt/module/tez/lib/*" after } Copy the code
-
Example Modify the Hive computing engine
[moe@hadoop102 ~]$ vim $HIVE_HOME/conf/hive-site.xml Copy the code
add
<property> <name>hive.execution.engine</name> <value>tez</value> </property> <property> <name>hive.tez.container.size</name> <value>1024</value> </property> Copy the code
-
Resolve log Jar package conflicts
[MOE @ hadoop102 ~] $rm/opt/module/tez/lib/slf4j - log4j12-1.7.10. JarCopy the code
-
4. Business analysis
4.1. Top10 video views were counted
Order by global order by global order by global order by global order
The final SQL:
SELECT
videoId,
views
FROM
gulivideo_orc
ORDER BY
views DESC
LIMIT 10;
Copy the code
4.2 Top10 video popularity statistics
(1) Count the number of videos in each category, and display the top 10 categories that contain the most videos.
(2) We need to aggregate according to category Group by, and then count the number of videoId in the group.
(3) Because the current table structure is: one video corresponds to one or more categories. So if you want to group by a category, you need to expand the category and then count it.
(4) Finally, the top 10 items are displayed in order of heat.
The final SQL:
SELECT
t1.category_name ,
COUNT(t1.videoId) hot
FROM
(
SELECT
videoId,
category_name
FROM
gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
) t1
GROUP BY
t1.category_name
ORDER BY
hot
DESC
LIMIT 10;
Copy the code
4.3. The category of the Top20 videos with the highest number of views and the number of Top20 videos in the category were counted
(1) First find all the information of the 20 videos with the highest number of views, and arrange them in descending order
(2) Split the categories in the 20 pieces of information (column shift)
(3) Finally, query the video category name and the number of Top20 videos in this category
The final SQL:
SELECT
t2.category_name,
COUNT(t2.videoId) video_sum
FROM
(
SELECT
t1.videoId,
category_name
FROM
(
SELECT
videoId,
views ,
category
FROM
gulivideo_orc
ORDER BY
views
DESC
LIMIT 20
) t1
lateral VIEW explode(t1.category) t1_tmp AS category_name
) t2
GROUP BY t2.category_name;
Copy the code
4.4 make statistics of the category ranking of the Top50 videos watched
The final SQL:
SELECT t6.category_name, t6.video_sum, rank() over(ORDER BY t6.video_sum DESC ) rk FROM ( SELECT t5.category_name, COUNT(t5.relatedid_id) video_sum FROM ( SELECT t4.relatedid_id, category_name FROM ( SELECT t2.relatedid_id , t3.category FROM ( SELECT relatedid_id FROM ( SELECT videoId, views, relatedid FROM gulivideo_orc ORDER BY views DESC LIMIT 50 )t1 lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id )t2 JOIN gulivideo_orc t3 ON t2.relatedid_id = t3.videoId ) t4 lateral VIEW explode(t4.category) t4_tmp AS category_name ) t5 GROUP BY t5.category_name ORDER BY video_sum DESC ) t6;Copy the code
4.5. Top10 videos in each category were counted, taking Music as an example
Ideas:
(1) in order to count the Top10 popularity of videos in Music category, it is necessary to find the Music category first, and then expand the category, so a table can be created to store the categoryId expanded data.
(2) Insert data into the category expanded table.
(3) The heat of videos in the corresponding category (Music) is counted. Top10 in Music category (you can also count others)
The final SQL:
SELECT
t1.videoId,
t1.views,
t1.category_name
FROM
(
SELECT
videoId,
views,
category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
)t1
WHERE
t1.category_name = "Music"
ORDER BY
t1.views
DESC
LIMIT 10;
Copy the code
4.6. Top10 videos watched in each category were counted
The final SQL:
SELECT t2.videoId, t2.views, t2.category_name, t2.rk FROM ( SELECT t1.videoId, t1.views, t1.category_name, rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC ) rk FROM ( SELECT videoId, views, category_name FROM gulivideo_orc lateral VIEW explode(category) gulivideo_orc_tmp AS category_name )t1 )t2 WHERE t2.rk < = 10;Copy the code
4.7. The Top10 users who upload the most videos and the videos whose videos are viewed in the top 20 times are counted
Ideas:
(1) Find out the 10 users who upload the most videos
(2) Associate the GuliVideo_ORc table to find out all the videos uploaded by these 10 users, and take the top 20 according to the number of views
The final SQL:
SELECT
t2.videoId,
t2.views,
t2.uploader
FROM
(
SELECT
uploader,
videos
FROM gulivideo_user_orc
ORDER BY
videos
DESC
LIMIT 10
) t1
JOIN gulivideo_orc t2
ON t1.uploader = t2.uploader
ORDER BY
t2.views
DESC
LIMIT 20;
Copy the code
Common errors and solutions
-
If tasks are stuck after the Tez engine is replaced, adjust the resource scheduling policy of the capacity scheduler
$HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
<property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.1</value> <description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description> </property> Copy the code
to
<property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>1</value> <description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description> </property> Copy the code
-
JVM heap memory overflow
Description: Java. Lang. OutOfMemoryError: Java heap space
Solution: Add the following code to yarn-site. XML
<property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> Copy the code
-
Virtual memory limits
Add the following configuration to yarn-site. XML:
<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> Copy the code
Three, friendship links
Big data Hive learning journey 6
Big data Hive learning journey 5
Big data Hive learning journey 4
Big data Hive learning journey 3
Big data Hive learning journey 2
Big data Hive learning journey 1