“This is the 28th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Hive combat

1. Requirement description

Statistics of conventional indicators of silicon Valley video websites and various TopN indicators:

Top10 video views were counted
Statistics video category heat Top10
The categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category were counted
Collect statistics on the categories of Top50 associated videos
Take Music as an example and add up the Top10 videos in each category
Top10 videos watched in each category — take a look at the Top10 users who upload the most videos and the top 20 videos they upload

2. Data structure

Video table
The users table

3. Preparation

Ready to watch

Tables to be prepared
- Create raw data tables: guliVideo_ORI, guliVideo_user_ORI
- Create the final table: guliVideo_ORc, guliVideo_user_ORc

Create the raw data table

gulivideo_ori

create table gulivideo_ori(
 videoId string, 
 uploader string, 
 age int, 
 category array<string>, 
 length int, 
 views int, 
 rate float, 
 ratings int, 
 comments int,
 relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
Copy the code

gulivideo_user_ori

create table gulivideo_user_ori(
 uploader string,
 videos int,
 friends int)
row format delimited fields terminated by "\t" 
stored as textfile;
Copy the code

Create orC storage format tables with SNAPPY compression

gulivideo_orc

create table gulivideo_orc(
 videoId string, 
 uploader string, 
 age int, 
 category array<string>, 
 length int, 
 views int, 
 rate float, 
 ratings int, 
 comments int,
 relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");
Copy the code

gulivideo_user_orc

create table gulivideo_user_orc(
 uploader string,
 videos int,
 friends int)
row format delimited 
fields terminated by "\t" 
stored as orc
tblproperties("orc.compress"="SNAPPY");
Copy the code

Insert data into the ORI table

load data local inpath "/opt/module/data/video" into table gulivideo_ori;
load data local inpath "/opt/module/data/user" into table gulivideo_user_ori;
Copy the code

Insert data into the ORC table

insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
Copy the code

Install Tez engine (understanding)

tez.apache.org/

Tez is a Hive running engine that performs better than MR. Why is it better than MR? A look.

Write MR programs directly with Hive. Assume that there are four dependent MR jobs. In the figure above, the green color is ReduceTask, and the cloud shape indicates write shielding, and the intermediate results need to be persistently written to HDFS.

Tez can convert multiple dependent jobs into one job. In this way, HDFS only needs to be written once and there are fewer intermediate nodes, greatly improving the computing performance of jobs.

Copy the TEZ installation package to the cluster and decompress the tar package

[moe@hadoop102 ~]$mkdir /opt/module/tez [moe@hadoop102 ~]$tar -zxvf /opt/software/ tez-0.10.1-snapshot-minimal.tar.gz -C /opt/module/tez/Copy the code

Uploading TEZ depends on HDFS

[moe@hadoop102 ~]$ hadoop fs -mkdir /tez

[moe@hadoop102 ~]$ hadoop fs -put /opt/software/tez-0.10.1-SNAPSHOT.tar.gz /tez
Copy the code

Tez – site. New XML


          

          
<configuration>

        <property>
                <name>tez.lib.uris</name>
                <value>${fs. DefaultFS} / tez/tez - 0.10.1 - the SNAPSHOT. Tar. Gz</value>
        </property>

        <property>
                <name>tez.use.cluster.hadoop-libs</name>
                <value>true</value>
        </property>

        <property>
                <name>tez.am.resource.memory.mb</name>
                <value>1024</value>
        </property>

        <property>
                <name>tez.am.resource.cpu.vcores</name>
                <value>1</value>
        </property>

        <property>
                <name>tez.container.max.java.heap.fraction</name>
                <value>0.4</value>
        </property>

        <property>
                <name>tez.task.resource.memory.mb</name>
                <value>1024</value>
        </property>

        <property>
                <name>tez.task.resource.cpu.vcores</name>
                <value>1</value>
        </property>

</configuration>
Copy the code

Modify Hadoop environment variables

[moe@hadoop102 ~]$ vim $HADOOP_HOME/etc/hadoop/shellprofile.d/tez.sh
Copy the code

Add Tez Jar package related information

hadoop_add_profile tez
function _tez_hadoop_classpath
{
 hadoop_add_classpath "$HADOOP_HOME/etc/hadoop" after
 hadoop_add_classpath "/opt/module/tez/*" after
 hadoop_add_classpath "/opt/module/tez/lib/*" after
}
Copy the code

Example Modify the Hive computing engine

[moe@hadoop102 ~]$ vim $HIVE_HOME/conf/hive-site.xml
Copy the code

add

<property>
        <name>hive.execution.engine</name>
        <value>tez</value>
</property>

<property>
        <name>hive.tez.container.size</name>
        <value>1024</value>
</property>
Copy the code

Resolve log Jar package conflicts

[MOE @ hadoop102 ~] $rm/opt/module/tez/lib/slf4j - log4j12-1.7.10. JarCopy the code

4. Business analysis

4.1. Top10 video views were counted

Order by global order by global order by global order by global order

The final SQL:

SELECT 
 videoId,
 views
FROM 
 gulivideo_orc
ORDER BY 
 views DESC
LIMIT 10;
Copy the code

4.2 Top10 video popularity statistics

(1) Count the number of videos in each category, and display the top 10 categories that contain the most videos.

(2) We need to aggregate according to category Group by, and then count the number of videoId in the group.

(3) Because the current table structure is: one video corresponds to one or more categories. So if you want to group by a category, you need to expand the category and then count it.

(4) Finally, the top 10 items are displayed in order of heat.

The final SQL:

SELECT 
 t1.category_name , 
 COUNT(t1.videoId) hot
FROM 
(
SELECT 
 videoId, 
 category_name 
FROM 
 gulivideo_orc 
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
) t1
GROUP BY 
 t1.category_name 
ORDER BY
 hot 
DESC 
LIMIT 10;
Copy the code

4.3. The category of the Top20 videos with the highest number of views and the number of Top20 videos in the category were counted

(1) First find all the information of the 20 videos with the highest number of views, and arrange them in descending order

(2) Split the categories in the 20 pieces of information (column shift)

(3) Finally, query the video category name and the number of Top20 videos in this category

The final SQL:

SELECT 
 t2.category_name,
 COUNT(t2.videoId) video_sum
FROM 
(
SELECT
 t1.videoId,
 category_name
FROM 
(
SELECT 
 videoId, 
 views ,
 category 
FROM 
 gulivideo_orc
ORDER BY 
 views 
DESC 
LIMIT 20 
) t1
lateral VIEW explode(t1.category) t1_tmp AS category_name
) t2
GROUP BY t2.category_name;
Copy the code

4.4 make statistics of the category ranking of the Top50 videos watched

The final SQL:

SELECT t6.category_name, t6.video_sum, rank() over(ORDER BY t6.video_sum DESC ) rk FROM ( SELECT t5.category_name, COUNT(t5.relatedid_id) video_sum FROM ( SELECT t4.relatedid_id, category_name FROM ( SELECT t2.relatedid_id , t3.category FROM ( SELECT relatedid_id FROM ( SELECT videoId, views, relatedid FROM gulivideo_orc ORDER BY views DESC LIMIT 50 )t1 lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id )t2 JOIN gulivideo_orc t3 ON t2.relatedid_id = t3.videoId ) t4 lateral VIEW explode(t4.category) t4_tmp AS category_name  ) t5 GROUP BY t5.category_name ORDER BY video_sum DESC ) t6;Copy the code

4.5. Top10 videos in each category were counted, taking Music as an example

Ideas:

(1) in order to count the Top10 popularity of videos in Music category, it is necessary to find the Music category first, and then expand the category, so a table can be created to store the categoryId expanded data.

(2) Insert data into the category expanded table.

(3) The heat of videos in the corresponding category (Music) is counted. Top10 in Music category (you can also count others)

The final SQL:

SELECT 
 t1.videoId, 
 t1.views,
 t1.category_name
FROM 
(
SELECT
 videoId,
 views,
 category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
)t1 
WHERE 
 t1.category_name = "Music" 
ORDER BY 
 t1.views 
DESC 
LIMIT 10;
Copy the code

4.6. Top10 videos watched in each category were counted

The final SQL:

SELECT t2.videoId, t2.views, t2.category_name, t2.rk FROM ( SELECT t1.videoId, t1.views, t1.category_name, rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC ) rk FROM ( SELECT videoId, views, category_name FROM gulivideo_orc lateral VIEW explode(category) gulivideo_orc_tmp AS category_name )t1 )t2 WHERE t2.rk < = 10;Copy the code

4.7. The Top10 users who upload the most videos and the videos whose videos are viewed in the top 20 times are counted

Ideas:

(1) Find out the 10 users who upload the most videos

(2) Associate the GuliVideo_ORc table to find out all the videos uploaded by these 10 users, and take the top 20 according to the number of views

The final SQL:

SELECT 
 t2.videoId,
 t2.views,
 t2.uploader
FROM
(
SELECT 
 uploader,
 videos
FROM gulivideo_user_orc 
ORDER BY 
 videos
DESC
LIMIT 10 
) t1
JOIN gulivideo_orc t2 
ON t1.uploader = t2.uploader
ORDER BY 
 t2.views 
DESC
LIMIT 20;
Copy the code

Common errors and solutions

If tasks are stuck after the Tez engine is replaced, adjust the resource scheduling policy of the capacity scheduler

$HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

<property>
 <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
 <value>0.1</value>
 <description>
 Maximum percent of resources in the cluster which can be used to run 
 application masters i.e. controls number of concurrent running
 applications.
 </description>
</property>
Copy the code

<property>
 <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
 <value>1</value>
 <description>
 Maximum percent of resources in the cluster which can be used to run 
 application masters i.e. controls number of concurrent running
 applications.
 </description>
</property>
Copy the code

JVM heap memory overflow

Description: Java. Lang. OutOfMemoryError: Java heap space

Solution: Add the following code to yarn-site. XML

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
</property>

<property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>2.1</value>
</property>

<property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx1024m</value>
</property>
Copy the code

Virtual memory limits

Add the following configuration to yarn-site. XML:

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>
Copy the code

Three, friendship links

Big data Hive learning journey 6

Big data Hive learning journey 5

Big data Hive learning journey 4

Big data Hive learning journey 3

Big data Hive learning journey 2

Big data Hive learning journey 1

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Hive learning journey 7

Hive combat

1. Requirement description

2. Data structure

3. Preparation

4. Business analysis

4.1. Top10 video views were counted

4.2 Top10 video popularity statistics

4.3. The category of the Top20 videos with the highest number of views and the number of Top20 videos in the category were counted

4.4 make statistics of the category ranking of the Top50 videos watched

4.5. Top10 videos in each category were counted, taking Music as an example

4.6. Top10 videos watched in each category were counted

4.7. The Top10 users who upload the most videos and the videos whose videos are viewed in the top 20 times are counted

Common errors and solutions

Three, friendship links

Big data Hive learning journey 7

Hive combat

1. Requirement description

2. Data structure

3. Preparation

4. Business analysis

4.1. Top10 video views were counted

4.2 Top10 video popularity statistics

4.3. The category of the Top20 videos with the highest number of views and the number of Top20 videos in the category were counted

4.4 make statistics of the category ranking of the Top50 videos watched

4.5. Top10 videos in each category were counted, taking Music as an example

4.6. Top10 videos watched in each category were counted

4.7. The Top10 users who upload the most videos and the videos whose videos are viewed in the top 20 times are counted

Common errors and solutions

Three, friendship links

Related Posts

GRPC Security 2: Fast implementation of server-side JWT authentication

In-depth study of Redis operation and maintenance

JVM- class loader details worth your reading!!