“This is the 28th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Hive combat

1. Requirement description

Statistics of conventional indicators of silicon Valley video websites and various TopN indicators:

  1. Top10 video views were counted

  2. Statistics video category heat Top10

  3. The categories of the Top20 videos with the highest number of views and the number of the Top20 videos in the category were counted

  4. Collect statistics on the categories of Top50 associated videos

  5. Take Music as an example and add up the Top10 videos in each category

  6. Top10 videos watched in each category — take a look at the Top10 users who upload the most videos and the top 20 videos they upload

2. Data structure

  1. Video table

  2. The users table

3. Preparation

  1. Ready to watch

    • Tables to be prepared

      • Create raw data tables: guliVideo_ORI, guliVideo_user_ORI
      • Create the final table: guliVideo_ORc, guliVideo_user_ORc
    • Create the raw data table

      • gulivideo_ori

        create table gulivideo_ori(
         videoId string, 
         uploader string, 
         age int, 
         category array<string>, 
         length int, 
         views int, 
         rate float, 
         ratings int, 
         comments int,
         relatedId array<string>)
        row format delimited fields terminated by "\t"
        collection items terminated by "&"
        stored as textfile;
        Copy the code
      • gulivideo_user_ori

        create table gulivideo_user_ori(
         uploader string,
         videos int,
         friends int)
        row format delimited fields terminated by "\t" 
        stored as textfile;
        Copy the code
    • Create orC storage format tables with SNAPPY compression

      • gulivideo_orc

        create table gulivideo_orc(
         videoId string, 
         uploader string, 
         age int, 
         category array<string>, 
         length int, 
         views int, 
         rate float, 
         ratings int, 
         comments int,
         relatedId array<string>)
        stored as orc
        tblproperties("orc.compress"="SNAPPY");
        Copy the code
      • gulivideo_user_orc

        create table gulivideo_user_orc(
         uploader string,
         videos int,
         friends int)
        row format delimited 
        fields terminated by "\t" 
        stored as orc
        tblproperties("orc.compress"="SNAPPY");
        Copy the code
      • Insert data into the ORI table

        load data local inpath "/opt/module/data/video" into table gulivideo_ori;
        load data local inpath "/opt/module/data/user" into table gulivideo_user_ori;
        Copy the code
      • Insert data into the ORC table

        insert into table gulivideo_orc select * from gulivideo_ori;
        insert into table gulivideo_user_orc select * from gulivideo_user_ori;
        Copy the code
  2. Install Tez engine (understanding)

    tez.apache.org/

    Tez is a Hive running engine that performs better than MR. Why is it better than MR? A look.

    Write MR programs directly with Hive. Assume that there are four dependent MR jobs. In the figure above, the green color is ReduceTask, and the cloud shape indicates write shielding, and the intermediate results need to be persistently written to HDFS.

    Tez can convert multiple dependent jobs into one job. In this way, HDFS only needs to be written once and there are fewer intermediate nodes, greatly improving the computing performance of jobs.

    • Copy the TEZ installation package to the cluster and decompress the tar package

      [moe@hadoop102 ~]$mkdir /opt/module/tez [moe@hadoop102 ~]$tar -zxvf /opt/software/ tez-0.10.1-snapshot-minimal.tar.gz -C /opt/module/tez/Copy the code
    • Uploading TEZ depends on HDFS

      [moe@hadoop102 ~]$ hadoop fs -mkdir /tez
      
      [moe@hadoop102 ~]$ hadoop fs -put /opt/software/tez-0.10.1-SNAPSHOT.tar.gz /tez
      Copy the code
    • Tez – site. New XML

      
                
      
                
      <configuration>
      
              <property>
                      <name>tez.lib.uris</name>
                      <value>${fs. DefaultFS} / tez/tez - 0.10.1 - the SNAPSHOT. Tar. Gz</value>
              </property>
      
              <property>
                      <name>tez.use.cluster.hadoop-libs</name>
                      <value>true</value>
              </property>
      
              <property>
                      <name>tez.am.resource.memory.mb</name>
                      <value>1024</value>
              </property>
      
              <property>
                      <name>tez.am.resource.cpu.vcores</name>
                      <value>1</value>
              </property>
      
              <property>
                      <name>tez.container.max.java.heap.fraction</name>
                      <value>0.4</value>
              </property>
      
              <property>
                      <name>tez.task.resource.memory.mb</name>
                      <value>1024</value>
              </property>
      
              <property>
                      <name>tez.task.resource.cpu.vcores</name>
                      <value>1</value>
              </property>
      
      </configuration>
      Copy the code
    • Modify Hadoop environment variables

      [moe@hadoop102 ~]$ vim $HADOOP_HOME/etc/hadoop/shellprofile.d/tez.sh
      Copy the code

      Add Tez Jar package related information

      hadoop_add_profile tez
      function _tez_hadoop_classpath
      {
       hadoop_add_classpath "$HADOOP_HOME/etc/hadoop" after
       hadoop_add_classpath "/opt/module/tez/*" after
       hadoop_add_classpath "/opt/module/tez/lib/*" after
      }
      Copy the code
    • Example Modify the Hive computing engine

      [moe@hadoop102 ~]$ vim $HIVE_HOME/conf/hive-site.xml
      Copy the code

      add

      <property>
              <name>hive.execution.engine</name>
              <value>tez</value>
      </property>
      
      <property>
              <name>hive.tez.container.size</name>
              <value>1024</value>
      </property>
      Copy the code
    • Resolve log Jar package conflicts

      [MOE @ hadoop102 ~] $rm/opt/module/tez/lib/slf4j - log4j12-1.7.10. JarCopy the code

4. Business analysis

4.1. Top10 video views were counted

Order by global order by global order by global order by global order

The final SQL:

SELECT 
 videoId,
 views
FROM 
 gulivideo_orc
ORDER BY 
 views DESC
LIMIT 10;
Copy the code

4.2 Top10 video popularity statistics

(1) Count the number of videos in each category, and display the top 10 categories that contain the most videos.

(2) We need to aggregate according to category Group by, and then count the number of videoId in the group.

(3) Because the current table structure is: one video corresponds to one or more categories. So if you want to group by a category, you need to expand the category and then count it.

(4) Finally, the top 10 items are displayed in order of heat.

The final SQL:

SELECT 
 t1.category_name , 
 COUNT(t1.videoId) hot
FROM 
(
SELECT 
 videoId, 
 category_name 
FROM 
 gulivideo_orc 
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
) t1
GROUP BY 
 t1.category_name 
ORDER BY
 hot 
DESC 
LIMIT 10;
Copy the code

4.3. The category of the Top20 videos with the highest number of views and the number of Top20 videos in the category were counted

(1) First find all the information of the 20 videos with the highest number of views, and arrange them in descending order

(2) Split the categories in the 20 pieces of information (column shift)

(3) Finally, query the video category name and the number of Top20 videos in this category

The final SQL:

SELECT 
 t2.category_name,
 COUNT(t2.videoId) video_sum
FROM 
(
SELECT
 t1.videoId,
 category_name
FROM 
(
SELECT 
 videoId, 
 views ,
 category 
FROM 
 gulivideo_orc
ORDER BY 
 views 
DESC 
LIMIT 20 
) t1
lateral VIEW explode(t1.category) t1_tmp AS category_name
) t2
GROUP BY t2.category_name;
Copy the code

4.4 make statistics of the category ranking of the Top50 videos watched

The final SQL:

SELECT t6.category_name, t6.video_sum, rank() over(ORDER BY t6.video_sum DESC ) rk FROM ( SELECT t5.category_name, COUNT(t5.relatedid_id) video_sum FROM ( SELECT t4.relatedid_id, category_name FROM ( SELECT t2.relatedid_id , t3.category FROM ( SELECT relatedid_id FROM ( SELECT videoId, views, relatedid FROM gulivideo_orc ORDER BY views DESC LIMIT 50 )t1 lateral VIEW explode(t1.relatedid) t1_tmp AS relatedid_id )t2 JOIN gulivideo_orc t3 ON t2.relatedid_id = t3.videoId ) t4 lateral VIEW explode(t4.category) t4_tmp AS category_name  ) t5 GROUP BY t5.category_name ORDER BY video_sum DESC ) t6;Copy the code

4.5. Top10 videos in each category were counted, taking Music as an example

Ideas:

(1) in order to count the Top10 popularity of videos in Music category, it is necessary to find the Music category first, and then expand the category, so a table can be created to store the categoryId expanded data.

(2) Insert data into the category expanded table.

(3) The heat of videos in the corresponding category (Music) is counted. Top10 in Music category (you can also count others)

The final SQL:

SELECT 
 t1.videoId, 
 t1.views,
 t1.category_name
FROM 
(
SELECT
 videoId,
 views,
 category_name
FROM gulivideo_orc
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name
)t1 
WHERE 
 t1.category_name = "Music" 
ORDER BY 
 t1.views 
DESC 
LIMIT 10;
Copy the code

4.6. Top10 videos watched in each category were counted

The final SQL:

SELECT t2.videoId, t2.views, t2.category_name, t2.rk FROM ( SELECT t1.videoId, t1.views, t1.category_name, rank() over(PARTITION BY t1.category_name ORDER BY t1.views DESC ) rk FROM ( SELECT videoId, views, category_name FROM gulivideo_orc lateral VIEW explode(category) gulivideo_orc_tmp AS category_name )t1 )t2 WHERE t2.rk < = 10;Copy the code

4.7. The Top10 users who upload the most videos and the videos whose videos are viewed in the top 20 times are counted

Ideas:

(1) Find out the 10 users who upload the most videos

(2) Associate the GuliVideo_ORc table to find out all the videos uploaded by these 10 users, and take the top 20 according to the number of views

The final SQL:

SELECT 
 t2.videoId,
 t2.views,
 t2.uploader
FROM
(
SELECT 
 uploader,
 videos
FROM gulivideo_user_orc 
ORDER BY 
 videos
DESC
LIMIT 10 
) t1
JOIN gulivideo_orc t2 
ON t1.uploader = t2.uploader
ORDER BY 
 t2.views 
DESC
LIMIT 20;
Copy the code

Common errors and solutions

  1. If tasks are stuck after the Tez engine is replaced, adjust the resource scheduling policy of the capacity scheduler

    $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

    <property>
     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
     <value>0.1</value>
     <description>
     Maximum percent of resources in the cluster which can be used to run 
     application masters i.e. controls number of concurrent running
     applications.
     </description>
    </property>
    Copy the code

    to

    <property>
     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
     <value>1</value>
     <description>
     Maximum percent of resources in the cluster which can be used to run 
     application masters i.e. controls number of concurrent running
     applications.
     </description>
    </property>
    Copy the code
  2. JVM heap memory overflow

    Description: Java. Lang. OutOfMemoryError: Java heap space

    Solution: Add the following code to yarn-site. XML

    <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
    </property>
    
    <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>2048</value>
    </property>
    
    <property>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
    </property>
    
    <property>
            <name>mapred.child.java.opts</name>
            <value>-Xmx1024m</value>
    </property>
    Copy the code
  3. Virtual memory limits

    Add the following configuration to yarn-site. XML:

    <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
    </property>
    Copy the code

Three, friendship links

Big data Hive learning journey 6

Big data Hive learning journey 5

Big data Hive learning journey 4

Big data Hive learning journey 3

Big data Hive learning journey 2

Big data Hive learning journey 1