This is the 14th day of my participation in the August More Text Challenge

preface

Next, let’s learn a classic data warehouse Hive.

Hive 3.1.2 and mysql 8.0.20 are used this time.

Basic Hive concepts

1.1 What Is Hive

1.1.1 summary

Hive: an open source data statistics tool and data analysis tool used by Facebook to solve massive structured logs

Hive: a data warehouse tool based on Hadoop that maps structured data files to a table and provides SQL-like query functions.

1.1.2 Hive nature

Hive maps structured data into a table. MR is used as the computing engine, HDFS as the storage, and YARN as the resource scheduling function.

A. Data processed by Hive is stored in HDFS

B. MapReduce is the underlying implementation of Hive analysis data

C. The execution program can be run on YARN.

1.2 Advantages and disadvantages of Hive

1.2.1 advantages

(1) The operation interface adopts SQL-like syntax to provide the ability of rapid development

(2) Avoid to write MapReduce, reduce development learning costs

(3) Hive has a high execution delay. Therefore, Hive is commonly used for data analysis and does not have high requirements on real-time performance

(4) Hive has the advantage of processing big data, but has no advantage for processing small data, because Hive has a high execution delay

(5) Hive supports user-defined functions. Users can implement functions based on their own requirements

1.2.2 shortcomings

(a) The HQL expression capability of Hive is limited

  • (1) The iterative algorithm cannot be expressed
  • (2) I am not good at data mining. Due to the limitation of MapReduce data processing process, more efficient algorithms cannot be realized

(b) Hive is inefficient

  • (1) MapReduce jobs automatically generated by Hive are usually not intelligent enough
  • (2) Hive tuning is difficult and coarse

1.3 Hive Architecture Principles

Hive computing data is stored in the HDFS

Hive metadata information (table information) is stored in a third-party database. By default, Derby is used instead of mysql to open multiple client Windows.

  • User interface: Client

CLI (Command-line interface), JDBC/ODBC(JDBC Access to Hive), and WEBUI (Browser Access to Hive)

  • Metadata: Metastore

Metadata includes the table name, database to which the table belongs (default), owner of the table, column/partition fields, type of the table (whether it is an external table), and directory where the table data resides.

By default, it is stored in its own Derby database. It is recommended to use MySQL to store Metastore

  • Hadoop

HDFS is used for storage and MapReduce is used for calculation.

  • Driver: Driver

    • (1) SQL Parser: Convert SQL strings into abstract syntax tree AST. This step is generally completed by third-party tool libraries, such as ANTLR; Parsing the AST, such as whether tables exist, whether fields exist, and whether SQL semantics are wrong.
    • (2) Physical Plan: Compile AST to generate logical execution Plan.
    • (3) Query Optimizer: Optimizes the logical execution plan.
    • Execution: To convert a logical Execution plan into a physical plan that can be executed. For Hive, it is MR/Spark.

Hive receives SQL commands from users through a series of interactive interfaces, uses its own Driver and MetaStore to translate these commands into MapReduce and submit them to Hadoop for execution. Finally, Hive outputs the execution results to user interaction interfaces.

1.4 Comparison between Hive and Databases

Hive uses the Hive Query Language (HQL), which is similar to SQL. Therefore, Hive can be easily understood as a database. In fact, Hive has nothing in common with databases structurally, except that they share a similar query language. This article will explain the differences between Hive and databases in several ways. Databases can be used in Online applications, but Hive is designed for data warehouses. This helps you understand Hive’s features from an application perspective.

1.4.1 Query language

SQL is widely used in data warehouses. Therefore, AN SQL-like query language HQL is designed specifically for Hive features. Developers familiar with SQL development can easily use Hive for development.

1.4.2 Data Update

Hive is designed for data warehouse applications, and data warehouse content is read more than written. Therefore, data rewriting is not recommended in Hive. All data is determined during loading. Data in a database usually needs to be modified frequently, so you can use INSERT INTO… VALUES add data using UPDATE… SET Modifies data.

1.4.3 Execution Delay

Hive does not have indexes and scans the entire table during data query, resulting in high latency. Another factor that causes high Hive execution latency is the MapReduce framework. Because MapReduce has a high latency, Hive query execution using MapReduce also has a high latency. In contrast, the database execution latency is low. Of course, this low is conditional, that is, the data size is small, when the data size exceeds the processing capacity of the database, Hive parallel computing obviously has an advantage.

1.4.4 Data scale

Hive is built on clusters and can use MapReduce for parallel computing, so it can support large amounts of data. Correspondingly, the database can support a smaller data scale.

Hive Installation

2.1 Hive Installation Address

  • Hive official website: hive.apache.org/
  • Document to check the address: cwiki.apache.org/confluence/…
  • Download address: archive.apache.org/dist/hive/
  • Github address: github.com/apache/hive

2.2 the Mysql installation

Hive uses mysql to store metadata. Therefore, install and configure mysql.

Mysql installation: RPM installation and source code installation

Juejin. Cn/post / 688635…

Juejin. Cn/post / 688855…

However, I will use yum to install it, which is n times more efficient than the first two. I recommend that you can use it when setting up your own environment.

1. Install the Mysql8.0 repository

 yum localinstall https://repo.mysql.com//mysql80-community-release-el7-1.noarch.rpm
Copy the code

2. Install Mysql 8.0

 yum install mysql-community-server
Copy the code

3. Start Mysql and configure it to start automatically upon startup

 systemctl start mysqld
 systemctl enable mysqld
Copy the code

4. View the default password and reset it

 grep 'temporary password' /var/log/mysqld.log
Copy the code

Log in and change the password: Note that there is no space between -p and the password

 [root@hadoop101 package]# mysql -p+TkQU4tYowE6
 mysql: [Warning] Using a password on the command line interface can be insecure.
 Welcome to the MySQL monitor.  Commands end with ; or \g.
 Your MySQL connection id is 13
 Server version: 8.0.25
Copy the code

5, change password, grant remote permission:

mysql> alter user 'root'@'localhost' IDENTIFIED with mysql_native_password by 'Admin12@2021'; Query OK, 0 rows affected (0.00 SEC) mysql> rename user 'root'@'localhost' to 'root'@'%'; Query OK, 0 rows affected (0.01sec) mysql> select PRIVILEGES; Query OK, 0 rows affected (0.00 SEC) MYSQL > grant all PRIVILEGES on *.* to 'root'@'%'; Query OK, 0 rows affected (0.00 SEC)Copy the code

6. Remote access

2.3 Hive deployment

2.3.1 hive installation

I downloaded apache-hive-3.1.2-bin.tar.gz.

  • 1. Upload the Hive JAR package to /opt/package of Linux

  • 2. Decompress apache-hive-3.2.2-bin.tar. gz to /opt/software and rename it to hive

    Tar -zxvf /opt/software/apache-hive-3.1.2-bin.tar.gz -c /opt/module/ mv apache-hive-3.1.2-bin/ hiveCopy the code
  • 3. Modify /etc/profile.d/my_env.sh to add environment variables

     vim /etc/profile
    Copy the code
     #HIVE_HOMEHIVE_HOME=/opt/module/hivePATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/binexport PATH JAVA_HOME HADOOP_HOME HIVE_HOME
    Copy the code
  • 4. Go to the lib directory to resolve the log JAR package conflict

    [leilei@hadoop100 lib]$mv log4j-slf4j-imp-2.10.0. jar log4j-slf4j-imp-2.10.0. bakCopy the code
2.3.2 Configuring Hive metadata to Mysql
  • 1, download mysql connection driver, I am a 8.0.25 the latest mysql, so you need to download the version of the driver, download links for: dev.mysql.com/downloads/f…

  • Copy the downloaded MYSQL JDBC driver to the hive lib directory:

    <? The XML version = "1.0"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? ><configuration> <! - the JDBC connection URL - > < property > < name > javax.mail. Jdo. Option. The ConnectionURL < / name > < value > JDBC: mysql: / / hadoop101:3306 / metastore? userSSL=false</value> </property> <! - a JDBC connection Driver - > < property > < name > javax.mail. Jdo. Option. ConnectionDriverName < / name > <value>com.mysql.cj.jdbc.Driver</value> </property> <! - a JDBC connection username - > < property > < name > javax.mail. Jdo. Option. ConnectionUserName < / name > < value > root value > < / < / property > <! - a JDBC connection password - > < property > < name > javax.mail. Jdo. Option. ConnectionPassword < / name > < value > Admin @ 2021 value > < / < / property > <! - the default in the Hive HDFS working directory - > < property > < name >. Hive metastore. Warehouse. Dir < / name > < value > / user/Hive/warehouse < / value > </property> <! - Hive metadata stored version of the validation - > < property > < name >. Hive metastore. Schema. Verification < / name > < value > false < value > / < / property > <! <property> <name>hive.metastore.uris</name> <value>thrift://hadoop101:9083 > </property> <! -- Specify hiveserver2 connection port number --> <property> <name>hive.server2.thrift. Port </name> <value>10000</value> </property> <! - hiveserver2 connection host - > < property > < name > hive. Server2. Thrift. Bind. Host < / name > < value > hadoop101 < value > / < / property > <! - metadata storage authorization - > < property > < name >. Hive metastore. Event. The notification. API. The auth < / name > < value > false < value > / < / property > </configuration>Copy the code
2.3.3 Initializing the Metadata Database
  • 1. Create mysql library MetaStore to store hive metadata tables.

  • 2. Initialize the Hive metadata database. This process will add Hive information to the corresponding mysql database

     cd hive/bin[root@hadoop101 bin]# ./schematool -initSchema -dbType mysql -verbose
    Copy the code

After we finished, we found that there were many more tables in the database.

2.4 start the hive

2.4.1 Starting MetaStore and Hiveserver2

These two services can be started directly, such as the following command, but after starting the window can no longer operate, nor can the current window be closed, so we do not recommend this method.

 hive --service metastorehive --service hiveserver2
Copy the code

Below, we start through the background mode, generally using the combination command: nohup [XXX command operation] >file 2>&1%. Output the result of running the XXX command to file and keep the process started by the command running in the background.

 [root@hadoop101 hive]# nohup hive --service metastore 2>&1 &[1] 108879[root@hadoop101 hive]# nohup hive --service hiveserver2 2>&1 &[2] 109675
Copy the code

For convenience, we use scripts to manage the startup and shutdown of the service. Scripts are not required to be mastered, but can be used directly. Just pay attention to the path

 vim $HIVE_HOME/bin/hiveservices.sh
Copy the code
#! /bin/bash HIVE_LOG_DIR=$HIVE_HOME/logs if [! -d $HIVE_LOG_DIR] then mkdir -p $HIVE_LOG_DIR Parameter function for process port 2 check_process () {pid = $(ps - ef 2 > / dev/null | grep -v grep | grep -i $1 | awk '{print $2}) ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1) echo $pid [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1 } function hive_start() { metapid=$(check_process HiveMetastore 9083) cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &" cmd=$cmd" sleep 4; HDFS dfsadmin - safemode wait > / dev/null 2 > &1 "[z]" $metapid "&& eval $CMD | | echo" Metastroe service has started" server2pid=$(check_process HiveServer2 10000) cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 "&" [- z $server2pid "] && eval $CMD | | echo "HiveServer2 service has launched"} function hive_stop () {metapid = $(check_process HiveMetastore 9083) [" $metapid "] && kill $metapid | | echo "Metastore service did not start the" server2pid = $(check_process HiveServer2 10000) [" $server2pid "] && kill $server2pid | | echo "HiveServer2 service did not start"} hive_start case $1 in "start");. "stop") hive_stop ;; "restart") hive_stop sleep 2 hive_start ;; "Status") check_process HiveMetastore 9083 > / dev/null && echo "Metastore service running" | | echo check_process Metastore service operation exception HiveServer2 10000 > / dev/null && echo "HiveServer2 service running" | | echo "HiveServer2 service running abnormal";; *) echo Invalid Args! echo 'Usage: '$(basename $0)' start|stop|restart|status' ;; esacCopy the code

Add execute permission to this script:

 [root@hadoop101 hive]# chmod +x $HIVE_HOME/bin/hiveservices.sh
Copy the code

Start the Hive background service:

 hiveservices.sh start
Copy the code
2.4.2 Starting a Cluster

Execute the scripts we write

 [root@hadoop101 sbin]# mycluster.sh start
Copy the code

See if the started service is normal:

[root@hadoop101 sbin]# myjps.sh ===== hadoop101 jps====== 18993 DataNode 122563 RunJar 20882 Jps 122123 RunJar 20619 NodeManager 18813 NameNode ===== hadoop102 jps====== 39191 NodeManager 37848 DataNode 39002 ResourceManager 39677 Jps ===== hadoop103 jps====== 26724 Jps 25333 DataNode 25482 SecondaryNameNode 26463 NodeManager [root@hadoop101 sbin]# Sh status Metastore service is running properly HiveServer2 service is running properlyCopy the code

2.5 Hive access

2.5.1 HiveJDBC access
  $HIVE_HOME/bin/beeline -u jdbc:hive2://hadoop101:10000 -n root
Copy the code

If the following information is displayed, the Hive is accessed successfully.

[root@hadoop101 hadoop]# $HIVE_HOME/bin/beeline -u jdbc:hive2://hadoop101:10000 -n rootConnecting to JDBC: hive2: / / hadoop101:10000 connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (Version 3.1.2)Transaction ISOLATION: TRANSACTION_REPEATABLE_READBeeline Version 3.1.2 by Apache Hive0: jdbc:hive2://hadoop101:10000>Copy the code

If startup fails:

  • 1. Check whether the Hiveserver2 service is running properly
  • 2. Check whether the hadoop /etc/hadoop/core-site. XML file is compatible so that any user needs to proxy the user to access it
<! Here is the compatibility configuration, skip first --> <! - configure the root (superUser) allows access through agent host node - > < property > < name >. Hadoop proxyuser. Root. Hosts < / name > < value > * < value > / < / property > <! - configure the root agents (superuser) allows the user belongs to group - - > < property > < name >. Hadoop proxyuser. Root. Groups < / name > < value > * < value > / < / property > <! - configure the root (superuser) allows users of the agent - > < property > < name >. Hadoop proxyuser. Root. The users < / name > < value > * < value > / < / property >Copy the code
2.5.2 Accessing a Hive Client
  • 1. Start the Hive client
[root@hadoop101 sofeware]# hivewhich: no hbase in (/opt/rh/devtoolset-9/root/usr/bin:/opt/sofeware/java8/bin:/opt/sofeware/java8/jre/bin:/usr/local/sbin:/usr/local/bin:/u The sr/sbin, / usr/bin: / opt/sofeware/hadoop - 3.1.3 / bin: / opt/sofeware/hadoop - 3.1.3 / sbin, / opt/sofeware/hive/bin: / home/lei/bin: / ro ot/bin)Hive Session ID = fb510c68-35b2-4505-86c6-f962b3dbdedbLogging initialized using configuration in Jar: file: / opt/sofeware/hive/lib/hive - common - 3.1.2. Jar! /hive-log4j2.properties Async: trueHive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Hive Session ID = 8f32c100-e1d7-4313-8bea-3b72441ab1cehive>Copy the code

You can see some printouts, the engine being used is MR, which is outdated and can be replaced later.

  • 2. Check the database

    hive> show databases; OKdefaultTime taken: 0.517 seconds, Touchdown: 1 row(s)Copy the code
  • 3, To print information beauty views, we can add the following two configurations in hive-site.xml:

    <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns  in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> <description>Whether to include the current database in the Hive prompt.</description> </property>Copy the code
  • 4, the final effect is not too obvious, add the header and the library, the aesthetic degree is not as good as Beeline. The above method is recommended

    hive (default)> create database mydb; OKTime taken: 1.209 secondshive (default)> use mydb; OKTime taken: 0.057 secondshive (mydb)> show databases; OKdatabase_namedefaultmydbCopy the code

2.6 Hive Interaction Commands

[root@hadoop101 conf]# hive -helpwhich: no hbase in (/opt/rh/devtoolset-9/root/usr/bin:/opt/sofeware/java8/bin:/opt/sofeware/java8/jre/bin:/usr/local/sbin:/usr/local/bin:/u The sr/sbin, / usr/bin: / opt/sofeware/hadoop - 3.1.3 / bin: / opt/sofeware/hadoop - 3.1.3 / sbin, / opt/sofeware/hive/bin: / home/lei/bin: / ro ot/bin)Hive Session ID = e8e97f61-a6de-440e-be53-8f3e1592414fusage: hive -d,--define <key=value> Variable substitution to apply to Hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> SQL from command line -f <filename> SQL from files -H,--help Print help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable  substitution to apply to Hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)Copy the code
  • 1. -e does not go to the Hive interaction window to execute SQL statements

    [root@hadoop101 ~]# bin/hive -e "select * from mydb.mytable;" [root@hadoop101 ~]# hive -e "select * from mydb.mytable;" which: no hbase in (/opt/rh/devtoolset-9/root/usr/bin:/opt/sofeware/java8/bin:/opt/sofeware/java8/jre/bin:/usr/local/sbin:/usr/local/bin:/u The sr/sbin, / usr/bin: / opt/sofeware/hadoop - 3.1.3 / bin: / opt/sofeware/hadoop - 3.1.3 / sbin, / opt/sofeware/hive/bin: / home/lei/bin: / ro ot/bin)Hive Session ID = 05e49b32-b3d3-4e8e-bf01-64293c905fa6Logging initialized using configuration in Jar: file: / opt/sofeware/hive/lib/hive - common - 3.1.2. Jar! /hive-log4j2.properties Async: TrueHive Session ID = 539ebc53-FF74-4def-b715-321edBe50966okMyTable. ID mytable.nameTime taken: 2.709 secondsCopy the code
  • 2. Run -f to execute the SQL statement in the script

      1. Create hive. SQL and write SQL files in it
      1. Execute the SQL file in the file, and output the results of the file to a result file
     hive -f hivef.sql  > /opt/module/datas/hive_result.txt
    Copy the code

2.7 Other Hive Command Operations

  • 1. Exit the Hive window

    hive(default)>exit; hive(default)>quit;Copy the code

    In the old version there was a difference between the two:

    • Exit: Submits data implicitly and then pushes it out
    • Quit: does not submit data and exits.
  • 2. How do I view HDFS file systems in the Hive CLI

     hive(default)>dfs -ls /;
    Copy the code
  • 3. View all historical commands entered in Hive

    • 1. Go to the root directory /root or /home/lei of the current user
    • 2. View the.hivehistory file

2.8 Configuring Common Hive Properties

2.8.1 Configuring Hive Run Logs
  • 1. Hive logs are stored in the /tep/root/hive.log directory (current user name) by default.

  • 2. Modify hive logs to save logs in /opt/software/hive/logs

    • 1, modify the hive/conf/hive – log4j. Properties. The template file name called hive – log4j. Properties
     [root@hadoop101 conf]# mv hive-log4j2.properties.template hive-log4j2.properties[root@hadoop101 conf]# vim hive-log4j2.properties 
    Copy the code
    • 2. Modify the configuration inside
     property.hive.log.dir = /opt/software/hive/logs
    Copy the code
2.8.2 Parameter Setting Mode
  • 1. View the current configuration information

     hive > set
    Copy the code
  • 2. There are three ways to configure parameters

    • 1. Configuration file

      • The default configuration file is hive-defalut.xml
      • User-defined configuration file: hive-site.xml
      • Note: User-defined configurations overwrite the default configurations. Hive also reads Hadoop configurations because Hive is started as a Hadoop client and the Hive configuration overwrites Hadoop configurations.
    • 2. Command line parameter mode

      • During Hive startup, you can add -hiveconf param=value on the CLI to set the parameter. This parameter is valid only for Hive startup
       hive -hiveconf mapred.reduce.tasks=10;
      Copy the code
      • View parameter Settings:

         hive (default)> set mapred.reduce.tasks;
        Copy the code
    • 3. Parameter declaration method

      You can use the SET keyword in HQL to SET parameters, which are valid only for this Hive startup.

       hive (default)> set mapred.reduce.tasks=100;
      Copy the code

      The priorities of the above three Settings increase in turn. Configuration file < command line parameter < parameter declaration. Note that some system-level parameters, such as log4J-related Settings, must be set in the first two ways because the reading of those parameters is done before the session is established