“This is the 20th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Getting started with Hive

1. What is Hive

Hive: Open source data collection for massive structured logs by Facebook.

Hive is a data warehouse tool based on Hadoop. Hive maps structured data files into a table and provides SQL-like query functions.

The essence is to convert HQL into MapReduce programs

  • Data processed by Hive is stored in HDFS
  • The underlying implementation of Hive data analysis is MapReduce
  • The execution program runs on Yarn

2. Advantages and disadvantages of Hive

2.1 advantages,

  • The operation interface uses SQL-like syntax to provide rapid development capabilities (simple and easy to use).
  • Eliminating the need to write MapReduce reduces developer learning costs.
  • Hive has a high execution latency. Therefore, Hive is commonly used for data analysis and does not require high real-time performance.
  • Hive is advantageous for processing large data, but has no advantage for processing small data because Hive has a high execution delay.
  • Hive supports user-defined functions. Users can implement user-defined functions based on their own requirements.

2.2 and disadvantages

  • Hive has limited HQL expression capabilities
    • The iterative algorithm cannot be expressed
    • I am not good at data mining. Due to the limitation of MapReduce data processing process, more efficient algorithms cannot be implemented.
  • Hive is more efficient than Hive
    • MapReduce jobs that are automatically generated are usually not intelligent enough
    • Hive tuning is difficult and coarse

3. Hive architecture principle

  • User interface: Client
    • CLI (Command-line interface), JDBC/ODBC(JDBC Access to Hive), and WEBUI (Browser Access to Hive)
  • Metadata: Metastore
    • Metadata includes the table name, database to which the table belongs (default), owner of the table, column/partition fields, type of the table (whether it is an external table), and directory where the table data resides.
    • By default, it is stored in its own Derby database. It is recommended to use MySQL to store Metastore
  • Hadoop
    • HDFS is used for storage and MapReduce is used for calculation.
  • Driver: Driver
    • SQL Parser converts SQL strings into abstract syntax trees called AST. This is usually done using third-party libraries such as ANTLR. Parsing the AST, such as whether tables exist, whether fields exist, and whether SQL semantics are wrong.
    • Compiler (Physical Plan) : Compiles the AST to produce a logical execution Plan.
    • Query Optimizer: Optimizes the logical execution plan.
    • Execution: To convert a logical Execution plan into a physical plan that can be executed. For Hive, it is MR/Spark.

Hive receives SQL commands from users through a series of interactive interfaces, uses its own Driver and MetaStore to translate these commands into MapReduce and submit them to Hadoop for execution. Finally, Hive outputs the execution results to user interaction interfaces.

4. Compare Hive and database

Hive uses the Hive Query Language (HQL), which is similar to SQL. Therefore, Hive can be easily understood as a database. In fact, Hive has nothing in common with databases structurally, except that they share a similar query language. This article will explain the differences between Hive and databases in several ways. Databases can be used in Online applications, but Hive is designed for data warehouses. This helps you understand Hive’s features from an application perspective.

4.1. Query language

SQL is widely used in data warehouses. Therefore, AN SQL-like query language HQL is designed specifically for Hive features. Developers familiar with SQL development can easily use Hive for development.

4.2 Data Storage Location

Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on block devices or local file systems.

4.3 Data update

Hive is designed for data warehouse applications, and data warehouse content is read more than written. Therefore, data rewriting is not recommended in Hive. All data is determined during loading. Data in a database usually needs to be modified frequently, so you can use INSERT INTO… VALUES add data using UPDATE… SET Modifies data.

Index of 4.4,

Hive does not process or scan data during data loading and does not index some keys in data. Hive scans the entire data to access specific values that meet conditions, resulting in high access latency. With the introduction of MapReduce, Hive can access data in parallel. Therefore, Even without indexes, Hive still has an advantage in accessing large amounts of data. In a database, one or more columns are usually indexed, so the database can have high efficiency and low latency for accessing a small amount of data under certain conditions. Hive is not suitable for online data query due to high data access latency.

4.5, perform

Most Hive queries are executed using MapReduce provided by Hadoop. Databases usually have their own execution engine.

4.6. Execution delay

Hive does not have indexes and scans the entire table during data query, resulting in high latency. Another factor that causes high Hive execution latency is the MapReduce framework. Because MapReduce has a high latency, Hive query execution using MapReduce also has a high latency. In contrast, the database execution latency is low. Of course, this low is conditional, that is, the data size is small, when the data size exceeds the processing capacity of the database, Hive parallel computing obviously has an advantage.

4.7 scalability

Since Hive is built on top of Hadoop, Hive scalability is consistent with Hadoop scalability (the largest Hadoop cluster in the world is at Yahoo! In 2009, the scale was about 4000 nodes). The database has very limited extended rows due to ACID semantics. Oracle, the most advanced parallel database, has a theoretical scaling capacity of only about 100.

4.8. Data scale

Hive is built on clusters and can use MapReduce for parallel computing, so it can support large amounts of data. Correspondingly, the database can support a smaller data scale.

Hive Installation

1. Hive installation address

  • Hive official website address

    • hive.apache.org/
  • Document viewing address

    • Cwiki.apache.org/confluence/…
  • Download address

    • archive.apache.org/dist/hive/
  • Making the address

    • github.com/apache/hive

2. Hive installation and deployment

2.1. Install Hive

  • Gz to the /opt/software directory of Linux

  • Decompress apache-hive-3.2.2-bin.tar. gz to the /opt/module/ directory

    [moe@hadoop102 module]$tar -zxvf apache-hive-3.1.2-bin.tar.gz -c /opt/module/Copy the code
  • Change the name of apache-hive-3.1.2-bin.tar.gz to hive-3.1.2

    [moe@hadoop102 module]$mv apache-hive-3.1.2-bin hive-3.1.2Copy the code
  • Modify /etc/profile.d/my_env.sh to add environment variables

    [moe@hadoop102 module]$ vim /etc/profile.d/my_env.sh
    Copy the code
  • Add content

    #HIVE_HOME
    exportHIVE_HOME = / opt/module/hive - 3.1.2export PATH=$PATH:$HIVE_HOME/bin
    Copy the code
  • Resolve log Jar package conflicts

    [moe@hadoop102 lib]$mv log4j-slf4j-imp-2.10.0. jar log4j-slf4j-imp-2.10.00.jarCopy the code
  • Initialize the metadata database

    [moe@hadoop102 hive-3.1.2]$bin/schematool -dbType derby-initschemaCopy the code

2.2. Start and use Hive

Note: Hadoop cluster must be started!! Cwiki.apache.org/confluence/…

  • Start the Hive

    [MOE @ hadoop102 hive - 3.1.2] $bin/hiveCopy the code
  • Using Hive

    hive> show databases;
    hive> show tables;
    hive> create table test(id int);
    hive> insert into test values(1);
    hive> select * from test;
    Copy the code
  • Open another window in the CRT window. Start Hive and monitor the hive.log file in/TMP/MOE

    Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /opt/module/hive-3.1.2/ metastore_DB.Copy the code

    Hive uses the Derby metadata database by default. After Hive is enabled, the metadata database is occupied and data is not shared with other clients. Therefore, you need to change the Hive metadata address to MySQL.

3, MySQL installation

  • Check whether MySQL has been installed on the current system

    [MOE @ hadoop102 hive - 3.1.2] $RPM - qa | grep mariadb mariadb - libs - 5.5.56-2. El7. X86_64 / / if there is a through the following unmount [MOE @ hadoop102 $sudo RPM -e --nodeps mariadb-libsCopy the code
  • Copy the MySQL installation package to the /opt/software directory

    [moe@hadoop102 software]$ll-rw-r --r--. 1 MOE MOE 609556480 11月 19 15:30 mysql-5.7.28-1.el7.x86_64.rpm-bundle.tarCopy the code
  • Decompress the MySQL installation package

    [moe@hadoop102 software]$tar -xf mysql-5.7.28-1.el7.x86_64. RPM -bundle.tarCopy the code
  • Perform RPM installation in the installation directory

    RPM sudo RPM -ivh mysql-community-common-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh mysql-community-libs-5.7.28-1.el7.x86_64. RPM Sudo RPM -ivh mysql-community-libs-compat-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh Ql-community-client-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh ql-community-server-5.7.28-1.el7.x86_64. RPMCopy the code
  • Delete all contents of the directory pointed to by datadir in /etc/my.cnf, if any:

    View the value of datadir:

    [mysqld]
    datadir=/var/lib/mysql
    Copy the code

    Delete all contents of /var/lib/mysql directory:

     [moe@hadoop102 mysql]$ cd /var/lib/mysql
     [moe@hadoop102 mysql]$ sudo rm -rf ./*
    Copy the code
  • Initializing the database

    [moe@hadoop102 opt]$ sudo mysqld --initialize --user=mysql
    Copy the code
  • View the temporary password of user root

    [moe@hadoop102 opt]$sudo cat /var/log/mysqld.log 2021-11-19t07:41:32.459045z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more 2021-11-19T07:41:32.602719z 0 [Warning] InnoDB: New log Files created, LSN=45790 2021-11-19T07:41:32.638515z 0 [Warning] InnoDB: Creating Foreign key Constraint System tables. 2021-11-19T07:41:32.645648z 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 1D96221D-490C-11EC-ba3F-000C296D2C15.2021-11-19T7:41:32.646270z 0 [Warning] Gtid table is not ready to be used.table 'mysql.gtid_executed' cannot be opened. 2021-11-19T07:41:34.241353z 0 [Warning] CA certificate ca.pem is self signed. 2021-11-19T07:41:34.493581z 1 [Note] A temporary password is generated for root@localhost: Ek/9.yjp #ANCopy the code
  • Start the MySQL service

    [moe@hadoop102 opt]$ sudo systemctl start mysqld
    Copy the code
  • Logging in to the MySQL database

    [moe@hadoop102 opt]$ mysql -uroot -p
    Enter password:
    Copy the code
  • You must change the password of user root first; otherwise, errors will be reported when you perform other operations

    set password = password("New password");
    Copy the code
  • Mysql > select root from user where user = ‘root’

    mysql> update mysql.user set host=The '%' where user='root';
    Query OK, 1 row affected (0.00 sec)
    Rows matched: 1  Changed: 1  Warnings: 0
    
    mysql> flush privileges;
    Query OK, 0 rows affected (0.00 sec)
    Copy the code

4. Configure Hive metadata to MySQL

4.1 copy driver

Copy the MySQL JDBC driver to the Hive lib directory

[moe@hadoop102 software]$cp mysql-connector-java-5.1.37.jar $HIVE_HOME/lib/Copy the code

4.2. Configure Metastore to MySQL

  • Create the hive-site. XML file in the $HIVE_HOME/conf directory

    
            
    
            
    <configuration>
    
            <! -- JDBC connection URL -->
            <property>
                    <name>javax.jdo.option.ConnectionURL</name>
                    <value>jdbc:mysql://hadoop102:3306/metastore? useSSL=false</value>
            </property>
    
            <! -- Driver for JDBC connection -->
            <property>
                    <name>javax.jdo.option.ConnectionDriverName</name>
                    <value>com.mysql.jdbc.Driver</value>
            </property>
    
            <! -- JDBC connection username-->
            <property>
                    <name>javax.jdo.option.ConnectionUserName</name>
                    <value>root</value>
            </property>
    
            <! -- Password for JDBC connection -->
            <property>
                    <name>javax.jdo.option.ConnectionPassword</name>
                    <value>xxx</value>
            </property>
    
            <! Hive metadata store version verification -->
            <property>
                    <name>hive.metastore.schema.verification</name>
                    <value>false</value>
            </property>
    
            <! -- Metadata storage authorization -->
            <property>
                    <name>hive.metastore.event.db.notification.api.auth</name>
                    <value>false</value>
            </property>
    
            <! -- Hive default HDFS working directory -->
            <property>
                    <name>hive.metastore.warehouse.dir</name>
                    <value>/user/hive/warehouse</value>
            </property>
    
    </configuration>
    Copy the code
  • Log in MySQL

    [moe@hadoop102 conf]$ mysql -uroot -p
    Enter password:
    Copy the code
  • Create the Hive metadata database

    mysql>  create database metastore;
    Copy the code
  • Initialize the Hive metadata database

    [moe@hadoop102 conf]$ schematool -initSchema -dbType mysql -verbose
    Copy the code

4.3. Start Hive again

  • Start the Hive

    [MOE @ hadoop102 hive - 3.1.2] $hiveCopy the code
  • Using Hive

    hive> show databases;
    hive> show tables;
    hive> create table test (id int);
    hive> insert into test values(1);
    hive> select * from test;
    Copy the code
  • Open another window in the CRT window to start Hive

5. Use the metadata service to access Hive

  • Add the following configuration information to the hive-site. XML file

    <! -- Specify the address to connect to store metadata -->
    <property>
            <name>hive.metastore.uris</name>
            <value>thrift://hadoop102:9083</value>
    </property>
    Copy the code
  • Start the metastore

    [moe@hadoop102 conf]$ hive --service metastore
    Copy the code
  • Start the hive

    [MOE @ hadoop102 hive - 3.1.2] $hiveCopy the code

6. Use JDBC to access Hive

  • Add the following configuration information to the hive-site. XML file

    
    <! Hiveserver2 -->
    <property>
            <name>hive.server2.thrift.bind.host</name>
            <value>hadoop102</value>
    </property>
    
    <! Hiveserver2 connection port number -->
    <property>
            <name>hive.server2.thrift.port</name>
            <value>10000</value>
    </property>
    Copy the code
  • Start the hiveserver2

    [moe@hadoop102 conf]$ hive --service hiveserver2
    Copy the code
  • Start the Beeline client (wait a little longer)

    Note: you need to configure the agent for user MOE in core-site. XML!!

    <property>
            <name>hadoop.proxyuser.moe.hosts</name>
            <value>*</value>
    </property>
    <property>
            <name>hadoop.proxyuser.moe.groups</name>
            <value>*</value>
    </property>
    Copy the code
    [MOE @ hadoop102 hive - 3.1.2] $bin/beeline -u JDBC: hive2: / / hadoop102:10000 - n MOE Connecting to JDBC: hive2: / / hadoop102:10000 Connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (Version 3.1.2) Transaction ISOLATION: TRANSACTION_REPEATABLE_READ Beeline Version 3.1.2 by Apache Hive 0: jdbc:hive2://hadoop102:10000>Copy the code
  • See the following interface

    Connecting to the JDBC: hive2: / / hadoop102:10000 Connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (Version 3.1.2) Transaction ISOLATION: TRANSACTION_REPEATABLE_READ Beeline Version 3.1.2 by Apache Hive 0: jdbc:hive2://hadoop102:10000>Copy the code

7. Common Hive interaction commands

[moe@hadoop102 bin]$ hive -help usage: hive -d,--define <key=value> Variable substitution to apply to Hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> SQL from command line -f <filename> SQL from files -H,--help Print help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable  substitution to apply to Hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)Copy the code
  • “-e” Does not go to the Hive interaction window to execute SQL statements

    [moe@hadoop102 bin]$ hive -e "select * from test;"
    Copy the code
  • “-f” executes the SQL statement in the script

    • Create a datas directory in /opt/module/hive-3.1.2/ and create the hivef. SQL file in the datas directory

      [moe@hadoop102 datas]$ vim hivef.sql
      Copy the code
    • Write the correct SQL statement to the file

      select * from test;
      Copy the code
    • Execute the SQL statement in the file

      [moe@hadoop102 datas]$ hive -f hivef.sql
      Copy the code
    • Execute the SQL statements in the file and write the results to the file

      [moe@hadoop102 datas]$ hive -f hivef.sql > hive_result.txt
      Copy the code

8. Perform other Hive commands

  • Exit the Hive window

    hive(default)>exit;
    hive(default)>quit;
    Copy the code
  • How do I view the HDFS file system in the Hive CLI

    hive> dfs -ls /;
    Found 2 items
    drwx------   - moe supergroup          0 2021-11-19 13:26 /tmp
    drwx------   - moe supergroup          0 2021-11-19 13:29 /user
    Copy the code
  • View all history commands entered in Hive

    • Access the root directory of the current user

      [moe@hadoop102 datas]$ cd ~
      Copy the code
    • View the. Hivehistory file

9. Configure common Hive properties

9.1. Configuring Hive Run Log Information

  • Hive logs are stored in the/TMP/username /hive.log directory by default

  • Change the hive log file to /opt/module/hive-3.1.2/logs

    • Modify/opt/module/hive – 3.1.2 / conf/hive – log4j2. Properties. The template file name called hive – log4j2. Properties

      [moe@hadoop102 conf]$ mv hive-log4j2.properties.template hive-log4j2.properties
      Copy the code
    • Change the log storage location in the hive-log4j2.properties file

      property.hive.log.dir = The/opt/module/hive - 3.1.2 / logs
      Copy the code

9.2. Print the current library and table header

Add the following two configurations to hive-site. XML

<! -- Print the header -->
<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

<! Print the current library -->
<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>
Copy the code

9.3. Parameter Configuration Mode

  • View all current configurations

    hive>set;
    Copy the code
  • There are three ways to configure parameters

    • Configuration file Mode

      The default configuration file is hive-default.xml

      User-defined configuration file: hive-site.xml

      Note: User-defined configuration overrides the default configuration. In addition, Hive also reads the Hadoop configuration because Hive is started as a Hadoop client and the Hive configuration overrides the Hadoop configuration. The configuration file is valid for all Hive processes started on the host.

    • Command line parameter mode

      When starting Hive, you can add -hiveconf param=value on the CLI to set parameters.

      [moe@hadoop102 logs]$ bin/hive -hiveconf mapred.reduce.tasks=10;
      Copy the code

      Note: This parameter is valid only for the Hive startup

      View parameter Settings:

      hive (default)> set mapred.reduce.tasks;
      Copy the code
    • Parameter declaration mode

      You can use the SET keyword to SET parameters in THE HQL

      Such as:

      hive (default)> set mapred.reduce.tasks=100;
      Copy the code

      Note: This parameter is valid only for the Hive startup.

      Viewing Parameter Settings

      hive (default)> set mapred.reduce.tasks;
      Copy the code

      The priorities of the above three Settings increase in turn. Configuration file < command line parameter < parameter declaration. Note that some system-level parameters, such as log4J-related Settings, must be set in the first two ways because the reading of those parameters is done before the session is established.