“This is the 20th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Getting started with Hive

1. What is Hive

Hive: Open source data collection for massive structured logs by Facebook.

Hive is a data warehouse tool based on Hadoop. Hive maps structured data files into a table and provides SQL-like query functions.

The essence is to convert HQL into MapReduce programs

Data processed by Hive is stored in HDFS
The underlying implementation of Hive data analysis is MapReduce
The execution program runs on Yarn

2. Advantages and disadvantages of Hive

2.1 advantages,

The operation interface uses SQL-like syntax to provide rapid development capabilities (simple and easy to use).
Eliminating the need to write MapReduce reduces developer learning costs.
Hive has a high execution latency. Therefore, Hive is commonly used for data analysis and does not require high real-time performance.
Hive is advantageous for processing large data, but has no advantage for processing small data because Hive has a high execution delay.
Hive supports user-defined functions. Users can implement user-defined functions based on their own requirements.

2.2 and disadvantages

Hive has limited HQL expression capabilities
- The iterative algorithm cannot be expressed
- I am not good at data mining. Due to the limitation of MapReduce data processing process, more efficient algorithms cannot be implemented.
Hive is more efficient than Hive
- MapReduce jobs that are automatically generated are usually not intelligent enough
- Hive tuning is difficult and coarse

3. Hive architecture principle

User interface: Client
- CLI (Command-line interface), JDBC/ODBC(JDBC Access to Hive), and WEBUI (Browser Access to Hive)
Metadata: Metastore
- Metadata includes the table name, database to which the table belongs (default), owner of the table, column/partition fields, type of the table (whether it is an external table), and directory where the table data resides.
- By default, it is stored in its own Derby database. It is recommended to use MySQL to store Metastore
Hadoop
- HDFS is used for storage and MapReduce is used for calculation.
Driver: Driver
- SQL Parser converts SQL strings into abstract syntax trees called AST. This is usually done using third-party libraries such as ANTLR. Parsing the AST, such as whether tables exist, whether fields exist, and whether SQL semantics are wrong.
- Compiler (Physical Plan) : Compiles the AST to produce a logical execution Plan.
- Query Optimizer: Optimizes the logical execution plan.
- Execution: To convert a logical Execution plan into a physical plan that can be executed. For Hive, it is MR/Spark.

Hive receives SQL commands from users through a series of interactive interfaces, uses its own Driver and MetaStore to translate these commands into MapReduce and submit them to Hadoop for execution. Finally, Hive outputs the execution results to user interaction interfaces.

4. Compare Hive and database

Hive uses the Hive Query Language (HQL), which is similar to SQL. Therefore, Hive can be easily understood as a database. In fact, Hive has nothing in common with databases structurally, except that they share a similar query language. This article will explain the differences between Hive and databases in several ways. Databases can be used in Online applications, but Hive is designed for data warehouses. This helps you understand Hive’s features from an application perspective.

4.1. Query language

SQL is widely used in data warehouses. Therefore, AN SQL-like query language HQL is designed specifically for Hive features. Developers familiar with SQL development can easily use Hive for development.

4.2 Data Storage Location

Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on block devices or local file systems.

4.3 Data update

Hive is designed for data warehouse applications, and data warehouse content is read more than written. Therefore, data rewriting is not recommended in Hive. All data is determined during loading. Data in a database usually needs to be modified frequently, so you can use INSERT INTO… VALUES add data using UPDATE… SET Modifies data.

Index of 4.4,

Hive does not process or scan data during data loading and does not index some keys in data. Hive scans the entire data to access specific values that meet conditions, resulting in high access latency. With the introduction of MapReduce, Hive can access data in parallel. Therefore, Even without indexes, Hive still has an advantage in accessing large amounts of data. In a database, one or more columns are usually indexed, so the database can have high efficiency and low latency for accessing a small amount of data under certain conditions. Hive is not suitable for online data query due to high data access latency.

4.5, perform

Most Hive queries are executed using MapReduce provided by Hadoop. Databases usually have their own execution engine.

4.6. Execution delay

Hive does not have indexes and scans the entire table during data query, resulting in high latency. Another factor that causes high Hive execution latency is the MapReduce framework. Because MapReduce has a high latency, Hive query execution using MapReduce also has a high latency. In contrast, the database execution latency is low. Of course, this low is conditional, that is, the data size is small, when the data size exceeds the processing capacity of the database, Hive parallel computing obviously has an advantage.

4.7 scalability

Since Hive is built on top of Hadoop, Hive scalability is consistent with Hadoop scalability (the largest Hadoop cluster in the world is at Yahoo! In 2009, the scale was about 4000 nodes). The database has very limited extended rows due to ACID semantics. Oracle, the most advanced parallel database, has a theoretical scaling capacity of only about 100.

4.8. Data scale

Hive is built on clusters and can use MapReduce for parallel computing, so it can support large amounts of data. Correspondingly, the database can support a smaller data scale.

Hive Installation

1. Hive installation address

Hive official website address
- hive.apache.org/
Document viewing address
- Cwiki.apache.org/confluence/…
Download address
- archive.apache.org/dist/hive/
Making the address
- github.com/apache/hive

2. Hive installation and deployment

2.1. Install Hive

Gz to the /opt/software directory of Linux

Decompress apache-hive-3.2.2-bin.tar. gz to the /opt/module/ directory

[moe@hadoop102 module]$tar -zxvf apache-hive-3.1.2-bin.tar.gz -c /opt/module/Copy the code

Change the name of apache-hive-3.1.2-bin.tar.gz to hive-3.1.2

[moe@hadoop102 module]$mv apache-hive-3.1.2-bin hive-3.1.2Copy the code

Modify /etc/profile.d/my_env.sh to add environment variables

[moe@hadoop102 module]$ vim /etc/profile.d/my_env.sh
Copy the code

Add content

#HIVE_HOME
exportHIVE_HOME = / opt/module/hive - 3.1.2export PATH=$PATH:$HIVE_HOME/bin
Copy the code

Resolve log Jar package conflicts

[moe@hadoop102 lib]$mv log4j-slf4j-imp-2.10.0. jar log4j-slf4j-imp-2.10.00.jarCopy the code

Initialize the metadata database

[moe@hadoop102 hive-3.1.2]$bin/schematool -dbType derby-initschemaCopy the code

2.2. Start and use Hive

Note: Hadoop cluster must be started!! Cwiki.apache.org/confluence/…

Start the Hive

[MOE @ hadoop102 hive - 3.1.2] $bin/hiveCopy the code

Using Hive

hive> show databases;
hive> show tables;
hive> create table test(id int);
hive> insert into test values(1);
hive> select * from test;
Copy the code

Open another window in the CRT window. Start Hive and monitor the hive.log file in/TMP/MOE
```
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /opt/module/hive-3.1.2/ metastore_DB.Copy the code
```
Hive uses the Derby metadata database by default. After Hive is enabled, the metadata database is occupied and data is not shared with other clients. Therefore, you need to change the Hive metadata address to MySQL.

3, MySQL installation

Check whether MySQL has been installed on the current system

[MOE @ hadoop102 hive - 3.1.2] $RPM - qa | grep mariadb mariadb - libs - 5.5.56-2. El7. X86_64 / / if there is a through the following unmount [MOE @ hadoop102 $sudo RPM -e --nodeps mariadb-libsCopy the code

Copy the MySQL installation package to the /opt/software directory

[moe@hadoop102 software]$ll-rw-r --r--. 1 MOE MOE 609556480 11月 19 15:30 mysql-5.7.28-1.el7.x86_64.rpm-bundle.tarCopy the code

Decompress the MySQL installation package

[moe@hadoop102 software]$tar -xf mysql-5.7.28-1.el7.x86_64. RPM -bundle.tarCopy the code

Perform RPM installation in the installation directory

RPM sudo RPM -ivh mysql-community-common-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh mysql-community-libs-5.7.28-1.el7.x86_64. RPM Sudo RPM -ivh mysql-community-libs-compat-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh Ql-community-client-5.7.28-1.el7.x86_64. RPM sudo RPM -ivh ql-community-server-5.7.28-1.el7.x86_64. RPMCopy the code

Delete all contents of the directory pointed to by datadir in /etc/my.cnf, if any:

View the value of datadir:
```
[mysqld]
datadir=/var/lib/mysql
Copy the code
```
Delete all contents of /var/lib/mysql directory:
```
 [moe@hadoop102 mysql]$ cd /var/lib/mysql
 [moe@hadoop102 mysql]$ sudo rm -rf ./*
Copy the code
```

Initializing the database

[moe@hadoop102 opt]$ sudo mysqld --initialize --user=mysql
Copy the code

View the temporary password of user root

[moe@hadoop102 opt]$sudo cat /var/log/mysqld.log 2021-11-19t07:41:32.459045z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more 2021-11-19T07:41:32.602719z 0 [Warning] InnoDB: New log Files created, LSN=45790 2021-11-19T07:41:32.638515z 0 [Warning] InnoDB: Creating Foreign key Constraint System tables. 2021-11-19T07:41:32.645648z 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 1D96221D-490C-11EC-ba3F-000C296D2C15.2021-11-19T7:41:32.646270z 0 [Warning] Gtid table is not ready to be used.table 'mysql.gtid_executed' cannot be opened. 2021-11-19T07:41:34.241353z 0 [Warning] CA certificate ca.pem is self signed. 2021-11-19T07:41:34.493581z 1 [Note] A temporary password is generated for root@localhost: Ek/9.yjp #ANCopy the code

Start the MySQL service

[moe@hadoop102 opt]$ sudo systemctl start mysqld
Copy the code

Logging in to the MySQL database

[moe@hadoop102 opt]$ mysql -uroot -p
Enter password:
Copy the code

You must change the password of user root first; otherwise, errors will be reported when you perform other operations
```
set password = password("New password");
Copy the code
```

Mysql > select root from user where user = ‘root’

mysql> update mysql.user set host=The '%' where user='root';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
Copy the code

4. Configure Hive metadata to MySQL

4.1 copy driver

Copy the MySQL JDBC driver to the Hive lib directory

[moe@hadoop102 software]$cp mysql-connector-java-5.1.37.jar $HIVE_HOME/lib/Copy the code

4.2. Configure Metastore to MySQL

Create the hive-site. XML file in the $HIVE_HOME/conf directory


        

        
<configuration>

        <! -- JDBC connection URL -->
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
                <value>jdbc:mysql://hadoop102:3306/metastore? useSSL=false</value>
        </property>

        <! -- Driver for JDBC connection -->
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.mysql.jdbc.Driver</value>
        </property>

        <! -- JDBC connection username-->
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>

        <! -- Password for JDBC connection -->
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>xxx</value>
        </property>

        <! Hive metadata store version verification -->
        <property>
                <name>hive.metastore.schema.verification</name>
                <value>false</value>
        </property>

        <! -- Metadata storage authorization -->
        <property>
                <name>hive.metastore.event.db.notification.api.auth</name>
                <value>false</value>
        </property>

        <! -- Hive default HDFS working directory -->
        <property>
                <name>hive.metastore.warehouse.dir</name>
                <value>/user/hive/warehouse</value>
        </property>

</configuration>
Copy the code

[moe@hadoop102 conf]$ mysql -uroot -p
Enter password:
Copy the code

Create the Hive metadata database

mysql>  create database metastore;
Copy the code

Initialize the Hive metadata database

[moe@hadoop102 conf]$ schematool -initSchema -dbType mysql -verbose
Copy the code

4.3. Start Hive again

Start the Hive

[MOE @ hadoop102 hive - 3.1.2] $hiveCopy the code

Using Hive

hive> show databases;
hive> show tables;
hive> create table test (id int);
hive> insert into test values(1);
hive> select * from test;
Copy the code

Open another window in the CRT window to start Hive

5. Use the metadata service to access Hive

Add the following configuration information to the hive-site. XML file

<! -- Specify the address to connect to store metadata -->
<property>
        <name>hive.metastore.uris</name>
        <value>thrift://hadoop102:9083</value>
</property>
Copy the code

Start the metastore

[moe@hadoop102 conf]$ hive --service metastore
Copy the code

Start the hive

[MOE @ hadoop102 hive - 3.1.2] $hiveCopy the code

6. Use JDBC to access Hive

Add the following configuration information to the hive-site. XML file


<! Hiveserver2 -->
<property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop102</value>
</property>

<! Hiveserver2 connection port number -->
<property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
</property>
Copy the code

Start the hiveserver2

[moe@hadoop102 conf]$ hive --service hiveserver2
Copy the code

Start the Beeline client (wait a little longer)

Note: you need to configure the agent for user MOE in core-site. XML!!

<property>
        <name>hadoop.proxyuser.moe.hosts</name>
        <value>*</value>
</property>
<property>
        <name>hadoop.proxyuser.moe.groups</name>
        <value>*</value>
</property>
Copy the code

[MOE @ hadoop102 hive - 3.1.2] $bin/beeline -u JDBC: hive2: / / hadoop102:10000 - n MOE Connecting to JDBC: hive2: / / hadoop102:10000 Connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (Version 3.1.2) Transaction ISOLATION: TRANSACTION_REPEATABLE_READ Beeline Version 3.1.2 by Apache Hive 0: jdbc:hive2://hadoop102:10000>Copy the code

See the following interface

Connecting to the JDBC: hive2: / / hadoop102:10000 Connected to: Apache Hive (version 3.1.2) Driver: Hive JDBC (Version 3.1.2) Transaction ISOLATION: TRANSACTION_REPEATABLE_READ Beeline Version 3.1.2 by Apache Hive 0: jdbc:hive2://hadoop102:10000>Copy the code

7. Common Hive interaction commands

[moe@hadoop102 bin]$ hive -help usage: hive -d,--define <key=value> Variable substitution to apply to Hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> SQL from command line -f <filename> SQL from files -H,--help Print help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable  substitution to apply to Hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)Copy the code

“-e” Does not go to the Hive interaction window to execute SQL statements
```
[moe@hadoop102 bin]$ hive -e "select * from test;"
Copy the code
```
“-f” executes the SQL statement in the script
- Create a datas directory in /opt/module/hive-3.1.2/ and create the hivef. SQL file in the datas directory
```
[moe@hadoop102 datas]$ vim hivef.sql
Copy the code
```
- Write the correct SQL statement to the file
```
select * from test;
Copy the code
```
- Execute the SQL statement in the file
```
[moe@hadoop102 datas]$ hive -f hivef.sql
Copy the code
```
- Execute the SQL statements in the file and write the results to the file
```
[moe@hadoop102 datas]$ hive -f hivef.sql > hive_result.txt
Copy the code
```

8. Perform other Hive commands

Exit the Hive window

hive(default)>exit;
hive(default)>quit;
Copy the code

How do I view the HDFS file system in the Hive CLI

hive> dfs -ls /;
Found 2 items
drwx------   - moe supergroup          0 2021-11-19 13:26 /tmp
drwx------   - moe supergroup          0 2021-11-19 13:29 /user
Copy the code

View all history commands entered in Hive
- Access the root directory of the current user
```
[moe@hadoop102 datas]$ cd ~
Copy the code
```
- View the. Hivehistory file

9. Configure common Hive properties

9.1. Configuring Hive Run Log Information

Hive logs are stored in the/TMP/username /hive.log directory by default
Change the hive log file to /opt/module/hive-3.1.2/logs
- Modify/opt/module/hive – 3.1.2 / conf/hive – log4j2. Properties. The template file name called hive – log4j2. Properties
```
[moe@hadoop102 conf]$ mv hive-log4j2.properties.template hive-log4j2.properties
Copy the code
```
- Change the log storage location in the hive-log4j2.properties file
```
property.hive.log.dir = The/opt/module/hive - 3.1.2 / logs
Copy the code
```

9.2. Print the current library and table header

Add the following two configurations to hive-site. XML

<! -- Print the header -->
<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

<! Print the current library -->
<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>
Copy the code

9.3. Parameter Configuration Mode

View all current configurations
```
hive>set;
Copy the code
```
There are three ways to configure parameters
- Configuration file Mode
  
  The default configuration file is hive-default.xml
  
  User-defined configuration file: hive-site.xml
  
  Note: User-defined configuration overrides the default configuration. In addition, Hive also reads the Hadoop configuration because Hive is started as a Hadoop client and the Hive configuration overrides the Hadoop configuration. The configuration file is valid for all Hive processes started on the host.
- Command line parameter mode
  
  When starting Hive, you can add -hiveconf param=value on the CLI to set parameters.
```
[moe@hadoop102 logs]$ bin/hive -hiveconf mapred.reduce.tasks=10;
Copy the code
```
  Note: This parameter is valid only for the Hive startup
  
  View parameter Settings:
```
hive (default)> set mapred.reduce.tasks;
Copy the code
```
- Parameter declaration mode
  
  You can use the SET keyword to SET parameters in THE HQL
  
  Such as:
```
hive (default)> set mapred.reduce.tasks=100;
Copy the code
```
  Note: This parameter is valid only for the Hive startup.
  
  Viewing Parameter Settings
```
hive (default)> set mapred.reduce.tasks;
Copy the code
```
  The priorities of the above three Settings increase in turn. Configuration file < command line parameter < parameter declaration. Note that some system-level parameters, such as log4J-related Settings, must be set in the first two ways because the reading of those parameters is done before the session is established.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Hive learning journey 1

Getting started with Hive

1. What is Hive

2. Advantages and disadvantages of Hive

2.1 advantages,

2.2 and disadvantages

3. Hive architecture principle

4. Compare Hive and database

4.1. Query language

4.2 Data Storage Location

4.3 Data update

Index of 4.4,

4.5, perform

4.6. Execution delay

4.7 scalability

4.8. Data scale

Hive Installation

1. Hive installation address

2. Hive installation and deployment

2.1. Install Hive

2.2. Start and use Hive

3, MySQL installation

4. Configure Hive metadata to MySQL

4.1 copy driver

4.2. Configure Metastore to MySQL

4.3. Start Hive again

5. Use the metadata service to access Hive

6. Use JDBC to access Hive

7. Common Hive interaction commands

8. Perform other Hive commands

9. Configure common Hive properties

9.1. Configuring Hive Run Log Information

9.2. Print the current library and table header

9.3. Parameter Configuration Mode

Big data Hive learning journey 1

Getting started with Hive

1. What is Hive

2. Advantages and disadvantages of Hive

2.1 advantages,

2.2 and disadvantages

3. Hive architecture principle

4. Compare Hive and database

4.1. Query language

4.2 Data Storage Location

4.3 Data update

Index of 4.4,

4.5, perform

4.6. Execution delay

4.7 scalability

4.8. Data scale

Hive Installation

1. Hive installation address

2. Hive installation and deployment

2.1. Install Hive

2.2. Start and use Hive

3, MySQL installation

4. Configure Hive metadata to MySQL

4.1 copy driver

4.2. Configure Metastore to MySQL

4.3. Start Hive again

5. Use the metadata service to access Hive

6. Use JDBC to access Hive

7. Common Hive interaction commands

8. Perform other Hive commands

9. Configure common Hive properties

9.1. Configuring Hive Run Log Information

9.2. Print the current library and table header

9.3. Parameter Configuration Mode

Related Posts

The interviewer! Don’t ask me about garbage collection

Basic file byte stream FileInputStream, FileOutputStream

Learn more about Spring component registration