Hive architecture and construction mode
[TOC]
preface
This document is written based on Hive 3.1.2
Hive basic knowledge
Basic architecture
- Hive consists of Hiveserver2 and Hive clients
- There are three types of Hive clients: Beeline, Hiveserver using JDBC, and Hive CLI.
- Hive Server consists of Hive Server2 and MetaStore
- Metastore is a metadata management component of Hive
- Hcatalog, built on MetaStore, exposes a set of apis that enable other frameworks such as Pig and FLink to use Hive’s metadata management capabilities to manage data from a table perspective
- Webchat exposes restful interfaces on top of hCatalog
- The actual hive data is stored in the HDFS of Hadoop
- Hue provides a graphical approach to facilitate SQL-based development and other additional functions
metastore
Hive data is essentially stored in HDFS. How do you view data as a table? Metastore is responsible for storing schema information, serialization information, storage location information, etc.
Metastore itself consists of two parts
- metastore server
- metatore db
In this classic architecture, like any single Java application, the Server is the application itself and the DB stores the data. However, there are three overall deployment modes of MetaStore
Embedded services and databases
Metastore Server and MetaStore DB are deployed together with Hive Server in embedded mode. Metastore DB starts an embedded Derby database
Embedded service
Metastore Server is deployed with Hive. But MetaStore DB uses a separate Mysql to take over
Services and databases are deployed separately
In addition to the database being deployed independently, the MetaStore Service itself is also deployed independently
hcatalog
Hcatalog, built on MetaStore, exposes a set of apis that enable other frameworks such as Pig and FLink to use Hive’s metadata management capabilities to manage data from a table perspective
demo
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Copy the code
The above command copies the file to HDFS and then uses this data as a new partition for table Rawevents through hCatalog
The client
Local mode of the client
The metaStore embedding or remote deployment described above is done from the perspective of Hiveserver. Hiveserver itself is a standalone deployment. However, hive clients can connect to deployed Remote Servers in Remote mode. You can also start the client with a local Hive Server and its corresponding MetaStore. This point must be made clear
beeline
As a new generation client recommended by Hive. He uses Thrift remote calls. Beeline’s local model
$HIVE_HOME/bin/hiveserver2 # Independent hiveserver $HIVE_HOME/bin/beeline -u JDBC :hive2://$HS2_HOST:$HS2_PORT # Then connect to Hiveserver as host and portCopy the code
Beeline’s local model
$HIVE_HOME/bin/beeline -u JDBC :hive2:// # this operation starts hiveserver, MetaStore, and beeline in one process. This is not recommended, just unit testingCopy the code
The difference between local and remote is whether to specify a remote Host and port. If not, it’s local mode
Beeline’s automatic mode
Each time we connect to the remote Hiveserver through beeline, we need to specify a long JDBC URL, which is cumbersome. If we want to hit the beeline command, we can connect to the remote Hiveserver2 directly. You can add the beeline-site. XML configuration file to the Hive configuration file directory. The content of the configuration file is as follows
<? The XML version = "1.0"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>beeline.hs2.jdbc.url.tcpUrl</name> <value>jdbc:hive2://localhost:10000/default; user=hive; password=hive</value> </property> <property> <name>beeline.hs2.jdbc.url.httpUrl</name> <value>jdbc:hive2://localhost:10000/default; user=hive; password=hive; transportMode=http; httpPath=cliservice</value> </property> <property> <name>beeline.hs2.jdbc.url.default</name> <value>tcpUrl</value> </property> </configuration>Copy the code
It configures two modes of beeline connection and a specific JDBC connection string. One uses TCP and the other uses HTTP. The default is TCP
jdbc
JDBC link Hive also has two modes
- For a remote server, the URL format is
jdbc:hive2://<host>:<port>/<db>; initFile=<file>
(default port for HiveServer2 is 10000). - For an embedded server, the URL format is
jdbc:hive2:///; initFile=<file>
(no host or port).
The deployment of
A full Hive deployment involves five components, of which the first three are required
- Metastore deployment
- Hiveserver2 deployment
- The client’s deployment
- Deployment of hCatalog Server (optional)
- Deployment of WebHCat Server (optional)
Hiveserver2 is embedded with MetaStore. It just plugs metaStore’s DB into Mysql.
On physical servers, hive clients can be deployed on three different servers. Of course, Hive clients can be deployed on multiple servers.
The configuration of the three components can be hive-site. XML, but the configuration content may not be the same. For example, hiveserver2 requires the meta DB connection information, but the client does not need this information.
In addition to hive-site.xml, Hive supports configuration in several other places
- Passed when the command is started
--hiveconf
Parameter to specify the customized configuration, such asbin/hive --hiveconf hive.exec.scratchdir=/tmp/mydir
- Metastore-related configuration is specified in the hivemetastore-site. XML file
- Specified in hiveserver2-site. XML, hiveserver2-exclusive configuration.
XML -> hivemetastore-site. XML -> hiveserver2-site. XML -> ‘-hiveconf’
Therefore, the best configuration policy is:
- This configuration is shared by Hiveserver2 and Hive clients in hive-site. XML, so that the configuration can be directly distributed to multiple machines
- Metastore-related configurations are saved in hivemetastore-site.xml. Meta database-specific configurations that exist only on metaserver deployed machines and do not distribute database passwords around
- Put the hiveserver2-specific configuration into hiveserver2-site.xml
Deploy hiveserver2
In the following configuration, Hiveserver is deployed using embedded MetaStore Server and remote Metastore DB
On the machine where Hive needs to be deployed. Create a Hive account, adduser hive, and add it to the Hadoop group. All subsequent configurations and startup are performed as the Hive user
Download the Hive installation package. Remove the template files in conf and configure them as required. Hive -default.xml. Template needs to be changed to hive-site.xml
Hive -default.xml.template contains default configurations of all Hive components.
Detailed deployment document: cwiki.apache.org/confluence/…
Create a path for storing Hive data in the HDFS
Create the following files in the HDFS and assign write permission to the hive owning group
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Copy the code
/user/hive/warehouse is where table data is actually stored in Hive. It is the default path, you can, of course, in the hive – site. The XML by setting the hive. The metastore. Warehouse. The dir attribute to specify a new path
Configure Hive environment variables
Export HIVE_HOME = / opt/apache - hive - 3.1.2 - bin/export PATH = $HIVE_HOME/bin: $PATHCopy the code
Configure the log output path
The default output path of hive is/TMP /
/hive.log. If we start as hive user, the path is/TMP /hive/hive.log
The/TMP path is used to store the intermediate status data of applications running in Linux. It is automatically cleared after the operating system restarts. Of course you can modify hive log4j files to specify hive. Log. dir=
You can also dynamically specify the hiveserver2 path bin/hiveserver2 –hiveconf hive.root.logger=INFO,DRFA, when starting hiveserver2
Hive temporary file configuration
The Hive runtime also stores temporary files, called Scratch files, in the host and HDFS. / TMP /hive-
Local: / TMP /
${java.io. Tmpdir}/${user.name}/ ${java.io. Tmpdir} indicates that the default directory is/TMP. ${user.name} indicates the current hive user. If you do not specify, you can delete the corresponding configuration item. ${java.io. Tmpdir}/${user.name}/ is not recognized in XML.
Configure and initialize the DATABASE information of MetaStore
In the hive-site. XML directory, create a hivemetastore-site. XML file to configure metastore information
Example Initialize the MetaStore DB
$HIVE_HOME/bin/schematool -dbType <db type> -initSchema
Copy the code
DbType can be Derby, Oracle, mysql, MSSQL, or Postgres
Resources cwiki.apache.org/confluence/…
Start the hiveserver2
$HIVE_HOME/bin/hiveserver2
Copy the code
The above commands are run in the foreground, and it is best to run them in the background mode with no hang up
Nohup $HIVE_HOME/bin/hiveserver2 > /opt/apache-hive-3.1.2-bin/logs/hive_runtime_log.log < /dev/null &Copy the code
Port 10002 corresponds to the Hive Web interface after startup
Basic client deployment
Package distribution
Copy the hive hair package with the same configuration to the machine on which Beeline is to be started to complete client configuration. The hivemetastore-site. XML and hiveserver2-site. XML files are not required. The hivemetastore-site. XML and hiveserver2-site. XML files can be modified based on the actual machine environment
Environment Variable Configuration
Export HIVE_HOME = / opt/apache - hive - 3.1.2 - bin/export PATH = $HIVE_HOME/bin: $PATHCopy the code
Log Path Configuration
As with Hiveserver, modify the path configuration as required
Start the
To connect to Hiveserver2, run $HIVE_HOME/bin/beeline -u JDBC :hive2://$HS2_HOST:$HS2_PORT.
The default directory used by Hive in HDFS is /user/hive/warehouse. To avoid permission related errors, you need to add the -n parameter to the Beeline connection to specify the user used by the current client. And the user must have /user/hive/warehouse and its related file permissions, if not, need to add separately. Permission model, similar to Linux. If there is a permission problem, the general error is similar
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=yarn, access=WRITE, inode="/user/hive/warehouse/test.db":hive:hadoop:drwxr-xr-x
Copy the code
A link with a user name is in the form:
beeline -u jdbc:hive2://master:10000 -n hive
Copy the code
Link to Hiveserver as the Hive user
Beeline integral using document: cwiki.apache.org/confluence/…
Hiveserver high availability deployment
Server Configuration
On all the machines that need to start HiverServer, configure hiveserver2-site.xml. Do the following configuration
<! - start the high availability configuration - > < property > < name > hive. Server2. Support. Dynamic. Service. The discovery < / name > < value > true < value > / < / property > <! - configuration in the address space of the zk - > < property > < name > hive. Server2.. Zookeeper namespace < / name > < value > hiveserver2 < value > / < / property > <! - configuration in the address of the zk -- > < property > < name > hive. The zookeeper. Quorum < / name > < value > master: 2181, slave1:2181, slave2:2181 value > < / </property>Copy the code
At the same time, remember that the hiverServer machines to be deployed must have the same metaStore configuration. Ensure that they are connected to the same mysql, you can copy the hivemetastore-site. XML configuration to multiple machines that need to start Hiveserver
Reference: lxw1234.com/archives/20…
Client connection
The connection mode by Beeline is as follows:
beeline -u "jdbc:hive2://master:2181,slave1:2181,slave2:2181/; serviceDiscoveryMode=zooKeeper; zooKeeperNamespace=hiveserver2" -n hiveCopy the code
Note that the JDBC URL must be quoted
certification
Select Kerberos as the authentication option here. The following configuration items need to be configured:
Authentication mode: Hive. Server2. Authentication, the authentication mode, the default NONE. The Options are NONE (USES plain SASL), NOSASL, KERBEROS, LDAP, PAM and CUSTOM. Set following for KERBEROS mode: Hive. Server2. Authentication, the kerberos principal - a kerberos principal for server. Hive. Server2. Authentication, kerberos keytab - keytab for server principal.Copy the code
The Beeline connection mode after Kerberos authentication is used
beeline -u "jdbc:hive2://master:2181,slave1:2181,slave2:2181/; serviceDiscoveryMode=zooKeeper; zooKeeperNamespace=hiveserver2"Copy the code
Before using the preceding command, ensure that the Kerberos authentication of the current machine has been implemented through Kinit. Otherwise, the beeline command will report an error because the Kerberos authentication ticket was not fetched.
This command automatically reads the information of the currently logged Kerberos user and is carried with the command executed
INFO : Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:haixue-hadoop, Ident: (token for test: HDFS_DELEGATION_TOKEN owner=test, renewer=yarn, realUser=hive/[email protected], issueDate=1594627135369, maxDate=1595231935369, sequenceNumber=66, masterKeyId=27)]
Copy the code
For example, if the current Kerberos TGT is the test user, the owner corresponding to the HivesQL is test. The actual communication between components is authenticated by the Hive /[email protected] user. However, the authorization granularity is controlled by Test.
This feature is controlled by the hive.server2.enable.doAs property configured by Hive. When this property is true, it means that the submitted user is run as the final SQL execution user
In addition, we are best hiveserver configuration file, will hive. Server2. Allow. User. The substitution close to false. This option allows the user to specify a user with the -n argument. This will result in a user manipulating someone else’s library table with his own Kerberos credentials. After this function is disabled, Hue cannot regard the login user as the person who submits the job.
Client connection based on hive-site. XML
The connection of the above records to Hiveserver2 is implemented through JDBC. However, some applications that depend on Hive can connect to Hiveserver only through hive-site. XML. Hue is typical
Hue searches for the hive-site. XML and beeline- h2-connection. XML configuration files in the ${HIVE_CONF_DIR} environment variable or /etc/hive/conf directory and reads the information in the files to connect to Hiveserver.
If Kerberos is configured for the cluster, you need to configure the Kerberos authentication configuration in hiveserver2-site. XML, for example
<property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/[email protected]</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/opt/keytab_store/hive.service.keytab</value> </property>Copy the code
For example, if Hue uses the Beeline connection, you can also configure beeline-h2-connection. XML to specify some proxy user information that connects to Hiveserver2. As follows:
<? The XML version = "1.0"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>beeline.hs2.connection.user</name> <value>hive</value> </property> <property> <name>beeline.hs2.connection.password</name> <value>hive</value> </property> </configuration>Copy the code
Some mistakes
Error 1 guava
An error was reported when metaStore Schema was initialized. Procedure
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String; Ljava/lang/Object;) VCopy the code
The version of the Guava package in hadoop share/hadoop/common/lib/ is different from that in Hive lib.
To solve the problem, delete the Hive guava package and copy the Hadoop guava package
Error 2, mysql driver
[hive@master bin]$ ./schematool -dbType mysql -initSchema SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found the binding in [the jar: file: / opt/apache - hive - 3.1.2 - bin/lib/log4j - slf4j - impl - 2.10.0. Jar! /org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found the binding in [the jar: file: / opt/hadoop - 3.2.1 / share/hadoop/common/lib/slf4j - log4j12-1.7.25. Jar! /org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Metastore connection URL: jdbc:mysql://xxx.xx.xx.xx:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : com.mysql.jdbc.Driver Metastore connection User: root org.apache.hadoop.hive.metastore.HiveMetaException: Failed to load driver Underlying cause: java.lang.ClassNotFoundException : com.mysql.jdbc.Driver Use --verbose for detailed stacktrace. *** schemaTool failed ***Copy the code
Hive lacks the mysql driver
Install the mysql driver to hive lib
Mistake 3
When using a beeline client: beeline -u JDBC: hive2: / / master: 10000 links, hiveserver2, submitted to the following error
Error: Could not open client transport with JDBC Uri: jdbc:hive2://master:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: hive is not allowed to impersonate anonymous (state=08S01,code=0)
Beeline version 3.1.2 by Apache Hive
Copy the code
We started Hiveserver2 as the Hive user. Therefore, all clients, whether a written user links to Hiveserver2, and eventually Hiveserver2 accesses the Hadoop cluster as a Hive user.
Hadoop does not allow hive users to use clusters as proxies for other users by default if you have not configured them in Hadoop
<! Master, Slave1, and Slave2 groups can use Hadoop cluster as hive users. The hive is a user - > < property > < name >. Hadoop proxyuser. Hive. Hosts < / name > < value > master, slave1, slave2 value > < / < / property > <property> <name>hadoop.proxyuser.hive.groups</name> <value>*</value> </property> <! -- the following configuration with hadoop. Proxyuser. Hive. Groups a choice, not both configuration. Is used to specify the hadoop. Proxyuser. Hive. Hosts on the machine, Specific which users can agent to the hive - > < property > < name >. Hadoop proxyuser. Super. Users < / name > < value > user1, user2 < value > / < / property >Copy the code
The resources
Cwiki.apache.org/confluence/… Cwiki.apache.org/confluence/… Stackoverflow.com/questions/2… Cwiki.apache.org/confluence/…
Welcome to follow my personal account “North by Northwest UP”, recording code life, industry thinking, technology comments