Windows10 system Hadoop and Hive development environment construction pit filling guide

The premise

At present, the author needs to build a data platform, and has found that there are a lot of pits in the installation and operation of components such as Hadoop and Hive under The Windows system. The author still spent several evenings when off work in the help of a number of Internet reference materials to complete the Windows10 system under Hadoop and Hive development environment. This article records the specific steps, problems encountered and corresponding solutions in the entire setup process.

Environment to prepare

Based on the author’s software version cleanliness, all selected components will use the current (2020-10-30) highest version.

software	version	note
`Windows`	`10`	The operating system
`JDK`	`8`	Don’t use greater than or equal for now`JDK9`An unknown exception occurs because the virtual machine is started
`MySQL`	`8.x`	Used to manage`Hive`The metadata
`Apache Hadoop`	`3.3.0`	–
`Apache Hive`	`3.1.2`	–
`Apache Hive src`	`1.2.2`	Because only`1.x`Version of the`Hive`Source code provides`.bat`Start script, have the ability to write their own scripts without this source package
`winutils`	`Hadoop - 3.3.0`	`Hadoop`the`Windows`Startup dependencies under the system

The following lists the download addresses of some components:

Apache Hadoop 3.3.0:https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Apache Hive 3.1.2:https://mirrors.bfsu.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Apache Hive 1.2.2 SRC:https://mirrors.bfsu.edu.cn/apache/hive/hive-1.2.2/apache-hive-1.2.2-src.tar.gz
winutils:https://github.com/kontext-tech/winutilsIf the download speed is slow, you can import the warehouse firstgitee.comDownload it again, or use the repository that the author has synchronizedhttps://gitee.com/throwableDoge/winutils)

After downloading this list of software, MySQL is installed as a system service and starts with the system. Gz, apache-hive-3.1.2-bin.tar.gz, apache-hive-1.2.2-src.tar.gz and winutils to the specified directory:

Apache-hive-1.2.2-src.tar. gz/bin/apache-hive-3.2.2-bin/bin/bin

Then copy the hadoop. DLL and winutils.exe files in the hadoop-3.3.0\bin directory of Winutils to the bin folder of the decompression directory of Hadoop:

Finally, configure the JAVA_HOME and HADOOP_HOME environment variables, and add %JAVA_HOME%\bin to Path; HADOOP_HOME and % % \ bin:

The JDK version I have installed is 1.8.0.212. In theory, any smaller version of JDK8 can be used.

Use the command line to test it. If the above steps are ok, the console output is as follows:

Configure and start Hadoop

In the etc\hadoop subdirectory of HADOOP_HOME, find and modify the following configuration files:

Core-site. XML (TMP directory must be configured with a non-virtual directory, do not use the default TMP directory, otherwise you will encounter later permission assignment failure)

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>  
    <property>
        <name>hadoop.tmp.dir</name>
        <value>E: / LittleData/hadoop - 3.3.0 / data/TMP</value>
    </property>  
</configuration>
Copy the code

Hdfs-site. XML (nameNode and dataNode subdirectories are created in HADOOP_HOME/data)

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>0.0.0.0:50070</value>
    </property>
    <property>    
        <name>dfs.namenode.name.dir</name>    
        <value>E: / LittleData/hadoop - 3.3.0 / data/the nameNode</value>    
    </property>    
    <property>    
        <name>dfs.datanode.data.dir</name>    
        <value>E: / LittleData/hadoop - 3.3.0 / data/dataNode</value>  
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>
Copy the code

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>
Copy the code

At this point, the minimal configuration is complete. Then you need to format the Namenode and start the Hadoop service. Go to the $HADOOP_HOME/bin directory and run HDFS namenode -format (do not repeat this command to format namenode) :

After formatting namenode, switch to the $HADOOP_HOME/sbin directory and run the start-all. CMD script:

CMD is expired. You are advised to use start-dfs. CMD and start-yarn.cmd instead. If you run stop-all. CMD, a similar message is displayed. You can use stop-dfs. CMD and stop-yarn.cmd instead. After successful execution of start-all.cmd, four JVM instances are pulled up (as shown in the Shell window above, four tabs are automatically created) and the current JVM instance can be viewed through JPS:

λ jps
19408 ResourceManager
16324 NodeManager
14792 Jps
15004 NameNode
2252 DataNode
Copy the code

You can see that The ResourceManager, NodeManager, NameNode, and DataNode applications have been started. The standalone Hadoop version has been started successfully. Run the stop-all. CMD command to exit the four processes. You can view the scheduling task status at http://localhost:8088/.

To view HDFS status and files, visit http://localhost:50070/.

To restart Hadoop, run the stop-all. CMD script and then the start-all. CMD script.

Configure and start Hive

Hive is built on HDFS, so ensure that Hadoop has been started. The default Hive file path prefix in HDFS is /user/ Hive /warehouse. Therefore, you can create this folder in HDFS by running the following command:

hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/warehouse
Copy the code

You also need to run the following command to create and grant permissions to the TMP directory:

hdfs dfs -mkdir /tmp
hdfs dfs -chmod -R 777 /tmp
Copy the code

Add HIVE_HOME to the system variable as E:\LittleData\ apache-hive-3.2.2-bin, and add %HIVE_HOME%\bin to the Path variable. , similar to the previous configuration of HADOOP_HOME. Download and copy a mysql-connector-java-8.0.x.jar to $HIVE_HOME/lib:

To create a Hive configuration file, copy and rename the configuration file template in the $HIVE_HOME/conf directory as follows:

$HIVE_HOME/conf/hive-default.xml.template= >$HIVE_HOME/conf/hive-site.xml
$HIVE_HOME/conf/hive-env.sh.template= >$HIVE_HOME/conf/hive-env.sh
$HIVE_HOME/conf/hive-exec-log4j.properties.template= >$HIVE_HOME/conf/hive-exec-log4j.properties
$HIVE_HOME/conf/hive-log4j.properties.template= >$HIVE_HOME/conf/hive-log4j.properties

Add the following information to the end of the hive-env.sh script:

Export HADOOP_HOME=E:\LittleData\hadoop-3.3.0 export HIVE_CONF_DIR=E:\LittleData\apache-hive-3.1.2-bin\conf export HIVE_AUX_JARS_PATH = E: \ LittleData \ apache - hive - 3.1.2 - bin \ libCopy the code

Modify the following attributes in the hive-site. XML file:

The property name	Attribute values	note
`hive.metastore.warehouse.dir`	`/user/hive/warehouse`	`Hive`This is the default value
`hive.exec.scratchdir`	`/tmp/hive`	`Hive`This is the default value
`javax.jdo.option.ConnectionURL`	`jdbc:mysql://localhost:3306/hive? characterEncoding=UTF-8& serverTimezone=UTC`	`Hive`The database connection where metadata is stored
`javax.jdo.option.ConnectionDriverName`	`com.mysql.cj.jdbc.Driver`	`Hive`Database driver for storing metadata
`javax.jdo.option.ConnectionUserName`	`root`	`Hive`Database user for storing metadata
`javax.jdo.option.ConnectionPassword`	`root`	`Hive`Password of the database for storing metadata
`hive.exec.local.scratchdir`	`E: / LittleData/apache - hive - 3.1.2 - bin/data/scratchDir`	Creating a Local Directory`$HIVE_HOME/data/scratchDir`
`hive.downloaded.resources.dir`	`E: / LittleData/apache - hive - 3.1.2 - bin/data/resourcesDir`	Creating a Local Directory`$HIVE_HOME/data/resourcesDir`
`hive.querylog.location`	`E: / LittleData/apache - hive - 3.1.2 - bin/data/querylogDir`	Creating a Local Directory`$HIVE_HOME/data/querylogDir`
`hive.server2.logging.operation.log.location`	`E: / LittleData/apache - hive - 3.1.2 - bin/data/operationDir`	Creating a Local Directory`$HIVE_HOME/data/operationDir`
`datanucleus.autoCreateSchema`	`true`	optional
`datanucleus.autoCreateTables`	`true`	optional
`datanucleus.autoCreateColumns`	`true`	optional
`hive.metastore.schema.verification`	`false`	optional

MySQL > alter database hive; MySQL > alter database hive; MySQL > alter database hive;

To initialize the Hive metadata, run the following script in the $HIVE_HOME/bin directory:

hive.cmd --service schematool -dbType mysql -initSchema
Copy the code

There is a small hole here. Line 3215 of the hive-site. XML file has a magical unrecognizable symbol:

This unrecognized symbol will cause Hive command execution exception, and you need to remove it. When the console outputs the Initialization Script Completed schemaTool Completed, the metadata database has been initialized.

CMD to connect to hive in $HIVE_HOME/bin:

> hive.cmd
Copy the code

Try creating a table t_test:

hive>  create table t_test(id INT,name string);
hive>  show tables;
Copy the code

Check http://localhost:50070/ to confirm that the T_test table has been created.

Try executing a write statement and a query statement:

hive>  insert into t_test(id,name) values(1,'throwx');
hive>  select * from t_test;
Copy the code

It took more than 30 seconds to write and 0.165 seconds to read.

Connect to Hive using JDBC

HiveServer2 is a Hive server interface module. You must enable HiveServer2 to enable remote clients to write and query Hive data. Currently, this module is based on the Thrift RPC implementation, which is an improved version of HiveServer and supports functions such as multi-client access and authentication. You can modify the following common attributes of HiveServer2 in the hive-site. XML configuration file:

The property name	Attribute values	note
`hive.server2.thrift.min.worker.threads`	`5`	Minimum number of worker threads. Default is 5
`hive.server2.thrift.max.worker.threads`	`500`	Maximum number of worker threads. Default value is 500
`hive.server2.thrift.port`	`10000`	Listen to the`TCP`Port number. The default value is 10000
`hive.server2.thrift.bind.host`	`127.0.0.1`	Bound host. The default value is`127.0.0.1`
`hive.execution.engine`	`先生`	Execution engine. The default value is`先生`

To start HiveServer2, run the following command in the $HIVE_HOME/bin directory:

hive.cmd --service hiveserver2
Copy the code

Hadoop-common and Hive-JDBC dependencies must be imported to the client, and the dependent versions should be the same as those of hadoop and Hive.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>3.3.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>3.1.2</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-jdbc</artifactId>
    <version>2.3.5. RELEASE</version>
</dependency>
Copy the code

The hadoop-common dependency chain is quite long, and it will download a large number of other related dependencies. Therefore, you can suspend the dependency download task in a Maven project at a spare time. There is also a problem with org.glassfish:javax.el snapshot package download, but it does not affect normal use). Finally add a unit test class, HiveJdbcTest:

@Slf4j
public class HiveJdbcTest {

    private static JdbcTemplate TEMPLATE;
    private static HikariDataSource DS;

    @BeforeClass
    public static void beforeClass(a) throws Exception {
        HikariConfig config = new HikariConfig();
        config.setDriverClassName("org.apache.hive.jdbc.HiveDriver");
        // The author modified the configuration of hive-site. XML because the port is not the default 10000
/ / config. SetJdbcUrl (" JDBC: hive2: / / 127.0.0.1:10091 ");
        config.setJdbcUrl("JDBC: hive2: / / 127.0.0.1:10091 / db_test");
        DS = new HikariDataSource(config);
        TEMPLATE = new JdbcTemplate(DS);
    }

    @AfterClass
    public static void afterClass(a) throws Exception {
        DS.close();
    }

    @Test
    public void testCreateDb(a) throws Exception {
        TEMPLATE.execute("CREATE DATABASE db_test");
    }

    @Test
    public void testCreateTable(a) throws Exception {
        TEMPLATE.execute("CREATE TABLE IF NOT EXISTS t_student(id INT,name string,major string)");
        log.info(Table T_student created successfully);
    }

    @Test
    public void testInsert(a) throws Exception {
        int update = TEMPLATE.update("INSERT INTO TABLE t_student(id,name,major) VALUES(? ,? ,?) ", p -> {
            p.setInt(1.10087);
            p.setString(2."throwable");
            p.setString(3."math");
        });
        log.info("Write t_student successfully, update record number :{}", update);  // The number of updates returned is 0
    }

    @Test
    public void testSelect(a) throws Exception {
        List<Student> result = TEMPLATE.query("SELECT * FROM t_student", rs -> {
            List<Student> list = new ArrayList<>();
            while (rs.next()) {
                Student student = new Student();
                student.setId(rs.getLong("id"));
                student.setName(rs.getString("name"));
                student.setMajor(rs.getString("major"));
                list.add(student);
            }
            return list;
        });
        [hivejdbctest. Student(id=10087, name=throwable, major=math)]
        log.info(T_student () {}, result);
    }

    @Data
    private static class Student {

        private Long id;
        private String name;
        privateString major; }}Copy the code

Possible problems

The following is a summary of possible problems.

The Java VIRTUAL machine fails to start. Procedure

Hadoop does not use any JDK[9+]. You are advised to switch to any JDK8 version.

The Hadoop execution file cannot be found

Ensure that the hadoop. DLL and winutils.exe files in the hadoop-3.3.0\bin directory of Winutils have been copied to the bin folder of the decompressed directory of Hadoop.

The batch script may not be found when the start-all. CMD script is executed. The solution is to add CD $HADOOP_HOME to the start line of the start-all. CMD script, such as CD E:\LittleData\hadoop-3.3.0.

Cannot access localhost:50070

Hdfs.http. address is usually omitted in the hdFS-site. XML configuration.

<property>
    <name>dfs.http.address</name>
    <value>0.0.0.0:50070</value>
</property>
Copy the code

Then call stop-all. CMD, and then call start-all. CMD to restart Hadoop.

The Hive connection to MySQL is abnormal

Pay attention to the MySQL driver package is copied to right $HIVE_HOME/lib, and check the javax.mail. Jdo. Option. The four attributes such as ConnectionURL is configured properly. If yes, check whether the MySQL version is incorrect or the service version does not match the driver version.

Hive failed to find the batch file. Procedure

CMD ‘is not recognized as an internal or external command… CMD scripts in the bin directory of the Hive 1.x source package must be copied to the $HIVE_HOME/bin directory.

Folder Permission Problem

The CreateSymbolicLink exception causes that Hive cannot use the INSERT or LOAD command to write data. These problems can be solved in the following ways:

Win + RThen rungpedit.msc– Computer Settings –WindowsSettings – Security Settings – Local Policy – User permission Assignment – Create symbolic link – Add current user.

Alternatively, use the administrator account or administrator permission to start CMD and run the corresponding script to start Hadoop or Hive.

Abnormal SessionNotRunning

When start or external client connection in HiveServer2 HiveServer2 likely this exception, concrete is a Java. Lang. ClassNotFoundException: Org. Apache. Tez. Dag. API. TezConfiguration exception. The solution is as follows: Change the value of hive.execution.engine in the hive-site. XML file from tez to Mr, and restart HiveServer2. Because teZ is not integrated, an error will still be reported after restart, but the startup will be automatically retried after 60000ms (generally, the startup will be successful after retry) :

This is a legacy issue, but it does not affect the normal connection of the client, but the startup time is 60 seconds longer.

HiveServer2 Port conflict

Change the value of hive.server2.thrift. Port in the hive-site. XML configuration file to an unused port and restart HiveServer2.

The security mode of the data node is abnormal

SafeModeException is displayed, indicating that SafeMode is ON. Run the HDFS dfsadmin -safemode leave command to remove the safemode.

AuthorizationException

This exception occurs when Hive connects to the HiveServer2 service through the JDBC client. The specific information is as follows: User: XXX is not allowed to impersonate anonymous. In this case, you only need to modify the Hadoop configuration file core-site. XML and add:

<property>
    <name>hadoop.proxyuser.xxx.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.xxx.groups</name>
    <value>*</value>
</property>
Copy the code

Here XXX refers to the specific system user name when the error is reported. For example, the system user name of the author’s development machine is doge

Then restart the Hadoop service.

The permission of MapRedTask is incorrect

A common exception occurs when Hive uses the JDBC client to connect to the HiveServer2 service to perform INSERT or LOAD operations. General description is Execution Error, the return code from 1 org.. Apache hadoop. Hive. Ql. Exec. Mr. MapRedTask. Permission denied: User = anonymous access = the EXECUTE, inode = “/ TMP/hadoop – yarn” : XXXX: its: DRWX — — — — — -. Run HDFS DFS -chmod -r 777 / TMP to grant read and write permissions to the/TMP directory to the anonymous user.

summary

It is better to set up Hadoop and Hive development environment directly on Linux or Unix system. File path and permission issues on Windows system can cause many unexpected problems. This article refers to a large number of Internet materials and Hadoop and Hive introductory books, here is not a post, standing on the shoulders of giants.

(C-4-D E-A-20201102)

Personal blog

Throwable’s Blog