Hadoop In Action part 1

The author | WenasWei

preface

Having introduced the overall architecture of Hadoop-offline batch technology, we will start to learn how to install, configure and use Hadoop. Will be introduced from the following points:

  • Linux environment configuration and installation of Hadoop
  • This section describes the three Hadoop installation modes
  • Local mode Installation
  • Installation in pseudo-cluster mode

A Linux environment configuration and installation of Hadoop

Hadoop needs to use some basic configuration requirements on the Linux environment, Hadoop user group and user addition, encrypted login operation, JDK installation

1.1 Ubuntu Network Configuration in VMWare

When installing Ubuntu18.04-Linux operating system using VMWare, the system configuration problem can be solved by sharing the blog post, CSDN jump link: VMWare Ubuntu network configuration

It includes the following important steps:

  • Buntu system information and changing host names
  • Windows Configure the NAT network for VMWare
  • Linux gateway setup and configuration static IP
  • Linux Modifies the hosts file
  • Linux password-free login

1.2 Adding Hadoop User groups and Users

1.2.1 Adding Hadoop User groups and users

Log in to the Linux-Ubuntu 18.04 VM as user root and run the following command:

$ groupadd hadoop
$ useradd -r -g hadoop hadoop
Copy the code
1.2.2 Granting Directory Permissions to Hadoop Users

To grant the /usr/local directory permission to the Hadoop user, run the following command:

$ chown -R hadoop.hadoop /usr/local/
$ chown -R hadoop.hadoop /tmp/
$ chown -R hadoop.hadoop /home/
Copy the code
1.2.3 Granting Sodu Permission to Hadoop Users

Add hadoop ALL=(ALL:ALL) ALL to root ALL=(ALL:ALL) ALL

$ vi /etc/sudoers

Defaults        env_reset
Defaults        mail_badpass
Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin"
root    ALL=(ALL:ALL) ALL
hadoop  ALL=(ALL:ALL) ALL
%admin ALL=(ALL) ALL
%sudo   ALL=(ALL:ALL) ALL
Copy the code
1.2.4 Assigning a Hadoop User login Password
$passwd hadoop Enter new UNIX password: Enter a new password Retype new UNIX password: confirm the new password passwd: password updated successfullyCopy the code

1.3 JDK installation

Install the JDK for Linux by referring to the shared blog Logstash- Dataflow Engine -< Section 3: Logstash Installation >- (section 2:3.2 Installing the JDK for Linux) install the configuration on each host, CSDN jump link: Logstash- Dataflow Engine

1.4 Download from Hadoop official website

The official website to download: hadoop.apache.org/releases.ht… Binary download

  • Name the download using wget (the download directory is the current directory) :

For example: version3.3.0 mirrors.bfsu.edu.cn/apache/hado…

$wget HTTP: / / https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gzCopy the code
  • Unzip and move to the folder you want to place:/usr/local
$ mv ./hadoop-3.3.0.tar.gz /usr/local

$ cd /usr/local

$ tar -zvxf hadoop-3.3.0.tar.gz
Copy the code

1.5 Configuring the Hadoop Environment

  • Modifying a Configuration File/etc/profile:
Add export JAVA_HOME=/usr/local/ Java /jdk1.8.0_152 export JRE_HOME = / usr/local/Java/jdk1.8.0 _152 / jre export CLASSPATH = $CLASSPATH: $JAVA_HOME/lib: $JAVA_HOME/jre/lib export HADOOP_HOME = / usr/local/hadoop - 3.3.0 export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/binCopy the code
  • Make the configuration file take effect
$ source /etc/profile 
Copy the code
  • Check whether the Hadoop configuration is successful
$hadoop version hadoop 3.3.0 Source code repository at https://gitbox.apache.org/repos/asf/hadoop.git - r T18 aa96f1871bfd858f9bac59cf2a81ec470da649af Compiled by brahma on 2020-07-06:44 z Compiled with protoc 3.7.1 From the source with checksum 5dc29b802d6ccd77b262ef9d04d19c4 This command was run using / usr/local/hadoop - 3.3.0 / share/hadoop/common/hadoop - common - 3.3.0. JarCopy the code

The Hadoop version is Hadoop 3.3.0, indicating that the Hadoop environment is successfully installed and configured.

Two Hadoop installation modes

Hadoop provides three different installation modes: single-machine mode, pseudo-cluster mode, and cluster mode.

2.1 Single-machine Mode

Single-machine mode (local mode) : The non-distributed mode is the default Hadoop mode. The non-distributed mode, that is, a single Java process, can be run without other configuration, facilitating debugging, tracing, and troubleshooting. You only need to configure JAVA_HOME in the Hadoop-env. sh file of Hadoop.

Local single-machine deployment mode Run the Hadoop Jar command and output the results to the local disk.

2.2 Pseudo-cluster mode

Hadoop runs in a pseudo-distributed fashion on a single node (single point of failure), with Hadoop processes running as separate Java processes, nodes acting as Both NameNode and DataNode, and reading files in HDFS. Can logically provide the same operating environment as cluster mode, physically deployed on a single server in pseudo-cluster mode: cluster mode needs to be deployed on multiple servers to achieve full cluster distribution physically.

In the pseudo-cluster mode, you need to configure JAVA HOME in the hadoop-env.sh file of Hadoop, the number of HDFS copies used by Hadoop, the YARN address, and SSH password-free login of the server. In pseudo-cluster mode, run the HadopJar command to run the Hadoop program and output the result to HDFS.

2.3 Cluster Mode

The cluster mode, also known as the full cluster mode, is fundamentally different from the pseudo-cluster mode. The cluster mode is a fully distributed cluster implemented on physical servers and deployed on multiple physical servers. The pseudo-cluster mode is logically a cluster mode, but it is deployed on a single physical server.

In production environments, Hadoop requires high reliability and availability. If a node fails, the entire cluster becomes unavailable. In addition, data in the production environment must be reliable and recoverable when a data node is faulty or lost. This requires that the cluster mode of Hadoop must be deployed on the production environment to meet the various requirements of the production environment.

Cluster deployment is the most complex of the three installation modes. It needs to be deployed on multiple physical servers. Plan the server environment in advance, including the file system used by Hadoop, number of HDFS copies, and YARN address. In addition, SSH password-free login between servers, RPC communication between Hadoop nodes, automatic switchover mechanism of NameNode failure, and HA high availability should be configured. In addition, you need to install and configure the distributed application coordination service, Zookeeper.

Cluster mode Run the Hadoop Jar command and output the running result to the HDFS.

Three single-machine mode

3.1 Modifying the Hadoop Configuration File

In single-machine deployment mode, modify the Hadoop configuration file hadoop-env.sh and add the Java environment configuration path

$vi /usr/local/hadoop-3.3.0/etc/ hadoop-env.sh JAVA_HOME = / usr/local/Java/jdk1.8.0 _152Copy the code

3.2 Creating test data files

  • Create a directory/home/hadoop/input:
$mkdir -p /home/hadoop/input
Copy the code
  • Create test data filesdata.input:
$CD /home/hadoop/input/ $vi data. Input # Write data content hadoop mapReduce hive Flume hbase Spark Storm Flume SQoop Hadoop hive kafka spark hadoop stormCopy the code

3.3 Running Hadoop Test Cases

Run Hadoop’s MapReduce sample program to count the number of words in a specified file.

  • Run the MapReduce program of Hadoop:
$hadoop jar/usr/local/hadoop - 3.3.0 / share/hadoop/graphs/hadoop - graphs - examples - 3.3.0. Jar wordcount /home/hadoop/input/data.input /home/hadoop/outputCopy the code
  • The general format is described as follows:

    • Hadoop JAR: Runs the MapReduce program in the hadoop command line format.
    • / usr/local/hadoop – 3.3.0 / share/hadoop/graphs/hadoop – graphs – examples – 3.3.0. Jar: Full path to the Jar package of Hadoop’s MapReduce program.
    • wordcount:The identification uses the word-counting MapReduce program becauseHadoop - graphs - examples - 3.3.0. JarMultiple MapReduce programs exist in the file.
  • The following table describes the parameters.

    • . / home/hadoop/input/data input: input data. The input file’s full path to the local name;
    • /home/hadoop/output: local output directory of result data. This directory cannot be created manually but needs to be created using hadoop programs.
  • Result:

2021-06-02 01:08:40,374 INFO mapreduce.Job:  map 100% reduce 100%
2021-06-02 01:08:40,375 INFO mapreduce.Job: Job job_local794874982_0001 completed successfully
Copy the code
  • Viewing file Results

The /home/hadoop/output folder and the generated file are as follows:

$ cd /home/hadoop/output $ /home/hadoop/output# ll total 20 drwxr-xr-x 2 root root 4096 Jun 2 01:08 ./ drwxr-xr-x 4 root  root 4096 Jun 2 01:08 .. / -rw-r--r-- 1 root root 76 Jun 2 01:08 part-r-00000 -rw-r--r-- 1 root root 12 Jun 2 01:08 .part-r-00000.crc -rw-r--r-- 1 root root 0 Jun 2 01:08 _SUCCESS -rw-r--r-- 1 root root 8 Jun 2 01:08 ._SUCCESS.crcCopy the code

View statistics file part-R-00000:

Flume 2 Hadoop 3 hbase 1 Hive 2 Kafka 1 MapReduce 1 Spark 2 SQoop 1 Storm 2Copy the code

Installation in pseudo-cluster mode

Hadoop runs in pseudo-distributed mode on a single node, and Hadoop processes run as separate Java processes. The nodes are both NameNode and DataNode, and at the same time, files in HDFS are read. The configuration files to be modified are core-site. XML, hdFS-site. XML, and mapred-site. XML. Each configuration is implemented by declaring the name and value of the property.

4.1 Pseudo cluster File Configuration

For the configuration of Hadoop pseudo-cluster mode, in addition to the hadoop-env.sh file, you need to configure the following four files :core site. XML, hdFS-site. XML, mapred-site. XML, and yarn-site. XML. Each file is in the same directory as the hadoop-env.sh file. The functions of each file are as follows:

4.4.1 core – site. XML

Dir specifies the location of NameNode, hadoop.tmp.dir is the base configuration on which the Hadoop file system depends, and on which many paths depend. If the location of Namenode and DataNode is not specified in hdFS-site. XML, the Namenode and DataNode are stored in this path by default.

  • The configuration file core-site. XML is as follows:
<? The XML version = "1.0" encoding = "utf-8"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-3.3.0/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop1:9000</value> </property> </configuration>Copy the code

Note: Hadoop1 is the configured host name

4.1.2 HDFS – site. XML

Configure the directory where NameNode and DataNode files are stored and the number of copies.

  • The hdfs-site. XML configuration file is as follows:
<? The XML version = "1.0" encoding = "utf-8"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> < name > DFS. The namenode. Name. Dir < / name > < value > / usr/local/hadoop - 3.3.0 / TMP/DFS/name < value > / < / property > < property > < name > DFS. Datanode. Data. Dir < / name > < value > / usr/local/hadoop - 3.3.0 / TMP/DFS/data value > < / < / property > < / configuration >Copy the code

Note: Pseudo-distributed has only one node, so DFs.replication needs to be set to 1 and at least three nodes in cluster mode. In addition, the node locations of datanode and Namenode are configured.

4.1.3 mapred – site. XML

The mapred-site.xml.template file is not available in earlier versions of Hadoop. You need to rename mapred-site.xml.template. HADOOP_HOME.

  • The mapred-site. XML configuration file is as follows:
<? The XML version = "1.0"? > <? xml-stylesheet type="text/xsl" href="configuration.xsl"? > <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value> </property> </configuration>Copy the code
4.1.4 yarn – site. XML

Configure the host name of ResourceManager. Configure a list of secondary services that are executed by NodeManager.

  • Yarn-site. XML configuration file is as follows:
<? The XML version = "1.0"? > <configuration> <! -- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>Copy the code

4.2 Formatting NameNode and Starting Hadoop

4.2.1 Granting Run Permission to The Root Account

The script directory is /usr/local/hadoop-3.0/sbin. The root account needs to be assigned the permission to run the script: start-dfs.sh, start-yarn.sh, stop-dfs.sh, and stop-yarn.sh.

  • (1) Start-dfs. sh and stop-dfs.sh are used to start and stop the HDFS process node respectively. Add the root running permission at the top of the script as follows:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Copy the code
  • (2) Start-yarn. ss and stop-yarn.sh are used to start and stop the yarn process node respectively. Add root to the top of the script as follows:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
Copy the code
4.2.2 Formatting NameNode
  • Run the following command to format the NameNode:
$ hdfs namenode -format 
Copy the code

If the following information is displayed, NameNode is successfully formatted:

INFO common.Storage: Storage directory /usr/local/hadoop-3.3.0/tmp/ DFS /name has been successfully formatted.Copy the code
Holdings to start the Hadoop
  • (1) to start the HDFS

Run the script on the cli to start HDFS:

$ sh start-dfs.sh
Copy the code

JPS views the process

$ jps 
13800 Jps
9489 NameNode
9961 SecondaryNameNode
9707 DataNode
Copy the code
  • (2) start the YARN

Run the script on the CLI to start YARN:

$ sh start-yarn.sh
Copy the code

JPS views the process

$ jps 
5152 ResourceManager
5525 NodeManager
13821 Jps
Copy the code
4.2.4 Viewing Hadoop Node Information

There are two ways to verify the startup of The Pseudo-cluster mode of Hadoop: One is to check whether the NameNode status is “Active” through the interface exposed by Hadoop in the browser; the other is to run the MapReduce program to verify whether the installation and startup are successful.

Enter the address in your browser to access:

http://192.168.254.130:9870/
Copy the code

The login interface is as shown in the figure, and the node is in the “Active state “:

4.3 Setting up a Verification Environment by Running MapReduce

To set up a verification environment by running MapReduce, perform the following steps:

  • Create the input file directory on HDFS
  • Upload data files to the HDFS
  • Run the MapReduce program
4.3.1 Creating an Input File Directory on the HDFS

Create the /data/input directory on the HDFS as follows:

$ hadoop fs -mkdir /data
$ hadoop fs -mkdir /data/input
$ hadoop fs -ls /data/

Found 1 items
drwxr-xr-x   - root supergroup          0 2021-06-05 11:11 /data/input
Copy the code
4.3.2 Uploading data Files to the HDFS

Upload the data file data.input in local mode to the HDFS directory: /data/input

$ hadoop fs -put /home/hadoop/input/data.input /data/input $ hadoop fs -ls /data/input Found 1 items -rw-r--r-- 1 root $hadoop fs -cat /data/input/data.input hadoop mapreduce hive flume hbase spark storm flume sqoop hadoop hive kafka spark hadoop stormCopy the code
4.3.3 Running the MapReduce Program
  • Run the wordCount program of Hadoop as follows:
$hadoop jar hadoop-mapreduce-examples- 3.1.3. jar wordcount /data/input/data.input /data/outputCopy the code
  • Viewing the Execution Result

During the execution of the wordCount program, the /data/output directory is automatically created. Run the following command to check the /data/output directory created in HDFS:

$ hadoop fs -ls /data/output

Found 2 items
-rw-r--r--   1 root supergroup          0 2021-06-05 11:19 /data/output/_SUCCESS
-rw-r--r--   1 root supergroup         76 2021-06-05 11:19 /data/output/part-r-00000

$ hadoop fs -cat /data/output/part-r-00000

flume	2
hadoop	3
hbase	1
hive	2
kafka	1
mapreduce	1
spark	2
sqoop	1
storm	2
Copy the code

Each word can be correctly output from the part-R-00000 file and the number of the word in the test data file, indicating that the pseudo-cluster mode of Hadoop correctly outputs MapReduce results to HDFS.

END

This paper is mainly for the subsequent deployment of Hadoop and other big data components of the network policy processing, including the most important network static IP setting, host name modification, setting the password free login and other operations, the next chapter will introduce the cluster mode of Hadoop installation, welcome to wechat public account: Progressive moqing; I am a worker in the wave of Internet, hope to learn and progress together with you, adhering to the belief: the more you know, the more you don’t know.

Reference Documents:

  • [1] RongT. Blog garden: www.cnblogs.com/tanrong/p/1… , 2019-04-02.
  • [2] Hadoop official website: hadoop.apache.org/
  • [3] Jiang Xiang. Massive data Processing and Big Data Technology station [M]. 1st edition. Beijing: Peking University Press,2020-09