In this deployment, two servers are used: host: Master (64.115.5.32) Slave: Slave0 (64.115.5.158) You can add more servers. It is better to have more than three servers.

Note: the Hadoop Namenode, SecondaryNameNode, ResourceManager this three services more memory, as far as possible not on a server, if enough resource to put on a server is no problem; The Namenode, SecondaryNameNode is main relations had better not put on the same server.

The installation and configuration of each node are basically the same. In practice, the installation and configuration are usually completed on the master node, and then the installation directory is copied to other nodes. Note: All operations are performed using Hadoop as user root and depend on the Java environment. Ensure that the Java environment is configured first

Download Hadoop configuration

Perform the following operations only on the master node. After the configuration is complete, distribute the configuration to other nodes.

Download the Hadoop installation package

Hadoop’s official website: hadoop.apache.org/ here I use Hadoop version download address: archive.apache.org/dist/hadoop…

Decompress the Hadoop installation package

Upload the downloaded hadoop-27.5.tar. gz file to the /opt/hadoop directory on the server. After uploading, execute the following code on the master host:

cd /opt/hadoop
Copy the code

Go to the /opt/hadoop directory and run the decompress command.

The tar - ZXVF hadoop - 2.7.5. Tar. GzCopy the code

After the decompression is successful, the system automatically creates a hadoop-2.7.5 subdirectory in the Hadoop directory.

For convenience, change the folder name hadoop-2.7.5 to Hadoop, which is the Hadoop installation directory. Run the command to change the folder name:

The mv hadoop - 2.7.5 hadoopCopy the code

Enter the installation directory and check the installation file. If the file list shown in the figure is displayed, the compression is successful

Configuration File Introduction

XML, hdFS-site. XML, mapred-site. XML, and yarn-site. XML. These four files are the configuration parameters of different components, as shown in the following table:

Configuration file name A configuration object The main content
core-site.xml Cluster Global Parameters Defines system-level parameters, such as HDFS URL and Hadoop temporary directory
hdfs-site.xml HDFS parameters For example, storage locations of name nodes and data nodes, number of file copies, and file access permissions
mapred-site.xml Graphs parameters Includes JobHistory Server and application parameters, such as the default number of Reduce jobs and the default upper and lower limits of memory that can be used by jobs
yarn-site.xml Cluster resource management system parameters Configure ResourceManager, NodeManager communication ports, and Web monitoring ports

Important parameters for configuring the cluster

This section discusses specifying important parameters in a given configuration file. Among the four configuration files, the most important parameters and their explanations are as follows:

1. core-site.xml

Parameter names The default value Parameter interpretation
fs.defaultFS file:/// File system host and port
io.file.buffer.size 4096 Buffer size of stream file
hadoop.tmp.dir /tmp/hadoop-${user.name} Temporary folder

2. hdfs-site.xml

Parameter names The default value Parameter interpretation
dfs.namenode.secondary.http-address 0.0.0.0:50090 Define the HTTP server address and port of HDFS
dfs.namenode.name.dir file://${hadoop.tmp.dir}/dfs/name Defines the location of the DFS name node on the local file system
dfs.datanode.data.dir file://${hadoop.tmp.dir}/dfs/data Defines the location on the local file system where a DFS data node stores data blocks
dfs.replication 3 Default number of block replications
dfs.webhdfs.enabled true Whether to read HDFS files through HTTP. If yes, the cluster security is poor

3. mapred-site.xml

Parameter names The default value Parameter interpretation
mapreduce.framework.name local The value can be local, classic, or YARN. If yarn is not used, the YARN cluster is not used to allocate resources
mapreduce.jobhistory.address 0.0.0.0:10020 You can define the IP address and port of the history server to view the records of Mapreduce jobs that have been run
mapreduce.jobhistory.webapp.address 0.0.0.0:19888 Defines the address and port that the history server web application accesses

4. yarn-site.xml

Parameter names The default value Parameter interpretation
yarn.resourcemanager.address 0.0.0.0:8032 ResourceManager Specifies the address for the client to access. The client uses this address to submit applications to RM and kill applications
yarn.resourcemanager.scheduler.address 0.0.0.0:8030 ResourceManager Access address provided by ApplicationMaster. ApplicationMaster uses this address to apply for or release resources from RM
yarn.resourcemanager.resource-tracker.address 0.0.0.0:8031 ResourceManager Specifies the address provided by the ResourceManager to NodeManager. NodeManager uses this address to report heartbeat and collect tasks to RM
yarn.resourcemanager.admin.address 0.0.0.0:8033 ResourceManager Access address provided by the administrator. The administrator uses this address to send management commands to RM
yarn.resourcemanager.webapp.address 0.0.0.0:8088 ResourceManager provides addresses for Web services. You can use this address to view cluster information in the browser
yarn.nodemanager.aux-services This configuration item allows users to customize some services. For example, map-Reduce shuffle is implemented in this way. In this way, users can extend their own services on NodeManager

Configuring the env File

To configure the JDK file, run the following command:

vi /opt/hadoop/hadoop/etc/hadoop/hadoop-env.sh
Copy the code

Find the line “Export JAVA_HOME” to configure the JDK path

Change the value to export JAVA_HOME=/usr/local/java/jdk1.8.0_207/

Configure the core component file

Hadoop core components of the file is a core – site. XML, located in the/opt/Hadoop/Hadoop/etc/Hadoop subdirectory, with vi to edit the core – site. The XML file, need to put the configuration code below files between and.

Execute the command to edit the core-site. XML file:

vi /opt/hadoop/hadoop/etc/hadoop/core-site.xml
Copy the code

Code to add between and:

<property> <name> fs.defaultfs </name> <value> HDFS ://master:9000</value> <description> Define the default file system host and port </description> </property> <! Dir </name> <value>/opt/hadoop/ hadoopData </value> <description>Abase --> <property> <name> Hadoop.tmp forother temporary directories.</description> </property>Copy the code

Create a data directory:

mkdir /opt/hadoop/hadoopdata 
Copy the code
mkdir /opt/hadoop/hdfs
mkdir /opt/hadoop/hdfs/data
mkdir /opt/hadoop/hdfs/name
Copy the code

After editing, exit and save!

Configuring the File System

Hadoop file system configuration file is HDFS – site. XML, located in the/opt/Hadoop/Hadoop/etc/Hadoop subdirectory, with vi to edit the file, need to put the following code in the file and between.

Run the following command to edit the HDFS -site. XML file:

vi /opt/hadoop/hadoop/etc/hadoop/hdfs-site.xml
Copy the code

Code to add between and:

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/hadoop/hdfs/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/opt/hadoop/hadoop/hdfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>2</value>
</property>

Copy the code

After editing, exit and save!

Configure the yarn-site. XML file

Site configuration file of Yarn is the Yarn – site. XML, located in the/opt/hadoop/hadoop/etc/hadoop subdirectory, still use vi to edit the file, add the following code in a file and between.

Run the command to edit the yarn-site. XML file:

vi /opt/hadoop/hadoop/etc/hadoop/yarn-site.xml
Copy the code

Code to add between and:

<property>
	  <name>yarn.nodemanager.aux-services</name>
	  <value>mapreduce_shuffle</value>
</property>
<property>
	  <name>yarn.resourcemanager.address</name>
	  <value>master:18040</value>
</property>
<property>
	  <name>yarn.resourcemanager.scheduler.address</name>
	  <value>master:18030</value>
</property>
<property>
	  <name>yarn.resourcemanager.resource-tracker.address</name>
	  <value>master:18025</value>
</property>
<property>
	  <name>yarn.resourcemanager.admin.address</name>
	  <value>master:18141</value>
</property>
<property>
	  <name>yarn.resourcemanager.webapp.address</name>
	  <value>master:18088</value>
</property>
Copy the code

Configure the MapReduce computing framework file

In/opt/hadoop/hadoop/etc/hadoop subdirectory, system already has a mapred – site. XML. The template file, we need to copy it and changed its name, location remains the same.

Execute copy and rename operation commands:

cp /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml
Copy the code

Then edit the mapred-site. XML file with VI and fill in the following code between and of the file.

Execute command:

vi /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml
Copy the code

Code to add between and:

<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>
Copy the code

After editing, save and exit!

Configure the Master Slaves file

The slaves file provides the list of slave nodes of the Hadoop cluster. This file is very important because when starting Hadoop, the system always starts the cluster according to the list of slave node names in the current slaves file and the slave nodes that are not in the list are not considered as computing nodes.

Execute edit Slaves file command:

vi /opt/hadoop/hadoop/etc/hadoop/slaves
Copy the code

Note: Vi is used to edit the slaves file, which should be edited according to the actual situation of the cluster built by readers themselves. For example: I only have Slave0.

So the following code should be added:

slave0
Copy the code

Note: Remove the original localhost line in the slaves file!

Configure domain name resolution (for both master and slave)

vi /etc/hosts
Copy the code

Add the following two lines:

64.115.5.32 master
64.115.5.158 slave0
Copy the code

Copy Hadoop from master to slave

Since I only have Slave0 here, I only need to copy it once. If there are too many machines, which other method can be used for replication?

Copy command:

scp -r /opt/hadoop root@slave0:/opt
Copy the code

Hadoop cluster startup

Configure operating system environment variables (master and slave required)

Then use vi to edit /etc/profile:

vi /etc/profile
Copy the code

Finally, append the following code to the end of the file:

# HADOOP
export HADOOP_HOME=/opt/hadoop/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Copy the code

After saving the configuration and exiting, run the following command:

source /etc/profile
Copy the code

The source /etc/profile command enables the configuration to take effect

Tip: Use the same configuration method as above in Slave0.

Format file system (do at master)

Run the command to format the file system:

hadoop namenode -format
Copy the code

Start and shut down the Hadoop cluster (do it at Master)

First go to the installation home directory and run the following command:

cd /opt/hadoop/hadoop/sbin
Copy the code

Then the startup command is:

start-all.sh
Copy the code

To shut down the Hadoop cluster, run the following command:

stop-all.sh
Copy the code

When Hadoop is started next time, you do not need to initialize NameNode. You only need to run the start-dfs.sh command and then run the start-yarn.sh command to start YARN.

In fact, Hadoop recommends deprecating the use of commands such as start-all.sh and stop-all.sh in favor of the start-dfs.sh and start-yarn.sh commands.

Verify that the Hadoop cluster is started successfully

You can run the JPS command on the terminal to check whether Hadoop is successfully started.

On the master node, execute:

jps
Copy the code

If the SecondaryNameNode, ResourceManager, Jps, and NameNode processes are displayed, the master node is successfully started.

Then execute JPS command under slave0 node:

If NodeManager, Jps, and DataNode are displayed, the slave node (Slave0) is successfully started.

After the server firewall is disabled, open the following address in the browser. If the following page is displayed, the deployment succeeds.

More information about configuration parameters

Above mentioned to the configuration of hadoop, are some important configuration information, there will be other configuration information, you can through the hadoop’s official website inquiries, address is as follows: hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren…

These sites provide access to all the latest Hadoop configuration information, including outdated definitions for better hadoop cluster maintenance.

conclusion

Hadoop 3.x is the latest version of Hadoop 2.7x, and many configurations are not specified, such as Job history server address, etc. If you need more professional and detailed deployment process, you can see: official documentation B site is Silicon Valley big Data Hadoop 3.x

Finally, thank my girlfriend for her tolerance, understanding and support in work and life!