1. Complete distribution mode

Fully distributed mode is more complex than local mode and pseudo-distributed mode model, the real use more Linux host to deploy Hadoop, planned for cluster, makes Hadoop modules deployed on different multiple machines respectively, this article is through three virtual machine for the cluster configuration, the main steps are:

  • Prepare VMS: Prepare the basic vm environment
  • ip+HostConfiguration: Manually configure the VMipAs well as the host name, you need to ensure that the three virtual machines function with each otherpingtong
  • sshConfiguration: Generate a key pair and copy the public key to the three VMS for password-free communication
  • HadoopConfiguration:core-site.xml+hdfs-site.xml+workers
  • YARNConfiguration:yarn-site.xml

2 VM Installation

Three VMS need to be used, one of which is the Master node and two Worker nodes. The first step is to install the VM and configure the environment, and then perform the test.

2.1 Image Download

To install the VM using VirtualBox, download the image of the latest version from the CentOS official website.

There are three different mirror images:

  • boot: Network installation version
  • dvd1: the full version
  • minimal: Minimum installation version

Select the minimal install version for convenience, that is, the one without the GUI.

2.2 installation

After downloading, open the Virtual Box and click New to select Expert mode:

Name it CentOSMaster, act as the Master node, and allocate memory (1 gb), if you think you have a large memory can be 2 gb:

Disk 30G is enough, other can keep default:

Once created, from the store in Settings, select the image to download:

After startup, the system prompts you to select a boot disk.

When you are ready, the following screen will appear. Select the first installation:

After a while, the installation screen is displayed:

To configure the installation location and time zone, select the installation location first:

Since it is a virtual single empty disk, select automatic partition:

Time zone here can choose Shanghai, China:

Select the network and change the host name to master:

Then click Configure:

Add the IP address and DNS server. The IP address can refer to the local computer. For example, the local IP address of the author’s machine is 192.168.1.7.

  • The virtual machineipYou can fill in192.168.1.8
  • The subnet mask is usually255.255.255.0
  • The default gateway is192.168.1.1
  • DNSThe server is114.114.114.114(Of course, you can also change to other publicDNSLike Ali’s223.5.5.5, baidu180.76.76.76Etc.)

Click Save to apply the host name and enable:

No problem if you can install:

Set root user password and create user:

The user uses a user named hadoopuser on which all subsequent operations are directly based:

Wait for a period of time after the installation is complete and restart.

2.3 start

First remove the original image before booting:

Black box screen after startup:

Log in to the hadoopuser user created earlier.

3 sshConnecting a VM

By default, it can’t connect to the outside world, you need to select Network from Devices in the menu bar, and set it to Bridged Adapter:

Ping test:

Then you can test whether the local machine can be pinged:

After logging in to the VM, you can use SSH to connect to the VM. You can connect to the VM from the local terminal as you normally do to the server.

Then enter the password to connect:

If you want to be lazy, you can use the key connection mode on the local machine:

Ssh-keygen -t ed25519 -a 100 ssh-copy-id -i ~/. SSH /id_ed25519.pub [email protected]Copy the code

4 Basic Environment construction

The basic environment is to install JDK and Hadoop, and use SCP to upload OpenJDK and Hadoop.

4.1 JDK

First download OpenJDK, then upload using SCP on local machine:

SCP its - 11 + 28 _linux - x64_bin. Tar. Gz [email protected]: / home/hadoopuserCopy the code

Then switch to SSH to connect to the virtual machine locally,

cd ~
tar -zxvf openjdk-11+28_linux-x64_bin.tar.gz 
sudo mv jdk-11 /usr/local/java
Copy the code

The next step is to edit /etc/profile and add bin to the environment variable, adding at the end:

sudo vim /etc/profile
# No vim please use vi
Sudo yum install vim
# add
export PATH=$PATH:/usr/local/java/bin
Copy the code

And then:

. /etc/profile
Copy the code

Testing:

4.2 Hadoop

The Hadoop SCP package is uploaded to the VM, decompressed, and moved to /usr/local:

SCP hadoop - 3.3.0. Tar. Gz [email protected]: / home/hadoopuserCopy the code

Vm SSH terminal:

cd~ tar-xvf hadoop-3.3.0.tar.gz sudo mv hadoop-3.3.0 /usr/local/hadoop
Copy the code

At the same time, modify the /etc/hadoop/hadoop-env. sh configuration file and enter the Java path:

sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
# fill in
export JAVA_HOME=/usr/local/java # Change to your Java directory
Copy the code

5 cloning

Because one Master node and two Worker nodes are required, shut down the Master node, select the configured CentOSMaster, and right-click to clone:

And select full clone:

Clone CentOSWorker1 and CentOSWorker2.

6 Host name +ipSet up the

Worker1 = Worker1; Worker2 = Worker1;

sudo vim /etc/hostname
# enter
# worker1
Copy the code

For IP, since the IP address of the Master node is 192.168.1.8, the two Worker nodes are modified as follows:

  • 192.168.1.9
  • 192.168.1.10
sudo vim /etc/sysconfig/network-scripts/ifcfg-xxxx # This file varies from person to person
# modified IPADDRIPADDR = 192.168.1.9Copy the code

After the modification, restart Worker1 and perform the same operations to change the host name and IP address of Worker2.

7 HostSet up the

Host Settings need to be set on Master and Worker nodes:

7.1 Masternode

sudo vim /etc/hosts
# add
192.168.1.9 worker1 # correspond to the IP address above
192.168.1.10 worker2
Copy the code

7.2 Worker1node

sudo vim /etc/hosts
# add
192.168.1.8 master
192.168.1.10 worker2
Copy the code

7.3 Worker2node

sudo vim /etc/hosts
# add
192.168.1.8 master
192.168.1.9 worker1
Copy the code

7.4 each otherpingtest

Ping the IP address or host name of the other two VMS on one of the three VMS. After the test passes, you can proceed to the next step. This section uses the Worker1 node test:

8 configurationssh

8.1 sshdservice

You need to configure SSH password-free (key) connections between three nodes (including itself)

systemctl status sshd
Copy the code

Check whether the SSHD service is enabled

systemctl start sshd
Copy the code

Open it.

8.2 Copying a Public Key

Perform the following operations on all three nodes:

ssh-keygen -t ed25519 -a 100
ssh-copy-id master
ssh-copy-id worker1
ssh-copy-id worker2
Copy the code

8.3 test

SSH directly from one of the nodes to the other nodes to log in without a password, such as in Master:

ssh master # hadoopuser = hadoopuser
ssh worker1
ssh worker2
Copy the code

9 MasternodeHadoopconfiguration

On the Master node, modify the following configuration files:

  • HADOOP/etc/hadoop/core-site.xml
  • HADOOP/etc/hadoop/hdfs-site.xml
  • HADOOP/etc/hadoop/workers

9.1 core-site.xml

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/usr/local/hadoop/data/tmp</value>
	</property>
</configuration>
Copy the code
  • fs.defaultFS:NameNodeaddress
  • hadoop.tmp.dir:HadoopThe temporary directory

9.2 hdfs-site.xml

<configuration>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/usr/local/hadoop/data/namenode</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/usr/local/hadoop/data/datanode</value>
	</property>
	<property>
		<name>dfs.replication</name>
		<value>2</value>
	</property>
</configuration>
Copy the code
  • dfs.namenode.name.dir: saveFSImageDirectory to storeNameNodethemetadata
  • dfs.datanode.data.dir: saveHDFSThe directory in which data is storedDataNodeMultiple data blocks of
  • dfs.replication:HDFSThe number of temporary backups stored, twoWorkerNode, so the value is2

9.3 workers

Finally modify workers, enter (same as the host name set above) :

worker1
worker2
Copy the code

9.4 Copying a Configuration File

Copy the Master configuration to the Worker:

scp /usr/local/hadoop/etc/hadoop/* worker1:/usr/local/hadoop/etc/hadoop/
scp /usr/local/hadoop/etc/hadoop/* worker2:/usr/local/hadoop/etc/hadoop/
Copy the code

10 HDFSFormat and start

10.1 start

In the Master node:

cd /usr/local/hadoop
bin/hdfs namenode -format
sbin/start-dfs.sh
Copy the code

You can run the JPS command to view the following information:

In the Worker node:

10.2 test

Browser input:

master:9870
# If you have not changed the local Host, you can enter it
# 192.168.1.8:9870
Copy the code

But…

I thought I’d see results after all this work.

Then I checked the Host + VM Host and Hadoop configuration file, and there was no problem.

In the end,

I was able to locate the problem

The firewall.

10.3 the firewall

CentOS8 enables the firewall by default and can use:

systemctl status firewalld
Copy the code

Check the firewall status.

Because port 9870 is used for access, check whether 9870 is enabled. Enter the following information in the Master node:

sudo firewall-cmd --query-port=9870/tcp
# or
sudo firewall-cmd --list-ports
Copy the code

If the output is no:

Is not open, manually open:

sudo firewall-cmd --add-port=9870/tcp --permanent
sudo firewall-cmd --reload # make it work
Copy the code

Type again in your browser:

master:9870
# if local Host is not modified
# 192.168.1.8:9870
Copy the code

You can now see a friendly page:

However, one problem is that the Worker Nodes are not displayed. The number of Live Nodes in the image above is 0, and the Datanodes has nothing displayed:

But you do see Datanode processes in Worker nodes:

To see the Worker nodes log (/ usr/local/hadoop/logs/hadoop – hadoopuser – datanode – worker1. Log) can see should be a Master node 9000 port is not open questions:

Run the stop-dfs.sh command on the Master node to disable the service. Run the start-dfs.sh command to enable the 9000 port.

/usr/local/hadoop/sbin/stop-dfs.sh
sudo firewall-cmd --add-port=9000/tcp --permanent
sudo firewall-cmd --reload
/usr/local/hadoop/sbin/start-dfs.sh
Copy the code

Visit again in browser:

master:9000
# or
# 192.168.1.8:9000
Copy the code

Now you can see the Worker node:

11 configurationYARN

11.1 YARNconfiguration

Changes in two Worker node/usr/local/hadoop/etc/hadoop/yarn – site. XML:

<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>master</value>
</property>
Copy the code

11.2 openYARN

Enable YARN on the Master node:

cd /usr/local/hadoop
sbin/start-yarn.sh
Copy the code

Also open port 8088 in preparation for the following tests:

sudo firewall-cmd --add-port=8088/tcp --permanent
sudo firewall-cmd --reload
Copy the code

11.3 test

Browser input:

master:8088
# or
# 192.168.1.8:8088
Copy the code

You should be able to access the following page:

Similarly, there is no Worker node. Check the log of Worker node, and it is found that the problem is also the port:

Disable YARN on the Master node, enable port 8031, and restart YARN:

/usr/local/hadoop/sbin/stop-yarn.sh
sudo firewall-cmd --add-port=8031/tcp --permanent
sudo firewall-cmd --reload
/usr/local/hadoop/sbin/start-yarn.sh
Copy the code

Visit again:

master:8088
# or
# 192.168.1.8:8088
Copy the code

You can now see the Worker node:

At this point, a Hadoop cluster consisting of VMS is set up.

12 reference

  • CSDN GitChat · with large data | in the history of the most detailed Hadoop environment set up
  • How To Set Up a Hadoop 3.2.1 Multi-Node Cluster on Ubuntu 18.04 (2 Nodes)
  • How to Install and Set Up a 3-Node Hadoop Cluster
  • Csdn-virtualbox Enables a host and a VM to ping each other and configure a static IP address