Hadoop Cluster setup tutorial with 14 process screenshots

🚀 author: “Big Data Zen”

🚀 Article introduction: This article mainly explains the construction of Hadoop cluster. In order to facilitate everyone’s understanding and operation, the blogger has taken screenshots of the key steps to reduce the probability of errors.

🚀 article source code access: the construction of this article PDF, the relevant installation package, partners can pay attention to my share oh.

🚀 Welcome to like 👍, collect ⭐, leave a message 💬

1.Hadoop overview and cluster planning

• Hadoop is distributed storage and computing provided by the Apache Foundation’s open source Distributed storage + Distributed computing platform

• Is a distributed system infrastructure: users can work with it without understanding the details of the distributed infrastructure.

• Distributed file system: HDFS implements distributed storage of files on many servers

• Distributed computing framework: MapReduce implements distributed parallel computing across many machines

• Distributed resource scheduling framework: YARN manages cluster resources and schedules jobs

Cluster planning

HDFS: NameNode and DataNodes ==> NN DN

YARN: ResourceManager, NodeManager ==> RM NM

node1	node2	node3
NN RN DN NM	DN NM	DN NM

2. Prepare the basic environment

To build a Hadoop cluster, a Java environment is essential, and every machine in the cluster must have it. In this step, we will install Java and configure the environment.

Version description: Java version is JDK8, Hadoop version is 2.7, cluster environment is Linux Centos7, the cluster has three machines, respectively node1, node2, and node3. You can also contact me to note relevant installation packages. The command to change the host name is

 hostnamectl set-hostname  xxxx
Copy the code

First, use the connection tool to connect to our Linux and upload the JDK8 installation package. It is suggested to create a folder to manage the uploaded files in a unified manner. Install JDK installation package jdK-8u212-linux-x64.tar. gz, here is I put in the app directory, after uploading good decompress operation. The command is

tar -xf jdk-8u212-linux-x64.tar.gz -C /app
Copy the code

After decompressing, configure environment variables in the bin directory. Edit the configuration file to add the following configuration.

 vi /etc/profile
Copy the code

After the command is added, enter Java -version on the CLI. If the following information is displayed, the command is successfully added.

3. Disable the firewall

The firewall is turned off so that local machines can also access our cluster resources through Web pages. If this step is not done, the cluster may become inaccessible when it is running. Run the following command to shut it down.

systemctl stop firewalld.service
Copy the code

4. Configure IP address mapping

In this step, the IP addresses of the three hosts are mapped to facilitate subsequent configuration and cluster communication. The same operation is performed on the three hosts. You can modify the IP address based on the host IP address, as shown in the following figure

5. Add a Hadoop user and grant permissions

In the cluster construction process, actually using the Root user is also possible, and more convenient. However, this is not usually done, and a separate Hadoop user is created to do this, which also increases the security of the cluster. The operation is as follows:

Start by adding Hadoop users to three machines and do the same on three machines.

Edit the configuration file and add the following line to grant permissions to facilitate subsequent operations: vi /etc/sudoresCopy the code

6. No-password login for the cluster

Encryption-free login is an important step. In this step, we mainly configure the three machines in the cluster. As we all know, when using SSH command to log in to another host, we need to enter the password, after the verification can log in. If the password is not configured, the password will be displayed frequently when the cluster is started. The operation is as follows:

In this case, you need to operate as a Hadoop user and run commandsssh-keygen -t rsaRunning a command on the primary node, node1, generates a key that is then distributed to other machines, thus achieving secret free access between clusters.

A Copy of the key, as you can see, does not require a password to log in to Node2 after running

ssh-copy-id -i ~/.ssh/id_rsa.pub node1
ssh-copy-id -i ~/.ssh/id_rsa.pub node2
ssh-copy-id -i ~/.ssh/id_rsa.pub node3
Copy the code

7. Decompress the Hadoop installation package and modify the configuration file

We are using Hadoop version 2.7, although 3.0 is available now, but bloggers recommend using version 2.7 for stability. After uploading, decompress the file. Decompress the command by referring to the JDK installation above, and then add the bin directory of Hadoop to the environment variable for the system to identify. The next step is to modify the configuration file, which is also important. Based on the situation of your own system, I will Copy the configuration file for your reference. The configuration file to be modified is as follows, which is located in the Hadoop directory under etc under the Hadoop installation package

hadoop-env.sh

Export JAVA_HOME = / app/jdk1.8Copy the code

core-site.xml

<configuration>
<property>
   <name>fs.default.name</name>
   <value>hdfs://node1:8020</value>
</property>
</configuration>
Copy the code

hdfs-site.xml

<configuration>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>/app/tmp/dfs/name</value>
</property>

<property>
  <name>dfs.datanode.data.dir</name>
  <value>/app/tmp/dfs/data</value>
</property>
</configuration>
Copy the code

mapred-site.xml

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> // Note that this file may not exist, you need to copy the file to modify. cp mapred-site.xml.template mapred-site.xmlCopy the code

yarn-site.xml

<configuration>

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node1</value>
</property>
</configuration>
Copy the code

slaves

node1
node2
node3
Copy the code

All configuration files have been modified

8. Distribute cluster files

In the base environment configured for Node1, the three machines need to synchronize their environments, and then distribute the configured Hadoop installation package to the other machines in the cluster. This is done quickly using only the SCP command. After operating on node1, you can see that the installation package is synchronized to the corresponding directory in other clusters

SCP -r hadoop-2.7node2 :/app SCP -r hadoop-2.7node3 :/appCopy the code

9. Format NameNode

Before starting the cluster, run the following command on node1 to format the cluster environment

 hadoop namenode -format
Copy the code

10. The cluster is started

At this point, all the preparations are complete and we start the cluster directly from the sbin command in the Hadoop installation package. Run the following command to start the cluster

./start-all.sh
Copy the code

To check whether the cluster is started successfully, run the JPS command to view the started processes. You can see that the processes are started successfully on all three machines

Check whether the page is accessible

You can see that the service based on ports 50070 and 8088 can be accessed successfully

11. A summary

Here the cluster is completed, the process of building will be more complicated, and prone to error, so partners need to pay more attention to the details when building, which the required installation package can be directly private chat me, remarks required installation package can be, thank you for your support! 💪