When learning a big data system, it is a basic operation to build a Hadoop. Many upper-layer applications of big data rely on HDFS. This article introduces a method to build a Hadoop cluster.

The required software and environment are as follows:

  • Local server cluster
  • openjdk1.8
  • Hadoop – 2.9.2

I’ve written about how to set up a local server cluster before, so if you need to set up a local server cluster here.

Environment to prepare

Before you start setting up Hadoop, you need to do some preparation.

First, turn off the firewall on the server and disable boot, so you don’t have to control ports on the individual servers.

$ systemctl stop firewalld.service
$ systemctl disable firewalld.service
Copy the code

Hostname = 192.168.56.3 hostname = 192.168.56.3 hostname = 192.168.56.3

$ vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=bigdata1
GATEWAY=192.168.56.2
Copy the code

Also edit the other two servers, named BigDatA2 and BigData3.

Then edit the /etc/hosts file for each server and add the hostname to IP mapping:

$vi /etc/hosts 192.168.56.3 bigdata1 192.168.56.4 bigData2 192.168.56.5 bigdata3Copy the code

Finally, in order to easily access each server on the host, we also need to edit the hosts file on the host:

$vi /etc/hosts 192.168.56.3 bigdata1 192.168.56.4 bigData2 192.168.56.5 bigdata3Copy the code

At this point, the configuration of the server cluster is complete, and the next step is to configure the Hadoop environment.

Hadoop also relies on a Java environment, so you need to configure the JDK first. This article uses OpenJDk1.8.

All software in this article will be installed in the /opt/module directory. If this directory does not exist, create one.

$ mkdir /opt/module
Copy the code

You can copy the downloaded JDK from the host to the server using the SCP command, and then decompress it to the target directory:

$ tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz -C /opt/module/
Copy the code

After the decompression is complete, configure environment variables:

$ vi /etc/profile

#Java config
export JAVA_HOME=/opt/module/java-se-8u41-ri
export PATH=$PATH:$JAVA_HOME/bin
Copy the code
$ source /etc/profile
Copy the code
$ java -version

Openjdk version "1.8.0 comes with _41"
OpenJDK Runtime Environment (build 1.8.0_41-b04)
OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
Copy the code

The JDK configuration is complete.

Then copy the Hadoop installation package to the server using the SCP command and decompress it to the /opt/module directory:

$tar -zxvf hadoop-2.9.2.tar.gz -c /opt/module/Copy the code

Also configure the environment directory

$ vi /etc/profile
#Hadoop config
exportHADOOP_HOME = / opt/module/hadoop - 2.9.2export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Copy the code
$ source /etc/profile
Copy the code
$hadoop version hadoop 2.9.2 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git - r 826 afbeae31ca687bc2f8471dc841b66ed2c6704 Compiled by ajisaka on the 2018-11-13 T12:42 z Compiled with protoc 2.5.0 the Fromsource with checksum 3a9939967262218aa556c684d107985
This commandThe was run using/opt/module/hadoop - 2.9.2 / share/hadoop/common/hadoop - common - 2.9.2. JarCopy the code

The hadoop installation is complete. At this point, the software has been installed on the single machine.

If this configuration is too tedious for every machine, you can use rsync to copy the installed software directly to other machines. Of course, you can also use SCP to copy the software, but SCP is a full copy, which is slow to copy when there are many files, while rsync is an incremental copy. Only the files that have been modified will be copied, so the copying will be fast.

If the software is not available on your system, use the following command to install it:

$ yum install rsync
Copy the code

After the installation, use rsync to synchronize the installed software to the other two servers:

$ rsync -rvl /opt/module/ root@bigdata2:/opt/module
$ rsync -rvl /opt/module/ root@bigdata3:/opt/module
Copy the code

Then configure the environment variables separately (the /etc/profile file can also be synchronized to another server) :

$ vi /etc/profile
#Java config
export JAVA_HOME=/opt/module/java-se-8u41-ri
export PATH=$PATH:$JAVA_HOME/bin

#Hadoop config
exportHADOOP_HOME = / opt/module/hadoop - 2.9.2export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Copy the code
$ source /etc/profile
Copy the code

At this point, all the software is installed.

Cluster planning

After configuring the basic environment, plan servers in the cluster so that each server plays a different role.

The planning is as follows: The most important NameNode is placed on the first server, the Yarn ResourceManager is placed on the second server, and the SecondaryNamenode is placed on the third server.

The specific plan is as follows:

The cluster configuration

Then you can configure the cluster according to the above plan. Take the server whose IP address is 192.168.56.3 as an example.

All configuration files for Hadoop are in the etc/ Hadoop directory:

$ cdThe/opt/module/hadoop - 2.9.2 / etc/hadoopCopy the code

First, configure the Hadoop core file, open core-site. XML, and add the following configuration to determine the server where NameNode is located:

$ vi core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://bigdata1:9000</value> </property> <property> < name > hadoop. TMP. Dir < / name > < value > / opt/module/hadoop - 2.9.2 / data/TMP < value > / < / property >Copy the code

Open [hadoop-env.sh](http://hadoop-env.sh) to configure JDK:

$ vi hadoop-env.sh

export JAVA_HOME=/opt/module/java-se-8u41-ri/
Copy the code

Configure the number of copies of each file and the server on which the SecondaryNameNode runs. Edit the HDFS-site. XML file:

$ vi hdfs-site.xml

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>bigdata3:50090</value>
</property>
Copy the code

Then start to configure YARN:

$ vi yarn-env.sh
export JAVA_HOME=/opt/module/java-se-8u41-ri/
Copy the code

Determine the mapReduce algorithm and the ResourceManager server, and edit yarn-site. XML:

$ vi yarn-site.xml
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>bigdata2</value>
</property>
Copy the code

Then configure MapReduce:

$ vi mapred-env.sh
export JAVA_HOME=/opt/module/java-se-8u41-ri/
Copy the code

Configure MapReduce job scheduling using YARN:

$ cp mapred-site.xml.template mapred-site.xml
$ vi mapred-site.xml
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
Copy the code

Finally, you need to configure a cluster so that each server knows which servers are in the cluster:

$ vi slaves
bigdata1
bigdata2
bigdata3
Copy the code

Then synchronize the configuration to the other two machines:

$rsync - RVL/opt/module/hadoop - 2.9.2 / root @ bigdata2: / opt/module/hadoop - 2.9.2 $rsync - RVL/opt/module/hadoop - 2.9.2 / Root @ bigdata3: / opt/module/hadoop - 2.9.2Copy the code

All the configuration is done here.

Cluster operations

To start the cluster, format the NameNode for the first time:

$ cd/ opt/module/hadoop - 2.9.2 $bin/hdfsnamenode - formatCopy the code

Start HDFS on the first server (IP: 192.168.56.3) :

$ sbin/start-dfs.sh
Copy the code

Start YARN on the second server (IP: 192.168.56.4).

$ sbin/start-yarn.sh
Copy the code

And then visit http://192.168.56.5:50090/, if you see the following interface, the cluster on the success.

To stop the cluster, run the following command:

$ sbin/stop-yarn.sh
$ sbin/stop-dfs.sh
Copy the code

The text/Rayjun

Follow the wechat official account and talk about other things