If you are setting up Hadoop for the first time, you are advised to start from scratch and install and configure Hadoop on Linux first

The system design

Hadoop fully distributed mode uses the master-slave architecture, that is, one Namanode and multiple Datanodes. This mode has only one Namanode and is rarely used in production because of single point of failure.

Distribution of the host

host ip hostname OS
The virtual machine 1 192.168.56.101 hadoop1 CentOS 7
The virtual machine 2 192.168.56.102 hadoop2 CentOS 7
The virtual machine 3 192.168.56.103 hadoop3 CentOS 7

Functional allocation

hadoop1 hadoop2 hadoop3
HDFS nameNode



dataNode
SecondaryNameNode



dataNode
dataNode
YARN nodeManager nodeManager resourceManager



nodeManager

Environment to prepare

As above, we need 3 virtual machine, we can create good after a machine first, through the cloned virtual machine software function, the cloned two, pay attention to the cloned machines need to place such as iP, nic, MAC address changes, how to solve the problems caused by cloning, cloning and will not covered here, is not the focus of this article, many online tutorials, By default, you’re done.

Configuring the Java Environment

You are advised to configure the Java environment on one machine before cloning. In this way, the Java environment on multiple machines is configured.

For details about how to configure the Java environment, see Installing and Configuring the Java Environment in Linux

Setting the host Name

Hostnamectl set-hostname specifies the hostnameCopy the code

Restart after the modification is complete

reboot
Copy the code

The same goes for the other machines

Configuring the hosts file

First configure hosts on hadoop1

vi /etc/hosts
Copy the code

Append the following information to the end of the file

192.168.56.101 hadoop1
192.168.56.102 hadoop2
192.168.56.103 hadoop3
Copy the code

Copy the file to another host

scp /etc/hosts root@hadoop2:/etc/hosts
scp /etc/hosts root@hadoop3:/etc/hosts
Copy the code

Configuring a Static IP Address

First go to Hadoop1 to check the network configuration

ifconfig
Copy the code

You can see that the network card is calledenp0s3The IP address is 192.168.56.101, subnet mask is 255.255.255.0, and broadcast address is 192.168.56.255. Here the IP is what we want, but it is assigned by the DHCP server, which can change, and we finally configure it to be static

Vi /etc/sysconfig/network-scripts/ ifcfg-NIC nameCopy the code

The name of the network card here is ENP0S3 found above

To change the IP address to static, change BOOTPROTO=” DHCP “to “static” and add the broadcast address, IP address, subnet mask, and gateway address to the end

BROADCAST = 192.168.56.255 IPADDR = 192.168.56.101 NETMASK = 255.255.255.0 GATEWAY = 192.168.56.1Copy the code

Finally restart the network

/etc/init.d/network restart
Copy the code

The same goes for the other machines

Time synchronization

Set the time zone

Start by setting all machines in the same time zone, such as UTC+8

Viewing the System Time Zone

date -R
Copy the code

If +0800 is displayed, it means UTC+8 and you don’t need to change it. If not, you need to change the time zone

tzselect
Copy the code

Once you’re done, you need to copy the appropriate time zone file, replace the system time zone file, or create a link file

Cp /usr/share/zoneinfo/$primary /$secondary time /etc/localtimeCopy the code

For example: Use Asia/Shanghai (+8) when setting the Time zone for China

cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
Copy the code

Synchronizing the online world

Install the ntpdate

yum -y install ntp ntpdate
Copy the code

Synchronize the system world with the network world

ntpdate cn.pool.ntp.org
Copy the code

Synchronize system time with hardware time

Linux Time is divided into System Clock and Real Time Clock (RTC).

  • System time: indicates the time in the current Linux Kernel.
  • Hardware time: battery time on the motherboard.

Each time when the system starts, the system reads the hardware time as the system time. To prevent errors, the system time after synchronization is written into the hardware time

Hwclock -- wCopy the code

Other machines operate the same

Disabling a Firewall

My operating system is Contos7

Disabling the Firewall

systemctl stop firewalld
Copy the code

Cancel boot

systemctl disable firewalld
Copy the code

Configure SSH password-free login

There are two purposes for configuring SSH password-free login

  1. Tell nameNode to send commands to dataNode
  2. Ask resourceManager to send commands to nodeManager

In addition, nameNode does not need to log in to nameNode, and resourceManager does not need to log in to nodeManager. According to the functional division at the beginning of the article, i.e

  • Configure hadoop1 login free hadoop2 and hadoop3 login free themselves
  • Configure hadoop2 to avoid login hadoop1 and hadoop3

For details about how to configure encrypted login between Linux systems, see Linux Encrypted Login

The cluster structures,

We need to configure Hadoop on the current machine, and then copy it to another machine. In this case, I choose Hadoop1, which is the machine that will be the nameNode

Download hadoop

Before cloning, it is recommended to download, decompress and clone on a machine, so that more than one machine will have

download

Wget HTTP: / / https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gzCopy the code

Unpack the

Tar -zxvf hadoop-3.1.3.tar.gz -c /usr/local/hadoopCopy the code

I unzip to /usr/local/hadoop, where /hadoop is a directory I created in advance

Configure hadoop environment variables

vi /etc/profile
Copy the code

Append the following information to the end of the file

Export HADOOP_HOME = / usr/local/hadoop/hadoop - 3.1.3 export PATH = $PATH: $HADOOP_HOME/sbin: $HADOOP_HOME/binCopy the code

Note the configuration of PATH

Refresh after the configuration is complete

source  /etc/profile
Copy the code

Check whether the variable takes effect

hadoop version
Copy the code

Configure hadoop – env. Sh

First, look at the address of JAVA_HOME

echo $JAVA_HOME
Copy the code

Configure the JAVA_HOME path

vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Copy the code

Go to Export JAVA_HOME and add it below

Export JAVA_HOME=JAVA_HOME addressCopy the code

Configure the core – site. The XML

To create the HDFS data storage directory, I used /hdfs_data/ in the $HADOOP_HOME directory

The mkdir/usr/local/hadoop/hadoop - 3.1.3 / hdfs_data mkdir/usr/local/hadoop/hadoop - 3.1.3 / TMPCopy the code
vi $HADOOP_HOME/etc/hadoop/core-site.xml
Copy the code
<configuration>
    <property>
        <! -- Set the IP address and port of the NameNode.
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop1:9000</value>
    </property>
  
    <property>
        <! -- temporary directory where data is stored. Note that the path must be absolute and do not use the ~ symbol -->
        <name>hadoop.tmp.dir</name>
        <value>/ usr/local/hadoop/hadoop - 3.1.3 / TMP</value>
    </property>
  
    <property>
        <! -- Set HDFS data storage on local file system -->
        <name>dfs.datanode.data.dir</name>
        <value>/ usr/local/hadoop/hadoop - 3.1.3 / hdfs_data</value>
    </property>
</configuration>
Copy the code

Configuration HDFS – site. XML

vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy the code
<configuration>
    <property>
		<! SecondaryNameNode set IP and port of the SecondaryNameNode.
        <name>dfs.namenode.secondary.http-address</name>
        <value>hadoop2:9868</value>
    </property>
</configuration>
Copy the code

Configuration mapred – site. XML

If there is no 'mapred-site. XML' file, see if there is a configuration template file 'mapred-site.xml.template', copy this template to generate 'mapred-site. XML'Copy the code
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy the code
<configuration>
    <property>
        <! Set MapReduce to use the YARN framework -->
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code

Configuration of yarn – site. XML

vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy the code
<configuration> <property> <! -- Set the mapReduce mash-up mode to the default mapReduce mash-up algorithm --> <name> yarn.nodeManager. aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <! - specifies the address of the ResourceManager - > < name > yarn. The ResourceManager. Address < / name > < value > hadoop3 < value > / < / property > </configuration>Copy the code

Configuration of workers

Note that the configuration file in Hadoop2 is slaves while the configuration file in 3 has been modified to workers

vi $HADOOP_HOME/etc/hadoop/workers
Copy the code

After the original content is deleted, add the following configuration

hadoop1
hadoop2
hadoop3
Copy the code

Copy the configured Hadoop to another machine

scp -r /usr/local/hadoop/ root@hadoop2:/usr/local/hadoop/
scp -r /usr/local/hadoop/ root@hadoop3:/usr/local/hadoop/
Copy the code

Formatting HDFS

Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode.

When formatting, pay attention to the permissions of the hadoop.tmp.dir directoryCopy the code
~ / hadoop - 3.1.3 / bin/HDFS namenode - formatCopy the code

Formatted successfully If “has been formatted successfully” is displayed

After the formatting is successful, check whether the DFS directory exists in the directory specified by hadoop.tmp.dir in core-site. XML

Tree - C - ~ / hadoop - 3.1.3 pf/HDFS/TMP /Copy the code

  • Fsimage_ * is a local file where NameNode metadata is persisted when memory is full.
  • fsimage_Md5 is a verification file used to verify Fsimage_Integrity.
  • Seen_txid is the hadoop version
  • Vession File contents
    • NamespaceID: indicates the unique ID of the NameNode
    • ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.
Cat/root/hadoop - 3.1.3 / HDFS/TMP/DFS/name/current/VERSIONCopy the code

Formatting HDFS

Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode.

$HADOOP_HOME/bin/hdfs namenode -format
Copy the code

Formatted successfully If “has been formatted successfully” is displayed

After the formatting is successful, check whether the DFS directory exists in the directory specified by hadoop.tmp.dir in core-site. XML

tree -C -pf $HADOOP_HOME/tmp/
Copy the code

  • Fsimage_ * is a local file where NameNode metadata is persisted when memory is full.
  • fsimage_Md5 is a verification file used to verify Fsimage_Integrity.
  • Seen_txid is the hadoop version
  • Vession File contents
    • NamespaceID: indicates the unique ID of the NameNode
    • ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.
cat $HADOOP_HOME/tmp/dfs/name/current/VERSION
Copy the code

Start the cluster

Run the following command on hadoop1

start-all.sh
Copy the code

If the above error occurs, it is due to a lack of user definition

Add the following information to the start-dfs.sh and stop-dfs.sh headers

HDFS_NAMENODE_USER=root
HDFS_DATANODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Copy the code

Add the following content to the start-yarn.sh and stop-yarn.sh headers

YARN_RESOURCEMANAGER_USER=root
YARN_NODEMANAGER_USER=root
Copy the code

Restart after configuration