I have written a blog about getting started and installing Hadoop, but I always feel there are some shortcomings. This time, I will sort out the related concepts and installation methods of Hadoop.

What is the Hadoop

We are living in an era of data explosion. With the rapid growth of data, it is urgent to solve the storage and calculation problems of massive data. That’s where Hadoop came in. Hadoop is a framework for distributed storage and computing of massive data. Distributed storage simply means that when you store data, it doesn’t just exist on one machine, it exists on multiple machines. Distributed computing is simple to understand, is the parallel processing of data by many machines, we write Java programs, generally write stand-alone programs, only run on a machine, so the processing capacity of the program is limited. This is just the way to understand it, and we’ll analyze it in detail. The author of Hadoop, Doug Cutting, named the framework by accident. His kid had a stuffed elephant toy and his kid was always pointing at it and calling it Hadoop, Hadoop, so he named it after it.

Hadoop distribution introduction

Let’s look at distributions of Hadoop. What is a distribution? For example, there are two camps of mobile operating systems, one is Apple’s IOS, and the other is Google’s Android. IOS is closed source, so there are no multiple distributions. If you make a new mobile system based on IOS, apple will Sue for bankruptcy. So there is no other distribution of IOS, only the official version. Android is open source, so based on this system, many mobile phone manufacturers will package it, because these mobile phone manufacturers will feel that the interface of the native Android system looks low, or some functions are not suitable for the use of Chinese people, so they will modify it. For example, domestic mobile phone manufacturers meizu, Xiaomi and Smartisan have built their own mobile phone operating systems based on Android, and these are some distributions of Android system. The same is true for Hadoop, which has evolved into a synonym for big data and formed a complete big data ecosystem. Moreover, Hadoop is open source by Apache, and its open source license determines that anyone can modify it and distribute/sell it as open source or commercial version. So there are many distributions of Hadoop, including Huawei distributions, Intel distributions, Cloudera distributions, Hortonworks distributions, HDP distributions, all of which are derived from Apache Hadoop. Here we pick a few key points for analysis:

Official native version: Apache Hadoop

Apache is a public welfare organization in the FIELD of IT, similar to the Red Cross. All the software in Apache is open source. We will learn native Hadoop. Other distributions are compatible with native Hadoop, which is not a concern. The disadvantage of native Hadoop is that there is no technical support, so you need to solve the problem by yourself or ask questions through the community on the official website, but the reply is generally slow and there is no guarantee that the problem can be solved. Another thing is that it is troublesome to build a cluster with native Hadoop, and many configuration files need to be modified. If there are too many cluster machines, The pressure for operation and maintenance personnel is relatively large, which we can feel when we build a cluster later.

Cloudera Hadoop(CDH)

Note that CDH is a commercial version, which has made some improvements to the official version. It provides charging technical support, provides interface operation, and facilitates cluster operation and maintenance management. CDH is currently used in enterprises or more, although CDH is charged, but some basic functions in CDH are not charged, can be used all the time, advanced functions need to charge to use, if you do not want to pay, can also make do with use.

HortonWorks(HDP)

It is open source, also provide the interface, convenient operations management, general Internet companies prefer to use the note, a blasting material again, latest news, current HDP has been CDH acquisition, are all belong to a company’s products, the late HDP will merge into the CDH, remains to be seen, specific to the operating strategy of the company. Final suggestion: IT is suggested to choose CDH or HDP when building the big data platform in practical work, which is convenient for operation and maintenance management. Otherwise, the operation and maintenance students will cry when managing the native Hadoop cluster with thousands of machines.

Hadoop version evolution history

Hadoop has gone through three major versionsFrom 1. X to 2. X to 3

Each major version of the upgrade has brought some qualitative improvements, let’s first analyze the changes of the three versions from the architectural level:

Hadoop1. X: HDFS + graphs

Hadoop2. X: HDFS + YARN + graphs

Hadoop3. X: HDFS + YARN + graphs

Detail optimization for Hadoop3.x

2. In Hadoop 3, HDFS supports erasure correction code, which is a persistent data storage method that saves more storage space than copy storage. With this method, the storage space can be saved by half with the same fault tolerance. Hadoop.apache.org/docs/r3.0.0… 3: Hadoop 2 of HDFS supports at most two NameNode, a master, and Hadoop 3 of HDFS supports multiple NameNode, a master more detail here: hadoop.apache.org/docs/r3.0.0… 4: MapReduce task-level local optimization, MapReduce adds support for a localized implementation of the mapping output collector. For intensive shuffling (shuffle – intensive) jobs, can lead to 30% performance improvement, details here: issues.apache.org/jira/browse… The main difference between Hadoop3 and Hadoop2 is that the new version provides better optimizations and usability. Hadoop.apache.org/docs/r3.0.0…

This section describes the three core components of Hadoop

Hadoop consists of the following components: HDFS+MapReduce+YARN HDFS is responsible for distributed storage of massive data. MapReduce is a computing model that is responsible for distributed computing of massive data. YARN manages and schedules cluster resources

Pseudo-distributed cluster installation

Next, let’s first look at the installation of a pseudo-distributed cluster. Take a look at this diagram

This diagram represents a Linux machine, or node, on which the JDK environment is installed

NameNode, SecondaryNameNode, and DataNode are HDFS service processes. ResourceManager and NodeManager are YARN service processes. MapRedcue has no processes here because it is a computing framework on which MapReduce can run once the Hadoop cluster is installed.

Before installing the cluster, you need to download the Hadoop installation package. In this case, Hadoop 3.2.0 is used.

There is a download button on the Hadoop website. If you go to the Apache Release Archive link, you can find the various versions of the installation package.



Note: if you find that the download of this foreign address is slow, you can use the domestic mirror address to download, but the version of the installation package provided in the domestic mirror address may not be complete, if you do not find the version we need, you still have to go to the official website to download.

These domestic images contain not only Hadoop installation packages, but also most of the software packages in the Apache organization

Address 1:

mirror.bit.edu.cn/apache/

Address 2:

mirrors.tuna.tsinghua.edu.cn/apache

With the installation package downloaded, we’ll start installing the pseudo-distributed cluster.

The machine bigData01 is used here

First configure the base environment

IP, hostname, Firewalld, SSH password-free login, JDKCopy the code
  • IP: Set a static IP address
[root@bigdata01 ~]# vi /etc/sysconfig/network-scripts/ifcfg-ens33 TYPE="Ethernet" PROXY_METHOD="none" BROWSER_ONLY="no" BOOTPROTO="static" DEFROUTE="yes" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" NAME="ens33" UUID="9a0df9ec-a85f-40bd-9362-ebe134b7a100" DEVICE = "ens33" IPADDR = 192.168.182.100 GATEWAY = 192.168.182.2 DNS1 = 192.168.182.2Copy the code
[root@bigdata01 ~]# service network restart 
Restarting network (via systemctl): [ OK ] 
[root@bigdata01 ~]# ip addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group def link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
inet 127.0.0.1/8 scope host lo 
valid_lft forever preferred_lft forever 
inet6 ::1/128 scope host 
valid_lft forever preferred_lft forever 
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state U 
link/ether 00:0c:29:9c:86:11 brd ff:ff:ff:ff:ff:ff 
inet 192.168.182.100/24 brd 192.168.182.255 scope global noprefixroute en 
valid_lft forever preferred_lft forever
inet6 fe80::c8a8:4edb:db7b:af53/64 scope link noprefixroute 
valid_lft forever preferred_lft forever
Copy the code
  • Hostname: Sets the temporary and permanent host names
[root@bigdata01 ~]# hostname bigdata01 
[root@bigdata01 ~]# vi /etc/hostname 
bigdata01
Copy the code
  • Firewalld: temporarily disable the firewall + permanently disable the firewall
[root@bigdata01 ~]# systemctl stop firewalld 
[root@bigdata01 ~]# systemctl disable firewalld
Copy the code
  • SSH Password-free login

SSH is a secure shell. You can use SSH to log in to a remote Linux machine. The Hadoop cluster we will talk about uses SSH. When we start the cluster, we only need to start it on one machine. Then Hadoop will connect to other machines through SSH and start corresponding programs on other machines. However, there is a problem that we need to enter a password when we use SSH to connect to other machines, so we need to implement SSH password-free login. Some students may have a question. You said that multiple machines need to be configured with password-free login, but now we are a pseudo-distributed cluster with only one machine. Note that regardless of the cluster of several machines, the steps to start the program in the cluster are the same, all through SSH remote connection to operate, even if a machine, it will use SSH to connect itself, we now use SSH to connect itself also need a password.

SSH password-free login will be discussed in detail

SSH, a secure/encryption shell, uses asymmetric encryption. There are two types of encryption: symmetric encryption and asymmetric encryption. Asymmetric encryption is secure because the decryption process is irreversible.

Asymmetric encryption generates a secret key, which can be divided into a public key and a private key. In this case, the public key is public and the private key is private.

So the process of SSH communication is that the first machine gives its public key to the second machine,

When the first machine wanted to communicate with the second machine,

The first machine would send a random string to the second machine, which would encrypt the string using the public key,

The first machine also encrypts the string using its private key and passes it to the second machine

This time, the machine had two copies of the content encryption, a is their use of public-key encryption, a is the first machine and use private key encryption, public key and a private key is calculated through a certain algorithm, this time, the second machine will compare the two encryption after the content of the match. If a match is made, the second machine assumes that the first machine is trusted and is allowed to log in. If not, it is considered an illegal machine.

Now let’s configure SSH password-free login formally. Since we are going to configure our own password-free login, the first machine and the second machine are the same.

First, execute ssh-keygen -t RSA on BigData01

Rsa stands for an encryption algorithm

Note: After executing this command, you need to press enter four times to return to the Linux command line to indicate that the operation is complete. You do not need to enter anything when you press enter.

After the command is executed, the public key and secret key files are generated in the ~/. SSH directory

[root@bigdata01 ~]# ll ~/.ssh/ 
total 12 
-rw-------. 1 root root 1679 Apr 7 16:39 id_rsa 
-rw-r--r--. 1 root root 396 Apr 7 16:39 id_rsa.pub 
-rw-r--r--. 1 root root 203 Apr 7 16:21 known_hosts
Copy the code

The next step is to copy the public key to the machine that requires password-free login

[root@bigdata01 ~]# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Copy the code

You can then SSH password-free login to the BigData01 machine

[root@bigdata01 ~]# ssh bigdata01 
Last login: Tue Apr 7 15:05:55 2020 from 192.168.182.1 
[root@bigdata01 ~]
Copy the code
  • The JDK installation

Install Hadoop

  • First, upload the Hadoop installation package to the /home/software directory
[root@bigdata01 software]# ll total 527024-RW-r --r-- 1 root root 345625475 Jul 19 2019 hadoop-3.2.0.tar.gz drwxr-xr-x 4 Dec 16 2018 jdk1.8-rw-r --r--. 5 Dec 16 2018 jdk1.8-RW-r --r--Copy the code
  • Decompress the Hadoop installation package and go to /usr/local/.
[root@bigdata01 software]# tar -zxvf hadoop-3.2.0.tar.gz
Copy the code

There are two important directories under the Hadoop directory, the bin directory and the sbin directory



Take a look at the bin directory, which contains HDFS and YARN scripts. These scripts are used to operate HDFS and YARN components in the Hadoop cluster

Take a look at the sbin directory, which contains many start-stop scripts that start or stop components in the cluster. Since we will be using some scripts under the bin and sbin directories, we need to configure the environment variables for ease of use.

  • Modify Hadoop configuration files



Mainly modify the following files:

hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
workers
Copy the code

First modify the hadoop-env.sh file and add the environment variable information to the end of the hadoop-env.sh file. JAVA_HOME: specifies the Java installation location. HADOOP_LOG_DIR: specifies the directory for storing hadoop logs

Export JAVA_HOME = / usr/Java/jdk1.8 export HADOOP_LOG_DIR = / usr/local/hadoop_repo/logs/hadoopCopy the code

Modify the core-site. XML file. Note that the host name in the fs.defaultFS property needs to be the same as the hostname you configured

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://bigdata01:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop_repo</value>
    </property>
</configuration>
Copy the code

Modify the hdFS-site. XML file to set the number of file copies in HDFS to 1, because the pseudo-distributed cluster now has only one node

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
Copy the code

Modify mapred-site. XML to set the resource scheduling framework used by MapReduce

<configuration>
   <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
   </property>
</configuration>
Copy the code

Modify yarn-site. XML to set the whitelist of services and environment variables supported by YARN

<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP _MAPRED_HOME</value> </property> </configuration>Copy the code

Modify workers, set the host name information of the slave node in the cluster. There is only one cluster here, so fill in Bigdata01

[root@bigdata01 hadoop]# vi workers 
bigdata01
Copy the code

The configuration file has been modified, but it cannot be directly started, because HDFS is a distributed file system in Hadoop. The file system needs to be formatted before being used. It is similar to buying a new disk and formatting it before being used.

  • Formatting HDFS
hdfs namenode -format
Copy the code

If an error is displayed, it is usually caused by the configuration file. Of course, you need to analyze the problem according to the specific error message. Note: The formatting operation can be performed only once. If the formatting fails, modify the configuration file and then perform the formatting. If the formatting succeeds, you cannot perform the formatting again. /usr/local/hadoop_repo if you need to do this again, you need to delete the contents of the /usr/local/hadoop_repo directory and then format it. Definitely not, after formatting the operating system will have to reinstall.

  • Start the pseudo-distributed cluster

Run the start-all.sh script in the sbin directory

[root@bigdata01 hadoop-3.2.0]# sbin/start-all.sh 
Starting namenodes on [bigdata01] 
ERROR: Attempting to operate on hdfs namenode as root ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation. Starting datanodes 
ERROR: Attempting to operate on hdfs datanode as root ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation. Starting secondary namenodes [bigdata01] 
ERROR: Attempting to operate on hdfs secondarynamenode as root ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operatioStarting resourcemanager 
ERROR: Attempting to operate on yarn resourcemanager as root ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation. Starting nodemanagers 
ERROR: Attempting to operate on yarn nodemanager as root 
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.
Copy the code

Many ERROR messages are displayed indicating that some user information about HDFS and YARN is missing. The solution is as follows: Modify the start-dfs.sh and stop-dfs.sh scripts in the sbin directory, and add the following content to the beginning of the scripts

HDFS_DATANODE_USER=root 
HDFS_DATANODE_SECURE_USER=hdfs 
HDFS_NAMENODE_USER=root 
HDFS_SECONDARYNAMENODE_USER=root
Copy the code

Modify the start-yarn.sh and stop-yarn.sh scripts in the sbin directory, and add the following content to the files

YARN_RESOURCEMANAGER_USER=root 
HADOOP_SECURE_DN_USER=yarn 
YARN_NODEMANAGER_USER=root
Copy the code

The cluster is restarted successfully. Procedure

  • Verify cluster process information

You can run the JPS command to view the process information of the cluster. If the JPS process is removed, the cluster can start normally only if there are five other processes



You can also verify cluster services on the web user interface (webui)

HDFS webui:

http://192.168.182.100:9870

YARN webui:

http://192.168.182.100:8088

  • Stop the cluster
[root @ bigdata01 hadoop - 3.2.0] # sbin/stop - all. ShCopy the code

Distributed Cluster Installation

  • Environment to prepare

Now that we’re done with the pseudo-distributed cluster let’s see what a real distributed cluster looks like, okay

If you look at this diagram, there are three nodes, the one on the left is the master node, and the two on the right are the slave nodes. Hadoop cluster supports the master/slave architecture.

By default, different processes are started on different nodes.

Let’s implement a hadoop cluster with one master and two slaves according to the plan in the figure

Environment: Three nodes

Bigdata01 192.168.182.100

Bigdata02 192.168.182.101

Bigdata03 192.168.182.102

Note: The basic environment of each node must be configured in advance, including IP, hostname, Firewalld, SSH password-free login and JDK

Delete the hadoop previously installed in BigData01, delete the decompressed directory, and modify the environment variables. Suppose we now have three Linux machines, each with a brand new environment. So let’s do that. Note: the configuration steps of the basic environment such as IP, hostname, Firewalld and JDK for these three machines are not recorded here. For details, refer to the previous steps. Bigdata01 BigData02 BigData03 IP, hostname, Firewalld, SSH password-free login, JDK and other basic environments have been configured. Once these basic environments are configured, there are still some configurations to complete. Configure /etc/hosts. Because two secondary nodes need to be remotely connected from the primary node, enable the primary node to identify the host name of the secondary node and use the host name for remote access. By default, only IP addresses can be used for remote access. To use the host name for remote access, configure the HOST IP address and host name in the /etc/hosts file of the node. In /etc/hosts file of BigData01, you need to configure the following information. It is better to configure the current node information in the file, so that the contents of this file are common and can be directly copied to the other two slave nodes

[root@bigdata01 ~]# vi /etc/hosts 
192.168.182.100 bigdata01 
192.168.182.101 bigdata02
192.168.182.102 bigdata03
Copy the code
[root@bigdata02 ~]# vi /etc/hosts 
192.168.182.100 bigdata01 
192.168.182.101 bigdata02
192.168.182.102 bigdata03
Copy the code
[root@bigdata03 ~]# vi /etc/hosts 
192.168.182.100 bigdata01 
192.168.182.101 bigdata02
192.168.182.102 bigdata03
Copy the code
  • Time is synchronized between cluster nodes

As long as a cluster involves multiple nodes, it needs to synchronize the time of these nodes. If the time difference between nodes is too large, the stability of the cluster will be reduced, and even cluster problems will occur. Run the ntpdate -u ntp.sjtu.edu.cn command to synchronize time, but the command cannot be found

[root@bigdata01 ~]# ntpdate -u ntp.sjtu.edu.cn 
-bash: ntpdate: command not found
Copy the code

Ntpdate is not available by default. Run the yum install -y ntpdate command and then manually run the ntpdate -u ntp.sjtu.edu.cn command to check whether the time synchronization operation can be successfully added to the Crontab timer of Linux once a minute

[root@bigdata01 ~]# vi /etc/crontab 
* * * * * root /usr/sbin/ntpdate -u ntp.sjtu.edu.cn
Copy the code

Then configure time synchronization on BigData02 and BigData03 to operate on BigData02

[root@bigdata02 ~]# yum install -y ntpdate 
[root@bigdata02 ~]# vi /etc/crontab 
* * * * * root /usr/sbin/ntpdate -u ntp.sjtu.edu.cn
Copy the code

Operate on the BigData03 node

[root@bigdata03 ~]# yum install -y ntpdate 
[root@bigdata03 ~]# vi /etc/crontab 
* * * * * root /usr/sbin/ntpdate -u ntp.sjtu.edu.cn
Copy the code
  • SSH password-free login perfect

Note: As for password-free login, only you can log in yourself without password at present. Ultimately, you need to realize password-free login to all nodes at the host point, so you need to improve the password-free login operation. First run the following command on the BigData01 machine to copy the public key information to the two slave nodes

[root@bigdata01 ~]# scp ~/.ssh/authorized_keys bigdata02:~/ 
[root@bigdata01 ~]# scp ~/.ssh/authorized_keys bigdata03:~/ 
Copy the code

Then execute on BigData02 and BigData03

[root@bigdata02/3 ~]# cat ~/authorized_keys >> ~/.ssh/authorized_keys
Copy the code

Verify the effect, on the BigData01 node using SSH remote connection between two slave nodes, if you do not need to enter a password, it is successful, at this time the host point can log in to all nodes without password.

[root@bigdata01 ~]# ssh bigdata02 
Last login: Tue Apr 7 21:33:58 2020 from bigdata01 
[root@bigdata02 ~]# exit 
logout Connection to bigdata02 closed. 
[root@bigdata01 ~]# ssh bigdata03
Last login: Tue Apr 7 21:17:30 2020 from 192.168.182.1 
[root@bigdata03 ~]# exit logout Connection to bigdata03 closed.
[root@bigdata01 ~]#
Copy the code

Is there any need for password-free logins between secondary nodes? This is not necessary because only the primary node needs to be remotely connected to other nodes when the cluster is started.

  • Install Hadoop

OK, so now that you have the infrastructure configured for the three nodes in the cluster, you need to install Hadoop on those three nodes. First install on the BigData01 node. 1: upload the hadoop-3.2.0.tar.gz installation package to the /home/software directory on the Linux machine.2: Decompress the Hadoop installation package. 3: Modify the Hadoop-related configuration file to go to the directory where the configuration file resides

[root@bigdata01 soft]# CD hadoop-3.2.0 /etc/hadoop-/ [root@bigdata01 hadoop]#Copy the code

First modify the hadoop-env.sh file and add the environment variable information at the end of the file

[root@bigdata01 hadoop]# vi hadoop-env.sh export JAVA_HOME=/usr/ Java /jdk1.8 export HADOOP_LOG_DIR=/data/hadoop_repo/logs/hadoopCopy the code

Modify the core-site. XML file. Note that the host name in fs.defaultFS must be the same as that of the primary node

[root@bigdata01 hadoop]# vi core-site.xml
<configuration>
<property> 
  <name>fs.defaultFS</name> 
  <value>hdfs://bigdata01:9000</value> 
</property>
<property> 
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/hadoop_repo</value> 
</property> 
</configuration>
Copy the code

Modify HDFS -site. XML file to set the number of file copies in HDFS to 2, or at most 2, because there are two secondary nodes in the cluster and the node where the secondaryNamenode process is located

[root@bigdata01 hadoop]# vi hdfs-site.xml 
<configuration>
  <property> 
    <name>dfs.replication</name> 
    <value>2</value> 
  </property> 
  <property> 
    <name>dfs.namenode.secondary.http-address</name> 
    <value>bigdata01:50090</value> 
  </property>
</configuration>
Copy the code

Modify mapred-site. XML to set the resource scheduling framework used by MapReduce

[root@bigdata01 hadoop]# vi mapred-site.xml 
<configuration> 
  <property> 
      <name>mapreduce.framework.name</name> 
      <value>yarn</value>
  </property> 
</configuration>
Copy the code

Modify yarn-site. XML to set the whitelist of services and environment variables supported by YARN. For distributed clusters, set the Hostname of Resourcemanager in this configuration file. Otherwise, nodeManager cannot find the Resourcemanager node.

[root@bigdata01 hadoop]# vi yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP _MAPRED_HOME</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>bigdata01</value> </property> </configuration>Copy the code

Modifying worker Configuration

[root@bigdata01 hadoop]# vi workers 
bigdata02 
bigdata03
Copy the code

Modify the startup script Modify the start-dfs.sh and stop-dfs.sh script files and add the following content to the beginning of the files

HDFS_DATANODE_USER=root 
HDFS_DATANODE_SECURE_USER=hdfs 
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Copy the code

Modify the start-yarn.sh and stop-yarn.sh script files and add the following content to the beginning of the files

YARN_RESOURCEMANAGER_USER=root 
HADOOP_SECURE_DN_USER=yarn 
YARN_NODEMANAGER_USER=root
Copy the code
  • Copy the modified configuration installation package on bigData01 to the other two slave nodes
[root@bigdata01 sbin]# CD /usr/local/[root@bigdata01 soft]# SCP -rq hadoop-3.2.0 bigdata02: /usr/local// / [root@bigdata01 soft]# scp-rq hadoop-3.2.0 bigdata03:/usr/local/Copy the code
  • Format HDFS on bigData01 node
[root@bigdata01 hadoop-3.2.0]# HDFS namenode-formatCopy the code

If this line is displayed in subsequent logs, the Namenode is successfully formatted.

Storage: Storage directory /usr/local/hadoop_repo/dfs/name has been successfully formatted.
Copy the code
  • To start the cluster, run the following command on the BigData01 node
[root @ bigdata01 hadoop - 3.2.0] # sbin/start - all. ShCopy the code
  • Validation of the cluster

Run the JPS command on the three machines. The process information is as follows: Run the JPS command on the BigData01 node

[root@bigdata01 hadoop-3.2.0]# jps 
6128 NameNode 
6621 ResourceManager 
6382 SecondaryNameNode
Copy the code

Execute on the BigData02 node

[root@bigdata02 ~]# jps 
2385 NodeManager 
2276 DataNode
Copy the code

Execute on the BigData03 node

[root@bigdata03 ~]# jps 
2385 NodeManager 
2276 DataNode
Copy the code
  • Stop the cluster

Run the stop command on the BigData01 node

[root @ bigdata01 hadoop - 3.2.0] # sbin/stop - all. ShCopy the code

At this point, the Hadoop distributed cluster is installed successfully!

Hadoop client node

In practice, it is not recommended to connect nodes in the cluster directly to operate the cluster. It is not safe to expose nodes in the cluster to ordinary developers

You are advised to install Hadoop on the service machine. Ensure that the Hadoop configuration on the service machine is consistent with that in the cluster. In this way, the Hadoop cluster can be operated on the service machine, which is called the Hadoop client node

There may be multiple Hadoop client nodes. In theory, the machine we want to operate the Hadoop cluster on can be configured as the client node of the Hadoop cluster. Notice Do not start Hadoop on the client node. It is only used by the client.