Hadoop official guide portal Portal

As of Now (January 8, 2020), the latest version of Hadoop is 3.2.1. This document is deployed and installed according to 3.2.1

This article has only shown you how to install and configure Hadoop clusters ranging from a few nodes to very large Hadoop clusters with thousands of nodes. To get a quick experience with Hadoop, you may only need to install it on a single server.Copy the code
This article does not cover advanced topics such as Hadoop security or high availability.Copy the code

Preparing the server

Four servers are planned. The operating system is centos 7

Changing the host Name

hostnamectl set-hostname centos-x
Copy the code

X is for each of our servers

Complete our host as follows

The host name IP Pre-distribution service
centos-1 10.211.55.11 DataNode NodeManager NameNode
centos-2 10.211.55.12 DataNode NodeManager SecondaryNameNode
centos-3 10.211.55.13 DataNode NodeManager ResourceManager
centos-4 10.211.55.14 DataNode NodeManager HistoryServer

Its installation – 8

Install JDK through yum

Yum update yum install java-1.8.0-openjdk-devel-yCopy the code

Modifying environment Variables

vim /etc/profile
Copy the code

Add to tail

export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
export PATH=$JAVA_HOME/bin:$PATH
Copy the code

Configure a static IP address for the server

vim /etc/sysconfig/network-scripts/ifcfg-enp0s5
Copy the code

The complete configuration is as follows

TYPE="Ethernet" PROXY_METHOD="none" BROWSER_ONLY="no" BOOTPROTO="static" # static DEFROUTE="yes" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" NAME="enp0s5" UUID="e2bda9d6-dc4f-4513-adbc-fdf3a1e2f384" DEVICE="enp0s5" ONBOOT="yes" # add GATEWAY=10.211.55.1 NAT IPADDR=10.211.55.12 # Allocate IP address NETMASK=255.255.255.0 # use ali public DNS1 DNS2=223.6.6.6 # Use Ali public DNS2Copy the code

Example Add a DHFS user

You are advised to run HDFS and YARN as separate users.

In most installations, the HDFS process is executed as “HDFS”. YARN The YARN account is commonly used

Adduser HDFS passwd HDFS # Change passwordCopy the code

Example Set SSH password-free login

All four servers are configured

ssh-keygen -t rsa
Copy the code
  1. Distribution of SSH key
ssh-copy-id centos-1
ssh-copy-id centos-2
ssh-copy-id centos-3
ssh-copy-id centos-4
Copy the code

Installing and deploying Hadoop

Switch to the HDFS user

su - hdfs
Copy the code

download

The curl http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz - OCopy the code

Unpack the

Unzip to /usr/local/

Tar -zxf hadoop-3.2.1. Tar. gz -c /usr/local/hadoop-3.2.1Copy the code

Modifying environment Variables

sudo vim /etc/profile
Copy the code

Change the original configuration to

Export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac))))) export HADOOP_HOME=/usr/local/hadoop-3.2.1  export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATHCopy the code

Modify the configuration

Here we go to the $HADOOP_HOME folder and start

mkdir -p $HADOOP_HOME/hdfs/data
mkdir -p $HADOOP_HOME/tmp
Copy the code

Configure hadoop – env. Sh

sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Copy the code

Add or modify

export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
Copy the code

Configure the core – site. The XML

vim etc/hadoop/core-site.xml
Copy the code

Configuration The configuration is as follows

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>HDFS: / / 10.211.55.11:4000</value>
        <description>URI of HDFS, file system ://namenode Identifier: port number</description>
    </property>
    
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/ usr/local/hadoop - 3.2.1 / TMP</value>
        <description>Local hadoop temporary folder on namenode</description>
    </property>

</configuration>
Copy the code

Fs. defaultFS is the address of NameNode. Hadoop.tmp. dir is the address of the hadoop temporary directory. By default, the data files of NameNode and DataNode are stored in the corresponding subdirectories under this directory.

Configuration HDFS – site. XML

vim etc/hadoop/hdfs-site.xml
Copy the code
<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>10.211.55.12:50090</value>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>10.211.55.11:50070</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>File: / usr/local/hadoop - 3.2.1 / HDFS/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>File: / usr/local/hadoop - 3.2.1 / HDFS/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>
Copy the code

DFS. The namenode. Secondary. HTTP – the address is specified secondaryNameNode HTTP access address and port number, because in the planning, we will plan for secondaryNameNode server centos – 2.

DFS. HTTP. Address is the default DFS address configured on the local computer. Dfs.namenode.name. dir Specifies the name folder and dfs.datanode.data.dir specifies the data folder dfs.datanode.data.dir specifies the number of copies

Configuration of workers

Called slaves in Hadoop2.x, renamed workers in 3.x. This parameter specifies which Datanodes exist in the HDFS and the IP addresses or host names of each node are separated by newlines.

vim etc/hadoop/workers
Copy the code

We’ll use the hostname here

centos-1
centos-2
centos-3
centos-4
Copy the code

Configuration of yarn – site. XML

vim etc/hadoop/yarn-site.xml
Copy the code

Configuration is as follows

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>centos-3</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>106800</value>
  </property>
</configuration>
Copy the code

Use centos-3 as resourcemanager as planned. Run yarn.log-aggregation-enable to enable log aggregation. Yarn.log-aggregation. remain-seconds Sets the maximum duration for storing aggregated logs in the HDFS.

Configuration mapred – site. XML

vim etc/hadoop/mapred-site.xml
Copy the code
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>centos-4:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>centos-4:19888</value>
    </property>
</configuration>
Copy the code

Mapreduce.framework. name Sets mapReduce jobs to run on YARN. Graphs. The jobhistory. The address is a set of graphs history server installed on centos – 4. Graphs. Jobhistory. Webapp. The address is set history server web page address and port number. Yarn. App. Graphs. Am. Env, graphs. The map. The env, graphs. Reduce. The env need set to HADOOP_MAPRED_HOME = ${HADOOP_HOME}. Otherwise, the JAR package cannot be found when the YARN program is running.

Starting a Hadoop Cluster

After completing all the necessary configurations, distribute the files to the HADOOP_CONF_DIR directory (/user/local) on all servers. This directory should be the same directory on all computers.

formatting

To start the Hadoop cluster, both the HDFS and YARN clusters need to be started. When starting HDFS for the first time, you must format it. Format the new distributed file system as HDFS.

$HADOOP_HOME/bin/ HDFS namenode -format < Cluster name >Copy the code

$HADOOP_HOME/ HDFS /data ($HADOOP_HOME/ HDFS /data)

Start the cluster.

If workers and SSH trust are configured, we can

$HADOOP_HOME/sbin/start-dfs.sh
Copy the code

Start the YARN

If workers and SSH trust are configured, we can

$HADOOP_HOME/sbin/start-yarn.sh
Copy the code

If there is no trust between workers and SSH configured above, we can

  1. Start thenamenode
$HADOOP_HOME/bin/hdfs --daemon start namenode
Copy the code
  1. Start theDataNode
$HADOOP_HOME/bin/hdfs --daemon start datanode
Copy the code

Start the NodeManager

The plan is on centos-4, so we execute it on centos-4

$HADOOP_HOME/bin/yarn --daemon start nodemanager
Copy the code

Start the ResourceManager

The plan is on centos-3, so we execute it on centos-3

$HADOOP_HOME/bin/yarn --daemon start resourcemanager
Copy the code

Start the HistoryServer

The plan is on centos-4, so we execute it on centos-4

$HADOOP_HOME/bin/mapred --daemon start historyserver
Copy the code

View the HDFS Web page

Located in thecentosthe50070Port:http://centos-1:50070/

View the YARN Web page

Located in thecentos-3the8088Port:http://centos-3:8088/

View the history WEB page

Port 19888 on centos-4: http://centos-4:19888/

test

To test this, we use wordCount

  1. The new file
sudo vim /opt/word.txt
Copy the code
  1. The text content
hadoop mapreduce hive
hbase spark storm
sqoop hadoop hive
spark hadoop
Copy the code
  1. newhadoopIn the folderdemo
hadoop fs -mkdir /demo
Copy the code
  1. File is written to
hdfs dfs -put /opt/word.txt /demo/word.txt
Copy the code
  1. Execute input tohadoop/output
yarn jar $HADOOP_HOME/ share/hadoop/graphs/hadoop - graphs - examples - 3.2.1. Jar wordcount/demo/word. TXT/outputCopy the code
  1. Viewing the File List
hdfs dfs -ls /output
Copy the code
Found 2 items
-rw-r--r--   3 hdfs supergroup          0 2020-01-07 02:20 /output/_SUCCESS
-rw-r--r--   3 hdfs supergroup         60 2020-01-07 02:20 /output/part-r-00000
Copy the code
  1. View the contents of the file
hdfs dfs -cat /output/part-r-00000
Copy the code
The 2020-01-07 16:40:19, 951 INFO sasl. SaslDataTransferClient: sasl encryption trust check: localHostTrusted = false, remoteHostTrusted = false hadoop 3 hbase 1 hive 2 mapreduce 1 spark 2 sqoop 1 storm 1Copy the code