Hadoop official guide portal Portal
As of Now (January 8, 2020), the latest version of Hadoop is 3.2.1. This document is deployed and installed according to 3.2.1
This article has only shown you how to install and configure Hadoop clusters ranging from a few nodes to very large Hadoop clusters with thousands of nodes. To get a quick experience with Hadoop, you may only need to install it on a single server.Copy the code
This article does not cover advanced topics such as Hadoop security or high availability.Copy the code
Preparing the server
Four servers are planned. The operating system is centos 7
Changing the host Name
hostnamectl set-hostname centos-x
Copy the code
X is for each of our servers
Complete our host as follows
The host name | IP | Pre-distribution service |
---|---|---|
centos-1 | 10.211.55.11 | DataNode NodeManager NameNode |
centos-2 | 10.211.55.12 | DataNode NodeManager SecondaryNameNode |
centos-3 | 10.211.55.13 | DataNode NodeManager ResourceManager |
centos-4 | 10.211.55.14 | DataNode NodeManager HistoryServer |
Its installation – 8
Install JDK through yum
Yum update yum install java-1.8.0-openjdk-devel-yCopy the code
Modifying environment Variables
vim /etc/profile
Copy the code
Add to tail
export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
export PATH=$JAVA_HOME/bin:$PATH
Copy the code
Configure a static IP address for the server
vim /etc/sysconfig/network-scripts/ifcfg-enp0s5
Copy the code
The complete configuration is as follows
TYPE="Ethernet" PROXY_METHOD="none" BROWSER_ONLY="no" BOOTPROTO="static" # static DEFROUTE="yes" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" NAME="enp0s5" UUID="e2bda9d6-dc4f-4513-adbc-fdf3a1e2f384" DEVICE="enp0s5" ONBOOT="yes" # add GATEWAY=10.211.55.1 NAT IPADDR=10.211.55.12 # Allocate IP address NETMASK=255.255.255.0 # use ali public DNS1 DNS2=223.6.6.6 # Use Ali public DNS2Copy the code
Example Add a DHFS user
You are advised to run HDFS and YARN as separate users.
In most installations, the HDFS process is executed as “HDFS”. YARN The YARN account is commonly used
Adduser HDFS passwd HDFS # Change passwordCopy the code
Example Set SSH password-free login
All four servers are configured
ssh-keygen -t rsa
Copy the code
- Distribution of SSH key
ssh-copy-id centos-1
ssh-copy-id centos-2
ssh-copy-id centos-3
ssh-copy-id centos-4
Copy the code
Installing and deploying Hadoop
Switch to the HDFS user
su - hdfs
Copy the code
download
The curl http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz - OCopy the code
Unpack the
Unzip to /usr/local/
Tar -zxf hadoop-3.2.1. Tar. gz -c /usr/local/hadoop-3.2.1Copy the code
Modifying environment Variables
sudo vim /etc/profile
Copy the code
Change the original configuration to
Export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac))))) export HADOOP_HOME=/usr/local/hadoop-3.2.1 export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATHCopy the code
Modify the configuration
Here we go to the $HADOOP_HOME folder and start
mkdir -p $HADOOP_HOME/hdfs/data
mkdir -p $HADOOP_HOME/tmp
Copy the code
Configure hadoop – env. Sh
sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Copy the code
Add or modify
export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)))))
Copy the code
Configure the core – site. The XML
vim etc/hadoop/core-site.xml
Copy the code
Configuration The configuration is as follows
<configuration>
<property>
<name>fs.defaultFS</name>
<value>HDFS: / / 10.211.55.11:4000</value>
<description>URI of HDFS, file system ://namenode Identifier: port number</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/ usr/local/hadoop - 3.2.1 / TMP</value>
<description>Local hadoop temporary folder on namenode</description>
</property>
</configuration>
Copy the code
Fs. defaultFS is the address of NameNode. Hadoop.tmp. dir is the address of the hadoop temporary directory. By default, the data files of NameNode and DataNode are stored in the corresponding subdirectories under this directory.
Configuration HDFS – site. XML
vim etc/hadoop/hdfs-site.xml
Copy the code
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>10.211.55.12:50090</value>
</property>
<property>
<name>dfs.http.address</name>
<value>10.211.55.11:50070</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>File: / usr/local/hadoop - 3.2.1 / HDFS/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>File: / usr/local/hadoop - 3.2.1 / HDFS/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
Copy the code
DFS. The namenode. Secondary. HTTP – the address is specified secondaryNameNode HTTP access address and port number, because in the planning, we will plan for secondaryNameNode server centos – 2.
DFS. HTTP. Address is the default DFS address configured on the local computer. Dfs.namenode.name. dir Specifies the name folder and dfs.datanode.data.dir specifies the data folder dfs.datanode.data.dir specifies the number of copies
Configuration of workers
Called slaves in Hadoop2.x, renamed workers in 3.x. This parameter specifies which Datanodes exist in the HDFS and the IP addresses or host names of each node are separated by newlines.
vim etc/hadoop/workers
Copy the code
We’ll use the hostname here
centos-1
centos-2
centos-3
centos-4
Copy the code
Configuration of yarn – site. XML
vim etc/hadoop/yarn-site.xml
Copy the code
Configuration is as follows
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>centos-3</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
</configuration>
Copy the code
Use centos-3 as resourcemanager as planned. Run yarn.log-aggregation-enable to enable log aggregation. Yarn.log-aggregation. remain-seconds Sets the maximum duration for storing aggregated logs in the HDFS.
Configuration mapred – site. XML
vim etc/hadoop/mapred-site.xml
Copy the code
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>centos-4:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>centos-4:19888</value>
</property>
</configuration>
Copy the code
Mapreduce.framework. name Sets mapReduce jobs to run on YARN. Graphs. The jobhistory. The address is a set of graphs history server installed on centos – 4. Graphs. Jobhistory. Webapp. The address is set history server web page address and port number. Yarn. App. Graphs. Am. Env, graphs. The map. The env, graphs. Reduce. The env need set to HADOOP_MAPRED_HOME = ${HADOOP_HOME}. Otherwise, the JAR package cannot be found when the YARN program is running.
Starting a Hadoop Cluster
After completing all the necessary configurations, distribute the files to the HADOOP_CONF_DIR directory (/user/local) on all servers. This directory should be the same directory on all computers.
formatting
To start the Hadoop cluster, both the HDFS and YARN clusters need to be started. When starting HDFS for the first time, you must format it. Format the new distributed file system as HDFS.
$HADOOP_HOME/bin/ HDFS namenode -format < Cluster name >Copy the code
$HADOOP_HOME/ HDFS /data ($HADOOP_HOME/ HDFS /data)
Start the cluster.
If workers and SSH trust are configured, we can
$HADOOP_HOME/sbin/start-dfs.sh
Copy the code
Start the YARN
If workers and SSH trust are configured, we can
$HADOOP_HOME/sbin/start-yarn.sh
Copy the code
If there is no trust between workers and SSH configured above, we can
- Start the
namenode
$HADOOP_HOME/bin/hdfs --daemon start namenode
Copy the code
- Start the
DataNode
$HADOOP_HOME/bin/hdfs --daemon start datanode
Copy the code
Start the NodeManager
The plan is on centos-4, so we execute it on centos-4
$HADOOP_HOME/bin/yarn --daemon start nodemanager
Copy the code
Start the ResourceManager
The plan is on centos-3, so we execute it on centos-3
$HADOOP_HOME/bin/yarn --daemon start resourcemanager
Copy the code
Start the HistoryServer
The plan is on centos-4, so we execute it on centos-4
$HADOOP_HOME/bin/mapred --daemon start historyserver
Copy the code
View the HDFS Web page
Located in thecentos
the50070
Port:http://centos-1:50070/
View the YARN Web page
Located in thecentos-3
the8088
Port:http://centos-3:8088/
View the history WEB page
Port 19888 on centos-4: http://centos-4:19888/
test
To test this, we use wordCount
- The new file
sudo vim /opt/word.txt
Copy the code
- The text content
hadoop mapreduce hive
hbase spark storm
sqoop hadoop hive
spark hadoop
Copy the code
- new
hadoop
In the folderdemo
hadoop fs -mkdir /demo
Copy the code
- File is written to
hdfs dfs -put /opt/word.txt /demo/word.txt
Copy the code
- Execute input to
hadoo
p/output
yarn jar $HADOOP_HOME/ share/hadoop/graphs/hadoop - graphs - examples - 3.2.1. Jar wordcount/demo/word. TXT/outputCopy the code
- Viewing the File List
hdfs dfs -ls /output
Copy the code
Found 2 items
-rw-r--r-- 3 hdfs supergroup 0 2020-01-07 02:20 /output/_SUCCESS
-rw-r--r-- 3 hdfs supergroup 60 2020-01-07 02:20 /output/part-r-00000
Copy the code
- View the contents of the file
hdfs dfs -cat /output/part-r-00000
Copy the code
The 2020-01-07 16:40:19, 951 INFO sasl. SaslDataTransferClient: sasl encryption trust check: localHostTrusted = false, remoteHostTrusted = false hadoop 3 hbase 1 hive 2 mapreduce 1 spark 2 sqoop 1 storm 1Copy the code