1. Environment description

  • Based on Apache Hadoop version 3.1.3
  • Depending on the JDK environment
  • Hadoop3.1.3 Download address

2. Hadoop Architecture

  • Hadoop in its narrow sense consists of three main componentsDistributed storage HDFS,Distributed computing MapReduce,Distributed resource management and scheduling YARN

2.1 the HDFS architecture

  • Mainly responsible for data storage

  • NameNode: Manages namespaces, stores data block mapping information (metadata), and processes clients’ access to HDFS.
  • SecondaryNameNode: Hot standby of NameNode. It periodically merges fsimage of namespace mirror and fsedits of namespace mirror. When the primary NameNode fails, it can quickly switch to the new Active NameNode
  • DataNode: Stores actual file data. Files are divided into multiple blocks and stored in multiple copies on different Datanodes

2.2 Yarn architecture

  • Responsible for job scheduling and resource management

  • ResourceManager(RM):
    • Handle submitted job requests, resource request requests.
    • Monitor NodeManager status
    • Start and monitor ApplicationMaster
  • NodeManager(NM):
    • Manage the resources running on each node
    • Periodically report resource usage on the node and the running status of containers to RM
    • Handle requests from AM to start/stop containers
  • Container:
    • A container for running tasks and an abstraction of Yarn resources encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network. Resources returned by RM for AM are represented by Containers. YARN assigns a Container to each task, and the task can only use the resources described in the Container
  • ApplicationMaster(AM):
    • Each task will start an AM in NM, and THEN AM will be responsible for sending the start request of MapTask and ReduceTask tasks to NM and applying for resources required for task execution from RM
    • Interact with RM to apply for resource containers (such as resources for job execution, resources for task execution)
    • Start and stop tasks and monitor the running status of all tasks. If a Task fails, apply for resources for the Task and restart the Task

3. Cluster planning

Unless otherwise specified, keep the same configuration for each server

Hadoop300 Hadoop301 Hadoop302
NameNode V
DataNode V V V
SecondaryNameNode V
ResourceManger V
NodeManger V V V

4. Download and decompress

4.1 Placing the Installation package

  • To download thehadoop3.1.3Unzip and create a shortcut to~/appDirectory, hadoop301,hadoop302 same
[hadoop@hadoop300 app]$ pwd
/home/hadoop/app
[hadoop@hadoop300 app]$ ll
lrwxrwxrwx   1 hadoop hadoop  47 2month21 12:33 hadoop -> /home/hadoop/app/manager/hadoop_mg/hadoop- 3.1.3

Copy the code

4.2 Configuring Hadoop Environment Variables

  • vim ~/.bash_profile
# ============ java =========================
export JAVA_HOME=/home/hadoop/app/jdk
export PATH=$PATH:$JAVA_HOME/bin

# ======================= Hadoop ============================
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Copy the code

Hadoop configuration

5.1 env file

  • Modify theUnder the ${HADOOP_HOME} / etc/hadoopAdd JDK environment variables to hadoop-env.sh, mapred-env.sh, and yarn-env.sh files
export JAVA_HOME=/home/hadoop/app/jdk
Copy the code

5.2 the core – site. XML

  • Modify the${HADOOP_HOME}/etc/hadoop/core-site.xmlfile
  • For details about Proxy User configurations, see Proxy User on the official website
<! -- set HDFS NameNode address -->
<property>
	<name>fs.defaultFS</name>
    <value>hdfs://hadoop300:8020</value>
</property>

<! / TMP /hadoop-${user.name} -->
<! -- <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> </property> -->

	 <! -- Configure static user to log in to HDFS as hadoop -->
	 <property>
			<name>hadoop.http.staticuser.user</name>
			<value>hadoop</value>
	 </property>

	<! -- Configure the host nodes that the Hadoop user can access through the proxy -->
	<property>
		<name>hadoop.proxyuser.hadoop.hosts</name>
		<value>*</value>
	</property>
	<! -- Configure the group to which the hadoop user belongs -->
	<property>
		<name>hadoop.proxyuser.hadoop.groups</name>
	 	<value>*</value>
	</property>
	<! -- Configure the hadoop user to allow proxy users, * represents all -->
	<property>
		<name>hadoop.proxyuser.hadoop.users</name>
	 	<value>*</value>
	</property>
Copy the code

5.3 HdFS-site. XML File (HDFS Configuration)

  • Configure HDFS properties
<! -- Specify the number of HDFS copies -->
<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>

<! SecondaryNameNode -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>hadoop302:9868</value>
</property>
Copy the code

5.4 yarn-site. XML (Yarn Configuration)

  • Configure yarn properties
<! -- Reducer obtain data using shuffle -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<! -- Specify the host address of YARN's ResourceManager.
<property>
	<name>yarn.resourcemanager.hostname</name>
  <value>hadoop301</value>
</property>

<! -- The minimum memory size allocated by the container to RM -->
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>200</value>
</property>
<! - container maximum, apply to the RM memory request permission beyond will be thrown InvalidResourceRequestException exception - >
<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
</property>

<! - Set the memory size that Yarn can use, which is the amount of physical memory (in MB) that can be allocated to containers. If set to 1 and yarn. The nodemanager. Resource. Detect - hardware - "capabilities to true, will be automatically calculated (on Windows and Linux). In other cases, the default is 8192MB. -->
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
</property>

<! -- Disable Yarn limit check on physical memory and virtual memory. Because memory calculation methods are different, the job may be considered as insufficient memory and kill the job -->
<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<! -- Set task history service address -->
<property>
        <name>yarn.log.server.url</name>
        <value>http://hadoop300:19888/jobhistory/logs/</value> 
</property>

<! - Enable log gathering Enable log gathering is to upload the log information collected locally by the container to the HDFS after the application is run. Easy to view -->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

<! -- Log retention time set to 7 days -->
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

Copy the code

5.5 mapred-site. XML (MapReduce Configuration)

  • Configure MapReduce Settings
<! -- Set MR to run on YARN, default to run local -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>
	
	<! Select jobHistory from jobHistory
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop300:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop300:19888</value>
  </property>

<! Hadoop environment variables -->
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

Copy the code

5.6 Workers File Configuration

  • Modify the${HADOOP_HOME}/etc/hadoop/workersFile to set the Hadoop cluster node list
    • Tip: Be careful not to have blank lines and Spaces
hadoop300
hadoop301
hadoop302
Copy the code

6. Start the test

6.1 Formatting NameNode

  • In hadoop300
[hadoop@hadoop300 app]$ hdfs namenode -format
Copy the code

6.2 start the Hdfs

  • In hadoop300 start
[hadoop@hadoop300 ~]$ start-dfs.sh
Starting namenodes on [hadoop300]
Starting datanodes
Starting secondary namenodes [hadoop302]
Copy the code

6.3 start the Yarn

  • In hadoop301 start
[hadoop@hadoop301 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
Copy the code

6.4 start the JobHistory

[hadoop@hadoop300 hadoop]$ mapred --daemon start historyserver
Copy the code

6.5 the effect

  • JPS views the process after it is successfully started
  • Then the NN, DN, and SN of HDFS are started.
  • The RM and NM of YARN are also started
  • Mr JobHistory was also launched
[hadoop@hadoop300 hadoop]$ xcall jps
--------- hadoop300 ----------
16276 JobHistoryServer
30597 DataNode
19641 Jps
30378 NameNode
3242 NodeManager
--------- hadoop301 ----------
24596 DataNode
19976 Jps
27133 ResourceManager
27343 NodeManager
--------- hadoop302 ----------
24786 SecondaryNameNode
27160 NodeManager
24554 DataNode
19676 Jps

Copy the code

The NameNode page for accessing HDFS is displayedhadoop300:9870

The SecondaryNameNode page for accessing HDFS is displayedhadoop300:9868

Access the Yarn management page: Yeshadoop301:8088

Access the JobHistory interface in thehadoop300:19888

Hadoop cluster unified startup script

  • vim hadoop.sh
#! /bin/bash

case The $1 in
"start") {echo---------- Hadoop cluster startup ------------echo "To start the Hdfs"
        ssh hadoop300 "source ~/.bash_profile; start-dfs.sh"
        echo "To start the Yarn"
        ssh hadoop300 "source ~/.bash_profile; mapred --daemon start historyserver"
        echo "To start the JobHistory"
        ssh hadoop301 "source ~/.bash_profile; start-yarn.sh"
};;
"stop") {echo---------- Hadoop cluster stop ------------echo "Close the Hdfs"
        ssh hadoop300 "source ~/.bash_profile; stop-dfs.sh"
        echo "Close the Yarn"
        ssh hadoop300 "source ~/.bash_profile; mapred --daemon stop historyserver"
        echo "JobHistory off"
        ssh hadoop301 "source ~/.bash_profile; stop-yarn.sh"
};;
esac
Copy the code