Big data platform building | Hadoop cluster structures

1. Environment description

Based on Apache Hadoop version 3.1.3
Depending on the JDK environment
Hadoop3.1.3 Download address

2. Hadoop Architecture

Hadoop in its narrow sense consists of three main componentsDistributed storage HDFS,Distributed computing MapReduce,Distributed resource management and scheduling YARN

2.1 the HDFS architecture

Mainly responsible for data storage

NameNode: Manages namespaces, stores data block mapping information (metadata), and processes clients’ access to HDFS.
SecondaryNameNode: Hot standby of NameNode. It periodically merges fsimage of namespace mirror and fsedits of namespace mirror. When the primary NameNode fails, it can quickly switch to the new Active NameNode
DataNode: Stores actual file data. Files are divided into multiple blocks and stored in multiple copies on different Datanodes

2.2 Yarn architecture

Responsible for job scheduling and resource management

ResourceManager(RM):
- Handle submitted job requests, resource request requests.
- Monitor NodeManager status
- Start and monitor ApplicationMaster
NodeManager(NM):
- Manage the resources running on each node
- Periodically report resource usage on the node and the running status of containers to RM
- Handle requests from AM to start/stop containers
Container:
- A container for running tasks and an abstraction of Yarn resources encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network. Resources returned by RM for AM are represented by Containers. YARN assigns a Container to each task, and the task can only use the resources described in the Container
ApplicationMaster(AM):
- Each task will start an AM in NM, and THEN AM will be responsible for sending the start request of MapTask and ReduceTask tasks to NM and applying for resources required for task execution from RM
- Interact with RM to apply for resource containers (such as resources for job execution, resources for task execution)
- Start and stop tasks and monitor the running status of all tasks. If a Task fails, apply for resources for the Task and restart the Task

3. Cluster planning

Unless otherwise specified, keep the same configuration for each server

	Hadoop300	Hadoop301	Hadoop302
NameNode	V
DataNode	V	V	V
SecondaryNameNode			V
ResourceManger		V
NodeManger	V	V	V

4. Download and decompress

4.1 Placing the Installation package

To download thehadoop3.1.3Unzip and create a shortcut to~/appDirectory, hadoop301,hadoop302 same

[hadoop@hadoop300 app]$ pwd
/home/hadoop/app
[hadoop@hadoop300 app]$ ll
lrwxrwxrwx   1 hadoop hadoop  47 2month21 12:33 hadoop -> /home/hadoop/app/manager/hadoop_mg/hadoop- 3.1.3

Copy the code

4.2 Configuring Hadoop Environment Variables

vim ~/.bash_profile

# ============ java =========================
export JAVA_HOME=/home/hadoop/app/jdk
export PATH=$PATH:$JAVA_HOME/bin

# ======================= Hadoop ============================
export HADOOP_HOME=/home/hadoop/app/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
Copy the code

Hadoop configuration

5.1 env file

Modify theUnder the ${HADOOP_HOME} / etc/hadoopAdd JDK environment variables to hadoop-env.sh, mapred-env.sh, and yarn-env.sh files

export JAVA_HOME=/home/hadoop/app/jdk
Copy the code

5.2 the core – site. XML

Modify the${HADOOP_HOME}/etc/hadoop/core-site.xmlfile
For details about Proxy User configurations, see Proxy User on the official website

<! -- set HDFS NameNode address -->
<property>
	<name>fs.defaultFS</name>
    <value>hdfs://hadoop300:8020</value>
</property>

<! / TMP /hadoop-${user.name} -->
<! -- <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> </property> -->

	 <! -- Configure static user to log in to HDFS as hadoop -->
	 <property>
			<name>hadoop.http.staticuser.user</name>
			<value>hadoop</value>
	 </property>

	<! -- Configure the host nodes that the Hadoop user can access through the proxy -->
	<property>
		<name>hadoop.proxyuser.hadoop.hosts</name>
		<value>*</value>
	</property>
	<! -- Configure the group to which the hadoop user belongs -->
	<property>
		<name>hadoop.proxyuser.hadoop.groups</name>
	 	<value>*</value>
	</property>
	<! -- Configure the hadoop user to allow proxy users, * represents all -->
	<property>
		<name>hadoop.proxyuser.hadoop.users</name>
	 	<value>*</value>
	</property>
Copy the code

5.3 HdFS-site. XML File (HDFS Configuration)

Configure HDFS properties

<! -- Specify the number of HDFS copies -->
<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>

<! SecondaryNameNode -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>hadoop302:9868</value>
</property>
Copy the code

5.4 yarn-site. XML (Yarn Configuration)

Configure yarn properties

<! -- Reducer obtain data using shuffle -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<! -- Specify the host address of YARN's ResourceManager.
<property>
	<name>yarn.resourcemanager.hostname</name>
  <value>hadoop301</value>
</property>

<! -- The minimum memory size allocated by the container to RM -->
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>200</value>
</property>
<! - container maximum, apply to the RM memory request permission beyond will be thrown InvalidResourceRequestException exception - >
<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
</property>

<! - Set the memory size that Yarn can use, which is the amount of physical memory (in MB) that can be allocated to containers. If set to 1 and yarn. The nodemanager. Resource. Detect - hardware - "capabilities to true, will be automatically calculated (on Windows and Linux). In other cases, the default is 8192MB. -->
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
</property>

<! -- Disable Yarn limit check on physical memory and virtual memory. Because memory calculation methods are different, the job may be considered as insufficient memory and kill the job -->
<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>false</value>
</property>
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

<! -- Set task history service address -->
<property>
        <name>yarn.log.server.url</name>
        <value>http://hadoop300:19888/jobhistory/logs/</value> 
</property>

<! - Enable log gathering Enable log gathering is to upload the log information collected locally by the container to the HDFS after the application is run. Easy to view -->
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

<! -- Log retention time set to 7 days -->
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

Copy the code

5.5 mapred-site. XML (MapReduce Configuration)

Configure MapReduce Settings

<! -- Set MR to run on YARN, default to run local -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>
	
	<! Select jobHistory from jobHistory
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop300:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop300:19888</value>
  </property>

<! Hadoop environment variables -->
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>

Copy the code

5.6 Workers File Configuration

Modify the${HADOOP_HOME}/etc/hadoop/workersFile to set the Hadoop cluster node list
- Tip: Be careful not to have blank lines and Spaces

hadoop300
hadoop301
hadoop302
Copy the code

6. Start the test

6.1 Formatting NameNode

In hadoop300

[hadoop@hadoop300 app]$ hdfs namenode -format
Copy the code

6.2 start the Hdfs

In hadoop300 start

[hadoop@hadoop300 ~]$ start-dfs.sh
Starting namenodes on [hadoop300]
Starting datanodes
Starting secondary namenodes [hadoop302]
Copy the code

6.3 start the Yarn

In hadoop301 start

[hadoop@hadoop301 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
Copy the code

6.4 start the JobHistory

[hadoop@hadoop300 hadoop]$ mapred --daemon start historyserver
Copy the code

6.5 the effect

JPS views the process after it is successfully started
Then the NN, DN, and SN of HDFS are started.
The RM and NM of YARN are also started
Mr JobHistory was also launched

[hadoop@hadoop300 hadoop]$ xcall jps
--------- hadoop300 ----------
16276 JobHistoryServer
30597 DataNode
19641 Jps
30378 NameNode
3242 NodeManager
--------- hadoop301 ----------
24596 DataNode
19976 Jps
27133 ResourceManager
27343 NodeManager
--------- hadoop302 ----------
24786 SecondaryNameNode
27160 NodeManager
24554 DataNode
19676 Jps

Copy the code

The NameNode page for accessing HDFS is displayedhadoop300:9870

The SecondaryNameNode page for accessing HDFS is displayedhadoop300:9868

Access the Yarn management page: Yeshadoop301:8088

Access the JobHistory interface in thehadoop300:19888

Hadoop cluster unified startup script

vim hadoop.sh

#! /bin/bash

case The $1 in
"start") {echo---------- Hadoop cluster startup ------------echo "To start the Hdfs"
        ssh hadoop300 "source ~/.bash_profile; start-dfs.sh"
        echo "To start the Yarn"
        ssh hadoop300 "source ~/.bash_profile; mapred --daemon start historyserver"
        echo "To start the JobHistory"
        ssh hadoop301 "source ~/.bash_profile; start-yarn.sh"
};;
"stop") {echo---------- Hadoop cluster stop ------------echo "Close the Hdfs"
        ssh hadoop300 "source ~/.bash_profile; stop-dfs.sh"
        echo "Close the Yarn"
        ssh hadoop300 "source ~/.bash_profile; mapred --daemon stop historyserver"
        echo "JobHistory off"
        ssh hadoop301 "source ~/.bash_profile; stop-yarn.sh"
};;
esac
Copy the code