Build cloud Server Hadoop cluster/pseudo-distribution | More challenges in August

Build cloud server Hadoop cluster in Linux

The process is divided into five steps

New users
Download and install
Configure SSH password-free login
Modify the configuration
Initialization, start, and stop

1. Add user Hadoop

useradd -d /home/hadoop -m hadoop
usermod -a -G root hadoop
passwd hadoop
Copy the code

Download and install

1. JDK8

Sudo yum install java-1.8.0-openJDK java-1.8.0-openjdk-devel-yCopy the code

2. Hadoop – 3.0.1

CD ~ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz mv Gz /home/hadoop/cd /home/hadoop/tar -xzvf hadoop-3.0.1.tar.gz chown hadoop hadoop-3.0.1-rCopy the code

3. Configure encryption-free login

1. Edit/etc/hosts

(The following IPn indicates the cloud server external IP address in the format of 192.168.1.1. Note that if the IP address is to the local machine, please use the internal IP address instead.

IP1 master
IP2 slave1
IP3 slave2
Copy the code

2. Switch to the Hadoop user and generate id_rsa.pub

su hadoop
cd ~
ssh-keygen -t rsa
cd ~/.ssh/
cat id_rsa.pub >> authorized_keys
chmod 700 /home/hadoop/.ssh
Copy the code

chmod 644 /home/hadoop/.ssh/authorized_keys

— All cloud servers must run the — command

3. Exchange the contents of the id_rsa.pub share

(If the pseudo-distribution mode is set up, you can skip the exchange sharing step and directly test SSH)

(1) Master cloud server operation

scp /home/hadoop/.ssh/authorized_keys slave2:/home/hadoop/.ssh/
Copy the code

(2) Slave1 cloud server operation

scp /home/hadoop/.ssh/authorized_keys slave3:/home/hadoop/.ssh/
Copy the code

(3) Slave2 cloud server operation

scp /home/hadoop/.ssh/authorized_keys master:/home/hadoop/.ssh/
Copy the code

The ultimate goal of this step is to have the authorized_keys content of all cloud servers contain their respective ID_RSA.pub information, and the content is the same.

(4) Test whether the configuration is successful

Execute command on master:

ssh slave1
quit
ssh slave2
quit
Copy the code

Run the following command on slave1:

ssh master
quit
ssh slave2
quit
Copy the code

Run the following command on slave2:

ssh master
quit
ssh slave1
quit
Copy the code

You need to ensure that all cloud servers can interact with each othersshTo go through.
First runsshPassword login is required. After entering the password, selectyesKeep records. After that, you no longer need to enter a password to log in.
If an exception occurs, restart the service and try again:sudo service sshd service.

Modify the configuration file

1. Configure environment variables in /etc/profile

Export JAVA_HOME=/usr/lib/ JVM /jre export HADOOP_HOME=/home/hadoop/hadoop-3.0.1/ export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib:$JAVA_HOME/binCopy the code

2. Enable environment variables

source /etc/profile
Copy the code

3. Locate the configuration file

For more details, see the official documentation

CD ~ / hadoop - 3.0.1 / etc/hadoop lsCopy the code

Found a lot of configuration files among them

(1) Core-site.xml is added

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>io.file.buffer.size</name>
		<value>131072</value>
	</property>
</configuration>
Copy the code

Set the URI of HDFS NameNode toIP1:9000
The I/O file cache capacity is set

(2) Add hdFS-sie.xml

< configuration > < property > < name > DFS. The namenode. Name. Dir < / name > < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/name value > < / </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.datanode.data.dir</name> < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/data value > < / < / property > < / configuration >Copy the code

Set the directory where Namenode information is stored
Set the number of copies to 2
Set the Secondary NameNode URI toIP2:9001 (slave1 <=> IP2)
Start the WebHDFS module
Set the directory path of the DataNode

(3) Yarn-site. XML is added

<configuration>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>master:8032</value>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.address</name>
		<value>master:8030</value>
	</property>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>master:8031</value>
	</property>
	<property>
		<name>yarn.resourcemanager.admin.address</name>
		<value>master:8033</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address</name>
		<value>master:8088</value>
	</property>

	<property>
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>512</value>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-mb</name>
		<value>2048</value>
	</property>

	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>1024</value>
	</property>
	<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
</configuration>
Copy the code

Set the URI of the client to submit the taskIP1:8032
Sets the URI for the main program resource toIP1:8032
Set the NodeManager URI toIP1:8033
Set the Web UI URI of ResourceManager toIP1::8088
The preceding four configurations are optional and have default values
Set the minimum memory required for each task to 512MB
Set the maximum memory required for each task to 2048MB
Set the available memory for NodeManger to 1024MB
If a task exceeds its memory limit, it is automatically killed.

(4) mapred-site. XML is added

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>master:10020</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>master:19888</value>
	</property>
</configuration>
Copy the code

The default task history port is also 10020
The task History Web interface port is also 19888

(5) Edit hadoop-env.sh

Change the position of about line 54 to

export JAVA_HOME=${JAVA_HOME}
Copy the code

(6) Create file masters and workers in the same directory

Masters content for

IP1
Copy the code

Workers content for

IP2
IP3
Copy the code

5. Initialization

1. Format NameNode

su hadoop
hdfs namenode -format
Copy the code

2. Start

start-dfs.sh
start-yarn.sh
Copy the code

start-all.sh
Copy the code

3. Mission history starts and stops

mr-jobhistory-daemon.sh start historyserver

mr-jobhistory-daemon.sh stop historyserver
Copy the code

Seven, other

Do not format the clusterids of the NameNode and DataNode at each startup. As a result, the startup fails.
If formatting is required, you need to remove the folder in the configuration file that is specified to be generated at run time, such ashdfs/name,hdfs/data,tmp(In the Hadoop installation directory).
You can view the hadoop installation directorylogsThe log folder is incorrectly sorted