Build cloud server Hadoop cluster in Linux

The process is divided into five steps

  1. New users
  2. Download and install
  3. Configure SSH password-free login
  4. Modify the configuration
  5. Initialization, start, and stop

1. Add user Hadoop

useradd -d /home/hadoop -m hadoop
usermod -a -G root hadoop
passwd hadoop
Copy the code

Download and install

1. JDK8

Sudo yum install java-1.8.0-openJDK java-1.8.0-openjdk-devel-yCopy the code

2. Hadoop – 3.0.1

CD ~ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz mv Gz /home/hadoop/cd /home/hadoop/tar -xzvf hadoop-3.0.1.tar.gz chown hadoop hadoop-3.0.1-rCopy the code

3. Configure encryption-free login

1. Edit/etc/hosts

(The following IPn indicates the cloud server external IP address in the format of 192.168.1.1. Note that if the IP address is to the local machine, please use the internal IP address instead.

IP1 master
IP2 slave1
IP3 slave2
Copy the code

2. Switch to the Hadoop user and generate id_rsa.pub

su hadoop
cd ~
ssh-keygen -t rsa
cd ~/.ssh/
cat id_rsa.pub >> authorized_keys
chmod 700 /home/hadoop/.ssh
Copy the code

    chmod 644 /home/hadoop/.ssh/authorized_keys

— All cloud servers must run the — command

3. Exchange the contents of the id_rsa.pub share

(If the pseudo-distribution mode is set up, you can skip the exchange sharing step and directly test SSH)

(1) Master cloud server operation
scp /home/hadoop/.ssh/authorized_keys slave2:/home/hadoop/.ssh/
Copy the code
(2) Slave1 cloud server operation
scp /home/hadoop/.ssh/authorized_keys slave3:/home/hadoop/.ssh/
Copy the code
(3) Slave2 cloud server operation
scp /home/hadoop/.ssh/authorized_keys master:/home/hadoop/.ssh/
Copy the code
  • The ultimate goal of this step is to have the authorized_keys content of all cloud servers contain their respective ID_RSA.pub information, and the content is the same.
(4) Test whether the configuration is successful
Execute command on master:
ssh slave1
quit
ssh slave2
quit
Copy the code
Run the following command on slave1:
ssh master
quit
ssh slave2
quit
Copy the code
Run the following command on slave2:
ssh master
quit
ssh slave1
quit
Copy the code
  • You need to ensure that all cloud servers can interact with each othersshTo go through.
  • First runsshPassword login is required. After entering the password, selectyesKeep records. After that, you no longer need to enter a password to log in.
  • If an exception occurs, restart the service and try again:sudo service sshd service.

Modify the configuration file

1. Configure environment variables in /etc/profile

Export JAVA_HOME=/usr/lib/ JVM /jre export HADOOP_HOME=/home/hadoop/hadoop-3.0.1/ export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib:$JAVA_HOME/binCopy the code

2. Enable environment variables

source /etc/profile
Copy the code

3. Locate the configuration file

For more details, see the official documentation

CD ~ / hadoop - 3.0.1 / etc/hadoop lsCopy the code
  • Found a lot of configuration files among them
(1) Core-site.xml is added
<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://master:9000</value>
	</property>
	<property>
		<name>io.file.buffer.size</name>
		<value>131072</value>
	</property>
</configuration>
Copy the code
  • Set the URI of HDFS NameNode toIP1:9000
  • The I/O file cache capacity is set
(2) Add hdFS-sie.xml
< configuration > < property > < name > DFS. The namenode. Name. Dir < / name > < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/name value > < / </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.datanode.data.dir</name> < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/data value > < / < / property > < / configuration >Copy the code
  • Set the directory where Namenode information is stored
  • Set the number of copies to 2
  • Set the Secondary NameNode URI toIP2:9001 (slave1 <=> IP2)
  • Start the WebHDFS module
  • Set the directory path of the DataNode
(3) Yarn-site. XML is added
<configuration>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>master:8032</value>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.address</name>
		<value>master:8030</value>
	</property>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>master:8031</value>
	</property>
	<property>
		<name>yarn.resourcemanager.admin.address</name>
		<value>master:8033</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address</name>
		<value>master:8088</value>
	</property>

	<property>
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>512</value>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-mb</name>
		<value>2048</value>
	</property>

	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>1024</value>
	</property>
	<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
</configuration>
Copy the code
  • Set the URI of the client to submit the taskIP1:8032
  • Sets the URI for the main program resource toIP1:8032
  • Set the NodeManager URI toIP1:8033
  • Set the Web UI URI of ResourceManager toIP1::8088
  • The preceding four configurations are optional and have default values
  • Set the minimum memory required for each task to 512MB
  • Set the maximum memory required for each task to 2048MB
  • Set the available memory for NodeManger to 1024MB
  • If a task exceeds its memory limit, it is automatically killed.
(4) mapred-site. XML is added
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>master:10020</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>master:19888</value>
	</property>
</configuration>
Copy the code
  • The default task history port is also 10020
  • The task History Web interface port is also 19888
(5) Edit hadoop-env.sh

Change the position of about line 54 to

export JAVA_HOME=${JAVA_HOME}
Copy the code
(6) Create file masters and workers in the same directory
Masters content for
IP1
Copy the code
Workers content for
IP2
IP3
Copy the code

5. Initialization

1. Format NameNode

su hadoop
hdfs namenode -format
Copy the code

2. Start

start-dfs.sh
start-yarn.sh
Copy the code

or

start-all.sh
Copy the code

3. Mission history starts and stops

mr-jobhistory-daemon.sh start historyserver

mr-jobhistory-daemon.sh stop historyserver
Copy the code

Seven, other

  1. Do not format the clusterids of the NameNode and DataNode at each startup. As a result, the startup fails.
  2. If formatting is required, you need to remove the folder in the configuration file that is specified to be generated at run time, such ashdfs/name,hdfs/data,tmp(In the Hadoop installation directory).
  3. You can view the hadoop installation directorylogsThe log folder is incorrectly sorted