Build cloud server Hadoop cluster in Linux
The process is divided into five steps
- New users
- Download and install
- Configure SSH password-free login
- Modify the configuration
- Initialization, start, and stop
1. Add user Hadoop
useradd -d /home/hadoop -m hadoop
usermod -a -G root hadoop
passwd hadoop
Copy the code
Download and install
1. JDK8
Sudo yum install java-1.8.0-openJDK java-1.8.0-openjdk-devel-yCopy the code
2. Hadoop – 3.0.1
CD ~ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz mv Gz /home/hadoop/cd /home/hadoop/tar -xzvf hadoop-3.0.1.tar.gz chown hadoop hadoop-3.0.1-rCopy the code
3. Configure encryption-free login
1. Edit/etc/hosts
(The following IPn indicates the cloud server external IP address in the format of 192.168.1.1. Note that if the IP address is to the local machine, please use the internal IP address instead.
IP1 master
IP2 slave1
IP3 slave2
Copy the code
2. Switch to the Hadoop user and generate id_rsa.pub
su hadoop
cd ~
ssh-keygen -t rsa
cd ~/.ssh/
cat id_rsa.pub >> authorized_keys
chmod 700 /home/hadoop/.ssh
Copy the code
chmod 644 /home/hadoop/.ssh/authorized_keys
— All cloud servers must run the — command
3. Exchange the contents of the id_rsa.pub share
(If the pseudo-distribution mode is set up, you can skip the exchange sharing step and directly test SSH)
(1) Master cloud server operation
scp /home/hadoop/.ssh/authorized_keys slave2:/home/hadoop/.ssh/
Copy the code
(2) Slave1 cloud server operation
scp /home/hadoop/.ssh/authorized_keys slave3:/home/hadoop/.ssh/
Copy the code
(3) Slave2 cloud server operation
scp /home/hadoop/.ssh/authorized_keys master:/home/hadoop/.ssh/
Copy the code
- The ultimate goal of this step is to have the authorized_keys content of all cloud servers contain their respective ID_RSA.pub information, and the content is the same.
(4) Test whether the configuration is successful
Execute command on master:
ssh slave1
quit
ssh slave2
quit
Copy the code
Run the following command on slave1:
ssh master
quit
ssh slave2
quit
Copy the code
Run the following command on slave2:
ssh master
quit
ssh slave1
quit
Copy the code
- You need to ensure that all cloud servers can interact with each other
ssh
To go through. - First run
ssh
Password login is required. After entering the password, selectyes
Keep records. After that, you no longer need to enter a password to log in. - If an exception occurs, restart the service and try again:
sudo service sshd service
.
Modify the configuration file
1. Configure environment variables in /etc/profile
Export JAVA_HOME=/usr/lib/ JVM /jre export HADOOP_HOME=/home/hadoop/hadoop-3.0.1/ export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib:$JAVA_HOME/binCopy the code
2. Enable environment variables
source /etc/profile
Copy the code
3. Locate the configuration file
For more details, see the official documentation
CD ~ / hadoop - 3.0.1 / etc/hadoop lsCopy the code
- Found a lot of configuration files among them
(1) Core-site.xml is added
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
Copy the code
- Set the URI of HDFS NameNode to
IP1:9000
- The I/O file cache capacity is set
(2) Add hdFS-sie.xml
< configuration > < property > < name > DFS. The namenode. Name. Dir < / name > < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/name value > < / </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>slave1:9001</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.datanode.data.dir</name> < value > file: / home/hadoop/hadoop - 3.0.1 / HDFS/data value > < / < / property > < / configuration >Copy the code
- Set the directory where Namenode information is stored
- Set the number of copies to 2
- Set the Secondary NameNode URI to
IP2:9001
(slave1 <=> IP2) - Start the WebHDFS module
- Set the directory path of the DataNode
(3) Yarn-site. XML is added
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Copy the code
- Set the URI of the client to submit the task
IP1:8032
- Sets the URI for the main program resource to
IP1:8032
- Set the NodeManager URI to
IP1:8033
- Set the Web UI URI of ResourceManager to
IP1::8088
- The preceding four configurations are optional and have default values
- Set the minimum memory required for each task to 512MB
- Set the maximum memory required for each task to 2048MB
- Set the available memory for NodeManger to 1024MB
- If a task exceeds its memory limit, it is automatically killed.
(4) mapred-site. XML is added
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>
Copy the code
- The default task history port is also 10020
- The task History Web interface port is also 19888
(5) Edit hadoop-env.sh
Change the position of about line 54 to
export JAVA_HOME=${JAVA_HOME}
Copy the code
(6) Create file masters and workers in the same directory
Masters content for
IP1
Copy the code
Workers content for
IP2
IP3
Copy the code
5. Initialization
1. Format NameNode
su hadoop
hdfs namenode -format
Copy the code
2. Start
start-dfs.sh
start-yarn.sh
Copy the code
or
start-all.sh
Copy the code
3. Mission history starts and stops
mr-jobhistory-daemon.sh start historyserver
mr-jobhistory-daemon.sh stop historyserver
Copy the code
Seven, other
- Do not format the clusterids of the NameNode and DataNode at each startup. As a result, the startup fails.
- If formatting is required, you need to remove the folder in the configuration file that is specified to be generated at run time, such as
hdfs/name
,hdfs/data
,tmp
(In the Hadoop installation directory). - You can view the hadoop installation directory
logs
The log folder is incorrectly sorted