Recently, someone asked if they could post some knowledge about big data. No problem! Today, start from the installation environment, build up their own learning environment.
Three ways to build And use Hadoop:
- Stand-alone version suitable for development and debugging;
- Pseudo-distributed is suitable for simulating cluster learning;
- Fully distributed for production environments.
This document describes how to build a fully distributed Hadoop cluster with one master node and two data nodes.
A prerequisite for
- Prepare three servers
Virtual machines, physical machines, and cloud instances can be used. Three instances in Openstack private cloud are used for installation and deployment.
- Operating system and software version
The server | system | memory | IP | planning | JDK | HADOOP |
---|---|---|---|---|---|---|
node1 | Ubuntu 18.04.2 LTS | 8G | 10.101.18.21 | master | The JDK 1.8.0 comes with _222 | Hadoop – 3.2.1 |
node2 | Ubuntu 18.04.2 LTS | 8G | 10.101.18.8 | slave1 | The JDK 1.8.0 comes with _222 | Hadoop – 3.2.1 |
node3 | Ubuntu 18.04.2 LTS | 8G | 10.101.18.24 | slave2 | The JDK 1.8.0 comes with _222 | Hadoop – 3.2.1 |
- JDK installed on three machines
Since Hadoop is written in the Java language, a Java environment needs to be installed on your computer. I used JDK 1.8.0_222 here (Sun JDK is recommended)
Install command
sudo apt install openjdk-8-jdk-headless
Copy the code
To configure the JAVA environment variable, add the following to the bottom of the.profile file in the current user root directory:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Copy the code
Use the source command to make it effective immediately
source .profile
Copy the code
- The host configuration
Modify the hosts files of the three servers
vim /etc/hosts
Add the following, based on the individual server IP
10.101.18.21 master
10.101.18.8 slave1
10.101.18.24 slave2
Copy the code
Secret free login configuration
- Production of the secret key
ssh-keygen -t rsa
Copy the code
- Master logs in to slave in non-secret mode
ssh-copy-id -i ~/.ssh/id_rsa.pub master
ssh-copy-id -i ~/.ssh/id_rsa.pub slave1
ssh-copy-id -i ~/.ssh/id_rsa.pub slave2
Copy the code
- Test secret free login
ssh master
ssh slave1
ssh slave2
Copy the code
Hadoop structures,
We download the Hadoop package from the Master node, modify the configuration, and then copy it to other Slave nodes for minor modifications.
- Download the installation package and create a Hadoop directory
# downloadWget HTTP: / / http://http://apache.claz.org/hadoop/common/hadoop-3.2.1//hadoop-3.2.1.tar.gzUnzip to /usr/localSudo tar -xzvf hadoop-3.2.1. Tar. gz -c /usr/local
# Change hadoop file permissionsSudo chown -r Ubuntu: Ubuntu hadoop-3.2.1. Tar.gzRename the folderSudo mv hadoop - 3.2.1 hadoopCopy the code
- Configure Hadoop environment variables for the Master node
As with JDK environment variables, edit the.profile file in the user directory to add Hadoop environment variables:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Copy the code
Execute source.profile for immediate effect
- Configuring the Master Node
All components of Hadoop with the XML file configuration, configuration files in the/usr/local/Hadoop/etc/Hadoop directory:
- Core-site. XML: configures common properties, such as I/O Settings used by HDFS and MapReduce
- Hdfs-site. XML: Hadoop daemon configuration, including Namenode, auxiliary Namenode, and Datanode
- Mapred-site. XML: MapReduce daemon configuration
- Yarn-site. XML: configures resource scheduling
A. Modify the core-site. XML file as follows:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Copy the code
Parameter Description:
- Fs. defaultFS: indicates the default file system. This parameter is required for HDFS clients to access HDFS
- Hadoop.tmp. dir: specifies the temporary directory of the Hadoop data store. Other directories will be based on this directory
If hadoop.tmp.dir is not configured, the system uses the default temporary directory/TMP /hadoo-hadoop. This directory will be deleted after each restart. You must re-run format to avoid errors.
B. Edit hdfs-site. XML as follows:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hdfs/data</value>
</property>
</configuration>
Copy the code
Parameter Description:
- Dfs. replication: number of copies of data blocks
- Dfs.name. dir: specifies the file storage directory of the namenode node
- Dfs.data. dir: specifies the file storage directory of datanode
C. Edit mapred-site. XML as follows:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
Copy the code
D. Edit yarn-site. XML and make the following changes:
<configuration>
<! -- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME</value>
</property>
</configuration>
Copy the code
E. Edit workers and modify as follows:
slave1
slave2
Copy the code
Configuring the Worker node
- Configuring the Slave Node
Package the Hadoop configured on the Master node and send it to the other two nodes:
# Hadoop package
tar -cxf hadoop.tar.gz /usr/local/hadoop
Copy to the other two nodes
scp hadoop.tar.gz ubuntu@slave1:~
scp hadoop.tar.gz ubuntu@slave2:~
Copy the code
Pressurize Hadoop packages on other nodes to /usr/local
sudo tar -xzvf hadoop.tar.gz -C /usr/local/
Copy the code
Configure Hadoop environment variables for Slave1 and Slaver2:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Copy the code
Start the cluster
- Format the HDFS file system
Go to the Hadoop directory on the Master node and perform the following operations:
bin/hadoop namenode -format
Copy the code
Format the Namenode. The operations performed before starting the service for the first time do not need to be performed later.
Truncate part of the log (see line 5 log to indicate formatting success) :
2019-11-11 13:34:18.960 INFO util.GSet: VM type = 64-bit
2019-11-11 13:34:18.960 INFO util.GSet: 0.029999999329447746% max memory 1.7 GB = 544.5 KB
2019-11-11 13:34:18.961 INFO util.GSet: capacity = 2^16 = 65536 entries
2019-11-11 13:34:18.994 INFO namenode.FSImage: Allocated new BlockPoolId: BP-2017092058-10.10118.21 -1573450458983
2019-11-11 13:34:19.010 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/name has been successfully formatted.
2019-11-11 13:34:19.051 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/hdfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2019-11-11 13:34:19.186 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/hdfs/name/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2019-11-11 13:34:19.207 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2019-11-11 13:34:19.214 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
Copy the code
- Starting a Hadoop Cluster
sbin/start-all.sh
Copy the code
Problems and solutions during startup:
Error: master: RCMD: socket: Permission denied
Solution:
Run echo “SSH” > /etc/pdsh/rcmd_default
B. error: JAVA_HOME is not set and could not be found.
Solution:
Modify hadoop-env.sh for the three nodes and add the following JAVA environment variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Copy the code
- Run the JPS command to check the running status
The Master node executes output:
19557 ResourceManager
19914 Jps
19291 SecondaryNameNode
18959 NameNode
Copy the code
Slave Node execution input:
18580 NodeManager
18366 DataNode
18703 Jps
Copy the code
- View the Hadoop cluster status
hadoop dfsadmin -report
Copy the code
View the results:
Configured Capacity: 41258442752 (38.42 GB)
Present Capacity: 5170511872 (4.82 GB)
DFS Remaining: 5170454528 (4.82 GB)
DFS Used: 57344 (56 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: 10.101.18.24:9866 (slave2)
Hostname: slave2
Decommission Status : Normal
Configured Capacity: 20629221376 (19.21 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 16919797760 (15.76 GB)
DFS Remaining: 3692617728 (3.44 GB)
DFS Used%: 0.00%
DFS Remaining%: 17.90%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Nov 11 15:00:27 CST 2019
Last Block Report: Mon Nov 11 14:05:48 CST 2019
Num of Blocks: 0
Name: 10.101.18.8:9866 (slave1)
Hostname: slave1
Decommission Status : Normal
Configured Capacity: 20629221376 (19.21 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 19134578688 (17.82 GB)
DFS Remaining: 1477836800 (1.38 GB)
DFS Used%: 0.00%
DFS Remaining%: 7.16%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Nov 11 15:00:24 CST 2019
Last Block Report: Mon Nov 11 13:53:57 CST 2019
Num of Blocks: 0
Copy the code
- Close the Hadoop
sbin/stop-all.sh
Copy the code
Web View The Hadoop cluster status
Enter http://10.101.18.21:9870 in your browser. The result is as follows:
Enter http://10.101.18.21:8088 in your browser. The result is as follows:
JAVA class at 9:30
data