Build a Hadoop cluster with zero infrastructure

1. Configuring the VM Network (NAT Mode)

Host ipconfig screenshot:

Vmnet8 Network configuration:

Vm network configuration:

2. Configure a single-node environment

2.1 Uploading files to CentOS and Configuring the Java and Hadoop environment

Uploading the installation package to the server:

After the upload is successful, you can see two compressed packages:

Unzip two packages:

Rename the file so that you can configure environment variables later:

JDK + Hadoop environment variables

Check whether JDK environment variables are configured successfully:

Check whether hadoop environment variables are configured successfully:

—————————————————————————————

Now that the JDK and Hadoop are installed, modify some configuration files

2.2 Changing the CentOS Host Name

Default host name:

View and modify host names:

To permanently change the host name and the configuration file, run the vi /etc/sysconfig/network command.

2.3 Binding Hostname to AN IP address

To bind hostname and IP address, run the vi /etc/hosts command

2.4 Disabling the Firewall

2.5 Hadoop Directory Structure

1. View the Hadoop directory structure and run ll

2. Important Contents

(1) Bin directory: stores scripts for operating Hadoop related services (HDFS,YARN)

(2) Etc directory: Hadoop configuration file directory where Hadoop configuration files are stored

(3) Lib directory: local library for storing Hadoop (data compression and decompression function)

(4) Sbin directory: stores scripts for starting or stopping Hadoop-related services

(5) Share directory: stores Hadoop dependent JAR packages, documents, and official cases

Three operating modes of Hadoop

Hadoop runs in local mode, pseudo-distributed mode, and fully distributed mode.

Hadoop official website: hadoop.apache.org/

Mode 1: Local running mode

Official Grep cases

Create an input folder under the hadoop2.8.5 file

[root @ node hadoop2.8.5] $mkdir inputCopy the code

2. Copy the Hadoop XML configuration file to input

[root@node hadoop2.8.5]$cp etc/hadoop/*.xml inputCopy the code

3. Run the MapReduce program in the share directory

[root @ node hadoop2.8.5] $bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.8.5. Jargrep input output'dfs[a-z.]+'
Copy the code

4. View the output

[root @ node hadoop2.8.5] $cat output / *Copy the code

Console results display:

The official WordCount case

Create a wcinput folder under the hadoop2.8.5 file

[root @ node hadoop2.8.5] $mkdir wcinputCopy the code

2. Create a wC. input file under the wcinput file

[root @ node hadoop2.8.5] $cdWcinput [root@node hadoop2.8.5]$touch wc.inputCopy the code

3. Edit wc.input

[root @ node hadoop2.8.5] $vi wc. InputCopy the code

Enter the following information in the file

hadoop

mapreduce

yarn

Save the configuration and exit: : wq

4. Go to /opt/module/hadoop2.8.5

5. Execute the program

[root @ node hadoop2.8.5] $hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.8.5. Jar wordcount wcinput wcoutputCopy the code

6. View the results

[root @ node hadoop2.8.5] $cat wcoutput/part - r - 00000Copy the code

hadoop 2

mapreduce 1

yarn 1

Case result Presentation:

Mode 2: Pseudo-distributed operation mode

Start HDFS and run the MapReduce program

1. Configure the cluster

(1) Configuration: hadoop-env.sh

Export JAVA_HOME = / usr/Java/jdk1.8 /

(2) Configuration: core-site.xml

<! -- <property> <name>fs.defaultFS</name> <value> HDFS ://node:9000</value> </property> <! - specifies the Hadoop runtime generated file storage directory - > < property > < name > Hadoop. The TMP. Dir < / name > < value > / usr/Java/hadoop2.8.5 / data/TMP value > < / </property>Copy the code

(3) Configuration: HDFS -site.xml

<! Replication </name> <value>3</value> </property>Copy the code

2. Start the cluster

(1) Format the NameNode.

[root@node hadoop2.8.5]$HDFS namenode-formatCopy the code

Start NameNode

[root@node hadoop2.8.5]$hadoop.sh start namenodeCopy the code

(3) Start the DataNode

[root@node hadoop2.8.5]$hadoop.sh start datanodeCopy the code

3. View the cluster

(1) Check whether the startup is successful. Run the JPS command

Note: JPS is a JDK command, not a Linux command. You cannot use JPS without installing the JDK

(2) View the HDFS file system on the web

http://node:50070

Note: In Windows, you need to add 192.168.158.128 node to C:\Windows\System32\drivers\etc\hosts via URL.

(3) View generated logs

View logs locally:

Viewing logs on the Web:

(4) Think: why can’t format NameNode all the time, format NameNode, what should be noticed?

[root @ node hadoop2.8.5] $cdThe data/TMP/DFS/name/current / [root @ node hadoop2.8.5] $cat VERSIONCopy the code

clusterID=clusterID=CID-1e77ad8f-5b3f-4647-a13a-4ea3f01b6d65

[root @ node hadoop2.8.5] $cd data/tmp/dfs/data/current/Copy the code

clusterID=clusterID=CID-1e77ad8f-5b3f-4647-a13a-4ea3f01b6d65

Note: If NameNode is formatted, a new cluster ID will be generated. As a result, the cluster ids of NameNode and DataNode are different, and the cluster cannot find the past data. Therefore, when formatting NameNode, delete data and log first, and then format NameNode.

4. Operate the cluster

(1) Create an INPUT folder in the HDFS

Run HDFS DFS -mkdir -p /usr/java/hadoop/input

(2) Upload the local test file to the file system

Run HDFS DFS -put wcinput/wc.input /user/ Java /hadoop/input/

(3) Check whether the uploaded file is correct

Execute the command: HDFS DFS – cat/usr/Java/hadoop/input/wc. Input

(4) Run MapReduce

Execute command: Hadoop jar share/hadoop/graphs/hadoop – graphs – examples – 2.8.5. Jar wordcount/usr/Java/hadoop/input / /usr/java/hadoop/output

(5) View the output

Run HDFS DFS -cat /usr/java/hadoop/output/*

(6) Download the test file to the local PC

Execute the command: HDFS DFS – get/usr/Java/hadoop wcoutput / / output/part – r – 00000

(7) Delete the output result

Run HDFS DFS -rm -f /usr/java/hadoop/output

Start YARN and run MapReduce

1. Configure the cluster

(1) Configure yarn-env.sh

(2) Configure yarn-site.xml

<! Mapreduce_shuffle </value> </property> <name> map.nodeManager. aux-services</name> <value>mapreduce_shuffle</value> </property> <! - specifies the ResourceManager YARN address - > < property > < name > YARN. The ResourceManager. The hostname < / name > < value > node < value > / < / property >Copy the code

(3) Run the mapred-env.sh command

(4) Configure mapred-site.xml.template as mapred-site.xml

<! Mapreduce.framework. name</name> <value> YARN </value> </property>Copy the code

2. Start the cluster

(1) Before startup, ensure that NameNode and DataNode have been started

(2) Start ResourceManager

Run the yarn-daemon.sh start resourcemanager command

Start NodeManager

Run the yarn-daemon.sh start nodemanager command

3. Perform cluster operations

(1) View YARN in the browser, as shown in Figure 2-35

http://node:8088/cluster

(2) Delete the output file from the file system

Run HDFS DFS -rm -f /usr/java/hadoop/output

(3) Run the MapReduce program

Execute command: Hadoop jar share/hadoop/graphs/hadoop – graphs – examples – 2.8.5. Jar wordcount/usr/Java/hadoop/input / /usr/java/hadoop/output

(4) View the running result

Mode 3: Fully distributed operation mode

1. Clone the VM

2. Modify the configuration file

(1) run the vi /etc/sysconfig/network-scripts/ifcfg-ens33 command

(2) vi /etc/sysconfig/network

(3) vi /etc/hosts

3. Plan cluster deployment

node

node1

node2

HDFS

NameNode

DataNode

SecondaryNameNode

DataNode

YARN

NodeManager

ResourceManager

NodeManager

4. Configure the cluster

1. Configure core-site.xml

<! -- <property> <name>fs.defaultFS</name> <value> HDFS ://node:9000</value> </property> <! - specifies the Hadoop runtime generated file storage directory - > < property > < name > Hadoop. The TMP. Dir < / name > < value > / usr/Java/hadoop2.8.5 / data/TMP value > < / </property>Copy the code

(2) HDFS configuration file

Configure hadoop – env. Sh

exportJAVA_HOME = / usr/Java/jdk1.8 /Copy the code

Configuration HDFS – site. XML

<property> <name>dfs.replication</name> <value>3</value> </property> <! Host configuration - specifies the Hadoop auxiliary name node - > < property > < name > DFS. The namenode. Secondary. HTTP - address < / name > < value > 2:50090 < / value > </property>Copy the code

(3) YARN configuration file

Configuration of yarn – env. Sh

exportJAVA_HOME = / usr/Java/jdk1.8 /Copy the code

Configuration of yarn – site. XML

<! Mapreduce_shuffle </value> </property> <name> map.nodeManager. aux-services</name> <value>mapreduce_shuffle</value> </property> <! - specifies the ResourceManager YARN address - > < property > < name > YARN. The ResourceManager. The hostname < / name > < value > node1 < value > / < / property >Copy the code

(4) MapReduce configuration file

Configuration mapred – env. Sh

exportJAVA_HOME = / usr/Java/jdk1.8 /Copy the code

Configuration mapred – site. XML

<! Mapreduce.framework. name</name> <value> Yarn </value> </property>Copy the code

5. Encryption-free communication between nodes: Use SSH to configure secret-free login

6. Group

Start HDFS: start-dfs.sh

Start the yarn: start – yarn. Sh

[node] jps

[node1] jps

[node2] jps

Cluster command:

Start/stop the HDFS

start-dfs.sh / stop-dfs.sh

Start/stop YARN

start-yarn.sh / stop-yarn.sh

Start all/stop all

start-all.sh / stop-all.sh