This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging
Apache version
Open your corresponding version of Hadoop official documentation (here hadoop.apache.org/docs/r2.10….). . It is best to choose the corresponding version of the document to operate to avoid some compatibility problems.
Install the configuration
-
Several Linux machines (3 Centos7.6 in this article)
-
Cluster time synchronization NTP (you do not need it, but you do need it).
-
Configure the JDK.
-
Configuration of the host.
-
Install the required software: SSH (required for cluster deployment) and rsync (required for cluster configuration synchronization).
-
$SSH localhost = $SSH localhost
$ ssh-keygen -t rsa -P ' ' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys Copy the code
-
-
Hadoop package (this article uses Hadoop-2.10.0.tar.gz)
-
Configure the/etc/hadoop/hadoop – env. Sh. Export JAVA_HOME=${JAVA_HOME} =${JAVA_HOME}
-
Configure HADOOP_HOME (optional, easy)
-
Export HADOOP_HOME = / home/Justin/env/hadoop - 2.10.0 export PATH = $PATH: $HADOOP_HOME/bin: $HADOOP_HOME/sbinCopy the code
-
-
The Hadoop configuration is complete (but not started yet). You can use the Hadoop command to test:
[justin@hadoop01]$ hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
....
Copy the code
Start the Hadoop
Hadoop has three startup modes:
- Single-process Mode: Local (Standalone) Mode
- Pseudo Distributed: Pseudo Distributed Mode
- Fully Distributed: Fully Distributed Mode
Single-process mode
Start only one Hadoop node. This deployment mode is generally used to debugging the programs that you want to deploy to the cluster.
This pattern requires no additional configuration and is equivalent to configuring the environment to execute the written JAR package. Can we use the official example to see whether the environment is OK?
validation
$ mkdir input
$ cp etc/hadoop/*.xml input
$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep input output'dfs[a-z.]+'
$ cat output/*// The basic environment is OKCopy the code
Pseudo distributed
Start Hadoop related background services on a single machine, each in a separate process.
Configure cluster.
-
Configure the/etc/hadoop/core – site. XML
-
<! -- configure HDFS NameNode address --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop01:9000</value> </property> <! By default, all files executed by Hadoop are in the TMP folder of the system, so they will disappear after the restart. Need to specify a new folder --> <property> <name>hadoop.tmp.dir</name> <value>/data/my/tmp</value> </property> Copy the code
-
For details, see core-default.xml
-
-
Configure the/etc/hadoop/HDFS – site. XML
-
<! -- specify the number of HDFS copies --> <property> <name>dfs.replication</name> <value>1</value> </property> Copy the code
-
For details, check HDFS -default.xml
-
-
Format NameNode (only on first startup, no formatting later)
-
$ bin/hdfs namenode -format Copy the code
-
Format the NameNode to generate a new cluster ID. The cluster ids of NameNode and DataNode are inconsistent, and the cluster cannot find the past data. Therefore, when formatting NameNode, delete data and log before formatting NameNode.
-
-
Start the NameNode&DataNode
-
$ sbin/start-dfs.sh// It can also be started separately$ sbin/hadoop-daemon.sh start namenode $ sbin/hadoop-daemon.sh start datanode Copy the code
-
-
stop
-
$ sbin/stop-dfs.sh Copy the code
-
Verify the HDFS
If namenode or Datanode startup fails, go to the Logs folder and view the logs.
Two ways:
-
[justin@hadoop01]$ jps 4899 DataNode 5316 Jps 5096 SecondaryNameNode 4783 NameNode Copy the code
-
Browser visit http://hadoop01:50070/
- Run the MapReduce example test directly using HDFS
// Upload the test file to the created input folder$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/justin # Only step by step
$ bin/hdfs dfs -mkdir /user/justin/input
$ bin/hdfs dfs -put etc/hadoop/*.xml /user/justin/input// Execute the official case Mr Program. Output must not exist$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep/user/Justin/input/output/user/Justin'dfs[a-z.]+'// Check the result$ bin/hdfs dfs -cat /user/justin/output/*// Delete the result$ hdfs dfs -rm -r /user/justin/output
Copy the code
Configuration of YARN
-
Configure etc/hadoop/mapred-site. XML (copy mapred-site.xml. Template).
<! -- specify MR program to run on YARN --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> Copy the code
- For details, see mapred-default.xml
-
Configure the/etc/hadoop/yarn – site. XML
<! -- How to obtain data from Reducer --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <! -- Specify the address of YARN ResourceManager --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop01</value> </property> Copy the code
- For details, see yarn-default.xml
-
Start resourcemanager and NodeManager (ensure that NameNode and DataNode are started)
-
$ sbin/start-yarn.sh// Or start separately$ sbin/yarn-daemon.sh start resourcemanager $ sbin/yarn-daemon.sh start nodemanager Copy the code
Copy the code
-
-
stop
-
$ sbin/stop-yarn.sh Copy the code
-
Verify the YARN
-
[justin@hadoop01]$ jps 4899 DataNode 6150 Jps 5096 SecondaryNameNode 5897 ResourceManager 4783 NameNode 5999 NodeManager Copy the code
-
http://hadoop01:8088/cluster
- Use YARN to run the MapReduce example test (perform the same example as MR above)
- The result can be viewed on the command line or in a browser.
Configuration JobHistory
Now that everything is ready, there is one problem:
The History is not available because we need to configure our JobHistory service.
-
Configure the/etc/hadoop/mapred – site. XML
Here the specific IP address can be specified at will (preferably find a relatively idle machine), but specify which IP address when the time to start JobHistory on which machine, otherwise it will fail to start!
<! Jobhistory service address --> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop01:10020</value> </property> <! Jobhistory web address --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop01:19888</value> </property> <! --> select * from HDFS where HDFS is stored --> <property> <name>mapreduce.jobhistory.done-dir</name> <value>/history/done</value> </property> <! -- path to temporary files in MR running --> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/history/done_intermediate</value> </property> Copy the code
-
Restart the yarn
$ sbin/stop-yarn.sh
$ sbin/start-yarn.sh
Copy the code
- Start the
JobHistory
service
$ sbin/mr-jobhistory-daemon.sh start historyserver
Copy the code
- Look at http://hadoop01:19888/jobhistory
- Execute the MapReduce job and click History to verify (if this cannot be opened, check whether the URL domain name is correct and set host).
Configure log aggregation
When the log server configuration is complete, an error occurs when you click on the History panel of a Job to view logs:
Aggregation is not enabled. Try the nodemanager at hadoop01:27695
Or see application log at http://hadoop01:27695/node/application/application_1595579336183_0002
Copy the code
This requires us to enable log aggregation, so that we can see the details of the program running, so as to facilitate development and debugging.
-
Configuration of yarn – site. XML
<! Enable log aggregation --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <! -- Set log retention time to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> Copy the code
-
Restart the YARN and History services.
Complete distribution
For a real cluster, we will deploy Hadoop’s NameNode and DataNode on different machines.
Before configuration, you need to plan how to distribute these service nodes. In this example:
hadoop01 | hadoop02 | hadoop03 | |
---|---|---|---|
HDFS | The NameNode, DataNode | DataNode | SecondaryNameNode, DataNode |
YARN | NodeManager | The ResourceManager, NodeManager | NodeManager |
Try to distribute the nodes evenly and don’t stack them on one machine. By default, DataNode and NodeManager are configured on each machine to manage Data and CPU respectively.
Ensure that haDOOP01, HaDOOP02, and haDOOP03 are used to log in to other machines using SSH.
Are you sure you want to continue connecting (yes/no)? Confirm the yes
If ResourceManager is configured on Hadoop02, it also requires that Hadoop02 can SSH the other two machines.
Just connect the three machines to each other for convenience.
-
First in haDOOP01 will be the basic environment is configured, the specific process refer to pseudo distributed.
-
Hadoop-env. sh Sets JAVA_HOME. My test situation is that other XXXX-env.sh without setting JAVA_HOME is ok.
-
core-site.xml
<! -- select NameNode from HDFS where Hadoop01 --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop01:9000</value> </property> <! -- Optional items --> <! -- hdFS-site will also configure name and data, so you don't need this --> <property> <name>hadoop.tmp.dir</name> <value>/data/my/tmp</value> </property> Copy the code
-
hdfs-site.xml
<! -- Set the HTTP address of the secondaryNamenode --> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop03:50090</value> </property> <! -- set the path where namenode is stored --> <property> <name>dfs.namenode.name.dir</name> <value>/ home/Justin/env/hadoop - 2.10.0 tmp_dfs/name</value> </property> <! -- Set the path for storing datanodes --> <property> <name>dfs.datanode.data.dir</name> <value>/ home/Justin/env/hadoop - 2.10.0 / tmp_dfs/data</value> </property> <! -- Optional items --> <! -- set the number of HDFS replicas (default: 3) --> <property> <name>dfs.replication</name> <value>3</value> </property> Copy the code
-
yarn-site.xml
<! -- Reduce_shuffle --> Mapreduce_shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <! -- Specify the machine on which ResourceManager is located --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop02</value> </property> <! Enable log aggregation --> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <! -- Set log retention time to 7 days --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> Copy the code
-
mapred-site.xml
<!--只是配置在yarn上运行MapReduce--> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <! Jobhistory service address --> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop01:10020</value> </property> <! Jobhistory web address --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop01:19888</value> </property> <! --> select * from HDFS where HDFS is stored --> <property> <name>mapreduce.jobhistory.done-dir</name> <value>/history/done</value> </property> <! -- path to temporary files in MR running --> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/history/done_intermediate</value> </property> Copy the code
-
Salve: vim etc/hadoop/slaves The main purpose of configuring slave is to know which nodes have Hadoop when clustering.
hadoop01 hadoop02 hadoop03 Copy the code
-
-
And then copy it to another machine.
scp -r /home/justin/env justin@hadoop02:/home/justin/ scp -r /home/justin/env justin@hadoop03:/home/justin/ #Or use xsync for synchronizationRsync av/home/Justin/env/hadoop - 2.10.0 / hadoop0X: / home/Justin/env/hadoop - 2.10.0 /Copy the code
-
If the cluster is started for the first time, format the NameNode (before formatting, empty the files in TMP and logs).
bin/hdfs namenode -format Copy the code
-
Start to cluster
Note: If NameNode and ResourceManger are not on the same machine, start YARN on the machine where ResouceManager resides instead of NameNode.
#Because NameNode specifies hadoop01, ResourceManager specifies Hadoop02 hadoop01: start-dfs.sh hadoop02: start-yarn.sh #If both NameNode and ResourceManger are specified on hadoop01, it can be used. Otherwise, ResourceManger fails to start start-all.sh Copy the code
-
Start the JobHistory service on the corresponding machine.
$ sbin/mr-jobhistory-daemon.sh start historyserver Copy the code
validation
It’s the same as the planned distribution:
[justin@hadoop01]$ jps
32176 Jps
32033 NodeManager
31064 NameNode
31208 DataNode
[justin@hadoop02]$ jps
10899 ResourceManager
11012 NodeManager
8581 DataNode
11421 Jps
[justin@hadoop03]$ jps
24960 DataNode
25953 NodeManager
25082 SecondaryNameNode
26124 Jps
Copy the code
Browser access:
http://hadoop01:50070/ (use NameNode IP)
http://hadoop02:8088/cluster (IP) using the ResourceManager
http://hadoop03:50090/status.html (see SecondaryNameNode)
If the page is empty, modify the share/hadoop/HDFS/webapps/static/DFS – dust. Js line 61 as follows. Then rsync the file on all machines and restart HDFS. Clear the cache and refresh.
'date_tostring' : function (v) {
//return moment(Number(v)).format('ddd MMM DD HH:mm:ss ZZ YYYY');
return new Date(Number(v)).toLocaleString();
},
Copy the code