Hadoop-apache version configuration

This is the third day of my participation in the August More text Challenge. For details, see:August is more challenging

Apache version

Open your corresponding version of Hadoop official documentation (here hadoop.apache.org/docs/r2.10….). . It is best to choose the corresponding version of the document to operate to avoid some compatibility problems.

Install the configuration

Several Linux machines (3 Centos7.6 in this article)
1. Cluster time synchronization NTP (you do not need it, but you do need it).
2. Configure the JDK.
3. Configuration of the host.
4. Install the required software: SSH (required for cluster deployment) and rsync (required for cluster configuration synchronization).
5. $SSH localhost = $SSH localhost
```
$ ssh-keygen -t rsa -P ' ' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Copy the code
```
Hadoop package (this article uses Hadoop-2.10.0.tar.gz)
1. Configure the/etc/hadoop/hadoop – env. Sh. Export JAVA_HOME=${JAVA_HOME} =${JAVA_HOME}
2. Configure HADOOP_HOME (optional, easy)
3. ```
Export HADOOP_HOME = / home/Justin/env/hadoop - 2.10.0 export PATH = $PATH: $HADOOP_HOME/bin: $HADOOP_HOME/sbinCopy the code
```
The Hadoop configuration is complete (but not started yet). You can use the Hadoop command to test:

[justin@hadoop01]$ hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
....
Copy the code

Start the Hadoop

Hadoop has three startup modes:

Single-process Mode: Local (Standalone) Mode
Pseudo Distributed: Pseudo Distributed Mode
Fully Distributed: Fully Distributed Mode

Single-process mode

Start only one Hadoop node. This deployment mode is generally used to debugging the programs that you want to deploy to the cluster.

This pattern requires no additional configuration and is equivalent to configuring the environment to execute the written JAR package. Can we use the official example to see whether the environment is OK?

validation

$ mkdir input
$ cp etc/hadoop/*.xml input
$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep input output'dfs[a-z.]+'
$ cat output/*// The basic environment is OKCopy the code

Pseudo distributed

Start Hadoop related background services on a single machine, each in a separate process.

Configure cluster.

Configure the/etc/hadoop/core – site. XML

<! -- configure HDFS NameNode address -->
<property>
	<name>fs.defaultFS</name>
	<value>hdfs://hadoop01:9000</value>
</property>
<! By default, all files executed by Hadoop are in the TMP folder of the system, so they will disappear after the restart. Need to specify a new folder -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/data/my/tmp</value>
</property>
Copy the code

For details, see core-default.xml

Configure the/etc/hadoop/HDFS – site. XML

<! -- specify the number of HDFS copies -->
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
Copy the code

For details, check HDFS -default.xml

Format NameNode (only on first startup, no formatting later)
- ```
$ bin/hdfs namenode -format
Copy the code
```
- Format the NameNode to generate a new cluster ID. The cluster ids of NameNode and DataNode are inconsistent, and the cluster cannot find the past data. Therefore, when formatting NameNode, delete data and log before formatting NameNode.

Start the NameNode&DataNode

$ sbin/start-dfs.sh// It can also be started separately$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
Copy the code

stop
- ```
$ sbin/stop-dfs.sh
Copy the code
```

Verify the HDFS

If namenode or Datanode startup fails, go to the Logs folder and view the logs.

Two ways:

[justin@hadoop01]$ jps
4899 DataNode
5316 Jps
5096 SecondaryNameNode
4783 NameNode
Copy the code

Browser visit http://hadoop01:50070/

Run the MapReduce example test directly using HDFS

// Upload the test file to the created input folder$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/justin # Only step by step
$ bin/hdfs dfs -mkdir /user/justin/input
$ bin/hdfs dfs -put etc/hadoop/*.xml /user/justin/input// Execute the official case Mr Program. Output must not exist$Bin/hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 2.10.0. Jar grep/user/Justin/input/output/user/Justin'dfs[a-z.]+'// Check the result$ bin/hdfs dfs -cat /user/justin/output/*// Delete the result$ hdfs dfs -rm -r /user/justin/output
Copy the code

Configuration of YARN

Configure etc/hadoop/mapred-site. XML (copy mapred-site.xml. Template).

<! -- specify MR program to run on YARN -->
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>
Copy the code

For details, see mapred-default.xml

Configure the/etc/hadoop/yarn – site. XML

<! -- How to obtain data from Reducer -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<! -- Specify the address of YARN ResourceManager -->
<property>

 <name>yarn.resourcemanager.hostname</name>
 <value>hadoop01</value>
</property>
Copy the code

For details, see yarn-default.xml

Start resourcemanager and NodeManager (ensure that NameNode and DataNode are started)

$ sbin/start-yarn.sh// Or start separately$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
Copy the code

Copy the code

stop
- ```
$ sbin/stop-yarn.sh
Copy the code
```

Verify the YARN

[justin@hadoop01]$ jps
4899 DataNode
6150 Jps
5096 SecondaryNameNode
5897 ResourceManager
4783 NameNode
5999 NodeManager
Copy the code

http://hadoop01:8088/cluster

Use YARN to run the MapReduce example test (perform the same example as MR above)

The result can be viewed on the command line or in a browser.

Configuration JobHistory

Now that everything is ready, there is one problem:

The History is not available because we need to configure our JobHistory service.

Configure the/etc/hadoop/mapred – site. XML

Here the specific IP address can be specified at will (preferably find a relatively idle machine), but specify which IP address when the time to start JobHistory on which machine, otherwise it will fail to start!

<! Jobhistory service address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop01:10020</value>
</property>
<! Jobhistory web address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop01:19888</value>
</property>
<! --> select * from HDFS where HDFS is stored -->
<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/history/done</value>
</property>
<! -- path to temporary files in MR running -->
<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/history/done_intermediate</value>
</property>
Copy the code

Restart the yarn

$ sbin/stop-yarn.sh
$ sbin/start-yarn.sh
Copy the code

Start theJobHistoryservice

$ sbin/mr-jobhistory-daemon.sh start historyserver
Copy the code

Look at http://hadoop01:19888/jobhistory
Execute the MapReduce job and click History to verify (if this cannot be opened, check whether the URL domain name is correct and set host).

Configure log aggregation

When the log server configuration is complete, an error occurs when you click on the History panel of a Job to view logs:

Aggregation is not enabled. Try the nodemanager at hadoop01:27695
Or see application log at http://hadoop01:27695/node/application/application_1595579336183_0002
Copy the code

This requires us to enable log aggregation, so that we can see the details of the program running, so as to facilitate development and debugging.

Configuration of yarn – site. XML

<! Enable log aggregation -->
<property>
	<name>yarn.log-aggregation-enable</name>
	<value>true</value>
</property>
<! -- Set log retention time to 7 days -->
<property>
	<name>yarn.log-aggregation.retain-seconds</name>
	<value>604800</value>
</property>
Copy the code

Restart the YARN and History services.

Complete distribution

For a real cluster, we will deploy Hadoop’s NameNode and DataNode on different machines.

Before configuration, you need to plan how to distribute these service nodes. In this example:

	hadoop01	hadoop02	hadoop03
HDFS	The NameNode, DataNode	DataNode	SecondaryNameNode, DataNode
YARN	NodeManager	The ResourceManager, NodeManager	NodeManager

Try to distribute the nodes evenly and don’t stack them on one machine. By default, DataNode and NodeManager are configured on each machine to manage Data and CPU respectively.

Ensure that haDOOP01, HaDOOP02, and haDOOP03 are used to log in to other machines using SSH.

Are you sure you want to continue connecting (yes/no)? Confirm the yes

If ResourceManager is configured on Hadoop02, it also requires that Hadoop02 can SSH the other two machines.

Just connect the three machines to each other for convenience.

First in haDOOP01 will be the basic environment is configured, the specific process refer to pseudo distributed.

Hadoop-env. sh Sets JAVA_HOME. My test situation is that other XXXX-env.sh without setting JAVA_HOME is ok.

core-site.xml

<! -- select NameNode from HDFS where Hadoop01 -->
<property>
	<name>fs.defaultFS</name>
	<value>hdfs://hadoop01:9000</value>
</property>
<! -- Optional items -->
<! -- hdFS-site will also configure name and data, so you don't need this -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/data/my/tmp</value>
</property>
Copy the code

hdfs-site.xml

<! -- Set the HTTP address of the secondaryNamenode -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>hadoop03:50090</value>
</property>
<! -- set the path where namenode is stored -->
<property>
	<name>dfs.namenode.name.dir</name>
	<value>/ home/Justin/env/hadoop - 2.10.0 tmp_dfs/name</value>
</property>
<! -- Set the path for storing datanodes -->
<property>
	<name>dfs.datanode.data.dir</name>
	<value>/ home/Justin/env/hadoop - 2.10.0 / tmp_dfs/data</value>
</property>
<! -- Optional items -->
<! -- set the number of HDFS replicas (default: 3) -->
<property>
	<name>dfs.replication</name>
	<value>3</value>
</property>
Copy the code

yarn-site.xml

<! -- Reduce_shuffle --> Mapreduce_shuffle -->
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<! -- Specify the machine on which ResourceManager is located -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>hadoop02</value>
</property>
<! Enable log aggregation -->
<property>
	<name>yarn.log-aggregation-enable</name>
	<value>true</value>
</property>
<! -- Set log retention time to 7 days -->
<property>
	<name>yarn.log-aggregation.retain-seconds</name>
	<value>604800</value>
</property>
Copy the code

mapred-site.xml

<!--只是配置在yarn上运行MapReduce-->
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>
<! Jobhistory service address -->
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop01:10020</value>
</property>
<! Jobhistory web address -->
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop01:19888</value>
</property>
<! --> select * from HDFS where HDFS is stored -->
<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/history/done</value>
</property>
<! -- path to temporary files in MR running -->
<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/history/done_intermediate</value>
</property>
Copy the code

Salve: vim etc/hadoop/slaves The main purpose of configuring slave is to know which nodes have Hadoop when clustering.
```
hadoop01
hadoop02
hadoop03
Copy the code
```

And then copy it to another machine.

scp -r /home/justin/env  justin@hadoop02:/home/justin/
scp -r /home/justin/env  justin@hadoop03:/home/justin/
#Or use xsync for synchronizationRsync av/home/Justin/env/hadoop - 2.10.0 / hadoop0X: / home/Justin/env/hadoop - 2.10.0 /Copy the code

If the cluster is started for the first time, format the NameNode (before formatting, empty the files in TMP and logs).
```
bin/hdfs namenode -format
Copy the code
```

Start to cluster

Note: If NameNode and ResourceManger are not on the same machine, start YARN on the machine where ResouceManager resides instead of NameNode.

#Because NameNode specifies hadoop01, ResourceManager specifies Hadoop02
hadoop01: start-dfs.sh
hadoop02: start-yarn.sh
#If both NameNode and ResourceManger are specified on hadoop01, it can be used. Otherwise, ResourceManger fails to start
start-all.sh
Copy the code

Start the JobHistory service on the corresponding machine.

$ sbin/mr-jobhistory-daemon.sh start historyserver
Copy the code

validation

It’s the same as the planned distribution:

[justin@hadoop01]$ jps
32176 Jps
32033 NodeManager
31064 NameNode
31208 DataNode

[justin@hadoop02]$ jps
10899 ResourceManager
11012 NodeManager
8581 DataNode
11421 Jps

[justin@hadoop03]$ jps
24960 DataNode
25953 NodeManager
25082 SecondaryNameNode
26124 Jps
Copy the code

Browser access:

http://hadoop01:50070/ (use NameNode IP)

http://hadoop02:8088/cluster (IP) using the ResourceManager

http://hadoop03:50090/status.html (see SecondaryNameNode)

If the page is empty, modify the share/hadoop/HDFS/webapps/static/DFS – dust. Js line 61 as follows. Then rsync the file on all machines and restart HDFS. Clear the cache and refresh.

'date_tostring' : function (v) {
    //return moment(Number(v)).format('ddd MMM DD HH:mm:ss ZZ YYYY');
    return new Date(Number(v)).toLocaleString();
},
Copy the code

Apache version

Install the configuration

Start the Hadoop

Single-process mode

validation

Pseudo distributed

Configure cluster.

Verify the HDFS

Configuration of YARN

Verify the YARN

Configuration JobHistory

Configure log aggregation

Complete distribution

validation

Related Posts

Take a closer look at the rationale behind Go-defer

Ultra long dry goods by Kubernetes Network Quick Start complete guide

Linked list algorithm interview? Just look at me!