In this deployment, two servers are used: host: Master (64.115.5.32) Slave: Slave0 (64.115.5.158) You can add more servers. It is better to have more than three servers.
Note: the Hadoop Namenode, SecondaryNameNode, ResourceManager this three services more memory, as far as possible not on a server, if enough resource to put on a server is no problem; The Namenode, SecondaryNameNode is main relations had better not put on the same server.
The installation and configuration of each node are basically the same. In practice, the installation and configuration are usually completed on the master node, and then the installation directory is copied to other nodes. Note: All operations are performed using Hadoop as user root and depend on the Java environment. Ensure that the Java environment is configured first
Download Hadoop configuration
Perform the following operations only on the master node. After the configuration is complete, distribute the configuration to other nodes.
Download the Hadoop installation package
Hadoop’s official website: hadoop.apache.org/ here I use Hadoop version download address: archive.apache.org/dist/hadoop…
Decompress the Hadoop installation package
Upload the downloaded hadoop-27.5.tar. gz file to the /opt/hadoop directory on the server. After uploading, execute the following code on the master host:
cd /opt/hadoop
Copy the code
Go to the /opt/hadoop directory and run the decompress command.
The tar - ZXVF hadoop - 2.7.5. Tar. GzCopy the code
After the decompression is successful, the system automatically creates a hadoop-2.7.5 subdirectory in the Hadoop directory.
For convenience, change the folder name hadoop-2.7.5 to Hadoop, which is the Hadoop installation directory. Run the command to change the folder name:
The mv hadoop - 2.7.5 hadoopCopy the code
Enter the installation directory and check the installation file. If the file list shown in the figure is displayed, the compression is successful
Configuration File Introduction
XML, hdFS-site. XML, mapred-site. XML, and yarn-site. XML. These four files are the configuration parameters of different components, as shown in the following table:
Configuration file name | A configuration object | The main content |
---|---|---|
core-site.xml | Cluster Global Parameters | Defines system-level parameters, such as HDFS URL and Hadoop temporary directory |
hdfs-site.xml | HDFS parameters | For example, storage locations of name nodes and data nodes, number of file copies, and file access permissions |
mapred-site.xml | Graphs parameters | Includes JobHistory Server and application parameters, such as the default number of Reduce jobs and the default upper and lower limits of memory that can be used by jobs |
yarn-site.xml | Cluster resource management system parameters | Configure ResourceManager, NodeManager communication ports, and Web monitoring ports |
Important parameters for configuring the cluster
This section discusses specifying important parameters in a given configuration file. Among the four configuration files, the most important parameters and their explanations are as follows:
1. core-site.xml
Parameter names | The default value | Parameter interpretation |
---|---|---|
fs.defaultFS | file:/// | File system host and port |
io.file.buffer.size | 4096 | Buffer size of stream file |
hadoop.tmp.dir | /tmp/hadoop-${user.name} | Temporary folder |
2. hdfs-site.xml
Parameter names | The default value | Parameter interpretation |
---|---|---|
dfs.namenode.secondary.http-address | 0.0.0.0:50090 | Define the HTTP server address and port of HDFS |
dfs.namenode.name.dir | file://${hadoop.tmp.dir}/dfs/name | Defines the location of the DFS name node on the local file system |
dfs.datanode.data.dir | file://${hadoop.tmp.dir}/dfs/data | Defines the location on the local file system where a DFS data node stores data blocks |
dfs.replication | 3 | Default number of block replications |
dfs.webhdfs.enabled | true | Whether to read HDFS files through HTTP. If yes, the cluster security is poor |
3. mapred-site.xml
Parameter names | The default value | Parameter interpretation |
---|---|---|
mapreduce.framework.name | local | The value can be local, classic, or YARN. If yarn is not used, the YARN cluster is not used to allocate resources |
mapreduce.jobhistory.address | 0.0.0.0:10020 | You can define the IP address and port of the history server to view the records of Mapreduce jobs that have been run |
mapreduce.jobhistory.webapp.address | 0.0.0.0:19888 | Defines the address and port that the history server web application accesses |
4. yarn-site.xml
Parameter names | The default value | Parameter interpretation |
---|---|---|
yarn.resourcemanager.address | 0.0.0.0:8032 | ResourceManager Specifies the address for the client to access. The client uses this address to submit applications to RM and kill applications |
yarn.resourcemanager.scheduler.address | 0.0.0.0:8030 | ResourceManager Access address provided by ApplicationMaster. ApplicationMaster uses this address to apply for or release resources from RM |
yarn.resourcemanager.resource-tracker.address | 0.0.0.0:8031 | ResourceManager Specifies the address provided by the ResourceManager to NodeManager. NodeManager uses this address to report heartbeat and collect tasks to RM |
yarn.resourcemanager.admin.address | 0.0.0.0:8033 | ResourceManager Access address provided by the administrator. The administrator uses this address to send management commands to RM |
yarn.resourcemanager.webapp.address | 0.0.0.0:8088 | ResourceManager provides addresses for Web services. You can use this address to view cluster information in the browser |
yarn.nodemanager.aux-services | This configuration item allows users to customize some services. For example, map-Reduce shuffle is implemented in this way. In this way, users can extend their own services on NodeManager |
Configuring the env File
To configure the JDK file, run the following command:
vi /opt/hadoop/hadoop/etc/hadoop/hadoop-env.sh
Copy the code
Find the line “Export JAVA_HOME” to configure the JDK path
Change the value to export JAVA_HOME=/usr/local/java/jdk1.8.0_207/
Configure the core component file
Hadoop core components of the file is a core – site. XML, located in the/opt/Hadoop/Hadoop/etc/Hadoop subdirectory, with vi to edit the core – site. The XML file, need to put the configuration code below files between and.
Execute the command to edit the core-site. XML file:
vi /opt/hadoop/hadoop/etc/hadoop/core-site.xml
Copy the code
Code to add between and:
<property> <name> fs.defaultfs </name> <value> HDFS ://master:9000</value> <description> Define the default file system host and port </description> </property> <! Dir </name> <value>/opt/hadoop/ hadoopData </value> <description>Abase --> <property> <name> Hadoop.tmp forother temporary directories.</description> </property>Copy the code
Create a data directory:
mkdir /opt/hadoop/hadoopdata
Copy the code
mkdir /opt/hadoop/hdfs
mkdir /opt/hadoop/hdfs/data
mkdir /opt/hadoop/hdfs/name
Copy the code
After editing, exit and save!
Configuring the File System
Hadoop file system configuration file is HDFS – site. XML, located in the/opt/Hadoop/Hadoop/etc/Hadoop subdirectory, with vi to edit the file, need to put the following code in the file and between.
Run the following command to edit the HDFS -site. XML file:
vi /opt/hadoop/hadoop/etc/hadoop/hdfs-site.xml
Copy the code
Code to add between and:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
Copy the code
After editing, exit and save!
Configure the yarn-site. XML file
Site configuration file of Yarn is the Yarn – site. XML, located in the/opt/hadoop/hadoop/etc/hadoop subdirectory, still use vi to edit the file, add the following code in a file and between.
Run the command to edit the yarn-site. XML file:
vi /opt/hadoop/hadoop/etc/hadoop/yarn-site.xml
Copy the code
Code to add between and:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:18040</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:18030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:18025</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:18141</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:18088</value>
</property>
Copy the code
Configure the MapReduce computing framework file
In/opt/hadoop/hadoop/etc/hadoop subdirectory, system already has a mapred – site. XML. The template file, we need to copy it and changed its name, location remains the same.
Execute copy and rename operation commands:
cp /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml
Copy the code
Then edit the mapred-site. XML file with VI and fill in the following code between and of the file.
Execute command:
vi /opt/hadoop/hadoop/etc/hadoop/mapred-site.xml
Copy the code
Code to add between and:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Copy the code
After editing, save and exit!
Configure the Master Slaves file
The slaves file provides the list of slave nodes of the Hadoop cluster. This file is very important because when starting Hadoop, the system always starts the cluster according to the list of slave node names in the current slaves file and the slave nodes that are not in the list are not considered as computing nodes.
Execute edit Slaves file command:
vi /opt/hadoop/hadoop/etc/hadoop/slaves
Copy the code
Note: Vi is used to edit the slaves file, which should be edited according to the actual situation of the cluster built by readers themselves. For example: I only have Slave0.
So the following code should be added:
slave0
Copy the code
Note: Remove the original localhost line in the slaves file!
Configure domain name resolution (for both master and slave)
vi /etc/hosts
Copy the code
Add the following two lines:
64.115.5.32 master
64.115.5.158 slave0
Copy the code
Copy Hadoop from master to slave
Since I only have Slave0 here, I only need to copy it once. If there are too many machines, which other method can be used for replication?
Copy command:
scp -r /opt/hadoop root@slave0:/opt
Copy the code
Hadoop cluster startup
Configure operating system environment variables (master and slave required)
Then use vi to edit /etc/profile:
vi /etc/profile
Copy the code
Finally, append the following code to the end of the file:
# HADOOP
export HADOOP_HOME=/opt/hadoop/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Copy the code
After saving the configuration and exiting, run the following command:
source /etc/profile
Copy the code
The source /etc/profile command enables the configuration to take effect
Tip: Use the same configuration method as above in Slave0.
Format file system (do at master)
Run the command to format the file system:
hadoop namenode -format
Copy the code
Start and shut down the Hadoop cluster (do it at Master)
First go to the installation home directory and run the following command:
cd /opt/hadoop/hadoop/sbin
Copy the code
Then the startup command is:
start-all.sh
Copy the code
To shut down the Hadoop cluster, run the following command:
stop-all.sh
Copy the code
When Hadoop is started next time, you do not need to initialize NameNode. You only need to run the start-dfs.sh command and then run the start-yarn.sh command to start YARN.
In fact, Hadoop recommends deprecating the use of commands such as start-all.sh and stop-all.sh in favor of the start-dfs.sh and start-yarn.sh commands.
Verify that the Hadoop cluster is started successfully
You can run the JPS command on the terminal to check whether Hadoop is successfully started.
On the master node, execute:
jps
Copy the code
If the SecondaryNameNode, ResourceManager, Jps, and NameNode processes are displayed, the master node is successfully started.
Then execute JPS command under slave0 node:
If NodeManager, Jps, and DataNode are displayed, the slave node (Slave0) is successfully started.
After the server firewall is disabled, open the following address in the browser. If the following page is displayed, the deployment succeeds.
More information about configuration parameters
Above mentioned to the configuration of hadoop, are some important configuration information, there will be other configuration information, you can through the hadoop’s official website inquiries, address is as follows: hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren… Hadoop.apache.org/docs/curren…
These sites provide access to all the latest Hadoop configuration information, including outdated definitions for better hadoop cluster maintenance.
conclusion
Hadoop 3.x is the latest version of Hadoop 2.7x, and many configurations are not specified, such as Job history server address, etc. If you need more professional and detailed deployment process, you can see: official documentation B site is Silicon Valley big Data Hadoop 3.x
Finally, thank my girlfriend for her tolerance, understanding and support in work and life!