The preparatory work
Rely on that
Hadoop depends on the Java environment, and different Hadoop versions correspond to different Java versions. Details are as follows
Hadoop version | Java version |
---|---|
2.7.x – 2.x | Java 7, java8 |
3.0-3.2 | java8 |
3.3 | Java8, java11 |
The Apache Hadoop community uses OpenJDK for the build/test/release environment and other JDKS/JVMS should work fine. But it's best to use OpenJDKCopy the code
Hadoop Version selection
As of this publication, the latest version of Hadoop is 3.3.0, but as you can see from the announcement, 3.1.3 is the latest stable version. It is recommended to choose stable version to avoid many pits.
This article also uses version 3.1.3 to build the Hadoop environment!
Its installation
Since hadoop 3.1.3 is selected, the corresponding OpenJDK should also select java8 version
How to install and configure the Java environment for Linux? See Installing and configuring the Java environment for Linux
Installing Other Components
Install PDSH
sudo yum install -y pdsh
Copy the code
Installation and deployment
download
Wget HTTP: / / https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gzCopy the code
Unpack the
The tar - ZXVF hadoop - 3.1.3. Tar. GzCopy the code
Hadoop installation and deployment mode
Hadoop can be installed and deployed in a variety of modes
-
Single-machine mode: In this mode, no distributed file system exists and data is accessed from the local file system without any daemons. This mode is mainly used for developing and debugging the application logic of MapReduce programs.
-
Pseudo-distributed mode: Hadoop is installed on a node with a distributed file system. It is the same as fully distributed but has poor performance. It is usually used for personal testing.
-
Fully distributed: A Hadoop cluster consisting of multiple machines uses the master-slave architecture. This mode is rarely used in production because only one Namanode has a single point of failure.
-
High availability mode: This mode solves the problem of completely distributed single point of failure. In this mode, there are multiple Namenodes, but only one is in the active state, and the others are all in the standby state. When a Namenode fails, it will automatically switch to other backup Namenodes. This mode of hadoop construction is often used in production, but it also has disadvantages. For example, there is only one Namenode that provides services externally at the same time. With the increase of service data and the expansion of clusters, namenode will be under increasing pressure
-
Federated mode: This mode applies to large-scale clusters. Multiple Namenodes can provide services at the same time, and each Namenode maintains only part of datanodes’ metadata
Local mode
Local mode is mainly used for running and debugging during local development. After downloading and decompressing Hadoop, no setting is done. The default mode is local mode. In local mode all modules run in a JVM process using a local file system rather than HDFS.
To verify that the local mode is configured correctly, we can use hadoop’s own character statistics program to test it
- Start by preparing a text file to be analyzed and adding anything
vi ~/test.txt
Copy the code
2. Run MapReduce Demo of Hadoop
~ / hadoop - 3.1.3 / bin/hadoop jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount ~/test.txt ~/testCopy the code
~/test is the directory for analyzing results. Do not create it yourself
If you don’t have enough memory on your machine, you may also get an OOM error
Solution: Lower the Hadoop maximum heap memory
Vi ~ / hadoop - 3.1.3 / etc/hadoop/hadoop - env. ShCopy the code
findexport HADOOP_HEAPSIZE=
And append belowexport HADOOP_HEAPSIZE=512
(My machine is small, so ONLY 512MB is allocated.)
Run the preceding command again. If the job ID contains local, the job is run in local mode.
View the output file
In local mode, mapReduce outputs are sent locally.If the _SUCCESS file is in the output directory, the JOB is successfully run. Part -r-00000 is the output result file.
Pseudo-distribution model
Pseudo-distributed mode simulates a distributed environment on a single machine and has all the functions of Hadoop
Five files need to be configured
- ~ / hadoop – 3.1.3 / etc/hadoop/hadoop – env. Sh
- ~ / hadoop – 3.1.3 / etc/hadoop/core – site. XML
- ~ / hadoop – 3.1.3 / etc/hadoop/HDFS – site. The XML
- ~ / hadoop – 3.1.3 / etc/hadoop/mapred – site. XML
- ~ / hadoop – 3.1.3 / etc/hadoop/yarn – site. The XML
Configure hadoop – env. Sh
First, look at the address of JAVA_HOME
echo $JAVA_HOME
Copy the code
Configure the JAVA_HOME path
Vi ~ / hadoop - 3.1.3 / etc/hadoop/hadoop - env. ShCopy the code
Go to Export JAVA_HOME and add it below
Export JAVA_HOME=JAVA_HOME addressCopy the code
Configure the core – site. The XML
Vi ~ / hadoop - 3.1.3 / etc/hadoop/core - site. XMLCopy the code
Configure the default HDFS access port and temporary data directory
<configuration>
<property>
<! -- temporary directory where data is stored. Note that the path must be absolute and do not use the ~ symbol -->
<name>hadoop.tmp.dir</name>
<value>/ root/hadoop - 3.1.3 / HDFS/TMP</value>
</property>
<property>
<! -- HDFS access port -->
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Copy the code
Hadoop.tmp. dir is a temporary hadoop directory. NameNode data of HDFS is stored in this directory. By default, hadoop.tmp.dir points to/TMP /hadoop-${user.name}. In this case, if the operating system restarts, the system will clear all files in/TMP
Note that the Hadoop temporary directory '~/hadoop-3.1.3/ HDFS/TMP' in the example needs to be created in advanceCopy the code
Mkdir -p ~/hadoop-3.1.3/ HDFS/TMP tree -c -fp ~/hadoop-3.1.3/ HDFS /Copy the code
Configuration HDFS – site. XML
Vi ~ / hadoop - 3.1.3 / etc/hadoop/HDFS - site. The XMLCopy the code
Configuring the Replication Number
<configuration>
<property>
<! -- Set the replication number to 1, that is, no replication is performed -->
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Copy the code
Formatting HDFS
Formatting is to block datanodes in the HDFS, a distributed file system. The initial metadata generated after partitioning is stored in NameNode.
When formatting, pay attention to the permissions of the hadoop.tmp.dir directoryCopy the code
~ / hadoop - 3.1.3 / bin/HDFS namenode - formatCopy the code
Formatted successfully If “has been formatted successfully” is displayed
After the formatting is successful, check whether the DFS directory exists in the directory specified by hadoop.tmp.dir in core-site. XML
Tree - C - ~ / hadoop - 3.1.3 pf/HDFS/TMP /Copy the code
- Fsimage_ * is a local file where NameNode metadata is persisted when memory is full.
- fsimage_Md5 is a verification file used to verify Fsimage_Integrity.
- Seen_txid is the hadoop version
- Vession File contents
- NamespaceID: indicates the unique ID of the NameNode
- ClusterID: indicates the clusterID. NameNode and DataNode cluster ids should be the same.
Cat/root/hadoop - 3.1.3 / HDFS/TMP/DFS/name/current/VERSIONCopy the code
Start the cluster.
We have configured hadoop-env.sh, core-site. XML, hdFS-site. XML and initialized HDFS. Now we are ready to start HDFS.
A complete HDFS system has the following components
- A Namenode node
- A standby Namenode node (together with the namenode node above)
- N Datanodes
Start the namenode
~ / hadoop - 3.1.3 / bin/HDFS daemon start - the namenodeCopy the code
Start the secondarynamenode
~ / hadoop - 3.1.3 / bin/HDFS - daemon start secondarynamenodeCopy the code
Start the datanode
~ / hadoop - 3.1.3 / bin/HDFS - daemon start datanodeCopy the code
Viewing startup Status
jps
Copy the code
Create directories, upload files, and download files on HDFS
Create a directory on the HDFS
~ / hadoop - 3.1.3 / bin/HDFS DFS - mkdir/demoCopy the code
Upload local files to HDFS
~/hadoop-3.1.3/bin/ HDFS DFS -put /root/test.txt /demoCopy the code
/root/test. TXT is the file to be uploaded, and /demo is the directory created in HDFS
View files in HDFS
~ / hadoop - 3.1.3 / bin/HDFS DFS - cat/demo/test. TXTCopy the code
Download the HDFS file to the local PC
~/hadoop-3.1.3/bin/ HDFS dfs-get /demo/test. TXT /root/testCopy the code
TXT is a file to be downloaded from the HDFS. /root/test is a local directory that needs to be created in advance
Configuration mapred – site. XML
If there is no 'mapred-site. XML' file, see if there is a configuration template file 'mapred-site.xml.template', copy this template to generate 'mapred-site. XML'Copy the code
Vi ~ / hadoop - 3.1.3 / etc/hadoop/mapred - site. XMLCopy the code
Set the framework used by MapReduce. Yarn is used in this example
<configuration>
<property>
<! Set MapReduce to use the YARN framework -->
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Copy the code
Configuration of yarn – site. XML
Vi ~ / hadoop - 3.1.3 / etc/hadoop/yarn - site. The XMLCopy the code
Set the yarn mixing mode to the default mapReduce algorithm
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Copy the code
Start the YARN
We have configured hadoop-env.sh, core-site. XML, mapred-site. XML, and yarn-site. XML. Now we only need to start YARN
The YARN system consists of Resourcemanager and NodeManager
Start the resourcemanager
~ / hadoop - 3.1.3 / bin/yarn - daemon start the resourcemanagerCopy the code
In YARN, ResourceManager manages and allocates all resources in a cluster in a unified manner. ResourceManager receives resource reports from each Node (NodeManager) and allocates the resource reports to applications (ApplicationManager) based on certain policies.
Start the nodemanager
~ / hadoop - 3.1.3 / bin/yarn - daemon start nodemanagerCopy the code
Viewing startup Status
jps
Copy the code
The port number of the YARN Web client is 8088. You can view the port number at http://localhost:8088.
Running graphs Job
Hadoop’s share directory comes with jars containing small examples of MapReduce instances, Position in hadoop – 3.1.3 / share/hadoop/graphs/hadoop – graphs – examples – 2.5.0. Jar, can run the examples of newly built experience hadoop platform, We’ll run a classic WordCount instance here to test it
Check the startup status of all services. Ensure that all services are started
~ / hadoop - 3.1.3 / bin/yarn jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount /demo/test.txt /demo/outputCopy the code
/demo/test. TXT is a file in the HDFS. If no file exists, prepare it in advance. /demo/output is an output directory in the HDFS
hadoop fs -rm -r /demo/output
Copy the code
If can’t find or unable to load the main class. Org. Apache hadoop. Graphs. The v2. App. MRAppMaster, requires the yarn – site. In the XML file configuration hadoop classpath
See the hadoop classpath
~ / hadoop - 3.1.3 / bin/hadoop classpathCopy the code
Configure hadoop classpath
Vi ~ / hadoop - 3.1.3 / etc/hadoop/yarn - site. The XMLCopy the code
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>Hadoop Classpath</value>
</property>
</configuration>
Copy the code
Try executing the WordCount instance again
~ / hadoop - 3.1.3 / bin/yarn jar ~ / hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - examples - 3.1.3. Jar wordcount /demo/test.txt /demo/outputCopy the code
No error this time, check the output directory
~ / hadoop - 3.1.3 / bin/HDFS DFS - ls/demo/outputCopy the code
Fully distributed mode
Pseudo-distributed mode can also be done on a single VPS, while fully distributed mode requires multiple VPS (at least three). Please refer to my other article hadoop fully distributed system construction for details