2. Hadoop environment Installation and configuration

The target

Master how to install Centos7 in VMware
Master Hadoop cluster construction
Master installation of related software
Know how to handle common problems

1 Install CentOS7 on VMware

1.1 installation VMware15

Follow the public account: EZ Big Data, reply VM to obtain VMware15 installation package and activation key

Expand: Bridged, NAT, host-only difference
- Bridged (bridge mode) : suitable for office, LAN environment, production environment commonly used. LAN in an independent host, can access any machine in the network, but need their own shops east configuration IP address.
- NAT (Network Address translation mode) : applies to the home environment (no router). It can access the Internet in the virtual system, which is directly converted by the host computer for networking and located in the VMnet8 subnet of the virtual machine.
- Host-only: the vm cannot be connected to the Internet but can only be connected to the Host and is located in the VMnet1 subnet of the VM.

1.2 Centos7 installation

Install the system

Reference: blog.51cto.com/13880683/21…
Setting a Static IP Address

Run the service network restart command to restart the network

Directory: ifcfg-ens33 file in /etc/sysconfig/network-scripts/

Key information, as follows:
```
BOOTPROTO=static ONBOOT=yes IPADDR=192.168.xxx.200 NETMASK=255.255.255.0 GATEWAY=192.168.xxx.2 DNS1=114.114.114.114 DNS2 = 8.8.8.8Copy the code
```

Disabling the Firewall

Firewall-cmd --state, view the default firewall status (notrunning is displayed when disabled, running is displayed when enabled) systemctl stop firewalld.service, Disable firewall systemctl disable firewalld.service to disable firewall startup upon startupCopy the code

Set up shared folders (FTP upload preferred, EZ preferred)

Reference: www.cnblogs.com/skyheaving/…

Network Exception

Failed to start LSB: Bring up/down networking.

#Solution: Disable NetworkManager
systemctl stop NetworkManager
systemctl disable NetworkManager
Copy the code

2 Hadoop Cluster Construction (Fully Distributed)

Hadoop operating mode: local mode, pseudo-distributed mode, fully distributed mode.

Note: This article mainly describes the fully distributed installation, the following configuration, need to configure three machines in VMware: Master, Slave1, Slave2, need to turn off the firewall, set static IP, change the host name.

The corresponding hardware configuration (memory, hard disk capacity) is set according to the situation of the machine.

2.1 Hadoop2.7.7 installation

The official document: hadoop.apache.org/docs/r2.7.7…

Download version: archive.apache.org/dist/hadoop…

Installation reference: www.cnblogs.com/thousfeet/p…

Uninstall the Java delivered with the system

Java - verizon RPM - qa | grep JDK. Noarch, delete all RPM -e -- nodeps XXXCopy the code

Change the hostname

#Changing the host Name
hostnamectl set-hostname xxx

#Set the relationship between master and Slave1 and slave2
#Add IP and hostname for Salve1, Slave2
vim /etc/hosts
Copy the code

Adding environment variables

#It takes effect for the current user
vim ~/.bash_profile

#It takes effect for all users
vim /etc/profile

#Execute effective command
source ~/.bash_profile
or
source /etc/profile 
Copy the code

Avoid close login

The principles of the secret-free login are as follows:

Create a public/private key

Ssh-keygen -t rsa #Copy the code

Create the authorized_keys file and change the permission to 600

cd .ssh
touch authorized_keys
chmod 600 authorized_keys
Copy the code

Append the public key to the authorized_keys file

#The public keys of master, Slave1, and Slave2 are appended to authorized_keyscat id_rsa.pub >> authorized_keysSSH master/slave1/slave2
Copy the code

Important directory
- Bin: stores scripts for operating Hadoop related services (HDFS, YARN)
- Etc: Hadoop configuration file directory for storing Hadoop configuration files
- Lib: a local repository for Hadoop (to compress and uncompress data)
- Sbin: stores scripts for starting or stopping hadoop-related services
- Share: Stores Hadoop dependent JAR packages, documents, and official cases

2.2 configure Hadoop

Modify the configuration

Modify the TMP directory in core-site. XML configuration

<configuration> <property> <name>fs.defaultFS</name> <value> HDFS ://master:9000</value> <description>HDFS URI, File system ://namenode identifier: port number </description> </property> <property> <name>hadoop.tmp.dir</name> < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/TMP < value > / < description > namenode locally on hadoop temporary folder < / description > < / property > <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration>Copy the code

Example Modify hadoop-env.sh to configure JAVA_HOME

Export JAVA_HOME = / home/Amos/SoftWare/jdk1.8.0 _251Copy the code

Modify HDFS -site. XML to set DFS /name and DFS /data

< configuration > < property > < name > DFS. The namenode. Secondary. HTTP - address < / name > < value > master: 9001 < value > / < / property > The < property > < name > DFS. The namenode. Name. Dir < / name > < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/name value > < / <description> Namenode store HDFS namespace metadata </description> </property> <property> <name>dfs.datanode.data.dir</name> < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/data < value > / < description > physical storage location data blocks on a datanode < / description > < / property > DFS. Replication </name> <value>3</value> <description> Number of copies, </description> </property> </ Configuration >Copy the code

Modify mapred-site. XML to set the yarn name

< Configuration > <property> <name> mapReduce.framework. name</name> <value> YARN </value> <description> Specifies that mapReduce is run in yarn. Hadoop1 </description> </property> <! - hadoop history server - > < property > < name > graphs. The jobhistory. Address < / name > < value > master: 10020 < value > / < description > MR JobHistory location of the log Server management < / description > < / property > < property > < name > graphs. The JobHistory. Webapp. Address < / name > <value>master:19888</value> <description> View the Web address of Mapreduce jobs that have been run on the historical server. They only need to start the service < / description > < / property > < property > < name > graphs. The jobhistory. Done - dir < / name > <value>/mr-history/done</value> <description> Store the log managed by Mr JobHistory Server. Default :/ MR-history /done</description> </property> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/mr-history/tmp</value> <description> Stores MapReduce job logs. Default: - the history/TMP/Mr < / description > < / property > < property > < name > yarn. The app. Graphs. Am. Staging - dir < / name > <value>/ MR-history /hadoop-yarn/</value> <description> Specifies the applicationID and required JAR files, etc. </description> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> <description> Physical memory limit for each map task </description> </property> < property > < name > graphs. Reduce. The memory. MB < / name > < value > 2048 < / value > < description > each reduce task of physical memory limit < / description > </property> </configuration>Copy the code

Modify slaves file to configure slave nodes
```
slave1slave2
Copy the code
```

Modify the yarn-site. XML file to configure the RM port

<configuration><! Log aggregation --> <property> <name>yarn.log-aggregation --> <property> <value>true</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property> </configuration>Copy the code

The remote copy
- SCP: security copy
Note: For the files copied to Slave1 and slave2, modify the related files, for example, source ~/.bash_profile
```
Remote copy to slave1, slave2 (premise is that the hostname is set to slave1 slave2) and SCP - r hadoop - 2.7.7 / [email protected]: / home/Amos/SoftWareCopy the code
```
- Rsync: remote synchronization tool
The difference between rsync and SCP is that rsync copies files faster than SCP. In addition, rsync only updates different files. SCP copies all files.
```
rsync -rvl ./yarn-site.xml root@slave1:/home/amos/test
Copy the code
```
Formatting & Launching

Important: Format NameNode only at the first startup. As a result, a new cluster ID will be generated when NameNode is formatted. As a result, the cluster ids of NameNode and DataNode are inconsistent, and the cluster cannot find past data. Therefore, when formatting NameNode, delete data data and log logs before formatting NameNode.
- Formatting on the master node
```
bin/hadoop namenode -format
Copy the code
```
- Start the cluster
  
  Note: If NameNode and ResourceManger are not on the same machine, Yarn cannot be started on NameNode. Start Yarn on the machine where ResourceManger resides.
```
Sbin /start-all.sh # JPS # JPS is a JDK command. 4464 ResourceManager 4305 SecondaryNameNode 4972 Jps 4094 NameNodeCopy the code
```
- The JPS view
```
#Master node: 16747 Jps16124 NameNode16493 ResourceManager16334 SecondaryNameNode# slave1 node: 10485 DataNode10729 Jps10605 NodeManager# slave2 node: 10521 NodeManager10653 Jps10399 DataNode
Copy the code
```

2.3 Troubleshooting

process information unavailable

After a common user starts the corresponding program, the root user runs the kill command, causing the process to be in this state. This phenomenon may occur when different accounts kill processes. User starts a Java process, but kills it as user root.
```
ll /tmp/|grep hsperfdata  rm -rf /tmp/hsperfdata* 
Copy the code
```

Turn off safe Mode

hdfs dfsadmin -safemode leave
Copy the code

3 summary

All things are difficult at the beginning, and I have always believed that the initial installation and configuration of any project is the most difficult.

This article mainly summarizes the installation and configuration of VMware CentOS7 and Hadoop. I recall that AT the beginning of my own groping, I encountered various pits during the period, and then spent nearly two days to lose more than N hair to complete. Speaking of, the installation of this article is only part of the pit, some easy Baidu is not necessary to elaborate here.

Identify the problem, summarize the thinking, then try to solve the problem yourself, the progress will be obvious.

Well, that’s all for today, bye bye ~