Learn big data, no matter how always can not get around the little yellow elephant HadoopThe installation of Hadoop can be said to be the first step to enter the field of big data. As a student who is still studying big data in school, I have accumulated some experience after years of learning, so I want to introduce a wave of nanny Hadoop installation teaching.
By default, you have some basic Linux, vmware Workstation and other similar virtual machine installation software installed on your computer (of course, you have the money to buy a cloud server, forget that I said).
Linux VM installation
- Here we download the mirror portal of centos7.6
- Install the centos VM
- Click Create vm
- Select the installation cd-rom column and import the image we have created. It will detect the system we want to install
- Click Next and click Finish. This will open the virtual machine automatically
- Press Enter until the visual interface is displayed. Set the installation language to Chinese
- We’d better choose minimum installation for speed
- The next step is to set the password of the root account and create the user we normally use
- Then there was a slow wait (about five or six minutes)
- The hou click on the restart is done!
Preparations before Hadoop installation
-
Static IP address and host name configuration
- Open the ifcfg-ens33 file to modify the configuration
vi /etc/sysconfig/network-scripts/ifcfg-ens33 ............ BOOTPROTO=static # Change DHCP to static ONBOOT=yes # Change no to yes IPADDR=192.168.10.200 # Add IPADDR attribute and IP address PREFIX=24 GATEWAY DNS1=114.114.114.114 # Add DNS1 and backup DNS DNS2=8.8.8.8Copy the code
- Restart the Network service
systemctl restart network # or service network restart Copy the code
-
Changing the host name
hostnamectl set-hostname master Copy the code
Note: After configuring the IP address and host name, reboot
-
Configure the /etc/hosts file
vi /etc/hosts #Add it later 192.168.216.114 master Copy the code
-
Disabling the firewall
systemctl stop firewalld systemctl disable firewalld #It is also a good idea to turn off selinux, which isa security mechanism on Linux systems. Go to the file and set selinux to Disabled vi /etc/selinux/config SELINUX=disabled Copy the code
-
Time synchronization
-
Enter tzselect and select 5, 9, 1, and 1
-
Download the NTP
yum install -y ntp Copy the code
-
Configure the/etc/NTP. Conf
Vim /etc/ntp.conf server 127.127.1.0 fudge 127.127.1.0 stratum 10Copy the code
-
Start :/bin/systemctl restart ntpd.service
-
-
Set no-password login
- Press enter after ssh-keygen
- Ssh-copy-id -i /root/.ssh/id_rsa -p 22 root@master Enter the password
Hadoop single-node installation and configuration
-
The JDK installation
-
Check to see if the JDK has been installed or built into the system, and if so, uninstall it
RPM - qa | grep JDK # if you have, please uninstall the RPM -e XXXXXXXX -- nodeps # will query to the built-in JDK forced unloadingCopy the code
-
Upload the JDK to /opt/software/
-
Decompress the JDK to /opt/apps/
cd /opt/software tar -zxvf jdk-8u152-linux-x64.tar.gz -C /opt/apps/ Copy the code
-
Renamed the JDK
cd /opt/apps mv jdk-8u152/ jdk Copy the code
-
Configure the Jdk environment variable /etc/profile
vim /etc/profile #The back to add #jdk environment export JAVA_HOME=/opt/apps/jdk export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH Copy the code
-
Make the current window effective
source /etc/profile Copy the code
-
Verifying the Java environment
java -version javac Copy the code
-
-
Hadoop Standalone Installation
-
Upload hadoop to /opt/software/
-
/opt/apps/
CD /opt/software/ tar -zxvf hadoop-2.1.6.tar. gz -c /opt/apps/Copy the code
-
Renamed the hadoop
CD /opt/apps mv hadoop-2.7.6/ hadoopCopy the code
-
Configure environment variables for Hadoop
vi /etc/profile #hadoop environment export HADOOP_HOME=/opt/apps/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH Copy the code
-
Make the current window effective
source /etc/profile Copy the code
-
Verify the hadoop
hadoop version Copy the code
-
Configure the hadoop-env.sh file
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh #Change the followingExport JAVA_HOME = / simple/jdk1.8.0 _152Copy the code
-
Hadoop pseudo-distributed installation and configuration
Pseudo distributed mode is introduced
First of all, we need to understand the characteristics of pseudo distributed 1. Characteristics
- Install on one machine, using the distributed idea, that is, distributed file system, non-local file system.
- Hdfs involved a daemon (namenode, datanode, secondarynamenode) are running on a machine, is independent of the Java process.
- More code debugging than Standalone mode, allowing you to check memory usage, HDFS input/output, and other daemon interactions.
Since we have already configured the non-secret login static IP host mapping and installed JDK and Hadoop, we are going to go straight to the file configuration
Configuration file
- The core – site. XML configuration
[root@master ~]# cd $HADOOP_HOME/etc/haoop [root@master hadoop]# vi core-site.xml <configuration> <! </name> <value> HDFS ://localhost/</value> </property> </configuration> Extensions: the default hadoop1.x port is 9000, and the default hadoop2.x port is 8020. Use either portCopy the code
- HDFS – site. The XML configuration
[root@master hadoop]# vi hdfs-site.xml <configuration> <! -- Configure the number of copies. Note that the pseudo-distribution mode can only be 1. --> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>Copy the code
- Configuration of hadoop-env.sh: specify the JDK environment (as in single-machine mode, not detailed here)
Format the NameNode
hdfs namenode -format
Copy the code
Start the cluster.
start-dfs.sh
Copy the code
- JPS View the process
WebUI_50070
You can enter: 192.168.10.200:50070 in the browser to view the pseudo distributed cluster information –1. Browse the page for ClusterID,BlockPoolID –2. Check the number of Live Nodes. It should be 1
Simple explanation:
Compiled Hadoop is Compiled by kshvachk tool. Cluster ID: Cluster ID. Block Pool ID: ID of the Block Pool of datanodes
Fully distributed cluster installation and configuration
Single machine and pseudo distributed can not be used in production environment, can only be used in the daily debugging and learning, we really use or fully distributed cluster
Vm Description
Use vm cloning to clone two VMS. The configurations of the three VMS are as follows
Host name IP Master 192.168.10.200 slave1 192.168.10.201 slave2 192.168.10.202
Note: if the clone operation is slave1 slave2, you do not need to turn off the firewall. Just add the mapping between the two clones in /etc/hosts and change the IP address
Note, note, note: 1. If you are from pseudo-distributed, it is best to shut down the related daemon of pseudo-distributed first: stop-all.sh 2. Delete the namenode and Datanode directories. Delete the namenode and Datanode directories. Delete the namenode and Datanode directories. Delete the namenode and Datanode directories
Daemon layout
Let’s set up the full distribution of HDFS and set up YARN. The layout of HDFS and YARN daemons is as follows:
master: namenode,datanode,ResourceManager,nodemanager slave1: datanode,nodemanager,secondarynamenode slave2: datanode,nodemanager
Hadoop configuration file configuration
- Note before configuration: 1. We first configure hadoop related properties on the Master machine node. In 2.
<value></value>
3. After the master is configured, clone two VMS and change their IP addresses - Configure the core-site. XML file
[root@master ~]# cd $HADOOP_HOME/etc/hadoop/
[root@master hadoop]# vi core-site.xml<configuration> <! DefaultFS </name> <value> HDFS ://master:8020</value> </property> <! -- HDFS base path, </name> <value>/opt/apps/ TMP </value> </property> </configuration>Copy the code
- Configure the HDFS-site. XML file
[root@master hadoop]# vi core-site.xml<configuration> <! -- location of metadata file fsimage managed by the namenode daemon --> <property> <name>dfs.namenode.name.dir</name> <value>file://${hadoop.tmp.dir}/dfs/name</value> </property> <! -- Determine where the DFS datanode should store its blocks in the local file system --> <property> <name>dfs.datanode.data.dir</name> <value>file://${hadoop.tmp.dir}/dfs/data</value> </property> <! -- Number of copies of the block --> <property> <name>dfs.replication</name> <value>3</value> </property> <! -- block size (128--> <property> <name>dfs.blocksize</name> <value>134217728</value> </property> <! -- SecondaryNamenode HTTP address: host name and port number of the daemon. Reference daemon layout - > < property > < name > DFS. The namenode. Secondary. HTTP-address</name>
<value>slave1:50090</value> </property> <! -- file detection directory --> <property> <name>fs.checkpoint.dir</name> <value>file:///${hadoop.tmp.dir}/checkpoint/dfs/cname</value> </property> <! </name> fs.checkpoint.edits.dir</name> <value>file:///${hadoop.tmp.dir}/checkpoint/dfs/cname</value>
</property>
<property>
<name>dfs.http.address</name>
<value>master:50070</value>
</property>
</configuration>
Copy the code
- Configure the mapred-site. XML file
XML and HDFS-site. XML files are required. However, to learn MapReduce, YARN resource manager is required. Therefore, configure related files in advance
[root@master hadoop]# cp mapred-site.xml.template mapred-site.xml
[root@master hadoop]# vi mapred-site.xml<configuration> <! -- Specify mapReduce to use YARN Resource Manager --> <property> <name>mapreduce.framework. Name </name> <value>yarn</value> </property> <! - configuration operation history server address - > < property > < name > graphs. The jobhistory. Address < / name > < value > master:10020</value> </property> <! - configuration operation history server HTTP address - > < property > < name > graphs. The jobhistory. Webapp. Address < / name > < value > master:19888</value>
</property>
</configuration>
Copy the code
- Configure the yarn-site. XML file
[root@master hadoop]# vi yarn-site.xml<configuration> <! -- Specify yarn shuffle technology --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <! - specifies the resourcemanager host name -- > < property > < name > yarn. The resourcemanager. The hostname < / name > < value > master < value > / < / property > <! -- Optional below --> <! -- Specify the class corresponding to shuffle --> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <! - configure the resourcemanager internal address - > < property > < name > yarn. The resourcemanager. Address < / name > < value > master:8032</value> </property> <! - internal correspondence address allocation of the resourcemanager scheduler - > < property > < name > yarn. The resourcemanager. Scheduler. The address < / name > < value > master:8030</value> </property> <! - configuration of resource scheduling resoucemanager internal address - > < property > < name > yarn. The resourcemanager. The resource-tracker.address</name>
<value>master:8031</value> </property> <! - configure the resourcemanager internal communications at the address - > < property > < name > yarn. The resourcemanager. Admin. Address < / name > < value > master:8033</value> </property> <! - configure the resourcemanager web UI monitoring page - > < property > < name > yarn. The resourcemanager. Webapp. Address < / name > < value > master:8088</value>
</property>
</configuration>
Copy the code
- Configure the hadoop-env.sh script (the same as the single-machine deployment mode).
- Configure the slaves file, which is used to specify the host name of the machine node where the Datanode daemon is located
[root@master hadoop]# vi slaves
master
slave1
slave2
Copy the code
- Configure the yarn-env.sh file. You do not need to configure this file. However, it is better to modify the JDK environment of Yarn
- Configuration instructions for the other two machines
After configuring hadoop related files on the Master machine, we have the following two ways to configure Hadoop on other machines.
- SCP synchronization (This method applies to scenarios where multiple VMS have been created in advance)
- Vm cloning
It’s still a hassle to install two more virtual machines from scratch so let’s go with cloning
- Open a newly cloned VM and change the host name
- Changing an IP Address
- Restart the Network service
- Repeat steps 1 to 3 for other newly cloned VMS
- The authentication of the no-secret login is from the master machine, connected to every other node, to verify that the no-secret is working, while removing the first interrogation step
- Suggestion: Reboot the network service on each machine after restarting the network service
The specific operation steps above are not repeated here
Format the NameNode
#Perform operations at master
hdfs namenode -format
Copy the code
If you are successful, you are ready to start your Hadoop cluster!
This section describes starting and closing scripts
1. Start script --start-dfsSh: script used to start the HDFS cluster --start-yarnSh: used to start the YARN daemon process --start-allSh: used to start the HDFS and YARN2. Close the script --stop-dfsSh: script used to shut down the HDFS clusterstop-yarnSh: used to stop the yarn daemon process --stop-allSh: used to shut down HDFS and YARN3Single daemon script -- Hadoop-daemonsSh: script used to start or stop a certain HDFS daemon process -- Hadoop-daemonSh: script used to start or stop an HDFS daemon process. Reg: hadoop-daemon.sh [start|stop] [namenode|datanode|secondarynamenode]
-- yarn-daemonsSh: script used to start or stop a daemon process of the HDFS -- yarn-daemonSh: script used to start or stop an HDFS daemon process reg: yarn-daemon.sh [start|stop] [resourcemanager|nodemanager]
Copy the code
Finally, each host performs the JPS process check operation. If the started process is based on our process layout, then congratulations on your successful Hadoop cluster setup!
Of course, if some processes do not start successfully, we can modify the corresponding process configuration file
Welcome to exchange and study
Personal blog
CSDN home page