The target

  • Master how to install Centos7 in VMware
  • Master Hadoop cluster construction
  • Master installation of related software
  • Know how to handle common problems

1 Install CentOS7 on VMware

1.1 installation VMware15

Follow the public account: EZ Big Data, reply VM to obtain VMware15 installation package and activation key

  • Expand: Bridged, NAT, host-only difference
    • Bridged (bridge mode) : suitable for office, LAN environment, production environment commonly used. LAN in an independent host, can access any machine in the network, but need their own shops east configuration IP address.
    • NAT (Network Address translation mode) : applies to the home environment (no router). It can access the Internet in the virtual system, which is directly converted by the host computer for networking and located in the VMnet8 subnet of the virtual machine.
    • Host-only: the vm cannot be connected to the Internet but can only be connected to the Host and is located in the VMnet1 subnet of the VM.

1.2 Centos7 installation

  • Install the system

    Reference: blog.51cto.com/13880683/21…

  • Setting a Static IP Address

    Run the service network restart command to restart the network

    Directory: ifcfg-ens33 file in /etc/sysconfig/network-scripts/

    Key information, as follows:

    BOOTPROTO=static ONBOOT=yes IPADDR=192.168.xxx.200 NETMASK=255.255.255.0 GATEWAY=192.168.xxx.2 DNS1=114.114.114.114 DNS2 = 8.8.8.8Copy the code
  • Disabling the Firewall

    Firewall-cmd --state, view the default firewall status (notrunning is displayed when disabled, running is displayed when enabled) systemctl stop firewalld.service, Disable firewall systemctl disable firewalld.service to disable firewall startup upon startupCopy the code
  • Set up shared folders (FTP upload preferred, EZ preferred)

    Reference: www.cnblogs.com/skyheaving/…

  • Network Exception

    Failed to start LSB: Bring up/down networking.
    
    #Solution: Disable NetworkManager
    systemctl stop NetworkManager
    systemctl disable NetworkManager
    Copy the code

2 Hadoop Cluster Construction (Fully Distributed)

Hadoop operating mode: local mode, pseudo-distributed mode, fully distributed mode.

Note: This article mainly describes the fully distributed installation, the following configuration, need to configure three machines in VMware: Master, Slave1, Slave2, need to turn off the firewall, set static IP, change the host name.

The corresponding hardware configuration (memory, hard disk capacity) is set according to the situation of the machine.

2.1 Hadoop2.7.7 installation

The official document: hadoop.apache.org/docs/r2.7.7…

Download version: archive.apache.org/dist/hadoop…

Installation reference: www.cnblogs.com/thousfeet/p…

  • Uninstall the Java delivered with the system

    Java - verizon RPM - qa | grep JDK. Noarch, delete all RPM -e -- nodeps XXXCopy the code
  • Change the hostname

    #Changing the host Name
    hostnamectl set-hostname xxx
    
    #Set the relationship between master and Slave1 and slave2
    #Add IP and hostname for Salve1, Slave2
    vim /etc/hosts
    Copy the code
  • Adding environment variables

    #It takes effect for the current user
    vim ~/.bash_profile
    
    #It takes effect for all users
    vim /etc/profile
    
    #Execute effective command
    source ~/.bash_profile
    or
    source /etc/profile 
    Copy the code
  • Avoid close login

    The principles of the secret-free login are as follows:

    1. Create a public/private key
    Ssh-keygen -t rsa #Copy the code
    1. Create the authorized_keys file and change the permission to 600
    cd .ssh
    touch authorized_keys
    chmod 600 authorized_keys
    Copy the code
    1. Append the public key to the authorized_keys file
    #The public keys of master, Slave1, and Slave2 are appended to authorized_keyscat id_rsa.pub >> authorized_keysSSH master/slave1/slave2
    Copy the code
  • Important directory

    • Bin: stores scripts for operating Hadoop related services (HDFS, YARN)
    • Etc: Hadoop configuration file directory for storing Hadoop configuration files
    • Lib: a local repository for Hadoop (to compress and uncompress data)
    • Sbin: stores scripts for starting or stopping hadoop-related services
    • Share: Stores Hadoop dependent JAR packages, documents, and official cases

2.2 configure Hadoop

  • Modify the configuration

    1. Modify the TMP directory in core-site. XML configuration

      <configuration> <property> <name>fs.defaultFS</name> <value> HDFS ://master:9000</value> <description>HDFS URI, File system ://namenode identifier: port number </description> </property> <property> <name>hadoop.tmp.dir</name> < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/TMP < value > / < description > namenode locally on hadoop temporary folder < / description > < / property > <property> <name>hadoop.proxyuser.root.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.root.groups</name> <value>*</value> </property> </configuration>Copy the code
    2. Example Modify hadoop-env.sh to configure JAVA_HOME

      Export JAVA_HOME = / home/Amos/SoftWare/jdk1.8.0 _251Copy the code
    3. Modify HDFS -site. XML to set DFS /name and DFS /data

      < configuration > < property > < name > DFS. The namenode. Secondary. HTTP - address < / name > < value > master: 9001 < value > / < / property > The < property > < name > DFS. The namenode. Name. Dir < / name > < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/name value > < / <description> Namenode store HDFS namespace metadata </description> </property> <property> <name>dfs.datanode.data.dir</name> < value > / home/Amos/SoftWare/hadoop - 2.7.7 / HDFS/data < value > / < description > physical storage location data blocks on a datanode < / description > < / property > DFS. Replication </name> <value>3</value> <description> Number of copies, </description> </property> </ Configuration >Copy the code
    4. Modify mapred-site. XML to set the yarn name

      < Configuration > <property> <name> mapReduce.framework. name</name> <value> YARN </value> <description> Specifies that mapReduce is run in yarn. Hadoop1 </description> </property> <! - hadoop history server - > < property > < name > graphs. The jobhistory. Address < / name > < value > master: 10020 < value > / < description > MR JobHistory location of the log Server management < / description > < / property > < property > < name > graphs. The JobHistory. Webapp. Address < / name > <value>master:19888</value> <description> View the Web address of Mapreduce jobs that have been run on the historical server. They only need to start the service < / description > < / property > < property > < name > graphs. The jobhistory. Done - dir < / name > <value>/mr-history/done</value> <description> Store the log managed by Mr JobHistory Server. Default :/ MR-history /done</description> </property> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/mr-history/tmp</value> <description> Stores MapReduce job logs. Default: - the history/TMP/Mr < / description > < / property > < property > < name > yarn. The app. Graphs. Am. Staging - dir < / name > <value>/ MR-history /hadoop-yarn/</value> <description> Specifies the applicationID and required JAR files, etc. </description> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> <description> Physical memory limit for each map task </description> </property> < property > < name > graphs. Reduce. The memory. MB < / name > < value > 2048 < / value > < description > each reduce task of physical memory limit < / description > </property> </configuration>Copy the code
    5. Modify slaves file to configure slave nodes

      slave1slave2
      Copy the code
    6. Modify the yarn-site. XML file to configure the RM port

      <configuration><! Log aggregation --> <property> <name>yarn.log-aggregation --> <property> <value>true</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property> </configuration>Copy the code
  • The remote copy

    • SCP: security copy

    Note: For the files copied to Slave1 and slave2, modify the related files, for example, source ~/.bash_profile

    Remote copy to slave1, slave2 (premise is that the hostname is set to slave1 slave2) and SCP - r hadoop - 2.7.7 / [email protected]: / home/Amos/SoftWareCopy the code
    • Rsync: remote synchronization tool

    The difference between rsync and SCP is that rsync copies files faster than SCP. In addition, rsync only updates different files. SCP copies all files.

    rsync -rvl ./yarn-site.xml root@slave1:/home/amos/test
    Copy the code
  • Formatting & Launching

    Important: Format NameNode only at the first startup. As a result, a new cluster ID will be generated when NameNode is formatted. As a result, the cluster ids of NameNode and DataNode are inconsistent, and the cluster cannot find past data. Therefore, when formatting NameNode, delete data data and log logs before formatting NameNode.

    • Formatting on the master node

      bin/hadoop namenode -format
      Copy the code
    • Start the cluster

      Note: If NameNode and ResourceManger are not on the same machine, Yarn cannot be started on NameNode. Start Yarn on the machine where ResourceManger resides.

      Sbin /start-all.sh # JPS # JPS is a JDK command. 4464 ResourceManager 4305 SecondaryNameNode 4972 Jps 4094 NameNodeCopy the code
    • The JPS view

      #Master node: 16747 Jps16124 NameNode16493 ResourceManager16334 SecondaryNameNode# slave1 node: 10485 DataNode10729 Jps10605 NodeManager# slave2 node: 10521 NodeManager10653 Jps10399 DataNode
      Copy the code

2.3 Troubleshooting

  • process information unavailable

    After a common user starts the corresponding program, the root user runs the kill command, causing the process to be in this state. This phenomenon may occur when different accounts kill processes. User starts a Java process, but kills it as user root.

    ll /tmp/|grep hsperfdata  rm -rf /tmp/hsperfdata* 
    Copy the code
  • Turn off safe Mode

    hdfs dfsadmin -safemode leave
    Copy the code

3 summary

All things are difficult at the beginning, and I have always believed that the initial installation and configuration of any project is the most difficult.

This article mainly summarizes the installation and configuration of VMware CentOS7 and Hadoop. I recall that AT the beginning of my own groping, I encountered various pits during the period, and then spent nearly two days to lose more than N hair to complete. Speaking of, the installation of this article is only part of the pit, some easy Baidu is not necessary to elaborate here.

Identify the problem, summarize the thinking, then try to solve the problem yourself, the progress will be obvious.

Well, that’s all for today, bye bye ~