This article describes how to set up a Spark cluster.

To set up a Spark cluster, you need to:

  1. Configure SSH non-secret login for the cluster
  2. Java jdk1.8
  3. Scala – 2.11.12
  4. The spark – 2.4.0 – bin – hadoop2.7
  5. Hadoop – 2.7.6

All of the above files are installed in the /home/zhuyb/opt folder.

The server

The servers were laboratory, using one master and three slave machines. The IP and the machine name are mapped in the hosts file so that the machine can be accessed directly through hostname.

ip addr hostname
219.216.64.144 master
219.216.64.200 hadoop0
219.216.65.202 hadoop1
219.216.65.243 hadoop2

Configure SHH encryption-free login

For details see cluster environment SSH password-free login setup SHH password-free login configuration is actually very easy, we now have four machines, first we generate new public and private key files on master.

Pub cat id_rsa.pub >> authorized_keys # Copy the public key to authorized_keys fileCopy the code

Then generate the public and private keys on the three slave machines in the same way, copy the respective authorized_keys file to the master, and append the public key from the file to the master’s authorized_keys file.

Ssh-copy-id -i master # Copy the public key to master authorized_keysCopy the code

Finally, copy the master authorized_keys file to the three slave machines. At this point, four machines can achieve SSH non-secret login.

Install the JDK and Scala

JDK version 1.8, Scala version 2.11. Scala 2.12 is somewhat incompatible with Spark 2.4, and there will be some issues in the programming process that should be resolved later. After decompressing the JDK and Scala files, you can configure environment variables in ~/.bashrc files.

exportJAVA_HOME = / home/zhuyb/opt/jdk1.8.0 _201export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
export JAVA_PATH=${JAVA_HOME}/bin:${JRE_HOME}/bin
export PATH=$PATH:${JAVA_PATH}

exportSCALA_HOME = / home/zhuyb/opt/scala - 2.11.12export PATH=${SCALA_HOME}/bin:$PATH

Copy the code

Then run the source ~/.bashrc command. All operations need to be configured on each machine.

Configure Hadoop

  1. Unzip Hadoop files to~/opt/Under the folder.
Tar -zxvf hadoop-2.1.3.tar. gz mv hadoop-2.7.3 ~/optCopy the code
  1. in~/.bashrcConfigure environment variables in the file and executesource ~/.bashrcTo take effect
` ` `exportHADOOP_HOME = / home/zhuyb/opt/hadoopp - 2.7.6export PATH=.:$HADOOP_HOME/bin:$HADOOP_HOME/sbin/:$PATH
export CLASSPATH=.:$HADOOP_HOME/lib:$CLASSPATH
exportHADOOP_PREFIX = / home/zhuyb/opt/hadoop - 2.7.6 ` ` `Copy the code
  1. Modify the corresponding configuration file

    A. Modify $HADOOP_HOME /etc/hadoop/Slaves to delete the original localhost and change it to the following content:

        hadoop0
        hadoop1
        hadoop2
    Copy the code

    B. modify the $HADOOP_HOME/etc/hadoop/core – site. XML

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://master:9000</value>
        </property>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>File: / home/zhuyb/opt/hadoop - 2.7.6 / TMP</value>
        </property>
    
    </configuration>
    
    Copy the code

    C. modify the $HADOOP_HOME/etc/hadoop/HDFS – site. XML

        <configuration>
        <property>
            <name>dfs.datanode.address</name>
            <value>0.0.0.0:50010</value>
        </property>
    
        <property>
            <name>dfs.namenode.secondary.http-address</name>
            <value>master:50090</value>
        </property>
        <! -- Number of backups: default is 3-->
         <property>
            <name>dfs.replication</name>
             <value>3</value>
         </property>
        <! -- namenode-->
         <property>
             <name>dfs.namenode.name.dir</name>
             <value>file:/home/zhuyb/tmp/dfs/name</value>
         </property>
        <! -- datanode-->
         <property>
             <name>dfs.datanode.data.dir</name>
             <value>file:/home/zhuyb/tmp/dfs/data</value>
    </property>
    </configuration>
    Copy the code

    D. Copy the template to generate XML, cp mapred-site.xml.template mapred-site. XML, and modify $HADOOP_HOME/etc/hadoop/mapred-site. XML

    <configuration>
    <! -- MapReduce job execution framework is YARN -->
        <property>
             <name>mapreduce.framework.name</name>
             <value>yarn</value>
        </property>
    <! -- MapReduce job record access address -->
        <property>
              <name>mapreduce.jobhistory.address</name>
              <value>master:10020</value>
         </property>
        <property>
               <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
         </property>
    </configuration>
    Copy the code

    E. modify $HADOOP_HOME/etc/hadoop/yarn – site. XML

        <configuration>
    	<! -- Site specific YARN configuration properties -->
    <property>
         <name>yarn.nodemanager.aux-services</name>
         <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    
    </configuration>
    Copy the code

    F. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh to modify JAVA_HOME

    exportJAVA_HOME = ~ / opt/jdk1.8.0 _121Copy the code
  2. Copy the Master Hadoop folder to Hadoop0, Hadoop1, and hadoop2

SCP - r ~ / opt/hadoop - 2.7.3 zhuyb @ hadoop0: ~ / optCopy the code

Configure the Spark

  1. Decompress the Spark file to~/optDown, and then in~/.bashrcConfigure environment variables and executesource ~/.bashrc
exportSPARK_HOME = / home/zhuyb/opt/spark - 2.4.0 - bin - hadoop2.7export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME
Copy the code
  1. Template to spark-env.sh,cp spark-env.sh.template spark-env.shModify $SPARK_HOME/conf/spark-env.sh to add the following information:
exportJAVA_HOME = / home/zhuyb/opt/jdk1.8.0 _201export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
Copy the code
  1. Slaves. Template become slaves,cp slaves.template slaves, modify $SPARK_HOME/conf/ Slaves and add the following:
hadoop0
hadoop1
hadoop2
Copy the code
  1. Hadoop1, hadoop2,hadoop0, $SPARK_HOME/conf/spark-env.sh, Change export SPARK_LOCAL_IP to the IP addresses of hadoOP1, hadoOP2, and hadoOP0

reference

Hadoop2.7.3+Spark2.1.0 completely distributed cluster building process