This article describes how to set up a Spark cluster.
To set up a Spark cluster, you need to:
- Configure SSH non-secret login for the cluster
- Java jdk1.8
- Scala – 2.11.12
- The spark – 2.4.0 – bin – hadoop2.7
- Hadoop – 2.7.6
All of the above files are installed in the /home/zhuyb/opt folder.
The server
The servers were laboratory, using one master and three slave machines. The IP and the machine name are mapped in the hosts file so that the machine can be accessed directly through hostname.
ip addr | hostname |
---|---|
219.216.64.144 | master |
219.216.64.200 | hadoop0 |
219.216.65.202 | hadoop1 |
219.216.65.243 | hadoop2 |
Configure SHH encryption-free login
For details see cluster environment SSH password-free login setup SHH password-free login configuration is actually very easy, we now have four machines, first we generate new public and private key files on master.
Pub cat id_rsa.pub >> authorized_keys # Copy the public key to authorized_keys fileCopy the code
Then generate the public and private keys on the three slave machines in the same way, copy the respective authorized_keys file to the master, and append the public key from the file to the master’s authorized_keys file.
Ssh-copy-id -i master # Copy the public key to master authorized_keysCopy the code
Finally, copy the master authorized_keys file to the three slave machines. At this point, four machines can achieve SSH non-secret login.
Install the JDK and Scala
JDK version 1.8, Scala version 2.11. Scala 2.12 is somewhat incompatible with Spark 2.4, and there will be some issues in the programming process that should be resolved later. After decompressing the JDK and Scala files, you can configure environment variables in ~/.bashrc files.
exportJAVA_HOME = / home/zhuyb/opt/jdk1.8.0 _201export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
export JAVA_PATH=${JAVA_HOME}/bin:${JRE_HOME}/bin
export PATH=$PATH:${JAVA_PATH}
exportSCALA_HOME = / home/zhuyb/opt/scala - 2.11.12export PATH=${SCALA_HOME}/bin:$PATH
Copy the code
Then run the source ~/.bashrc command. All operations need to be configured on each machine.
Configure Hadoop
- Unzip Hadoop files to
~/opt/
Under the folder.
Tar -zxvf hadoop-2.1.3.tar. gz mv hadoop-2.7.3 ~/optCopy the code
- in
~/.bashrc
Configure environment variables in the file and executesource ~/.bashrc
To take effect
` ` `exportHADOOP_HOME = / home/zhuyb/opt/hadoopp - 2.7.6export PATH=.:$HADOOP_HOME/bin:$HADOOP_HOME/sbin/:$PATH
export CLASSPATH=.:$HADOOP_HOME/lib:$CLASSPATH
exportHADOOP_PREFIX = / home/zhuyb/opt/hadoop - 2.7.6 ` ` `Copy the code
-
Modify the corresponding configuration file
A. Modify $HADOOP_HOME /etc/hadoop/Slaves to delete the original localhost and change it to the following content:
hadoop0 hadoop1 hadoop2 Copy the code
B. modify the $HADOOP_HOME/etc/hadoop/core – site. XML
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>File: / home/zhuyb/opt/hadoop - 2.7.6 / TMP</value> </property> </configuration> Copy the code
C. modify the $HADOOP_HOME/etc/hadoop/HDFS – site. XML
<configuration> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:50010</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:50090</value> </property> <! -- Number of backups: default is 3--> <property> <name>dfs.replication</name> <value>3</value> </property> <! -- namenode--> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/zhuyb/tmp/dfs/name</value> </property> <! -- datanode--> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/zhuyb/tmp/dfs/data</value> </property> </configuration> Copy the code
D. Copy the template to generate XML, cp mapred-site.xml.template mapred-site. XML, and modify $HADOOP_HOME/etc/hadoop/mapred-site. XML
<configuration> <! -- MapReduce job execution framework is YARN --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <! -- MapReduce job record access address --> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> </property> </configuration> Copy the code
E. modify $HADOOP_HOME/etc/hadoop/yarn – site. XML
<configuration> <! -- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> </configuration> Copy the code
F. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh to modify JAVA_HOME
exportJAVA_HOME = ~ / opt/jdk1.8.0 _121Copy the code
-
Copy the Master Hadoop folder to Hadoop0, Hadoop1, and hadoop2
SCP - r ~ / opt/hadoop - 2.7.3 zhuyb @ hadoop0: ~ / optCopy the code
Configure the Spark
- Decompress the Spark file to
~/opt
Down, and then in~/.bashrc
Configure environment variables and executesource ~/.bashrc
exportSPARK_HOME = / home/zhuyb/opt/spark - 2.4.0 - bin - hadoop2.7export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME
Copy the code
- Template to spark-env.sh,
cp spark-env.sh.template spark-env.sh
Modify $SPARK_HOME/conf/spark-env.sh to add the following information:
exportJAVA_HOME = / home/zhuyb/opt/jdk1.8.0 _201export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
Copy the code
- Slaves. Template become slaves,
cp slaves.template slaves
, modify $SPARK_HOME/conf/ Slaves and add the following:
hadoop0
hadoop1
hadoop2
Copy the code
- Hadoop1, hadoop2,hadoop0, $SPARK_HOME/conf/spark-env.sh, Change export SPARK_LOCAL_IP to the IP addresses of hadoOP1, hadoOP2, and hadoOP0
reference
Hadoop2.7.3+Spark2.1.0 completely distributed cluster building process