First, homework submission

1.1 spark – submit

In all Spark modes, the spark-submit command is used to submit a job in the following format:

/bin/spark-submit \ --class <main-class> # application master entry class --master <master-url> # cluster master URL --deploy-mode <deploy-mode> # Deploy mode --conf <key>=<value> # Optional configuration... # other options <application-jar> \ # jar package path [application-arguments] # Arguments passed to the main entry classCopy the code

Note that: in a cluster environment, application-jar must be accessible to all nodes in the cluster. It can be a path in HDFS. It can also be a local file system path. If it is a local file system path, the Jar package must exist in the same path on every machine node in the cluster.

1.2 the deploy – mode

Deploy-mode has two optional parameters: cluster and client. The default value is client. Both are described in Spark On Yarn mode.

  • In cluster mode, Spark Drvier runs in the Master process of applications. The process is managed by YARN in the cluster. Clients that submit jobs can be stopped after applications are started.
  • In client mode, Spark Drvier runs in the client process that submits jobs. The Master process only requests resources from YARN.

1.3 the master – the url

All optional parameters of master-URL are shown in the following table:

Master URL Meaning
local Run Spark locally using one thread
local[K] Run Spark locally with K worker threads
local[K,F] K worker threads are used to run locally, and the second parameter is the number of failed retries of the Task
local[*] Run Spark locally with the same number of threads as the number of CPU cores
local[*,F] Run Spark locally with the same number of threads as the number of CPU cores

The second parameter is the number of failed retries of the Task
spark://HOST:PORT Connect to the master node of the specified standalone cluster. The default port number is 7077.
spark://HOST1:PORT1,HOST2:PORT2 If the standalone cluster is highly available using Zookeeper, it must contain all master host addresses set by Zookeeper.
mesos://HOST:PORT Connect to the given Mesos cluster. The default port number is 5050. For Mesos Clusters that use ZooKeeper, usemesos://zk://...To specify the address, use--deploy-mode clusterMode to submit.
yarn Connect to a YARN cluster. The cluster is configuredHADOOP_CONF_DIRorYARN_CONF_DIRTo decide. use--deploy-modeParameter to configureclientclusterMode.

The following describes three common deployment modes and their corresponding job submission modes.

2. Local mode

It is easiest to submit a job in Local mode and no configuration is required. The submit command is as follows:

#Submit applications in local modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[2] \ The/usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100 # to SparkPi parametersCopy the code

Spark-examples_2.11-2.4.4. jar is a test case package provided by Spark. SparkPi is used to calculate Pi values. The result is as follows:

Standalone mode

Standalone is a built-in cluster mode provided by Spark that is managed by using a built-in resource manager. The following shows the cluster configuration of 1 Mater and 2 Worker nodes as shown in the figure, using two hosts:

  • Hadoop001: Since there are only two hosts, hadoop001 is both the Master node and the Worker node;
  • Hadoop002: Worker node.

3.1 Environment Configuration

Ensure that Spark is decompressed to the same path on the two hosts. Then go to ${SPARK_HOME}/conf/ of hadoOP001, copy the configuration sample and configure it:

# cp spark-env.sh.template spark-env.sh
Copy the code

Run the spark-env.sh command to configure the JDK directory and send the configuration to hadoop002 using the SCP command:

#JDK Installation LocationJAVA_HOME = / usr/Java/jdk1.8.0 _201Copy the code

3.2 Cluster Configuration

In ${SPARK_HOME}/conf/, copy the sample cluster configuration and configure it:

# cp slaves.template slaves
Copy the code

Specify the hostname of all Worker nodes:

# A Spark Worker will be started on each of the machines listed below.
hadoop001
hadoop002
Copy the code

Here are three points to note:

  • The mapping between host name and IP address must be in/etc/hostsOtherwise, use the IP address directly.
  • Each host name must have a single line;
  • The Master host of Spark accesses all Worker nodes through SSH. Therefore, you need to configure the encryption-free login in advance.

3.3 start

Run start-all.sh to start Master and all Worker services.

./sbin/start-master.sh 
Copy the code

Access port 8080 and view the Web-UI of Spark. Two valid working nodes are displayed:

3.4 Submitting a Job

#Submit to the standalone cluster in Client modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://hadoop001:7077 \ --executor-memory 2G \ - total - executor - cores 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100
#Submit to the standalone cluster in Cluster modeSpark - submit \ - class org. Apache. Spark. Examples. SparkPi \ - master spark: / / 207.184.161.138:7077 \ - deploy - mode cluster \ --supervise \ # Configure this parameter to enable supervision. If the main application exits unexpectedly, --executor-memory 2G \ --total-executor-cores 10 \ The/usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100Copy the code

3.5 Optional Configuration

A common problem when submitting a job on a virtual machine is that the job cannot obtain sufficient resources:

Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient resources
Copy the code


At this point, you can view the Web UI. You can modify the executor-memory of the Woker as well as the executor-memory of the Woker. The default value is all available memory of the host minus 1 GB.


All the optional configurations of Master and Woker nodes are as follows. You can configure them in spark-env.sh:

Environment Variable Meaning (Meaning)
SPARK_MASTER_HOST Master Node ADDRESS
SPARK_MASTER_PORT Master Node IP address port (default: 7077)
SPARK_MASTER_WEBUI_PORT Master’s Web UI port (default: 8080)
SPARK_MASTER_OPTS -dx =y (default: none) for master configuration only.spark-standalone-mode
SPARK_LOCAL_DIRS Temporary storage directory of Spark, used to temporarily store map output and persistent RDDs. Multiple directories are separated by commas
SPARK_WORKER_CORES Number of CPU Cores available to spark worker nodes. (Default: all available)
SPARK_WORKER_MEMORY The amount of memory that can be used by the spark worker node (default: total memory minus 1GB).
SPARK_WORKER_PORT Port of the Spark worker node (default: random)
SPARK_WORKER_WEBUI_PORT The worker’s Web UI Port (default: 8081)
SPARK_WORKER_DIR Worker The directory where the application runs, which contains logs and staging space (default: SPARK_HOME/work)
SPARK_WORKER_OPTS For worker configuration properties only, the format is “-dx =y” (default: None). All attributes can be referred to the official documentation:spark-standalone-mode
SPARK_DAEMON_MEMORY Memory allocated to spark Master and Worker daemons. (Default: 1G)
SPARK_DAEMON_JAVA_OPTS JVM options for spark Master and worker daemons in the format “-dx =y” (default: None)
SPARK_PUBLIC_DNS Public DNS names of spark master and worker. (Default: None)

Spark on Yarn mode

Spark can submit jobs to Yarn. In this case, you do not need to start the Master node or the Worker node.

3.1 configuration

In spark-env.sh, you can specify the location of the hadoop configuration directory by using YARN_CONF_DIR or HADOOP_CONF_DIR:

YARN_CONF_DIR=The/usr/app/hadoop - server - cdh5.15.2 / etc/hadoop
# JDK installation location
JAVA_HOME=The/usr/Java/jdk1.8.0 _201
Copy the code

3.2 start

Ensure that Hadoop has been started, including YARN and HDFS. During calculation, Spark uses HDFS to store temporary files. If HDFS is not started, an exception will be thrown.

# start-yarn.sh
# start-dfs.sh
Copy the code

3.3 Submitting an Application

#The file is submitted to the YARN cluster in client modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ --executor-memory 2G \ - num - executors 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100
#Submit the data to the YARN cluster in cluster modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --executor-memory 2G \ - num - executors 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100Copy the code