Spark series 5 - Spark runtime mode and job submission

First, homework submission

1.1 spark – submit

In all Spark modes, the spark-submit command is used to submit a job in the following format:

/bin/spark-submit \ --class <main-class> # application master entry class --master <master-url> # cluster master URL --deploy-mode <deploy-mode> # Deploy mode --conf <key>=<value> # Optional configuration... # other options <application-jar> \ # jar package path [application-arguments] # Arguments passed to the main entry classCopy the code

Note that: in a cluster environment, application-jar must be accessible to all nodes in the cluster. It can be a path in HDFS. It can also be a local file system path. If it is a local file system path, the Jar package must exist in the same path on every machine node in the cluster.

1.2 the deploy – mode

Deploy-mode has two optional parameters: cluster and client. The default value is client. Both are described in Spark On Yarn mode.

In cluster mode, Spark Drvier runs in the Master process of applications. The process is managed by YARN in the cluster. Clients that submit jobs can be stopped after applications are started.
In client mode, Spark Drvier runs in the client process that submits jobs. The Master process only requests resources from YARN.

1.3 the master – the url

All optional parameters of master-URL are shown in the following table:

Master URL	Meaning
`local`	Run Spark locally using one thread
`local[K]`	Run Spark locally with K worker threads
`local[K,F]`	K worker threads are used to run locally, and the second parameter is the number of failed retries of the Task
`local[*]`	Run Spark locally with the same number of threads as the number of CPU cores
`local[*,F]`	Run Spark locally with the same number of threads as the number of CPU cores The second parameter is the number of failed retries of the Task
`spark://HOST:PORT`	Connect to the master node of the specified standalone cluster. The default port number is 7077.
`spark://HOST1:PORT1,HOST2:PORT2`	If the standalone cluster is highly available using Zookeeper, it must contain all master host addresses set by Zookeeper.
`mesos://HOST:PORT`	Connect to the given Mesos cluster. The default port number is 5050. For Mesos Clusters that use ZooKeeper, use`mesos://zk://...`To specify the address, use`--deploy-mode cluster`Mode to submit.
`yarn`	Connect to a YARN cluster. The cluster is configured`HADOOP_CONF_DIR`or`YARN_CONF_DIR`To decide. use`--deploy-mode`Parameter to configure`client` 或 `cluster`Mode.

The following describes three common deployment modes and their corresponding job submission modes.

2. Local mode

It is easiest to submit a job in Local mode and no configuration is required. The submit command is as follows:

#Submit applications in local modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[2] \ The/usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100 # to SparkPi parametersCopy the code

Spark-examples_2.11-2.4.4. jar is a test case package provided by Spark. SparkPi is used to calculate Pi values. The result is as follows:

Standalone mode

Standalone is a built-in cluster mode provided by Spark that is managed by using a built-in resource manager. The following shows the cluster configuration of 1 Mater and 2 Worker nodes as shown in the figure, using two hosts:

Hadoop001: Since there are only two hosts, hadoop001 is both the Master node and the Worker node;
Hadoop002: Worker node.

3.1 Environment Configuration

Ensure that Spark is decompressed to the same path on the two hosts. Then go to ${SPARK_HOME}/conf/ of hadoOP001, copy the configuration sample and configure it:

# cp spark-env.sh.template spark-env.sh
Copy the code

Run the spark-env.sh command to configure the JDK directory and send the configuration to hadoop002 using the SCP command:

#JDK Installation LocationJAVA_HOME = / usr/Java/jdk1.8.0 _201Copy the code

3.2 Cluster Configuration

In ${SPARK_HOME}/conf/, copy the sample cluster configuration and configure it:

# cp slaves.template slaves
Copy the code

Specify the hostname of all Worker nodes:

# A Spark Worker will be started on each of the machines listed below.
hadoop001
hadoop002
Copy the code

Here are three points to note:

The mapping between host name and IP address must be in/etc/hostsOtherwise, use the IP address directly.
Each host name must have a single line;
The Master host of Spark accesses all Worker nodes through SSH. Therefore, you need to configure the encryption-free login in advance.

3.3 start

Run start-all.sh to start Master and all Worker services.

./sbin/start-master.sh 
Copy the code

Access port 8080 and view the Web-UI of Spark. Two valid working nodes are displayed:

3.4 Submitting a Job

#Submit to the standalone cluster in Client modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://hadoop001:7077 \ --executor-memory 2G \ - total - executor - cores 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100
#Submit to the standalone cluster in Cluster modeSpark - submit \ - class org. Apache. Spark. Examples. SparkPi \ - master spark: / / 207.184.161.138:7077 \ - deploy - mode cluster \ --supervise \ # Configure this parameter to enable supervision. If the main application exits unexpectedly, --executor-memory 2G \ --total-executor-cores 10 \ The/usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100Copy the code

3.5 Optional Configuration

A common problem when submitting a job on a virtual machine is that the job cannot obtain sufficient resources:

Initial job has not accepted any resources; 
check your cluster UI to ensure that workers are registered and have sufficient resources
Copy the code

At this point, you can view the Web UI. You can modify the executor-memory of the Woker as well as the executor-memory of the Woker. The default value is all available memory of the host minus 1 GB.

All the optional configurations of Master and Woker nodes are as follows. You can configure them in spark-env.sh:

Environment Variable	Meaning (Meaning)
`SPARK_MASTER_HOST`	Master Node ADDRESS
`SPARK_MASTER_PORT`	Master Node IP address port (default: 7077)
`SPARK_MASTER_WEBUI_PORT`	Master’s Web UI port (default: 8080)
`SPARK_MASTER_OPTS`	-dx =y (default: none) for master configuration only.spark-standalone-mode
`SPARK_LOCAL_DIRS`	Temporary storage directory of Spark, used to temporarily store map output and persistent RDDs. Multiple directories are separated by commas
`SPARK_WORKER_CORES`	Number of CPU Cores available to spark worker nodes. (Default: all available)
`SPARK_WORKER_MEMORY`	The amount of memory that can be used by the spark worker node (default: total memory minus 1GB).
`SPARK_WORKER_PORT`	Port of the Spark worker node (default: random)
`SPARK_WORKER_WEBUI_PORT`	The worker’s Web UI Port (default: 8081)
`SPARK_WORKER_DIR`	Worker The directory where the application runs, which contains logs and staging space (default: SPARK_HOME/work)
`SPARK_WORKER_OPTS`	For worker configuration properties only, the format is “-dx =y” (default: None). All attributes can be referred to the official documentation:spark-standalone-mode
`SPARK_DAEMON_MEMORY`	Memory allocated to spark Master and Worker daemons. (Default: 1G)
`SPARK_DAEMON_JAVA_OPTS`	JVM options for spark Master and worker daemons in the format “-dx =y” (default: None)
`SPARK_PUBLIC_DNS`	Public DNS names of spark master and worker. (Default: None)

Spark on Yarn mode

Spark can submit jobs to Yarn. In this case, you do not need to start the Master node or the Worker node.

3.1 configuration

In spark-env.sh, you can specify the location of the hadoop configuration directory by using YARN_CONF_DIR or HADOOP_CONF_DIR:

YARN_CONF_DIR=The/usr/app/hadoop - server - cdh5.15.2 / etc/hadoop
# JDK installation location
JAVA_HOME=The/usr/Java/jdk1.8.0 _201
Copy the code

3.2 start

Ensure that Hadoop has been started, including YARN and HDFS. During calculation, Spark uses HDFS to store temporary files. If HDFS is not started, an exception will be thrown.

# start-yarn.sh
# start-dfs.sh
Copy the code

3.3 Submitting an Application

#The file is submitted to the YARN cluster in client modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ --executor-memory 2G \ - num - executors 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100
#Submit the data to the YARN cluster in cluster modespark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --executor-memory 2G \ - num - executors 10 \ / usr/app/spark - 2.4.0 - bin - hadoop2.6 / examples/jars/spark - examples_2. 11-2.4.0. Jar \ 100Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Spark series 5 — Spark runtime mode and job submission

First, homework submission

1.1 spark – submit

1.2 the deploy – mode

1.3 the master – the url

2. Local mode

Standalone mode

3.1 Environment Configuration

3.2 Cluster Configuration

3.3 start

3.4 Submitting a Job

3.5 Optional Configuration

Spark on Yarn mode

3.1 configuration

3.2 start

3.3 Submitting an Application

Spark series 5 — Spark runtime mode and job submission

First, homework submission

1.1 spark – submit

1.2 the deploy – mode

1.3 the master – the url

2. Local mode

Standalone mode

3.1 Environment Configuration

3.2 Cluster Configuration

3.3 start

3.4 Submitting a Job

3.5 Optional Configuration

Spark on Yarn mode

3.1 configuration

3.2 start

3.3 Submitting an Application

Related Posts

Algorithm essence: share a worth sharing algorithm problem

The ELK Stack log collection platform is deployed in Kubernetes

Download server files using Python and view the download progress in real time | Weekend Learning