3. Introduction to the server

1. Hardware and deployment suggestions

  1. withHDFSDeploy it on the same machine, and then on the same LAN.
  2. withHBase(Low-latency storage) Deployed separately.
  3. Four to eight disks, no RAID;
  4. Under Linux, noatime is selected to install disks to reduce unnecessary writes.
  5. In the Spark configuration, spark.local.dir is a comma-separated list of local disks. It can share disks with the HDFS.
  6. Memory >= 8GiB and <=200GiB, Spark occupies 75% of the total memory.
  7. Ten gigabit network card;
  8. At least 8-16 CPU cores;

2. The environment variable

The environment variable meaning
JAVA_HOME The Java installation location (if not the default setting)PATH).
PYSPARK_PYTHON Python binary executables available for PySpark in both drivers and workers (python2.7Default if it is available, otherwisepython).spark.pyspark.pythonIf a property is set, it takes precedence
PYSPARK_DRIVER_PYTHON Python binary executable used for PySpark only in drivers (defaultPYSPARK_PYTHON).spark.pyspark.driver.pythonIf a property is set, it takes precedence
SPARKR_DRIVER_R R binary executable for SparkR shell (defaultR) a property is set, it takes precedence
SPARK_LOCAL_IP Bind the IP address of the machine.
SPARK_PUBLIC_DNS The hostname of your Spark program is broadcast to other computers.
SPARK_MASTER_HOST Bind the primary server to a specific host name or IP address, such as a public host name or IP address.
SPARK_MASTER_PORT Start the primary server on another port (default: 7077).
SPARK_MASTER_WEBUI_PORT Port of the primary Web UI (default: 8080).
SPARK_MASTER_OPTS Applies only to the configuration properties of the primary server in the form of “-dx = y” (default: none). See the list of possible options below.
SPARK_LOCAL_DIRS The directory used for the “temporary” space in Spark, including the mapping output file and the RDD stored on disk. It should be on a fast local disk on your system. It can also be a comma-separated list of directories on different disks.
SPARK_WORKER_CORES Total number of cores allowed to be used by Spark applications on computers (default: all available cores).
SPARK_WORKER_MEMORY The total amount of memory allowed to be used by Spark applications on the computer, for example1000m.2g(Default: Total memory minus 1 GiB); Please note that each application’sA separateMemory is all about itspark.executor.memoryProperty configured.
SPARK_WORKER_PORT Start the Spark worker on a specific port (default: random).
SPARK_WORKER_WEBUI_PORT Port for secondary Web UI (default: 8081).
SPARK_WORKER_DIR The directory in which to run the application, which will include logs and temporary space (default: SPARK_HOME/work).
SPARK_WORKER_OPTS Applies only to worker configuration properties in the form of “-dx = y” (default: none). See the list of possible options below.
SPARK_DAEMON_MEMORY Memory allocated to the Spark main daemon and the secondary daemon itself (default: 1 GB).
SPARK_DAEMON_JAVA_OPTS The JVM options for Spark primary and secondary daemons themselves appear as “-dx = y” (default: none).
SPARK_DAEMON_CLASSPATH The classpath of the Spark primary daemon and secondary daemon itself (default: none).
SPARK_PUBLIC_DNS Spark Public DNS names of the primary and secondary servers (default: none).

Iv. Introduction to the client

1. The key configuration

val conf = new SparkConf()
  .set("spark.cores.max", "10")
val sc = new SparkContext(conf)
Copy the code
  1. Spark.executor. cores Number of CPU cores assigned to each application;
  2. Spark.cores. Max Limits the number of CPU cores used.

2. Submit a job

${SPARK_HOME}/bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100 # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master Spark: / / \ \ 7077 - executor - memory 20 g - total - executor - 100 cores \ / path/to/examples. The jar \ 1000 # Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class Org. Apache. Spark. Examples. SparkPi \ - master spark: / / \ 7077 - deploy - mode cluster \ - supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 # Run a Python application on a Spark standalone cluster. / bin/Spark - submit \ - master Spark: / / \ 7077 examples/src/main/python/ \ 1000 # Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ - class org. Apache. Spark. Examples. SparkPi \ - master mesos: / / \ - deploy - mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000 # Run on a Kubernetes cluster in cluster deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master k8s://xx.yy.zz.ww:443 \ --deploy-mode cluster \ --executor-memory 20G \ --num-executors 50 \ http://path/to/examples.jar \ 1000Copy the code