This is the 14th day of my participation in the August More Text Challenge

3. Introduction to the server

1. Hardware and deployment suggestions

  1. withHDFSDeploy it on the same machine, and then on the same LAN.
  2. withHBase(Low-latency storage) Deployed separately.
  3. Four to eight disks, no RAID;
  4. Under Linux, noatime is selected to install disks to reduce unnecessary writes.
  5. In the Spark configuration, spark.local.dir is a comma-separated list of local disks. It can share disks with the HDFS.
  6. Memory >= 8GiB and <=200GiB, Spark occupies 75% of the total memory.
  7. Ten gigabit network card;
  8. At least 8-16 CPU cores;

2. The environment variable

The environment variable meaning
JAVA_HOME The Java installation location (if not the default setting)PATH).
PYSPARK_PYTHON Python binary executables available for PySpark in both drivers and workers (python2.7Default if it is available, otherwisepython).spark.pyspark.pythonIf a property is set, it takes precedence
PYSPARK_DRIVER_PYTHON Python binary executable used for PySpark only in drivers (defaultPYSPARK_PYTHON).spark.pyspark.driver.pythonIf a property is set, it takes precedence
SPARKR_DRIVER_R R binary executable for SparkR shell (defaultR).spark.r.shell.commandIf a property is set, it takes precedence
SPARK_LOCAL_IP Bind the IP address of the machine.
SPARK_PUBLIC_DNS The hostname of your Spark program is broadcast to other computers.
SPARK_MASTER_HOST Bind the primary server to a specific host name or IP address, such as a public host name or IP address.
SPARK_MASTER_PORT Start the primary server on another port (default: 7077).
SPARK_MASTER_WEBUI_PORT Port of the primary Web UI (default: 8080).
SPARK_MASTER_OPTS Applies only to the configuration properties of the primary server in the form of “-dx = y” (default: none). See the list of possible options below.
SPARK_LOCAL_DIRS The directory used for the “temporary” space in Spark, including the mapping output file and the RDD stored on disk. It should be on a fast local disk on your system. It can also be a comma-separated list of directories on different disks.
SPARK_WORKER_CORES Total number of cores allowed to be used by Spark applications on computers (default: all available cores).
SPARK_WORKER_MEMORY The total amount of memory allowed to be used by Spark applications on the computer, for example1000m.2g(Default: Total memory minus 1 GiB); Please note that each application’sA separateMemory is all about itspark.executor.memoryProperty configured.
SPARK_WORKER_PORT Start the Spark worker on a specific port (default: random).
SPARK_WORKER_WEBUI_PORT Port for secondary Web UI (default: 8081).
SPARK_WORKER_DIR The directory in which to run the application, which will include logs and temporary space (default: SPARK_HOME/work).
SPARK_WORKER_OPTS Applies only to worker configuration properties in the form of “-dx = y” (default: none). See the list of possible options below.
SPARK_DAEMON_MEMORY Memory allocated to the Spark main daemon and the secondary daemon itself (default: 1 GB).
SPARK_DAEMON_JAVA_OPTS The JVM options for Spark primary and secondary daemons themselves appear as “-dx = y” (default: none).
SPARK_DAEMON_CLASSPATH The classpath of the Spark primary daemon and secondary daemon itself (default: none).
SPARK_PUBLIC_DNS Spark Public DNS names of the primary and secondary servers (default: none).

Iv. Introduction to the client

1. The key configuration

val conf = new SparkConf()
  .setMaster(...)
  .setAppName(...)
  .set("spark.cores.max", "10")
val sc = new SparkContext(conf)
Copy the code
  1. Spark.executor. cores Number of CPU cores assigned to each application;
  2. Spark.cores. Max Limits the number of CPU cores used.

2. Submit a job

${SPARK_HOME}/bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments] # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100 # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master Spark: / / 207.184.161.138: \ \ 7077 - executor - memory 20 g - total - executor - 100 cores \ / path/to/examples. The jar \ 1000 # Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class Org. Apache. Spark. Examples. SparkPi \ - master spark: / / 207.184.161.138: \ 7077 - deploy - mode cluster \ - supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 # Run a Python application on a Spark standalone cluster. / bin/Spark - submit \ - master Spark: / / 207.184.161.138: \ 7077 examples/src/main/python/pi.py \ 1000 # Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ - class org. Apache. Spark. Examples. SparkPi \ - master mesos: / / 207.184.161.138:7077 \ - deploy - mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000 # Run on a Kubernetes cluster in cluster deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master k8s://xx.yy.zz.ww:443 \ --deploy-mode cluster \ --executor-memory 20G \ --num-executors 50 \ http://path/to/examples.jar \ 1000Copy the code