This is the 13th day of my participation in the August More Text Challenge

Second, the basic

2. Cluster management mode

The picture is from the official website

  1. Standalone: Standalone mode, for use in non-production environments such as testing;
  2. Apache Mesos: a cluster management tool that supports MapReduce.
  3. Hadoop YARN: a tool that uses Hadoop components to manage resources.
  4. K8S: Open source container management tool that supports automatic deployment and scaling;

3. monitoring&security

(1) Standalone mode only
From To Default Port Purpose Configuration Setting Notes
Browser Standalone Master 8080 Web UI spark.master.ui.port / SPARK_MASTER_WEBUI_PORT Jetty-based. Standalone mode only.
Browser Standalone Worker 8081 Web UI spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT Jetty-based. Standalone mode only.
Driver / Standalone Worker Standalone Master 7077 Submit job to cluster / Join cluster SPARK_MASTER_PORT Set to “0” to choose a port randomly. Standalone mode only.
External Service Standalone Master 6066 Submit job to cluster via REST API spark.master.rest.port Use spark.master.rest.enabledto enable/disable this service. Standalone mode only.
Standalone Master Standalone Worker (random) Schedule executors SPARK_WORKER_PORT Set to “0” to choose a port randomly. Standalone mode only.
2) All cluster managers
From To Default Port Purpose Configuration Setting Notes
Browser Application 4040 Web UI spark.ui.port Jetty-based
Browser History Server 18080 Web UI spark.history.ui.port Jetty-based
Executor / Standalone Master Driver (random) Connect to application / Notify executor state changes spark.driver.port Set to “0” to choose a port randomly.
Executor / Driver Executor / Driver (random) Block Manager port spark.blockManager.port Raw socket via ServerSocketChannel

4. Run the process

  1. Driver Creates a SparkContext to apply for resources, allocate tasks, and monitor them.
  2. Executor allocates resources and starts the Executor process;
  3. SparkContext builds a DAG based on RDD dependencies and submits it to DAGScheduler for parsing into stages. The TaskSet is submitted to the underlying scheduler TaskScheduler for processing.
  4. Executor requests tasks from SparkContext. The TaskScheduler assigns tasks to the Executor to run and provides application code.
  5. Executor executes tasks and reports the results to the TaskScheduler and DAGScheduler. When the tasks are complete, data is written and all resources are released.

5. RDD dependence

  1. Narrow dependence: a father RDD partition corresponds to a child RDD partition | | multiple father RDD partition corresponding to a son RDD partition;
  2. Wide dependency: one partition of a parent RDD corresponds to multiple partitions of a child RDD;
  3. Stage division: divide narrow dependencies into the same Stage as far as possible to realize pipeline calculation;