Getting Started with Spark (2)

This is the 13th day of my participation in the August More Text Challenge

The picture is from the official website

Standalone: Standalone mode, for use in non-production environments such as testing;
Apache Mesos: a cluster management tool that supports MapReduce.
Hadoop YARN: a tool that uses Hadoop components to manage resources.
K8S: Open source container management tool that supports automatic deployment and scaling;

From	To	Default Port	Purpose	Configuration Setting	Notes
Browser	Standalone Master	8080	Web UI	`spark.master.ui.port / SPARK_MASTER_WEBUI_PORT`	Jetty-based. Standalone mode only.
Browser	Standalone Worker	8081	Web UI	`spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT`	Jetty-based. Standalone mode only.
Driver / Standalone Worker	Standalone Master	7077	Submit job to cluster / Join cluster	`SPARK_MASTER_PORT`	Set to “0” to choose a port randomly. Standalone mode only.
External Service	Standalone Master	6066	Submit job to cluster via REST API	`spark.master.rest.port`	Use `spark.master.rest.enabled`to enable/disable this service. Standalone mode only.
Standalone Master	Standalone Worker	(random)	Schedule executors	`SPARK_WORKER_PORT`	Set to “0” to choose a port randomly. Standalone mode only.

From	To	Default Port	Purpose	Configuration Setting	Notes
Browser	Application	4040	Web UI	`spark.ui.port`	Jetty-based
Browser	History Server	18080	Web UI	`spark.history.ui.port`	Jetty-based
Executor / Standalone Master	Driver	(random)	Connect to application / Notify executor state changes	`spark.driver.port`	Set to “0” to choose a port randomly.
Executor / Driver	Executor / Driver	(random)	Block Manager port	`spark.blockManager.port`	Raw socket via ServerSocketChannel

Driver Creates a SparkContext to apply for resources, allocate tasks, and monitor them.
Executor allocates resources and starts the Executor process;
SparkContext builds a DAG based on RDD dependencies and submits it to DAGScheduler for parsing into stages. The TaskSet is submitted to the underlying scheduler TaskScheduler for processing.
Executor requests tasks from SparkContext. The TaskScheduler assigns tasks to the Executor to run and provides application code.
Executor executes tasks and reports the results to the TaskScheduler and DAGScheduler. When the tasks are complete, data is written and all resources are released.

Narrow dependence: a father RDD partition corresponds to a child RDD partition | | multiple father RDD partition corresponding to a son RDD partition;
Wide dependency: one partition of a parent RDD corresponds to multiple partitions of a child RDD;
Stage division: divide narrow dependencies into the same Stage as far as possible to realize pipeline calculation;