This is the 13th day of my participation in the August More Text Challenge
Second, the basic
2. Cluster management mode
The picture is from the official website
- Standalone: Standalone mode, for use in non-production environments such as testing;
- Apache Mesos: a cluster management tool that supports MapReduce.
- Hadoop YARN: a tool that uses Hadoop components to manage resources.
- K8S: Open source container management tool that supports automatic deployment and scaling;
3. monitoring&security
(1) Standalone mode only
From | To | Default Port | Purpose | Configuration Setting | Notes |
---|---|---|---|---|---|
Browser | Standalone Master | 8080 | Web UI | spark.master.ui.port / SPARK_MASTER_WEBUI_PORT |
Jetty-based. Standalone mode only. |
Browser | Standalone Worker | 8081 | Web UI | spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT |
Jetty-based. Standalone mode only. |
Driver / Standalone Worker | Standalone Master | 7077 | Submit job to cluster / Join cluster | SPARK_MASTER_PORT |
Set to “0” to choose a port randomly. Standalone mode only. |
External Service | Standalone Master | 6066 | Submit job to cluster via REST API | spark.master.rest.port |
Use spark.master.rest.enabled to enable/disable this service. Standalone mode only. |
Standalone Master | Standalone Worker | (random) | Schedule executors | SPARK_WORKER_PORT |
Set to “0” to choose a port randomly. Standalone mode only. |
2) All cluster managers
From | To | Default Port | Purpose | Configuration Setting | Notes |
---|---|---|---|---|---|
Browser | Application | 4040 | Web UI | spark.ui.port |
Jetty-based |
Browser | History Server | 18080 | Web UI | spark.history.ui.port |
Jetty-based |
Executor / Standalone Master | Driver | (random) | Connect to application / Notify executor state changes | spark.driver.port |
Set to “0” to choose a port randomly. |
Executor / Driver | Executor / Driver | (random) | Block Manager port | spark.blockManager.port |
Raw socket via ServerSocketChannel |
4. Run the process
- Driver Creates a SparkContext to apply for resources, allocate tasks, and monitor them.
- Executor allocates resources and starts the Executor process;
- SparkContext builds a DAG based on RDD dependencies and submits it to DAGScheduler for parsing into stages. The TaskSet is submitted to the underlying scheduler TaskScheduler for processing.
- Executor requests tasks from SparkContext. The TaskScheduler assigns tasks to the Executor to run and provides application code.
- Executor executes tasks and reports the results to the TaskScheduler and DAGScheduler. When the tasks are complete, data is written and all resources are released.
5. RDD dependence
- Narrow dependence: a father RDD partition corresponds to a child RDD partition | | multiple father RDD partition corresponding to a son RDD partition;
- Wide dependency: one partition of a parent RDD corresponds to multiple partitions of a child RDD;
- Stage division: divide narrow dependencies into the same Stage as far as possible to realize pipeline calculation;