Learn more about Java basics
Spark technology Insider: Stage partitioning and submitting source code analysis
Narrow and wide dependencies
Narrow depends on:
Each partition of the parent RDD is used by at most one partition of the child RDD. In this case, one partition of the parent RDD corresponds to one partition of the child RDD, and two partitions of the parent RDD correspond to one partition of the child RDD. In the figure, Map /filter and Union belong to situation 1, and join with inputs co-partitioned belongs to situation 2.
Wide depends on:
The partition of the child RDD depends on all partitions of the parent RDD because of shuffle operations, such as groupByKey and uncoordinated join, as shown in the figure.
DAG
DAG, short for Directed Acyclic Graph, is often used for modeling. Spark uses DAG to model THE RDD relationship and describes the Dependency of RDD, which is also called LINEAGE. RDD Dependency is maintained by Dependency. Refer to Spark RDD Dependency. The corresponding implementation of DAG in Spark is DAGScheduler.
Stage
In Spark, the DAG graph is divided into different stages based on the dependency between RDD. A Job is divided into multiple tasksets, and each task group is called a Stage. For narrow dependencies, partition conversion can be completed in the same thread due to the certainty of partition dependencies. Narrow dependencies are divided into the same stage by Spark. For wide dependencies, only after the parent RDD shuffle is completed, The next stage will start the next calculation.
Stage division idea
-
Therefore, the overall idea of Spark dividing stages is as follows: Start from back to front, break off when wide dependency occurs, and divide into a stage. When a narrow dependency is encountered, the RDD is added to the stage. Therefore, in the figure above,RDD C,RDD D,RDD E,RDDF are built in A stage,RDD A is built in A separate stage, and RDD B and RDD G are built in the same stage.
-
In Spark, there are two types of tasks: ShuffleMapTask and ResultTask.
- ResultTask: For the last Stage (i.e., ResultStage) in the DAG graph, a ResultTask is generated that has the same number of partitions as the last RDD (the last edge in the DAG graph) in the DAG graph
- ShuffleMapTask: For the non-final Stage (i.e., ShuffleMapStage), ShuffleMapTask is generated with the same number of last RDD partitions for the Stage
-
The number of tasks in each Stage is determined by the number of partitions of the last RDD in the Stage
Note: The execution of the same Stage is serial. For example, in RDD C-D-F of Stage2, assuming that there is only one CPU core, Spark executes one data in the c-D-F sequence before running the next data. Instead of computing all the data from RDD C to RDD D, and then computing RDD F.
conclusion
The division of width dependence of RDD is to divide stages, which is to realize the Pipline calculation model. The calculation mode of Pipline can realize the localization of data in the form of high-order functions in a way of pipeline flow, and transfer logic without data transmission. Data is dropped when the Pipline algorithm encounters a persistence operator or Shuffle (wide dependency operator).