A, Application

Spark is a Spark Application program written by users. After the Application is submitted to Spark, Spark allocates resources to the Application, converts the program, and executes it. An Application contains multiple jobs.

Second, the Job

A calculation job consisting of one or more stages triggered by the Action operator; These Action operators mainly include: Reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, saveAsSequenceFile, saveAsObject, countByKey;

Third, Stage

Scheduling phase: The scheduling phase corresponding to a taskSet. Each Job is divided into stages according to the RDD wide dependency relationship, and each Stage contains a taskSet.

Operator causing Shuffle: Repartition, repartitionAndSortWithinPartitions, coalesce, reduceByKey groupByKey, sortByKey, join, cogroup, etc. These operators fall into three main categories:

  • 1. Repartition: Generally, repartition will cause Shuffle. All data in previous partitions must be randomly and evenly shuffled in the entire cluster and then added to the downstream specified partition.
  • ByKey: To aggregate a Key, ensure that the same Key on all nodes in the cluster is processed on the same node.
  • 3. Join: To join two RDDS, Shuffle the data with the same key to the same node, and then perform the Cartesian product of the two RDDS with the same key.

Fourth, the TaskSet

A set of tasks that are associated but have no Shuffle relationship with each other;

Five, the Task

A partition in an RDD corresponds to a task, which is the smallest processing unit in a single partition.

Six, summarized

The relationship between stages and tasks: Stages can be interpreted as MapReduce processing. Tasks in each Stage can be completed in one Executor without Shuffle.

Shuffle: Map + Reduce — > Data retransmission

Take an Application, an Action and a Job as an example.

Note: The picture is original, if reproduced, please indicate the source.