Introduction of the Spark
Apache Spark is a unified analysis engine for large-scale data processing. Based on memory computing, It improves real-time data processing in a big data environment, ensures high fault tolerance and scalability, and allows users to deploy Spark on a large number of hardware to form clusters.
The Spark source code has grown from 40W lines of 1.x to over 100W lines of 1.x, with more than 1400 contributors. The entire Spark framework source code is a huge project. Let’s take a look at the underlying execution principle of Spark.
Spark Running Process
The specific operation process is as follows:
-
SparkContext registers with the resource manager and applies to the resource manager to run Executor
-
The resource manager allocates executors, and then the resource manager starts executors
-
Executor sends the heartbeat to the resource manager
-
SparkContext builds a DAG directed acyclic graph
-
Split DAG into stages (Tasksets)
-
Send the Stage to the TaskScheduler
-
Executor applies for tasks from SparkContext
-
The TaskScheduler sends tasks to Executor to run
-
SparkContext also issues application code to executors
-
The Task runs on an Executor, and all resources are released
1. Build a DAG diagram from a code perspective
Val lines1 = sc.textFile(inputPath1).map(...) .map(...) Val lines2 = sc.textFile(inputPath2).map(...) Val lines3 = sc.textFile(inputPath3) Val dtinone1 = lines2.union(lines3) Val dtinone = lines1.join(dtinone1) dtinone.saveAsTextFile(...) dtinone.filter(...) .foreach(...)Copy the code
The DAG diagram for the above code looks like this:
The Spark kernel draws a directed acyclic graph of the computed path, known as DAG, at the moment the computation is needed.
The calculation of Spark takes place in the Action of the RDD. However, for all transformations before the Action, Spark only records the path generated by the RDD and does not trigger the actual calculation.
2. Divide DAG into Stage core algorithms
An Application can have multiple jobs and stages:
Spark Application can trigger many jobs due to different actions. An Application can have many jobs, and each job is composed of one or more stages. The later stages depend on the earlier stages. In other words, the following stages will run only after the preceding dependent stages are computed.
Basis of division:
Stage division is based on wide dependencies, such as reduceByKey, groupByKey and other operators, which will lead to the generation of wide dependencies.
Review the division of width dependence:
Narrow dependencies: A partition of the parent RDD will only have a partition dependency of the quilt RDD. That is one-to-one or many to one relationship, can be understood as the only child. Common narrow dependencies include map, filter, Union, mapPartitions, mapValues, join (parent RDD is hash-partitioned), etc. Wide dependencies: One partition of the parent RDD covers multiple partition dependencies (involving shuffle) of the RDD. The one-to-many relationship can be understood as superexistence. Common wide dependencies include groupByKey, partitionBy, reduceByKey, and Join (the parent RDD is not hash-partitioned).
Core algorithm: backtracking algorithm
Backtracking/reverse parsing: when a narrow dependency is encountered, this Stage is added; when a wide dependency is encountered, Stage segmentation is performed.
The Spark kernel works backwards from the RDD that triggered the Action, first creating a Stage for the last RDD, and then working backwards. If it is found to have a wide dependency on an RDD, it creates a new Stage for the RDD that has the wide dependency. That RDD is the last RDD for the new Stage. Then, by analogy, the Stage is divided according to narrow or wide dependencies until all RDD traversals are completed.
3. Divide DAG into Stage analysis
A Spark application can have multiple DAGs (if there are several actions, there are several DAGs, and there is only one Action at the end (not shown in the figure), then there is one DAG).
A DAG can have multiple stages (classified by wide dependency /shuffle).
Multiple tasks can be executed in parallel on the same Stage (number of tasks = number of partitions, as shown in the figure above, Stage1 has three partitions P1, P2 and P3, and corresponding three tasks).
As you can see, the reduceByKey operation in the DAG is a wide dependency, and the Spark kernel will divide it into different stages based on this boundary.
At the same time, we can notice that in the figure Stage1, textFile, flatMap and Map are all narrow dependencies. These steps can form a pipeline operation. The partition generated through flatMap operation can continue the map operation without waiting for the end of the whole RDD calculation. This greatly improves the efficiency of calculation.
4. Submit Stages
DAGScheduler submits a task set through the TaskScheduler interface, which triggers the TaskScheduler to build an instance of TaskSetManager to manage the lifecycle of the task set. For DAGScheduler, the commit scheduling phase is now complete.
The implementation of TaskScheduler uses TaskSetManager to schedule specific tasks to the corresponding Executor nodes when computing resources are available.
5. Monitor Jobs, tasks, and Executors
- DAGScheduler Monitors jobs and tasks:
DAGScheduler needs to monitor the current job scheduling stage and even the completion of tasks to ensure that the interdependent job scheduling stage can be smoothly executed.
This is achieved by exposing a series of callback functions. For TaskScheduler, these callback functions mainly include the start and end failures of tasks and the failures of task sets. DAGScheduler further maintains the state information of the job and scheduling phases based on the lifecycle information of these tasks.
- DAGScheduler monitors Executor health:
The TaskScheduler uses callback functions to inform DAGScheduler of the lifetimes of specific executors. If an Executor crashes, the output of the ShuffleMapTask of the corresponding scheduler task set will also be marked as unavailable. This results in a change in the state of the corresponding task set, leading to the re-execution of the relevant computing tasks to retrieve the missing relevant data.
6. Obtain the task execution result
- Results DAGScheduler:
After a specific task is executed in Executor, the result needs to be returned to DAGScheduler in some form, depending on the type of task.
- There are two outcomes, intermediate and final:
For tasks corresponding to FinalStage, DAGScheduler returns the result of the operation itself.
However, for ShuffleMapTask corresponding to the intermediate scheduling stage, DAGScheduler returns relevant storage information in a MapStatus rather than the result itself, and these storage location information will be used as the basis for tasks in the next scheduling stage to obtain input data.
- There are two types, DirectTaskResult and IndirectTaskResult:
Results returned by ResultTask are divided into two categories according to the size of task results:
If the result is small enough, it is placed directly in the DirectTaskResult object.
If it is larger than a certain size, DirectTaskResult is serialized on the Executor side and stored as a block in the BlockManager. The BlockID returned by BlockManager is then returned to the TaskScheduler in an IndirectTaskResult object. The TaskScheduler then calls the TaskResultGetter to retrieve BlockID from IndirectTaskResult and BlockManager to retrieve the corresponding DirectTaskResult.
7. Overall interpretation of task scheduling
A diagram shows the overall task scheduling:
Spark operating architecture features
1. The Executor process is exclusive
Each Application gets its own Executor process that resides for the duration of the Application and runs Tasks in a multithreaded manner.
Spark Application cannot share data across applications unless the data is written to an external storage system. As shown in the figure:
2. Supports multiple resource managers
Spark has nothing to do with the resource manager, as long as you can get Executor processes and keep communicating with each other.
Spark supports the following resource managers: Standalone, On Mesos, On YARN, Or On EC2. As shown in the figure:
3. Submit jobs nearby
The Client submitting SparkContext should be close to the Worker node (the node running Executor), preferably in the same Rack, During Spark Application, a lot of information is exchanged between SparkContext and Executor.
If you want to run SparkContext in a remote cluster, it is better to use RPC to submit SparkContext to the cluster rather than run SparkContext away from the Worker.
As shown in the figure:
4. Implement the principle of moving programs, not data
Based on the principle of moving applications rather than data, Task uses optimization mechanisms for data localization and speculative execution.
Key methods: taskIdToLocations and getPreferedLocations.
As shown in the figure: