1. Run the architecture
The core of the Spark framework is a computing engine that adopts the standard master-slave structure.
Basic structure of Spark execution. The Driver in the figure represents the master, which is responsible for managing the scheduling of jobs in the entire cluster. The Executor in the figure is a slave, responsible for the actual execution of the task
2. Core components
2.1 Driver
The Spark driver node is used to execute the main method in the Spark task and execute the actual code. The Driver is responsible for the following tasks during Spark job execution:
Convert user programs into Jobs Schedule tasks between executors track Executor execution and display query execution through the UICopy the code
In fact, we can’t describe the definition of Driver precisely because we don’t see any words about Driver in the entire programming process. A Driver is a program that drives the entire application. It is also called a Driver class.
2.2 Executor
Spark Executor is a JVM process on a Worker node in a cluster that runs independent Tasks in a Spark job. When the Spark application is started, Executor nodes are started at the same time and exist throughout the Spark application life cycle. If an Executor node fails or crashes, The Spark application can continue to run. The tasks on the faulty Executor node are scheduled to run on other Executor nodes.
Executor has two core functions:
- Runs the tasks that comprise the Spark application and returns the results to the driver process
- They provide in-memory storage through their Block Managers for RDD’s that require caching in user programs. RDD is cached directly within the Executor process, so tasks can take advantage of cached data to speed up operations at run time.
2.3 the Master and the Worker
In an independent deployment environment of the Spark cluster, the resource scheduling function is implemented without relying on other resource scheduling frameworks. Therefore, the environment has two other core components: Master and Worker. Master is a process that is mainly responsible for resource scheduling and allocation and cluster monitoring, similar to RM in Yarn environment. Worker is also a process, and a Worker runs on a server in the cluster. The Master allocates resources to process and compute data in parallel, similar to NM in Yarn.
2.4 ApplicationMaster
When a Hadoop user submits an application to the YARN cluster, the application contains ApplicationMaster, which applies for a resource Container to execute a task from the resource scheduler. It runs the user’s own program job, monitors the execution of the entire task, and tracks the status of the entire task. Handle exceptions such as task failures. Simply put, the ApplicationMaster is used to decouple ResourceManager from Driver.
3. Core concepts
3.1 Executor and Core
Spark Executor is a JVM process that runs on a Worker node in a cluster, which is dedicated to computing in the entire cluster. When submitting an application, you can specify the number of compute nodes and corresponding resources. Resources generally refer to the memory size of the working Executor and the number of virtual CPU cores used.
The startup parameters of the application are as follows:
The name of the | instructions |
---|---|
–num-executors | Configure the number of executors |
–executor-memory | Configure the memory size for each Executor |
–executor-cores | Configure the number of virtual CPU cores per Executor |
3.2 Parallelism
In distributed computing framework, many tasks are executed at the same time. Since the tasks are distributed on different computing nodes, the parallel execution of multiple tasks can be truly realized. Remember, this is parallel, not concurrent. Here we refer to the number of tasks performed in parallel across the cluster as parallelism. So what is the parallelism of a job? This depends on the default configuration of the framework. Applications can also be modified dynamically while running.
3.3 Directed Acyclic Graph (DAG)
-
The framework of big data computing engine is generally divided into four categories according to different ways of use. The first category is MapReduce, which is carried by Hadoop. It divides computing into two stages, namely Map stage and Reduce stage. For the upper application, it has to try every means to split the algorithm, or even have to realize the series of multiple jobs in the upper application to complete a complete algorithm, such as iterative calculation. As a result of such drawbacks, the emergence of support DAG framework. Therefore, frameworks that support DAG are classified as second-generation computing engines. Tez and Oozie above. We won’t go into the differences between the various DAG implementations here, but for Tez and Oozie at the time, it was mostly batch tasks. Next comes the third generation of computing engines represented by Spark. Third generation computing engines feature DAG support within jobs (not across jobs) and real-time computing.
-
The directed acyclic graph is not really a graph, but a high-level abstract model of the data flow mapped directly by the Spark program. Simple understanding is the whole process of the calculation of the execution of the graph, so that more intuitive, more easy to understand, can be used to express the topology of the program.
-
DAG (Directed Acyclic Graph) a Directed Acyclic Graph is a topological Graph composed of points and lines, which has direction and does not close a loop.
4. Submission process
The so-called submission process refers to the process of submitting applications written by developers based on requirements to the Spark runtime environment for calculation through the Spark client. In different deployment environments, the submission process is basically the same, but there are slight differences. We do not make detailed comparison here. However, in domestic work, Spark references are more deployed to the Yarn environment, so the submission process in this course is based on the Yarn environment.
When the Spark application is submitted to the Yarn environment for execution, there are two deployment execution modes: Client and Cluster. The main difference between the two modes is the location of the running node of the Driver program.
4.1 Yarn Client mode
- In Client mode, the Driver module for monitoring and scheduling is executed on the Client instead of Yarn, so it is generally used for testing.
- The Driver runs on the local machine where the task is submitted
- After the Driver starts, it communicates with ResourceManager to apply for starting ApplicationMaster
- ResourceManager allocates containers, starts ApplicationMaster on an appropriate NodeManager, and applies for Executor memory from ResourceManager
- ResourceManager allocates containers to ApplicationMaster after receiving ApplicationMaster’s request for resources. ApplicationMaster starts the Executor process on the Specified NodeManager
- After the Executor process is started, it registers with the Driver in reverse order. After all executors are registered, the Driver starts to run the main function
- Then, when the Action operator is executed, a Job is triggered and stages are divided based on the wide dependency. Each stage generates a TaskSet, and then the task is distributed to each Executor for execution.
4.2 Yarn Cluster mode
- In Cluster mode, Driver modules used for monitoring and scheduling are started and executed in Yarn Cluster resources. They are generally used in the actual production environment.
- In YARN Cluster mode, after a task is submitted, ResourceManager communicates with ResourceManager to apply for starting ApplicationMaster.
- ResourceManager allocates a Container and starts ApplicationMaster on an appropriate NodeManager. In this case, ApplicationMaster is the Driver.
- After the Driver starts, it applies for Executor memory from ResourceManager. ResourceManager allocates Containers after receiving ApplicationMaster’s request. Then start the Executor process on the appropriate NodeManager
- After the Executor process is started, it registers with the Driver in reverse order. After all executors are registered, the Driver executes the main function.
- Then, when the Action operator is executed, a Job is triggered and stages are divided based on the wide dependency. Each stage generates a TaskSet, and then the task is distributed to each Executor for execution.