Spark is generally deployed on YARN. Therefore, Y asks most about the arN submission process. The biggest difference between the two modes is the execution position on the driver
Yarn Client mode *
In the first step, the Driver side runs on the local machine where the task is submitted
In the second step, after the Driver starts, it communicates with ResourceManager to apply for starting an ApplicationMaster
In the third step, ResourceManager allocates a Container, starts ApplicationMaster on an appropriate NodeManager, and applies for Executor memory from ResourceManager
Step 4. ResourceManager allocates containers to ApplicationMaster after receiving ApplicationMaster’s request for resources. ApplicationMaster starts the Executor process on the Specified NodeManager
In step 5, after the Executor process is started, it registers with the Driver in reverse order. After all executors are registered, the Driver starts to execute the main function
In step 6, when the Action operator is executed, a Job is triggered and stages are divided based on the wide dependencies. Each stage generates a TaskSet and then distributes the task to each Executor for execution.
Yarn Cluster mode *
Step 1: In YARN Cluster mode, after a task is submitted, ResourceManager communicates with ResourceManager to apply for starting ApplicationMaster
In the second step, ResourceManager allocates a Container and starts ApplicationMaster on the appropriate NodeManager. ApplicationMaster is the Driver.
Step 3 after the Driver starts, it applies for Executor memory from ResourceManager. ResourceManager allocates containers after receiving ApplicationMaster’s request. Then start the Executor process on the appropriate NodeManager
In step 4, the Executor process is reversely registered with the Driver. After all executors are registered, the Driver starts to execute the main function.
In step 5, when the Action operator is executed, a Job is triggered and stages are divided according to the wide dependencies. Each stage generates a TaskSet, and then the task is distributed to each Executor for execution.