Plus 2019-06-29 20:23:49 1153
Category column: Hadoop Ecological Spark
copyright
First, the reader needs to know what Spark’s task is.
See our previous article “Spark Performance Tuning tuning Task Parallelism” for more details.
Task description:
The Spark job we wrote is called Application;
-
An application has multiple jobs (an action, such as a collect operation, triggers a job).
-
During the occurrence of shuffle (such as reduceByKey), each job will be divided into a stage.
-
Each stage is divided into multiple tasks, which are assigned to executors. Each task has a thread to execute it.
-
A task processes a small piece of data.
Why is there “data localization wait time” – spark.locality.wait?
Before allocating tasks on each stage of an application on the driver, Spark calculates the fragment data (Partition in the RDD) to be calculated by each task. In Spark’s task allocation algorithm, each task is preferentially allocated to the node where the data to be calculated is located. In this way, data transfer between networks is avoided.
In fact, sometimes the task is not assigned to the node of the data it computes. Why?
It is possible that the computing resources and capability of the node are used up, and the task needs computing resources to process data. As a result, Spark typically waits a period of time to see if a task can be assigned to the node on which it is processing the data. The default wait time is 3 seconds (3s is not absolute, and the wait time can be set differently for different localization levels). If the wait time is exceeded and the wait cannot continue, a localization level with poor performance is selected, such as assigning the task to a node that is close to the data node it is processing, and transferring the data for calculation.
For us, it is best if the task is assigned to the node on which it processes the data, gets the data directly from the local Executor’s corresponding BlockManager, transfers the data in pure memory, or has partial disk IO.
So what are the localization levels we mentioned? What are some of them?
Localization means that the task is assigned a portion of the data to process, and the task and the data it needs to process may be at different node locations. Based on this location relationship, there are five different localization levels:
1.PROCESS_LOCAL: process localization. The task that computes data is executed by an executor, and the data resides in the BlockManager corresponding to that executor. This localization level gives the best performance. (As shown below)
2.NODE_LOCAL: node localization. In the first case, the data resides on the node as an HDFS block, while the Task node runs on an executor. In the second case, the task and the data it is processing are on different executors on the same node, and the data needs to be transferred between processes. (As shown below)
3.NO_PREF: For task, it doesn’t matter where the data is retrieved.
4.RACK_LOCAL: Rack localization. The task and the data it processes reside on different nodes in the same rack, and the data needs to be transferred between nodes over the network. (As shown below)
5.ANY: The task and the data it processes can be anywhere in the cluster, not on the same Rack, and the data is transferred across the Rack. Worst performance. (As shown below)
When Spark allocates tasks, PROCESS_LOCAL(process localization) is preferentially selected. If the PROCESS_LOCAL level is not allocated, the default waiting period is 3s, which is set by spark.locality.wait. Therefore, in the Spark job, you can set the spark.locality.wait parameter to adjust the waiting time.
When to adjust the spark.locality.wait parameter?
When testing spark jobs, run them in client mode and view the Spark job run logs. The start Task… PROCESS_LOCAL, NODE_LOCAL
Look at the data localization level of most tasks. If most of these are PROCESS_LOCAL, you do not need to adjust spark.locality.wait. However, if most tasks are NODE_LOCAL, ANY, it is better to adjust the waiting time for data localization and observe whether the localization level of most tasks is improved and whether the Spark operation time is shortened.
How to adjust spark.locality.wait?
SparkConf sparkConf = new sparkConf()
.set(” spark.locality. Wait “, “10”) // 10 means 10 seconds
— — — — — — — —
Copyright notice: This article is an original article BY CSDN blogger “Knowledge plus”. It follows CC 4.0 BY-SA copyright agreement. Please attach the link of the original source and this statement.
The original link: blog.csdn.net/lilei199211…