Record Troubleshooting in some common Spark jobs.
1. OOM value caused by the buffer size of the Shuffle Reduce terminal
During the shuffle process, when the Reduce end pulls data from the Map end, each task has its own buffer to store the pulled data. The default buffer is 48 MB.
If a large amount of data is generated on the Map end and data is written quickly on the Map end, the buffer of all tasks on the Reduce end may be fully loaded and the buffer of each task will be fully written. As a result, the task reads a large amount of data during calculation. However, the amount of memory allocated by each executor in the current JVM process is dead (0.2 ratio), and a large number of task computations are likely to drain memory and result in OOM.
In this case, we can reduce the buffer memory size of the Reduce end so that the task can be accessed several times. This reduces the performance consumption of network transmission, but ensures that the memory of the Reduce end is sufficient.
2. The GC of the JVM causes a shuffle file pull failure
“Shuffle file not found” is a common occurrence in Spark. In some cases, the stage and task will be submitted again. Run it again, and you’re done. There is no such mistake.
This is because the previous stage consumes too much memory during shuffle, causing frequent GC, and the entire JVM process stops when GC is triggered. At this point, the executor of the next stage will try to pull the data that it needs and will not pull it. At this point, the executor of the next stage will try to wait for a while and pull it again. Sometimes, after the waiting time, the gc of the previous stage will finish and the data will be pulled.
These two parameters can be adjusted in this case:
1, spark. Shuffle. IO. MaxRetries shuffle pull the number of retry after failure, default is 3 times;
2, spark. Shuffle. IO. RetryWait each retry interval pull files, the default is 5 s;
3. The Yarn cluster resources are insufficient, causing application failure
One Spark job may have run in a Yarn cluster, and you submit another Spark job. The resources set for the two Spark jobs are equal to or slightly smaller than the resources in the whole cluster. In this case, the second Spark job may fail to run. Because the Spark job will probably consume more resources after it runs than we allocated in the shell script.
To prevent this from happening, there are several options:
1. In J2EE, only one Spark job can be submitted to the Yarn cluster for execution. In this way, resources for one Spark job are sufficient.
2. Execute Spark jobs in queues to maximize the resources allocated to each Spark job and ensure that each Spark job is executed in the shortest time.
4. An error was reported due to serialization problems
When a Spark job is submitted in Client mode, an error similar to Serializable appears in the locally printed log. This error is caused by a class that does not implement a serialization interface.
There are two common places to serialize:
1. Operator functions summarize variables of external custom types;
2. If a custom type is to be used as an element type in the RDD, that type should also implement the serialization interface.
5. Excessive network adapter traffic occurs in yarn-client mode
In the production environment, if Spark is run in yarn-client mode, the driver process runs on the local device, and a large amount of data is transferred. As a result, the network adapter traffic on the local device is heavy.
Therefore, yarn-cluster mode must be used in the production environment.
6. Memory overflow of the JVM in yarn-cluster mode
Sometimes, Spark jobs that contain Spark SQL may run properly in yarn-client mode, but in yarn-Cluster mode, the MEMORY overflow problem of THE JVM’s PermGen(permanent generation) is reported.
In yarn-client mode, the driver runs on the local device, the PermGen configuration of the JVM used by Spark is the local Spark class file, and the permanent band size of the JVM is 128 MB. In yarn-cluster mode, drvier runs on a node in the Yarn cluster. The default size is 82M.
Spark SQL has a lot of complex SQL semantic parsing, syntax conversion, and so on inside, which is extremely complex. In this case, if SQL data is not written properly, memory consumption and permanent generation usage will be large, which may easily exceed the default value set on the Yarn cluster.
Solution:
Add the following configuration to the spark-submit script:
--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"
This sets the size of the driver permanent generation. The default is 128MB and the maximum is 256M. In this case, it is almost guaranteed that your Spark job will not run out of permanent memory as described in yarn-cluster mode.