This is the 11th day of my participation in Gwen Challenge

1 Troubleshooting 1: Control the buffer size on the Reduce end to avoid OOM

  • During the Shuffle process, the Reduce task does not wait until the Map task writes all its data to the disk and then pulls the data. Instead, when the Map task writes a little data, the Reduce task pulls a small amount of data and then performs subsequent operations, such as aggregation and operator function usage.

  • The amount of data that the reduce task can pull is determined by the buffer that the reduce task pulls data from. The data is stored in the buffer and then processed. The default size of the buffer is 48MB. Tasks on the reduce end perform calculations while pulling data. The data may not exceed 48MB each time. In most cases, some data is pulled and then processed.

  • Increasing the size of the reduce buffer reduces the number of pull times and improves Shuffle performance. However, sometimes the data volume on the map end is very large and the write speed is very fast. In this case, all tasks on the reduce end may reach the maximum limit of their buffer, that is, 48MB. In addition, the reduce side performs the aggregation function code, may create a large number of objects, which may cause memory overflow, i.e. OOM.

  • If memory overflow occurs on the reduce end, you can reduce the size of the pull data buffer on the reduce end, for example, to 12MB.

  • This problem has occurred in actual production environments, and it is a typical performance-for-execution principle. The reduce end draws data from a smaller buffer, which does not result in OOM. However, the reUDCE end draws more times, resulting in more network transmission overhead and performance deterioration.

  • Note: To ensure that the task runs, consider performance optimization.

2 Troubleshooting 2: The SHUFFLE file fails to be pulled due to the JVM GC

  • In Spark, a shuffle file not found error may occur. This is a very common error. After the error occurs, you need to execute it again and the error will not be displayed.

  • During the Shuffle operation, tasks of later stages attempt to fetch data from the Executor of tasks of earlier stages. As a result, the Executor is performing GC, which stops all work sites in the Executor. For example, BlockManager and Netty-based network communication may cause a shuffle file not found error when the task fails to pull data for a long time. This error will not occur when the task is executed again for the second time.

  • You can adjust the number of data pull retries on the Reduce end and the data pull interval on the Reduce end to adjust Shuffle performance. As a result, the number of data pull retries on the Reduce end increases and the waiting interval after each failure increases.

val conf = new SparkConf()
  .set("spark.shuffle.io.maxRetries", "60")
  .set("spark.shuffle.io.retryWait", "60s")
Copy the code

3 Troubleshooting three: Solve various serialization errors

  • If an error occurs during the running of the Spark job and the error message contains words such as Serializable, the error may be caused by serialization problems.
  • Note the following points when serializing:
    • Custom classes that are elements of the RDD must be serializable;
    • External custom variables that can be used in operator functions must be serializable;
    • RDD element types, operator functions, and third-party types that do not support serialization, such as Connection, should not be used.

4 Troubleshooting five: Solve the problem of network adapter traffic surge caused by yarn-client mode

The following figure shows the working principle of yarn-client:

  • In yarn-client mode, the Driver starts on the local machine. The Driver schedules all tasks and communicates frequently with multiple executors in the YARN cluster.

  • Suppose there are 100 executors and 1000 tasks, then each Executor is assigned 10 tasks. After that, the Driver frequently communicates with the 1000 tasks running on the Executor. The communication data is very high and the communication category is very high. As a result, the network adapter traffic on the local machine may surge due to frequent network communication during Spark task running.

  • Caution Yarn-client is only used in the test environment. The yarn-client mode is used because you can view detailed log information. By viewing the log, you can locate problems in the program and avoid faults in the production environment.

  • In the production environment, yarn-cluster mode must be used. In yarn-cluster mode, the network adapter traffic on the local machine does not surge. If there is a network communication problem in yarn-cluster mode, the O&M team must rectify the problem.