Solving Data Skew with Spark (2)

This is the 14th day of my participation in the More text Challenge. For more details, see more text Challenge

If some data is allowed to be discarded in Spark jobs, you can filter out the key that may cause data skewness. In this way, data skewness does not occur in Spark jobs.

In normal cases, the join operation performs shuffle and reduce join, that is, all the same keys and corresponding values are aggregated into a Reduce task, and then join. The process of ordinary join is shown in the figure below:

A normal join involves the shuffle process. A shuffle process is equivalent to pulling the data with the same key into a shuffle Read task and then joining it. In this case, a Reduce join is called. However, if an RDD is small, you can broadcast the full data of the small RDD +map operator to achieve the same effect as join. In this case, the shuffle operation does not occur and data skewth does not occur.

(Note that the RDD cannot be broadcast. You can only collect the data inside the RDD to the Driver memory and then broadcast it.)

The core idea:
- Instead of using the join operator, the Broadcast variable and the Map operator are used to implement the join operation. In this way, the shuffle operation is completely avoided and data skewering is completely avoided. The data in the smaller RDD is directly pulled to the memory of the Driver through the collect operator, and then a Broadcast variable is created for it. Then run the Map operator on the other RDD. In the operator function, obtain the full data of the smaller RDD from the Broadcast variable and compare the data of the current RDD with the data of the current RDD based on the connection key. If the connection key is the same, connect the data of the two RDD in the required way.
- According to the preceding ideas, the shuffle operation does not occur, which fundamentally eliminates data skewering caused by the join operation.
- When the join operation has data skewdness and the data volume of one RDD is small, this mode is preferred, and the effect is very good. The process of map join is shown in the figure:

Inapplicable Scenario analysis:
- Each Executor saves one copy of the broadcast variable of Spark. If the data amount of two RDD’s is large, the memory may overflow if one RDD with a large amount of data is used as a broadcast variable.

In Spark, if an RDD has only one key, data corresponding to the key is scattered by default during the shuffle process and processed by different Reduce tasks.
If data skews are caused by a single key, you can extract the skews separately to form an RDD, and use the SKEws to join other RDD separately. In this case, according to the operation mechanism of Spark, During the shuffle phase, data in the RDD is dispersed to multiple tasks for join operation. The process of separate join with tilted key is shown in the figure:

1. Application scenario analysis:
- You can convert the data in the RDD to an intermediate table, or you can use countByKey() to see the amount of data that corresponds to each key in the RDD. If you find that there is only one key in the RDD, you can consider this method.
- When the amount of data is very large, it can be considered to use sample to obtain 10% of the data, then analyze which key in the 10% of the data may cause data skewness, and then extract the data corresponding to this key separately.
2. Inapplicable Scenario analysis:
- If there are many keys in an RDD that cause data skews, this scheme is not applicable.