[course links] – www.bilibili.com/video/av543…

I. Problem description

The data skew problem in Spark refers to the data skew problem during Shuffle. Different keys correspond to different amounts of data, resulting in different amounts of data processed by different tasks. Note that it is important to distinguish between data skew and data overload. Data skew refers to the fact that a few tasks are allocated most of the data, so that a few tasks run slowly. Data overload means that all tasks are allocated a large amount of data, and all tasks run slowly. Performance of data skew: in A. Park, most tasks are executed quickly, and only a limited number of tasks are executed very slowly. At this time, data skew may be caused by the operation of all tasks, but the operation is very slow. B. Most tasks in the Park job are executed quickly. However, some tasks report an OOM error suddenly during execution. In this case, data skew may occur and jobs cannot run properly. Locate data skewism: a. View the Shuffle operators in the code, such as reduceByKey, countByKey, groupByKey, and Join. B. View the Log file of the Spark job. The log file is accurate to a certain line for code errors.

Second, solutions

1 Aggregate the original data

1.1 Avoiding the Shuffle Process

In most cases, Spark job data is obtained from Hive tables. Hive table data is generated yesterday after ETL. To avoid data skew, we can consider avoiding the Shuffle process. If the data source is a Hive table, aggregate the data in the Hive table first. If the data is grouped by key, combine all values corresponding to a key into a string in a special format. In this way, only one key has one data. After that, all the values corresponding to a key need to be processed by map, no Shuffle operation is required.

1.2 Reducing the Key Granularity

The possibility of data skew can be reduced only when the amount of data for each task is reduced. The amount of key data increases, which may worsen the data skew.

1.3 Increasing the Key Granularity

If the possibility of data skew is reduced, the amount of data per task can be increased. If it is not possible to aggregate one data for each key, you can expand the aggregation granularity of keys in certain scenarios.

2 Filter keys that cause skew

If certain data is allowed to be discarded during Spark jobs, you can filter out the data corresponding to keys that may cause data skew.

3 Improve the parallelism of Reduce in Shuffle operations

If 1 and 2 have no effect, increase the parallelism on the Reduce end during Shuffle. The increase in parallelism on the Reduce end increases the number of Tasks on the Reduce end, and the amount of data allocated to each task decreases, alleviating data skew. In most Shuffle operators, you can pass in a parallelism setting parameter, such as reduceByKey(500). This parameter determines the parallelism of the Reduce end during the Shuffle process. During the Shuffle operation, a specified number of Reduce tasks are created. For Spark Shuffle in the SQL statements, such as groupBy, join, and so on, need by setting the SparkConf. Set (” Spark. SQL. Shuffle. Partitions “, “50”), the default is 200. If the number of Shuffle Read tasks is increased, multiple keys originally assigned to one task can be assigned to multiple tasks so that each task processes less data.

4 Use random keys to achieve double aggregation

When operators such as groupByKey and reduceByKey are used, random Key can be considered to realize double aggregation. Add a random number to the Key, form a new Key, and then aggregate.

5 Convert reduce Join to Map Join

The number of small RDD is directly pulled to the memory of the Driver through the collect operator, and then a BroadCast variable is created for it. Then perform the map operator on the other RDD. In the operator function, obtain the full data of the smaller RDD from the Broadcast variable, compare each data in the current RDD with two keys. If the connection Key is the same, connect the two RDD data in the way you want. Note that RDD cannot be broadcast. Only RDD internal data can be pulled to the memory of the Driver through collect and then broadcast.

6 Sample Sample Join the slanted key separately

In Spark, if an RDD has only one Key, the data corresponding to the Key is scattered by default during Shuffle, and different Reduce tasks are used to process the data. If data skew is caused by a single Key, the Key with data skew is extracted separately to form an RDD and join separately.

7 Use random numbers and expansion to join

If a large number of keys in an RDD causes data skew during join, expand data in one RDD and dilute the other RDD before join. For example, rdd1. join(RDD2) : Randomly add 1 to 10 to the front of a Key in RDD1, and then expand RDD2 by 10 times. Add 1 to 10 to the front of each Key, and then join. Limitations: If both RDD’s are large, scaling the RDD by a factor of N will not work.