This is the 21st day of my participation in the More text Challenge. For more details, see more text Challenge

Adjust operator 1: mapPartitions

The ordinary MAP operator operates on each element in the RDD, while the mapPartitions operator operates on each partition in the RDD. If it is a common map operator, assuming that a partition has 10,000 data, then the function in the map operator is executed 10,000 times, that is, each element is operated on.

If the mapPartition operator is used, because a task processes a PARTITION of an RDD, a task executes function only once. Function receives all partition data at a time, which is highly efficient.

For example, if you want to write all the data in the RDD through JDBC, if you use the MAP operator, you need to create a database connection for each element in the RDD, which consumes a lot of resources. If you use the mapPartitions operator, you only need to establish one database connection for the data in a partition.

The mapPartitions operator also has some disadvantages: for normal MAP operations, one piece of data is processed at a time. If memory is insufficient after 2000 pieces of data are processed, the 2000 pieces of data that have been processed can be garbage collected from memory. However, if the mapPartitions operator is used and the data volume is very large, function processes the data of one partition at a time. If the memory is insufficient, the memory cannot be reclaimed at this time, and the OOM may occur, that is, memory overflow.

Therefore, the mapPartitions operator is suitable for the time when the amount of data is not very large, and the performance improvement of using mapPartitions is good. (If there is a large amount of data, once you use the mapPartitions operator, it will go directly to OOM)

In your project, you should first estimate the amount of RDD data, the amount of data per partition, and the memory resources allocated to each Executor, and consider using the mapPartitions operator instead of maps if resources permit.

Operator tuning 2: foreachPartition optimizes database operations

In a production environment, the foreachPartition operator is used to write data to the database. The foreachPartition operator can optimize the database write performance.

If you use the foreach operator to perform database operations, the foreach operator traverses each data of the RDD. Therefore, a database connection is established foreach data, which is a great waste of resources. Therefore, you should use the foreachPartition operator to write data to the database.

Much like the mapPartitions operator, foreachPartition processes each partition of the RDD as a traversal object, one partition at a time. That is, if database-related operations are involved, only one database connection needs to be created foreachPartition, as shown in the figure below:

  • After using the foreachPartition operator, you can obtain the following performance improvements:
    • For the function we wrote, the data is processed one partition at a time;
    • Create a unique database connection for data within a partition.
    • You only need to send one SQL statement and multiple sets of parameters to the database;

In a production environment, all database operations are performed using the foreachPartition operator. The foreachPartition operator has a problem. Similar to the mapPartitions operator, if a partition has a large amount of data, it may cause memory overflow.

Operator tuning 3: Repartition solves the SparkSQL low parallelism problem

In the first section of General Performance Tuning, we explained how to adjust the parallelism. However, the parallelism setting does not take effect for Spark SQL. The parallelism set by the user only takes effect for all Spark stages except Spark SQL.

Users cannot specify the parallelism of Spark SQL. By default, Spark SQL automatically sets the parallelism of the stage where Spark SQL resides based on the split number of HDFS files corresponding to hive tables. The users themselves through spark. Default. Parallelism parameter specifies the parallelism, will only take effect in no spark SQL stage.

The parallelism of the Spark SQL stage cannot be manually set. If the amount of data in the stage is large, the subsequent transformation operation in the stage has complex service logic, and the number of tasks automatically set by Spark SQL is small, This means that each task has to process a large amount of data and then perform very complex processing logic, which may mean that the first stage with Spark SQL is very slow and the subsequent stages without Spark SQL are very fast.

To solve the problem that Spark SQL cannot set the parallelism and number of tasks, we can use the repartition operator.

The parallelism and number of tasks in the Spark SQL step cannot be changed. However, for the RDD queried by Spark SQL, use the repartition operator to repartition the RDD immediately. In this way, the RDD can be repartited into multiple partitions. Since Spark SQL is no longer designed for RDD operations after repartition, the parallelicity of the stage is equal to the value you set manually. In this way, the stage where Spark SQL resides can only process a large amount of data and execute complex algorithm logic with a few tasks.