background

Some time ago, I worked on a streaming project. The scenario was: For streaming data, filtering rules were used to conduct real-time filtering and produce result data. Streaming data is a continuous stream of IP, which is screened from the qualified IP set and transmitted to the downstream message-oriented middleware.

Technology selection

The upstream data is read from message-oriented middleware and processed by Spark Streaming. The downstream data is also read by message-oriented middleware.

Radio variable

Application scenarios for broadcast variables

In distributed computing such as Spark, if each operator needs to read a variable and the data volume of the variable is at least 100 levels, the variable is broadcast to each executor operator.

Advantages of broadcast variables over external variables

  • Reduce the network transmission of data between each operator; Obviously, the broadcast variable, once broadcast, is present in all executors and does not change with task changes. For external variables, however, they vary from task to task, and if they are used, they need to be pulled/transferred/removed from each node.
  • Reduced memory footprint; The data for broadcast variables is stored in executor shared memory. That is, only one copy of broadcast variables is stored in one executor. External variables, however, are stored in each task. The number of tasks allocated to an executor determines how many external variables are stored.

Ps :Spark executor memory management can be seen in the previous article: blog.csdn.net/qq_35583915…

Use of broadcast variables

val sparkConf = new SparkConf().setAppName("broadcast-in-spark") sparkConf.set("spark-config-key","spark-config-value") Val sparkSession = sparksession.builder.config (sparkConf).enableHivesupPort ().getorCreate () Because only one Spark context can be defined in a Spark project, the sparkContext used for broadcast variables can only be obtained from the sparkSession defined earlier. To ensure that don't appear two context definition val sparkContext = sparkSession. GetSparkContext val broadcastUse = sparkContext. Broadcast (useValue) ${broadcastuse. value}" println(s" broadcastuse. value}")Copy the code

Broadcast variable use tips

  • It is best to use the broadcast variable as a constant. Of course, it does not mean that the broadcast variable cannot be modified, but the modification step of the broadcast variable is: 1. Delete the broadcast variable; 2. Redefine the new broadcast variable using the current name. So, broadcast variables are best used in constant situations;
  • Broadcast variables cannot be defined/modified/deleted in the operator. In addition to the read operation, other operations to broadcast variables need to use the Spark context of the project, that is, sparkContext. The Spark context can only be defined and used in the Driver and cannot be serialized between executors. As you know, operator execution is in Executor, but operations such as external variables and setting the Spark execution environment are in Driver. [ps: See the previous summary of the Driver/Executor, etc., article: blog.csdn.net/qq_35583915…