This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.
Spark Streaming was used last time to read and process data in Redis. Let’s solve the output problem this time. The simplest way to print it is to use the print function that comes with the computed result object and print it to the running screen. But I run it in a remote distributed environment. And then push the program to run. A more convenient way to view health is needed.
Spark Streaming does offer some extra ways to save data as files and other things like saveAsTextFiles, but it’s not as convenient as you might think.
In addition to this, Spark Streaming provides a function foreachRDD that you can customize the output.
For the sake of ease of use. I decided to use the publishing and subscription mechanism of Redis to push the data to Redis, which was broadcast externally, and then read the data with other programs. That is, a bus-like mechanism. If kafka is a heavyweight, it can persist data so that it can be pulled out at any time, unlike Redis, which can only be used on a real-time basis. But if you just look at the custom output, the principle is the same.
ForeachRDD is an operation based on RDD, and its behavior is basically the same as that of the calculation operation, which also means a classic situation: RDD is distributed operation. It requires running the same program on different machines. If we pass the redis connection to this function directly, we will get a serialization error because it needs to serialize the connection object and send it to the calculator, which is not possible.
It is also worth noting that it is not recommended that we create redis connection objects for every calculation in the RDD. Although this does not cause the serialization problems described earlier, establishing a connection object for each record incurs a lot of redundant overhead.
The appropriate approach is to make use of the distributed characteristics of RDD. RDD also uses the classic partition computing model, where part of the data calculated by each node is in the partition. Generally speaking, it is acceptable to establish a connection object for the data of this partition. We can send data to Redis as follows.
behaviorCount.foreachRDD { (rdd) =>
rdd.foreachPartition { partitionOfRecords =>
val jedis = new Jedis("master".47777)
partitionOfRecords.foreach(it => {
jedis.publish("channel", data string of it)
})
jedis.close()
}
}
Copy the code
If you’re looking for more efficiency, you can also use connection pooling technology to retrieve connections from a connection pool, further reducing overhead.
Another problem is how to integrate the data and tell when it was calculated. You can also use the foreachRDD function here. Because it can actually have two arguments, it can be followed by a timestamp variable after RDD, which is also sent to Redis for other programs to handle.