“This article has participated in the good article call order activity, click to see: back end, big front end double track submission, 20,000 yuan prize pool for you to challenge!”
Spark Streaming
When it comes to Spark Streaming, we have to start with Spark. In the world of big data today, Spark has long been a well-known framework for big data processing and analysis. At the beginning of its creation, Spark significantly improved the performance of big data processing by using in-memory computing and DAG to simplify the processing process. It quickly became the most prominent framework in the big data field, compared with the traditional big data batch processing framework Hadoop MapReduce.
Later, with the rise of stream computing technology, Spark has achieved great success in the field of batch processing and started to extend its antenna to the field of stream computing, hence Spark Streaming was born. Spark Streaming is a stream computing framework based on Spark batch processing technology, which provides scalable, high throughput and error tolerance stream data processing capabilities.
From the system architecture of Spark Streaming, we can see the continuous relationship between Spark Streaming flow computing technology and Spark batch processing technology.
System architecture
The following figure describes the working principle of Spark Streaming:
When Spark Streaming receives stream data, it first divides it into RDD (Resilient Distributed Datasets). Each RDD is actually a small block of data. Then, the Spark engine processes the RDD block data. Finally, the processed results are also output in sequence as RDD blocks.
Therefore, Spark Streaming is essentially a continuous batch processing of stream data after it is divided into blocks.
The description of the flow
Next, let’s look at how to describe a stream computation process in Spark Streaming.
Spark Streaming is built on Spark, and the core of Spark is an execution engine for batch processing of RDD block data. Therefore, Spark Streaming uses the concept of “template” when describing streams.
Spark Streaming describes the flow computing process as follows:
-
The first is the RDD. It is a core concept of the Spark engine. It represents a data set and serves as a computing unit for data processing in Spark.
-
And then DStream. It is an abstraction of Spark Streaming convection and represents a continuous data stream. In Spark Streaming system, DStream is the template of RDD of the same class, so we can also regard DStream as a sequence composed of RDD of the same class, and each RDD represents the stream data within an interval.
-
And then Transformation. It represents the Spark Streaming’s processing logic for DStream (RDD of the same class). DStream provides a wide range of Transformation apis, including Map, flatMap, Filter, Reduce, Union, Join, Transform and Update AtebyKey. With these apis, you can do various transformations to dStreams to change one data stream into another.
-
Then we have DStreamGraph. It is a description of the flow calculation process of Spark Streaming, also known as DAG. In Spark Streaming system, DStreamGraph represents the TEMPLATE of RDD processing DAG. DStream can be connected to DStreamGraph through Transformation, which is the equivalent of drawing a DAG with dots and lines.
-
And finally, Output Operations. It is the operation of Spark Streaming to output a DStream to an external system such as a console, database or file system. The Output Operations currently supported by DStream include Print, saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles, and foreachRDD. Because these operations trigger external system access, the execution of DStream’s various transformations is actually triggered by these operations.
Spark Streaming: DStreamGraph and DStream are the same as DAG, DStream is the same as DAG, and DStream is the same as DAG. Transformation and Output Operations are “nodes” where tasks are performed.
The processing of flow
Next, let’s look at how streams are processed in Spark Streaming. Similar to Storm, we will discuss the four aspects of stream input, stream processing, stream output, and reverse pressure.
The first is the input to the stream. Spark Streaming provides three ways to create input data streams.
-
The first is the basic data source, which directly builds the input data stream through the related API of StreamingContext. Such apis typically build input data streams from sockets, files, or memory, such as socketTextStream, textFileStream, queueStream, and so on.
-
The other is advanced data source, which builds input data flow from message middleware or message sources such as Kafka, Flume and Kinesis through external tool classes.
-
3 it is the custom data sources, when a user realizes the org. Apache. Spark. Streaming. The receiver abstract classes, you can implement a custom data source.
Since Spark Streaming uses DStream to represent the data stream, the input data stream is also represented as DStream.
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("WordCountExample");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost".9999);
Copy the code
In the above code, we use socketTextStream to create an input stream that receives text data from the local 9999 port.
Then there is the processing of the stream. Spark Streaming streams are processed through various DStream conversion apis. DStream’s conversion operations generally fall into three categories.
-
The first is the common streaming operations, such as Map, filter, Reduce, count, transform, and so on.
-
The second type is the operation related to the stream data state, such as union, join, COgroup, window, etc.
-
The third class is flow state related operations, currently updateStateByKey and mapWithState.
Here is an example of how to transform a DStream:
// Divide each line into words and count the number of words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split("")).iterator());
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);
Copy the code
In the above code, we first read the text stream lines from the socket. After dividing each line into words, we use the flatMap to convert the text stream words. The words stream is then converted to pairs of counting tuples using mapToPair. Finally, carry out quantity statistics with words as groups, and convert wordCounts into word count stream through reduceByKey.
Next is the output of the stream. Spark Streaming allows DStream output to external systems, which is done through the various output operations of DStream. The output operation of a DStream can output data to a console, file system, or database, etc.
At present, output operations of DStream include print, saveAsTextFiles, saveAsHadoopFiles and foreachRDD, etc. ForeachRDD is a universal DStream output interface. Users can use foreachRDD to customize Spark Streaming output. The following example demonstrates printing a word count stream to the console.
Finally, reverse pressure. Spark Streaming does not support reverse pressure in earlier versions of Spark, but since Spark 1.5, Spark Streaming also introduces reverse pressure, which makes it necessary for the Streaming system. Spark Streaming reverse pressure is disabled by default. When to use reverse pressure function, need to spark the streaming. Backpressure. Enabled is set to true.
In general, the reverse pressure of Spark is based on the idea of PID controller in industrial control. Its working principle is as follows.
-
First, after each batch of data is processed, Spark collects the processing end time, processing delay, wait delay, and number of processed messages of each batch of data.
-
The processing speed is then estimated based on the statistics and this estimate is communicated to the data producer.
-
Finally, according to the estimated processing speed, the data producer dynamically adjusts the production speed to match the processing speed.
The state of flow
Next, let’s look at the state of streams in Spark Streaming. Spark Streaming The state management of streams is also implemented in the transformation operation provided by DStream.
Let’s look at the stream data state first. Since DStream itself divides data streams into RDD for batch processing, Spark Streaming naturally needs to perform caching and state management of data. In other words, the RDD that makes up a DStream is a stream data state. DStream provides some Window-related apis for window management of convection data, and implements the count and Reduce aggregation functions based on Windows. In addition, DStream provides apis for union, Join, and CoGroup to associate multiple streams. The apis related to the preceding Windows and associated operations are also supported by Spark for flow data status.
After the flow data state, let’s look at the flow information state. DStream’s updateStateByKey and mapWithState operations provide a way to manage the state of stream information. Both updateStateByKey and mapWithState can record historical information based on the key and update it when new data arrives.
The difference is that updateStateByKey returns all the history of the record, whereas mapWithState only returns the information that was updated when the current batch of data was processed. It is as if the former is returning a complete histogram, while the latter is returning only the changed bars in the histogram.
As you can see, mapWithState is much better than updateStateByKey. And functionally, mapWithState is more appropriate for most real-time streaming computing applications, if not for report generation scenarios.
Message processing reliability
Finally, let’s look at the reliability of message processing in Spark Streaming. The reliability of Spark Streaming is determined by data receiving, data processing and data output. Since version 1.2, Spark has introduced a Write Ahead Logs (WAL) mechanism to save received data to an error-tolerant store. When WAL is turned on and combined with reliable data receivers such as Kafka, Spark Streaming can provide “at least once” message reception. Starting with version 1.3, Spark introduced the Kafka Direct API, which enables “exact once” message reception.
The data processing of Spark Streaming is completed based on RDD, and RDD provides “precise once” message processing. Therefore, in the data processing part, Spark Streaming naturally has the “precise once” message reliability guarantee.
However, the data output of Spark Streaming only has the reliability guarantee of “at least once”. That is, the processed data may be exported to external systems multiple times.
In some scenarios, this won’t be a problem. For example, the output data is saved to the file system, and the result of repeated sending is only to overwrite the data that was written once before.
But in other scenarios, such as incrementally updating the database based on the output, some additional deprocessing may be required. One possible approach is to add a unique identifier to each RDD to represent the batch of data, and then use this unique identifier to check whether the data has been written to the database before. Of course, at this point the writing to the database needs to use transactions to ensure integrity.