This is the 23rd day of my participation in Gwen Challenge
Spark Streaming runs for the first time without losing data
- The kafka parameter auto-.offset. reset parameter is set to remit data at the earliest offset
Spark Streaming precision one time consumption
- Manually maintain offsets
- After the service data is processed, submit the offset
In extreme cases, an outage or a power outage during the offset submission will cause the problem of repeated consumption during the second startup of the Spark program. Therefore, things are used to ensure accurate one-time consumption in scenarios where the amount or accuracy is very high
Trispark Streaming controls the speed of data consumption per second
- Through the spark. Streaming. Kafka. MaxRatePerPartition parameter to set the spark streaming from kafka partition pull article number per second
Spark Streaming backpressure mechanism
- The spark. Streaming. Backpressure. Enabled parameter is set to true, open the back pressure mechanism after the spark Streaming will go to kafka consumption data, according to the dynamic delay limit by spark. Streaming. Kafka. MaxRatePerPartition control parameters, so the two parameters are typically used together
5 Spark Streaming a stage Takes time
- The Spark Streaming stage takes time determined by the slowest task. Therefore, when data is tilted, a task runs slowly, causing the Spark Streaming to run slowly.
6 Spark Streaming gracefully off
- The spark. Streaming. StopGracefullyOnShutdown parameter set to true, will spark in the JVM is closed normally closed StreamingContext, rather than immediately shut down
Kill Command: yarn application-kill followed by applicationID
7 Spark Streaming Specifies the number of default partitions
- The default number of Spark Streaming partitions is the same as the number of connected Kafka topic partitions. Generally, Spark Streaming does not use the repartition operator to increase partitions, because repartition takes time to increase shuffle
What are the different ways SparkStreaming can consume data in Kafka?
- Based on Receiver mode
This approach uses Receiver to fetch data. Receiver is implemented using Kafka’s high-level Consumer API. The Receiver receives data from Kafka and stores it in the Spark Executor’s memory. The Spark Streaming job processes the data.
However, with the default configuration, this approach can result in data loss due to underlying failures. To enable high reliability and zero data loss, Spark Streaming must enable Write Ahead Log (WAL). This mechanism synchronously writes the received Kafka data to a write-ahead log on a distributed file system such as HDFS. So, even if the underlying node fails, you can use data from the pre-write log to recover.
- Based on Direct mode
This new direct approach, not based on Receiver, was introduced in Spark 1.3 to ensure a more robust mechanism. Instead of using Receiver to receive data, this approach periodically queries Kafka to get the latest offset for each topic+partition, thereby defining the range of offsets for each batch. When the data-processing job is started, Kafka’s simple Consumer API is used to fetch data in the offset range specified by Kafka.
The advantages are as follows: Simplified parallel reading: If you want to read multiple partitions, you do not need to create multiple input dstreams and then union them. Spark creates as many RDD partitions as Kafka partitions and reads data from Kafka in parallel. So there is a one-to-one mapping between Kafka partition and RDD partition. High performance: To ensure zero data loss, the WAL mechanism must be enabled in receiver mode. This approach is inefficient because the data is actually copied twice. Kafka has a highly reliable mechanism for copying one copy of the data, which in turn copies to WAL. The Direct mode does not rely on Receiver and does not need to enable WAL. As long as data is replicated in Kafka, data can be restored through the Kafka copy. Once and only once transaction mechanism.
- Contrast:
The receiver approach uses Kafka’s high level API to store consumed offsets in ZooKeeper. This is the traditional way to consume Kafka data. This approach, combined with WAL, ensures high reliability with zero data loss, but it does not guarantee that data will be processed once and only once, and may be processed twice. Spark and ZooKeeper may be out of sync.
Spark Streaming itself is responsible for tracking consumed offsets and saving them in checkpoint, based on direct and using Kafka’s simple API. Spark itself must be synchronous, so data can be consumed once and only once.
In the actual production environment, the Direct mode is mostly used