preface

Liu is a second-year graduate student who is about to find a job. On the one hand, he wrote his blog to review and summarize the knowledge points of big data development, and on the other hand, he hoped to help more self-taught friends. Because Old Liu is self-taught big data development, there will certainly be some deficiencies, but also hope everyone can criticize and correct, let us progress together!

Today is about the integration of SparkStreaming and Kafka, this article is very suitable for beginners of small partners, also welcome to come to express opinions, Lao Liu this time will use the form of pictures about other people’s technology blog does not have some details, these details are very useful for beginners of small partners!!

The body of the

Why SparkStreaming integration with Kafka?

First we need to know why SparkStreaming is integrated with Kafka. Nothing happens for no reason!

As a real-time computing framework, Spark only involves computing and does not involve data storage, so we need to use Spark to connect to external data sources in the future. SparkStreaming, as a submodule of Spark, has four types of data sources:

  1. Socket data source (used for testing)
  2. HDFS data source (yes, but not much)
  3. Custom data sources (not important, I don’t see many people customize data sources)
  4. Extended data sources (such as Kafka data sources, which are important and will be asked in interviews)

SparkStreaming and Kafka integration, but only on principle, code will not post, too much online, liu write some of their own understanding of things!

                                       

Kafka – 0.8 SparkStreaming integration

SparkStreaming and Kafka integration to see Kafka version, first to talk about SparkStreaming integration Kafka-0.8.

In SparkStreaming integration Kafka-0.8, the easiest way to keep data from being lost is to checkpoint, but there is a problem with checkpoint. When code is updated, checkpoint stops working. So if you want to prevent data loss, you need to manage the offset yourself.

We will not feel strange to the code upgrade, Lao Liu to explain it well!

In our daily development, we often encounter two situations: the code starts out bad, changes it, repackages it, resubmits it; The business logic has changed and we need to rework the code!

The first time we checkpoint persistence, the entire jar is serialized into a binary file, which is a unique directory. If SparkStreaming tries to checkpoint the data, but if the code changes even a little bit, you can’t find the directory that you packed before. It will result in data loss!

So we need to manage the offsets ourselves!

ZooKeeper cluster management offset, the program starts, will read the last offset, after reading data, SparkStreaming will read data from Kafka according to the offset, read data, the program will run. After running, the ZooKeeper cluster will submit the offset, but there is a small problem. The program is suspended, but the offset is not submitted, and the result has been partially transferred to HBase. When re-reading data, there will be data duplication, but only a batch of data is affected.

However, there is a very serious problem. When a large number of consumers consume data, the offset needs to be read. As a distributed coordination framework, ZooKeeper is not suitable for a large number of read and write operations, especially write operations. Therefore, ZooKeeper is not suitable for high-concurrency requests. It can only be used as a lightweight metadata store, but cannot be used as a data store for high-concurrency reads and writes.

SparkStreaming integration Kafka-1.0 is introduced.

Kafka – 1.0 SparkStreaming integration

Using Kafka to store offsets directly avoids the risk of using ZooKeeper to store offsets. Note that Kafka has a function to automatically commit offsets, which may result in data loss.

Because setting autocommit will automatically commit offsets at a certain frequency, such as every 2 seconds. However, after I intercepted a data, I submitted the offset before I had time to process it. It just reached 2 seconds, so the data was lost, so we usually submitted the offset manually!

How do I design an alarm monitoring scheme?

In daily development work, we need to design a monitoring scheme for real-time tasks, because there is no monitoring of real-time tasks, the program is running naked, whether the task is delayed and other conditions can not be obtained, this is a very terrible situation!

This is just a solution designed by using KafkaOffsetmonitor to monitor tasks. Then, crawler technology is used to obtain monitoring information, and the data is imported into OpenFalcon. In OpenFalcon, alarms can be configured according to policies or alarm systems can be developed by ourselves. Finally, the information is sent to developers using enterprise wechat or SMS!

conclusion

All right! This article mainly explains the integration process of SparkStreaming and Kafka, Liu put a lot of thought into the details, those interested in big data remember to give Liu a like. Finally, if you have any questions, contact the official number: hardworking Old Liu, and have a pleasant exchange!