Hello everyone, this is the second round of The MQ Series, some of the messages are late, and I am really sorry for being urged more by several readers backstage!
The article dragged on for several weeks, with the initial idea that it would be too difficult not only to write thoroughly but also to control the length of each specific message middleware, and that it would be difficult to do both.
Finally, I decided to write more articles. On the one hand, it can speed up the output frequency; On the other hand, it’s easier for people to digest.
Without further ado, the second round starts.
Why start with Kafka?
The beginning of Understand MQ revolves around the “discovery, storage, consumption” nature of MQ, explaining the general knowledge of MQ and systematically answering the question: How do you go about designing an MQ?
Starting with this article, I’ll be looking at specific message-oriented middleware. I chose Kafka for three reasons:
First, RocketMQ and Kafka are currently two of the most popular messaging middleware, most widely used by Internet companies, and will be the focus of this series.
Second, from the perspective of the history of MQ, Kafka was born before RocketMQ, and alibaba team fully borrowed Kafka’s design ideas when implementing RocketMQ. Once you know how Kafka is designed, it will be easier to understand RocketMQ later.
Third, Kafka is a lightweight MQ that has the most basic capabilities of MQ, but does not support advanced features such as deferred queuing and retry mechanisms, thus reducing implementation complexity. Starting with Kafka is a great way to quickly get to the heart of MQ.
Finish the background, please follow my train of thought, together from shallow to deep analysis Kafka.
Lift Kafka’s veil
Before diving into a technology, it is not recommended to start by understanding the architecture and technical details, but what is it? What problem was it created to solve?
After mastering these background knowledge, it is helpful for us to understand the design considerations and design ideas behind it.
In writing this article, I looked up a lot of information, and the definition of Kafka can be said to be very varied, without careful consideration can easily be confused, I think it is necessary to take you through.
Let’s take a look at Kafka’s own definition:
Apache Kafka is an open-source distributed event streaming platform.
Apache Kafka is an open source distributed streaming platform.
Isn’t Kafka a messaging system? Why is it called a distributed stream processing platform? Are the two the same thing?
I’m sure some readers are wondering, but to explain this question, we need to start with the background of Kafka’s birth.
Kafka started out as an in-house incubated project at Linkedin and was designed as a “data pipeline” to handle two scenarios:
1. Operation activity scenario: record users’ browsing, searching, clicking, activity and other behaviors.
2. System O&M scenario: Monitors server performance indicators such as CPU, memory, and request time.
It can be seen that both kinds of data belong to the category of logs, which are characterized by real-time data production and large amount of data.
Linkedin initially tried using ActiveMQ to solve the data transfer problem, but the performance was not up to speed before they decided to develop Kafka themselves.
So from the beginning, Kafka was built for real-time log streaming. With that in mind, it’s easy to understand Kafka’s relationship to streaming data, and why Kafka is so widely used in big data. It is also because it was originally created to solve the big data pipeline problem.
Further explanation: why is Kafka officially defined as a streaming platform? Isn’t it just a data channel capability? Why is it platform related?
This is because Kafka has been providing data processing components since version 0.8, such as:
1. Kafka Streams: a lightweight stream computing library similar to Spark and Flink.
Kafka Connect: a data synchronization tool that can import Kafka data into relational databases, Hadoop, and search engines.
Kafka’s ambition is not just to be a messaging system, but to be a real-time streaming platform.
Kafka’s website mentions three capabilities:
1. Publish and subscribe capability of data (message queue)
2. Distributed storage capability of data (storage system)
3. Real-time data processing capability (stream processing engine)
This clears up the history and definition of Kafka. Of course, this series only focuses on Kafka’s first two capabilities, as both are strongly MQ related.
Start with Kafka’s message model
After understanding Kafka’s position and the background of its birth, let’s take a look at Kafka’s design philosophy.
As I mentioned in the last article, to understand MQ, it is advisable to start at the core theoretical level of the “message model” rather than looking at the technical architecture first, let alone dive into the technical details.
The message model, understood as a logical structure, is the next layer of abstraction from the technical architecture and often contains the most core design ideas.
Let’s take a look at Kafka’s message model and see how it evolved.
First, to distribute a single piece of message data to multiple consumers, with each consumer receiving the full amount of messages, it is natural to think of broadcasting.
Then the problem arises: A message is broadcast to all consumers, but not everyone wants all of it. For example, consumer A only wants messages 1, 2, 3, and consumer B only wants messages 4, 5, 6.
The key to this problem is that MQ does not understand the semantics of the messages, and it simply cannot categorize and deliver them.
At this point, MQ came up with a clever solution: it threw the puzzle directly at the producers, requiring them to logically categorize messages as they were sent, thus evolving the Topic and publish-subscribe model we know well.
In this way, consumers can simply subscribe to the topics they are interested in and get messages from them.
Having done so, however, a problem remains: what if multiple consumers are interested in the same Topic (consumer C in the figure below)?
With the traditional queuing model (unicast), when one consumer takes a message from the queue, it is deleted and the other consumer cannot get it.
At this time, it is natural to think of the following solution:
That is, every time a Topic adds a new consumer, it “replicates” the exact same data queue.
This solves the problem, but as the number of downstream consumers increases, it causes a rapid degradation of MQ performance. For Kafka, in particular, which was built to deal with big data scenarios, the cost of replication was simply too high.
This is where Kafka’s most elegant solution comes in: it holds all the messages in a persistent store, and the consumer can take whichever message he or she wants, whenever he or she wants, just passing the offset of a message.
Such a fundamental change shifts the complexity of consumption back to the consumer, making Kafka a much less complex place to start, and setting the stage for high performance and scalability. (This is the core difference between Kafka and ActiveMQ and RabbitMQ.)
And finally, to simplify things, here’s the picture:
This is the original message model for Kafka.
This indirectly explains why Kakfa is also officially defined as a storage system in Chapter 2.
Of course, Kafka is much more sophisticated than that, and for the sake of space, the rest of this article will cover it.
The 04 is at the end
This article starts from the background of Kafka’s birth, taking us to clarify the definition of Kafka and the problems it is intended to solve.
In addition, the message model and design ideas of Kafka are analyzed step by step, which is Kafka’s top-level abstraction.
I hope you found something interesting. The next article will take a closer look at Kafka’s architecture. See you next time!
About the author: 985 master, former Amazon engineer, now dachang technical director
Welcome to pay attention to my personal public account: Wuge Ramble IT, share core technology and career growth!