Reprinted from the official account

Kafka, the messaging middleware we used in those years

Why message queues

Weekend boring brush mobile phone, a Taobao APP suddenly popped out a message “in order to reward old customers, girlfriend buy one get one free, activities only today!” .

Buy one get one free! I can’t miss it! I couldn’t help but order it right away. So I chose two of the latest, order, pay at one go! Satisfied lying in bed, thinking about having a girlfriend immediately, unexpectedly happy insomnia…

The next day when I was on duty, I suddenly received a call from the Courier:

Little elder brother: “are you xx? Your girl friend arrived, I am downstairs in you now, you take it!” . Me: “This… I’m at work. Can you bring it tonight? “. Little elder brother: “evening can not line, evening I also got off work!” .

So two people stalemate for a long time…

Finally, the little brother said, or I help you put it downstairs in the small fang convenience store, you came to take it after work in the evening, the embarrassing situation was alleviated!

Back to the point, if there were no Xiao Fang Convenience store, the interaction diagram between the Courier boy and me would be as follows:

 

What’s going to happen?

1, for this girlfriend, I asked for leave to go back to get (boss did not approve).

2. I’ve been waiting downstairs (I’ve got other deliveries to make).

3. Deliver over the weekend (apparently can’t wait).

4, I don’t want this girlfriend (absolutely impossible)!

After the emergence of Xiaofang Convenience Store, the interaction diagram should be as follows:

  

In the above example, “delivery boy” and “ME who bought my girlfriend” are the two systems that need to interact, and Xiaofang Convenience store is the “message middleware” that we will talk about in this article. Summarize the following benefits after the emergence of Xiaofang Convenience store (message-oriented middleware) :

1, the decoupling

The Courier brother has a lot of express delivery, he needs to call each time to confirm whether the consignee is available, which time period is available, and then determine the delivery plan. This is totally dependent on the consignee! If there are more than one delivery, the delivery boy will be busy like crazy… If there is a convenience store, the Courier only needs to put the Courier in the same neighborhood in the same convenience store, and then notify the consignee to pick up the goods, at this time the Courier and the consignee can realize the decoupling!

2, asynchronous

The delivery guy had to wait downstairs after he called me, until I took your delivery and he couldn’t deliver anyone else. Express little brother will express in the small fang convenience store, and can do other work, do not need to wait for your arrival and has been in a waiting state. Improve the efficiency of work.

3, peak clipping

Suppose I bought a variety of goods in different stores on Double Eleven, and the express delivery of these stores are different, including Zhongtong, YTO, SHentong, and all kinds of express, etc.. What’s more, they all arrived at the same time! The little brother called me to take the north gate express, yuantong little brother called me to go to the south gate, Shentong little brother called me to the east gate. I was so distracted…

We can see that in the scenario where the system needs to interact, the use of message queue middleware is really beneficial. Based on this idea, there are more professional “middleware” such as Fengchao and Cainiao Station than Xiaofang Convenience store.

Finally, the above story is pure fiction…

The mode of message queue communication

Through the above example, we introduced the message middleware, and introduced the benefits of the emergence of message queues. Here we need to introduce the two modes of message queue communication:

1. Point-to-point mode

As shown in the figure above, the point-to-point pattern is typically a pull or polling based messaging model characterized by messages sent to queues being processed by one and only one consumer. After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption.

The advantage of the point-to-point model is that consumers can control how often they pull messages. However, whether there are messages in the message queue that need to be consumed cannot be sensed on the consumer side, so additional threads are required to monitor on the consumer side.

Publish and subscribe mode

As shown in the figure above, the publish-and-subscribe pattern is a message-based messaging model that can have many different subscribers. After a producer puts a message into a message queue, the queue pushes the message to consumers that have subscribed to the message.

Since it is the consumer who passively receives the push, there is no need to sense whether the message queue is waiting to consume the message! Consumer1, Consumer2, and Consumer3 can process messages differently due to different machine performance, but the message queue can’t sense how fast the consumer is consuming! So the speed of push is a problem with the publish/subscribe model! Suppose the three consumer processing speeds are 8M/s, 5M/s, and 2M/s respectively. If the queue pushes at 5M/s, consumer3 cannot handle it! If the queue push speed is 2M/s, then consumer1 and Consumer2 will have a huge waste of resources!

Kafka

With a brief introduction to why message queues are needed and the two modes of message queue communication, it’s time to introduce kafka, the main character of this article.

Kafka is a high-throughput distributed publish subscribe messaging system that can process all action stream data in consumer-scale websites with high performance, persistence, multi-copy backup, horizontal scaling capabilities…

Infrastructure and terminology

Without further ado, let’s take a look at the picture first. Through this picture, we can have an overview of related concepts and their relationships:

If you’re confused by this picture, that’s fine!

Let’s start with the concepts

Producer: A Producer is a Producer, the Producer of information and the entrance to information.

Kafka cluster:

Broker: A Broker is an instance of Kafka. Each server has one or more instances of Kafka. We assume that each Broker has one server. Each broker in a Kafka cluster has a unique number, such as broker-0, broker-1, and so on…

Topic: The Topic of the message, which can be interpreted as the classification of the message. Kafka data is stored in the Topic. Multiple topics can be created on each broker.

Partition: A Topic can have multiple partitions. Partitions are used to load kafka throughput. The data of the same topic in different partitions are not repeated, and the manifestation of partition is a folder!

Replication: Each partition has multiple copies that act as a backup. If the primary partition (Leader) fails, a Follower is selected to become the Leader. The default maximum number of replicas in Kafka is 10, and the number of replicas cannot be greater than the number of brokers. Followers and leaders are definitely on different machines, and the same machine can only store one copy (including itself) on a partition.

Message: The body of each sent Message.

The Consumer is the outlet of information.

Consumer Group: We can Group multiple consumers into a Consumer Group. In Kafka’s design, data from a partition can only be consumed by one Consumer in the Consumer Group. Consumers in the same consumer group can consume data from different partitions of the same topic to improve Kafka throughput!

Zookeeper: Kafka clusters rely on Zookeeper to store cluster meta-information to ensure system availability. Follow the public account Java Technology stack to get the Zookeeper tutorial and questions I compiled.

Workflow analysis

Kafka’s basic infrastructure and basic concepts are introduced in this article. Kafka’s basic architecture and basic concepts are introduced in this article. Kafka’s basic concepts are introduced in this article. Follow the Java Technology Stack for more Kafka tutorials.

Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: Kafka

To send data

In the diagram above, a producer is a producer and a gateway to data. Notice the red arrows in the graph. The Producer always looks for the leader when writing data to the follower, rather than directly writing data to the follower! How to find a leader? What is the process of writing? Take a look at the picture below:

The sending process has been explained in the picture, not listed in the text alone! Note that after messages are written to the leader, the followers take the initiative to synchronize with the leader! The producer uses push mode to publish data to the broker, and each message is appended to the partition and written to disk in sequence, ensuring that data in the same partition is in order! Write the schematic diagram as follows:

Data is written to different partitions. Why partition kafka? As you can probably guess, the main purpose of zoning is:

1, easy to expand. Since a topic can have multiple partitions, we can easily cope with the increasing amount of data by expanding the machine.

2. Improve concurrency. The read/write unit is partition. Multiple consumers can consume data at the same time, improving message processing efficiency.

Those familiar with load balancing should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In Kafka, if a topic has multiple partitions, How does the producer know which partition to send data to?

There are several principles in Kafka:

1. Partition Specifies the partition to be written to. If the specified partition is written to, the corresponding partition is written to.

2. If no partition is specified but a data key is set, a partition will be hash based on the key value.

3. If neither a partition is specified nor a key is set, a partition is selected by polling.

Guarantee against message loss is a basic guarantee of message queuing middleware. How can producer guarantee against message loss when writing messages to Kafka? In fact, the above write flow chart has been described, that is through the ACK response mechanism! When a producer writes data to a queue, a parameter can be set to confirm that kafka received data. This parameter can be set to 0, 1, or all.

0 indicates that the producer sends data to the cluster without waiting for the return of the cluster. Therefore, the message cannot be sent successfully. The least safe but the most efficient.

1 indicates that producer sends data to the cluster and sends the next data as long as the leader responds. This ensures that the leader sends data successfully.

All indicates that the producer sends data to the cluster only after all followers complete synchronization with the leader, ensuring that the leader sends data successfully and all copies are backed up. Safety is the highest, but efficiency is the lowest.

Finally, if I write to a topic that doesn’t exist, can I write to it successfully? Kafka automatically creates topics, and the number of partitions and replicas is set to 1 by default.

Save the data

After Producer writes data to Kafka, the cluster needs to save the data! Kafka stores data on disk, and writing to disk is a time-consuming operation, perhaps in our common sense, not suitable for such a high-concurrency component. Kafka starts with a separate disk space and writes data sequentially (which is more efficient than random writes).

Partition structure

As mentioned earlier, every topic can be divided into one or more partitions. If you think topic is abstract, then partition is more concrete. Each segment file contains a. Index file, a. Log file, and a. Timeindex file. Log files are actually where messages are stored, while index and timeindex files are index files that retrieve messages.

As shown in the figure above, this partition has three sets of segment files. Each log file has the same size, but the number of messages stored is not necessarily the same. The file is named after the segment’s minimum offset. For example, 000. Index stores messages whose offset is 0 to 368795. Kafka uses segment + index to solve the search efficiency problem.

The Message structure says that the log file is actually where the Message is stored. We also write Message after Message to Kafka from producer. What are the messages stored in the log? The message contains the message body, message size, offset, and compression type…… Wait a minute! Here are three key things to know:

1, offset: offset is an ordered ID number of 8 bytes, which can uniquely determine the position of each message in PARITION!

2. Message size: The message size occupies 4 bytes, which describes the size of the message.

3. Message body: The message body stores the actual message data (compressed) and occupies different space according to the specific message.

Storage policy

Kafka stores all messages, whether or not they are consumed. So what’s the deletion strategy for old data?

1. Based on time, the default value is 168 hours (7 days).

2. Based on size, the default configuration is 1073741824.

Note that the time complexity for Kafka to read a particular message is O(1), so deleting expired files here will not improve Kafka’s performance!

Consumption data

After the message is stored in the log file, the consumer can consume it. Just like the production message, the consumer also asks the leader to pull the message.

Multiple consumers can form a consumer group, each with a group ID! Consumers in the same consumer group can consume the data of different partitions under the same topic, but not multiple consumers in the group can consume the data of the same partition!!

It’s a little convoluted. Take a look at the picture below:

The figure shows that the number of consumers in a consumer group is less than the number of partitions. Therefore, a consumer can consume multiple partitions at a slower speed than a consumer processing only one partition! If there are more consumers in a consumer group than partitions, will there be multiple consumers consuming the same partition? It has already been mentioned that this will not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions!

Each segment contains.log,.index, and.timeindex files. Each segment contains the offset, size, and body of the message. How to use segment+offset to find messages? What if we now need to find a message whose offset is 368801? Let’s take a look at the picture below:

Find the segment file where the 368801message from offset is located.

Index = 368796+1; index = 368796+1; offset =368801; So the relative offset we’re looking for here is 5). Since the file uses a sparse index to store the relationship between relative offset and the corresponding physical offset of Message, the index with relative offset of 5 cannot be found directly. Here, the dichotomy method is also used to find the largest relative offset in the index entry whose relative offset is less than or equal to the specified relative offset, so the index whose relative offset is 4 is found.

3. The physical offset of the message store is 256 based on the found index whose offset is 4. Open the data file and scan sequentially from position 256 until you find the Message with offset 368801.

This mechanism is built on the basis of ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to efficiently search data! At this point, consumers can get the data they need to process.

How does each consumer record the location of his consumption?

In the early version, consumers maintained the offset they consumed in ZooKeeper, and consumers reported it every time at intervals, which was easy to lead to repeated consumption and poor performance! In the new version, offsets consumed by consumers are maintained directly in the __consumer_offsets topic of the Kafk cluster!

Copyright belongs to the original author, if there is infringement, please contact to delete.