preface

In the previous two articles, both Kafka and ZooKeeper were successfully installed. Company chores more, has no time to continue to learn behind, today to preliminary explore the principle of Kafka. Of course, the goal is not to get it all at once, nor is it realistic. The main purposes are as follows:

  • In the future, if you have any problems with Kafka, you can easily locate them
  • When integrating with Java, I can know where the important integration points are, for example, THE ACK problem that I am concerned about
  • Component learning, use is just the beginning, principle is the step up

Kafka infrastructure and terminology explained

Without further ado, let’s get straight to the picture above:

If you’re confused by this picture, that’s fine! Let’s start with the concepts

Producer: A Producer is a Producer, the Producer of information and the entrance to information.

Kafka cluster:

  • Broker: A Broker is an instance of Kafka. Each server has one or more instances of Kafka. We assume that each Broker has one server. Each broker in a Kafka cluster has a unique number, such as broker-0, broker-1, and so on…

  • Topic: The Topic of the message, which can be interpreted as the classification of the message. Kafka data is stored in the Topic. Multiple topics can be created on each broker.

  • Partition** : Partition for a Topic. Each Topic can have multiple partitions. Partition is used to load kafka throughput. The data of the same topic in different partitions are not repeated, and the manifestation of partition is a folder!

How does the partition load balancing work?

  • Replication**: Each partition has multiple copies that act as a backup. If the primary partition (Leader) fails, a Follower is selected to become the Leader. The default maximum number of replicas in Kafka is 10, and the number of replicas cannot be greater than the number of brokers. Followers and leaders are definitely on different machines, and the same machine can only store one copy (including itself) on a partition.

Why do the leader and replica followers of a partition have to be on different machines?

  • Message: The body of each sent Message.

The Consumer is the outlet of information.

Consumer Group: We can Group multiple Consumer groups into one Consumer Group. In Kafka’s design, data from a partition can only be consumed by one Consumer in the Consumer Group. Consumers in the same consumer group can consume data from different partitions of the same topic to improve Kafka throughput!

What if the consumer group has more consumers than the partition?

Zookeeper: Kafka clusters rely on Zookeeper to store cluster meta-information to ensure system availability.

We explain the terminology, along with a few questions that will be answered step by step in this article.

Process analysis

Kafka’s workflow is then analyzed with the above structure diagram. If the above terminology is still in the cloud, it doesn’t matter. When the process is finished, everything will become clear.

Message-oriented middleware, I believe, is already pretty darn good for people who make applications. There is no pattern, meaning, or anything like that of message-oriented middleware. Kafka is also a kind of middleware.

To send data

Do you see anything very detailed in the picture above?

  • When the producer sends data, he must first determine which partition to send data to and send it to the leader of this partition through some balanced strategy. For example, there are two partitions in topicA, named partition0 and partition1 respectively.

    The two copies of Partition0 are partition0 - copy A, and the two copies of Partition1 - copy C and partition1 - copy D respectively. Assuming that after this load balancing, data will be sent to Partition0, the leader of partition0 will be sent!Copy the code
  • When consumers consume, they only look for the leader to consume!

  • The data sent by the producer is synchronized between the leader and follower

The flow chart of sending data is as follows:

Well, don’t be a little confused like me. Although the leader and followers are in the same frame, they must be on different machines. Why? Because the leader is the primary partition and the followers are the standby partition, the standby and the primary partition together will be meaningless.

Note that after messages are written to the leader, the followers take the initiative to synchronize with the leader! The producer uses push mode to publish data to the broker, and each message is appended to the partition and written to disk in sequence, ensuring that data in the same partition is in order! Write the schematic diagram as follows:

Data is written to different partitions. Why partition kafka? As you can probably guess, the main purpose of partitioning is: 1, easy to expand. Since a topic can have multiple partitions, we can easily cope with the increasing amount of data by expanding the machine. 2. Improve concurrency. The read/write unit is partition. Multiple consumers can consume data at the same time, improving message processing efficiency.

Those familiar with load balancing should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In Kafka, if a topic has multiple partitions, How does the producer know which partition to send data to? Partition Specifies the partition to be written to. If the partition is specified, the corresponding partition is written to. 2. If no partition is specified but a data key is set, a partition will be hash based on the key value. 3. If neither a partition is specified nor a key is set, a partition is selected by polling.

Guarantee against message loss is a basic guarantee of message queuing middleware. How can producer guarantee against message loss when writing messages to Kafka? In fact, the above write flow chart has been described, that is through the ACK response mechanism! When a producer writes data to a queue, a parameter can be set to confirm that kafka received data. This parameter can be set to 0, 1, or all.

  • 0 indicates that the producer sends data to the cluster without waiting for the return of the cluster. Therefore, the message cannot be sent successfully. The least safe but the most efficient. \
  • 1 indicates that producer sends data to the cluster and sends the next data as long as the leader responds. This ensures that the leader sends data successfully. \
  • All indicates that the producer sends data to the cluster only after all followers complete synchronization with the leader, ensuring that the leader sends data successfully and all copies are backed up. Safety is the highest, but efficiency is the lowest.

We can see a future where most of us will be in all mode!

Finally, if I write to a topic that doesn’t exist, can I write to it successfully? Kafka automatically creates topics, and the number of partitions and replicas is set to 1 by default.

Save the data

After Producer writes data to Kafka, the cluster needs to save the data! Kafka stores data on disk, and writing to disk is a time-consuming operation, perhaps in our common sense, not suitable for such a high-concurrency component. Kafka starts with a separate disk space and writes data sequentially (which is more efficient than random writes).

Each topic can be divided into one or more partitions, which can be expressed as folders on the server. Under the folders of each Partition, there are groups of segment files. Each segment file contains a.index file, a.log file, and a.timeindex file (not available in earlier versions). The log file is used to store messages, while the index and timeindex files are used to retrieve messages.

Partition is composed of multiple segments to facilitate log cleaning and recovery. Each Segment is named offset from the first message of the Segment and its suffix is.log. There is also an index file that specifies the offset range of Log entries contained within each Segment. The file is named the same way, with the suffix “.index “as follows:

As shown in the figure above, this partition has three sets of segment files. Each log file has the same size, but the number of messages stored is not necessarily the same. The file is named after the segment’s minimum offset. For example, 000. Index stores messages whose offset is 0 to 368795. Kafka uses segment + index to solve the search efficiency problem.

The Message structure says that the log file is actually where the Message is stored. We also write Message after Message to Kafka from producer. What are the messages stored in the log? The message contains the message body, message size, offset, and compression type…… Wait a minute! Offset: offset is an ordered ID number of 8 bytes, which can uniquely determine the position of each message in PARITION! 2. Message size: The message size occupies 4 bytes, which describes the size of the message. 3. Message body: The message body stores the actual message data (compressed) and occupies different space according to the specific message.

Storage Policy Kafka stores all messages whether or not they are consumed. So what’s the deletion strategy for old data? 1. Based on time, the default value is 168 hours (7 days). 2. Based on size, the default configuration is 1073741824. Note that the time complexity for Kafka to read a particular message is O(1), so deleting expired files here will not improve Kafka’s performance!

Consumption data

After the message is stored in the log file, the consumer can consume it. Just like the production message, the consumer also asks the leader to pull the message.

Multiple consumers can form a consumer group, each with a group ID! Consumers in the same consumer group can consume the data of different partitions under the same topic, but not multiple consumers in the group can consume the data of the same partition!! It’s a little convoluted. Take a look at the picture below:

The figure shows that the number of consumers in a consumer group is less than the number of partitions. Therefore, a consumer can consume multiple partitions at a slower speed than a consumer processing only one partition! If there are more consumers in a consumer group than partitions, will there be multiple consumers consuming the same partition? It has already been mentioned that this will not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions!

Each segment contains.log,.index, and.timeindex files. Each segment contains the offset, size, and body of the message. How to use segment+offset to find messages? What if we now need to find a message whose offset is 368801? Let’s take a look at the picture below:

Find the segment file where the 368801message from offset is located.

Index = 368796+1; index = 368796+1; offset =368801; So what we’re looking for hereThe relative offset5). Since the file uses a sparse index to store the relationship between relative offset and the corresponding physical offset of Message, the index with relative offset of 5 cannot be found directly. Here, the dichotomy method is also used to find the largest relative offset in the index entry whose relative offset is less than or equal to the specified relative offset, so the index whose relative offset is 4 is found.

3. The physical offset of the message store is 256 based on the found index whose offset is 4. Open the data file and scan sequentially from position 256 until you find the Message with offset 368801.

This mechanism is built on the basis of ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to efficiently search data! At this point, consumers can get the data they need to process. How does each consumer record the location of his consumption? In the early version, consumers maintained the offset they consumed in ZooKeeper, and consumers reported it every time at intervals, which was easy to lead to repeated consumption and poor performance! In the new version, offsets consumed by consumers are maintained directly in the __consumer_offsets topic of the Kafk cluster.

That’s all for this piece.