preface

Only a bald head can be strong.

Welcome to our GitHub repository Star: github.com/ZhongFuChen…

I have written two basic articles before this one, which I strongly recommend reading first:

  • What is ZooKeeper?
  • What is a message queue?

As we all know, there are several kinds of message queue products. Here’s why I chose to study Kafka.

We use a modified version of Kafka and our own message queues (Kafka and RocketMQ), so I wanted to learn something about Kafka. This article is an introduction to Kafka, and I hope it will be helpful.

Preview this article in advance:

This article took me a long time to draw, the purpose is to introduce you to the most accessible way, if you think it is good, please give me a thumbs up!

What is Kafka?

First we have to go to the official website to see what Kafka says:

  • kafka.apache.org/intro

When collecting materials and learning, I have found that many predecessors have translated and summarized the introduction of the official website, so I will not repeat here, and post the address for you to learn by yourself:

  • Scala. Cool / 2018/03 / lea…
  • Colobu.com/2014/08/06/…

As mentioned in my previous message queue introduction article, there are a number of possible considerations for creating a message queue:

  • Using message queues cannot be stand-alone (must be distributed or clustered)
  • Data is written to message queues, which may cause data loss. Data needs to be persisted in message queues (disk? Database? Redis? Distributed file system?)
  • How do I ensure that messages (data) are in order?
  • Why is the data re-consumed in the message queue

Let me take Kafka as an example to answer these questions briefly, and then get started Kafka.

1.1 introduction to Kafka

As we all know, Kafka is a message queue. Those who put messages in the queue are called producers, and those who consume messages from the queue are called consumers.

In a messaging middleware, there is not only one queue, we often have multiple queues, and we producers and consumers have to know which queue to throw data to and which queue to send messages from. We need to give the queue a name called topic(the equivalent of a table in a database)

Now that we name the queue, the producer knows which queue to put data in, and the consumer knows which queue to put data in. We can have multiple producers ** dropping data to the same queue, and multiple consumers ** taking data to the same queue

To improve the throughput of a topic, Kafka partitions topics.

So, the producer is actually dropping data to a Partition in topic named Java3y, and the consumer is actually fetching data to a Partition in Topic named Java3y

A Kafka server is called a Broker. A Kafka cluster is a cluster of Kafka servers:

A topic can be divided into multiple partitions, which are actually distributed among different brokers, for example:

The takeaway: Kafka is naturally distributed.

For those of you who are not familiar with distributed/clustering and basic distributed concepts, you can follow me on GitHub: github.com/ZhongFuChen…

Keywords: distributed, SpringCloud is guaranteed to make sense of it. Give me a thumbs up if you think my writing is good!

We now know that data thrown into a topic is actually distributed to different partitions that reside on different brokers. Distribution certainly poses a problem: “What if one of the brokers (Kafka servers) gets jitter or dies?”

Here’s what Kafka does: Our data is stored on different partitions, and Kafka backs up those partitions. For example, we now have three partitions on three brokers. Each partition is backed up, and these backups are scattered across different brokers.

The red partition represents the primary partition, and the purple partition represents the backup partition. Producers interact with the primary partition when they drop data to a topic, and consumers interact with the primary partition when they consume data from a topic.

The backup partition is only used for backup, not read or write. If a Broker fails, the partitions of other brokers will be elected as the primary partition, thus achieving high availability.

It is also worth mentioning that when a producer drops data into a topic, we know that it is written to a partition, so how does the partition persist it? If the Broker fails, data will be lost.

Kafka writes partition data to disk (message logging), but Kafka only allows appending (sequential access), avoiding slow random I/O operations.

  • Kafka also does not write data to disk as soon as it becomes available to a partition. Kafka caches a portion of data and waits for a sufficient amount of data or a certain amount of time before flush.

In balabala, producers put data into a topic. In balabala, consumers put data into a topic. Since the data is stored in the partition, the consumer is actually fetching the data from the partition.

There can be more than one producer and there can be more than one consumer. As shown above, one consumer consumes three partitions of data. Multiple consumers can form a consumer group.

Instead of one consumer consuming three partitions, we now have consumer groups, so each consumer can consume one partition (also to improve throughput).

As is shown in the picture, what we want to illustrate here is:

  • If a consumer in the consumer group dies, then one of the consumers may have to consume two partitions
  • If there are only three partitions and the consumer group has four consumers, one consumer will be idle
  • If one more consumer group is added, both the new consumer group and the original consumer group can consume all the data of the topic. (Groups of consumers are logically independent)

The data that producers throw into a topic is stored in partitions, while partition persistence to disk is sequential access by I/O. The data is written to cache first and written to disk in batches after a period of time or when the data volume is large enough.

The consumer is also careful about reading: while reading disk data requires copying kernel-mode data to user-mode, Kafka calls sendFile () directly from the kernel-space (DMA) to the kernel-space (Socket), saving one step of copying.

Some students may wonder: how do consumers know where they spend their money? Doesn’t Kafka support backtracking? How does that work?

  • For example, as mentioned above, if a consumer in a consumer group dies, the partition consumed by the deceased consumer may be consumed by the surviving consumer. The surviving consumer needs to know where the dead consumer is spending, or how to play.

Kafka uses offset to indicate where a consumer is spending. Each consumer will have their own offset. To put it bluntly, offset represents the consumption progress of consumers.

In previous versions of Kafka, this offset was managed by Zookeeper. Later Kafka developers decided that Zookeeper was not suitable for a large number of cuts and changes. The offset is then stored within the broker as an internal topic(__consumer_offsets).

The offset is committed every time a consumer makes a purchase, and Kafka lets you choose whether to commit automatically or manually.

Now that Zookeeper is mentioned, let’s add one more thing. Zookeeper is an important dependency on Kafka, although it is not used as an offset for saving clients in the new version of Kafka.

  • Detect the addition or removal of brokers and consumers.
  • Maintain the leader/subordinate relationship (primary partition and backup partition) of all partitions. If the primary partition fails, you need to elect the backup partition as the primary partition.
  • Maintain meta configuration information, such as topic and partition
  • .

The last

Server 89/ year, 229/3 years, buy to send yourself, send my girlfriend for the New Year immediately more appropriate, bought to build a project to see the interviewer also sweet, but also familiar with the technology stack, (old users with family accounts to buy, I use my girlfriend 😂). Scan or click to buy

Build tutorial, from 0 to take you step by step to build 😂

Through this article, the beginning of the article several questions estimated more or less understand some. Let me answer briefly:

Using message queues cannot be stand-alone (must be distributed or clustered)

Kafka is distributed by nature, dropping data to a topic actually stores data to partitions of multiple brokers

Data is written to message queues, which may cause data loss. Data needs to be persisted in message queues (disk? Database? Redis? Distributed file system?)

Kafka stores partitions as message logs and speeds them up by sequentially accessing IO and caching (waiting for a certain amount or time) before actually writing the data to disk.

How do I ensure that messages (data) are in order?

Kafka writes data to partitions, and writes to individual partitions are sequential. To ensure global order, write to only one partition. There can only be one consumer if consumption is to be orderly.

Why is the data re-consumed in the message queue

All distributed network jitter/machine downtime and other problems can not be avoided, it is likely that consumer A read the data, not yet have time to consume, hang up. Zookeeper discovers that consumer A hangs up and asks consumer B to consume the partition originally used by consumer A. When consumer A reconnects, it finds that the same data has been consumed repeatedly. (All sorts of things, consumers running out of time, etc.)

If repeated consumption is not allowed in the business, it is best for the consumer to do business verification (if it has been consumed, it will not be consumed).


This article is primarily an introduction to Kafka, but there are other concepts that Kafka covers, and there are other things. In my opinion, many interview questions are related to configuration, so before solving some problems, see if you can solve them with the existing configuration first (learn more framework, you will find that many official support to solve the problem, you may have to change the configuration/parameters).

This has been included in my GitHub featured articles, welcome to Star: github.com/ZhongFuChen…

Happy to export dry Java technology public number: Java3y. The public account has more than 300 original technical articles, massive video resources, beautiful brain map, attention can be obtained!

Thank you very much talent can see here, if this article is written well, feel “three crooked” I have something to ask for praise for attention ️ to share 👥 to leave a message 💬 for warm male I really very useful!!

Creation is not easy, your support and recognition, is the biggest motivation for my creation, we will see you in the next article!