In the previous article “Overview of Messaging systems” introduced to the messaging system, this time will learn the basic concepts in Kafka. First of all, let’s review the three roles in the usage scenario of message system: producer, message system and consumer. The producer is responsible for generating messages and sending messages to the message system, while the message system provides messages to consumers for processing, as shown in the following figure.
Kafka is a publish/subscribe based messaging system, as shown below. Producers push messages to a Topic in Kafka, and topics are introduced to categorize messages so that consumers can subscribe to the Topic they need to get messages.
Kafka was able to start working, but then faced a single point of problem. To solve the single point problem, Kafka introduces the concept of a Broker. A Broker is an instance of Kafka, and multiple brokers can run on a machine. Here we consider only one instance of Kafka on a machine. Multiple brokers will form a Kafka cluster, and topics can also specify the number of replicas to be distributed as multiple replicas on multiple machines. Kafka uses ZooKeeper to elect a Leader from multiple replicas. The other replicas act as followers. Leaer is responsible for reading and writing messages, that is, dealing with producers and consumers, while simultaneously writing messages to other replicas. When a Broker fails and a Topic is left without a Leader, a new Leader is elected to solve the single point problem.
The introduction of brokers and replicas solves the single point problem, and then comes the performance problem. For a single Topic, only the Broker of the Leader communicates with producers and consumers, so throughput is limited by the machine on which the Broker resides. So how to improve throughput. Kafka breaks up topics into partitions, or partitions of messages, similar to databases, for load balancing without the need for a single server. As shown in the figure below, producer A writes messages to Partition 0 and 1 of TopicA, respectively, while consumers A and B also get messages from partition 0 and 1. Note that messages stored on different partitions are also different, and the concept of replicas should be kept separate.
In the figure above, we can see that consumer A gets messages from partition 0 and partition 1 when consuming TopicA. To further improve throughput, Kafka introduces the concept of consumer groups, which divide consumer A into multiple consumers to form A single consumer group. We can think of consumer A as an instance of application A. In order to improve the throughput of consumption, we deploy several instances of Consumer A, so that multiple consumers form A consumer group, but they all do what application A does and need to be separated from consumer B (different applications). Generally, the number of consumers in the consumer group is set to be consistent with the number of zones, so that a consumer can be responsible for a zone and improve efficiency. If the number of consumers in the consumer group is less than the number of partitions, a single consumer is responsible for multiple partitions. If the number of consumers in the consumer group is greater than the number of zones, there will be consumers who can not divide the zones, resulting in waste. So it’s generally consistent. For brevity, and consumer group B is similar to consumer group A, consumer group B is not drawn in the figure below.
The basic concepts in Kafka are these: producer, consumer, Topic, Broker, replica, partition, and message group. Finally, to give you a better idea of partitioning, let me draw another detail.
A partition can be regarded as a separate queue. Producers write messages to the appropriate partition according to the policies. There are three kinds of policies: 1. 2. If no partition is specified, hash function is used to specify the partition according to the key of the message. Third, if there is no key, the partition is polled. The point here is that the data in the partition is different, and a message will only enter one partition. The consumers in the consumption group will get the corresponding message from the partition according to the offset for consumption processing.