introduce

Kafka is a distributed, partitioned, and replicable messaging system. It provides the functionality of a normal messaging system, but with its own unique design. What does this unique design look like?

Let’s start with some basic messaging system terms:

Kafka categorizes messages by topic.

The program that will publish messages to Kafka Topic becomes Producers.

The application that orders topics and consumes the message becomes consumer.

Kafka runs as a cluster and can be made up of one or more services, each called a broker.

Producers send messages over the network to the Kafka cluster, which provides messages to consumers, as shown in the following figure:

The client communicates with the server through TCP. Kafka provides a Java client and supports multiple languages.

Switchable viewer and Logs

Let’s start with an abstraction that Kafka provides :topic.

A topic is a generalization of a set of messages. For each topic, Kafka partitions its logs as shown below:

Each partition consists of an ordered series of immutable messages appended to the partition consecutively. Each message in the partition has a sequential serial number called offset that uniquely identifies the message in the partition.

The Kafka cluster retains all published messages, regardless of whether they are consumed or not, for a configurable period of time. For example, if the retention policy for a message is set to two days, it can be consumed for two days after a message is published. It will then be discarded to free up space. Kafka’s performance is constant regardless of the amount of data, so retaining too much data is not a problem.

In fact, the only data that each consumer needs to maintain is the position of the message in the log, known as offset. The offset is maintained by the consumer: normally the value of the offset increases as the consumer reads messages, but the consumer can read messages in any order. For example, it can set the offset to an old value to re-read previous messages.

The combination of these features makes Kafka consumers very lightweight: they can read messages without affecting clusters and other consumers. You can tail messages using the command line without impacting other consumers who are consuming messages.

Partitioning logs serves the following purposes: First, it keeps the number of logs per log small and can be saved on a single service. In addition, each partition can be published and consumed separately, providing a possibility for concurrent operation topics.

distributed

Each partition has replicas among several services in the Kafka cluster, so that the services that hold replicas can process data and requests together, and the number of replicas can be configured. Replicas make Kafka fault tolerant.

Each partition has one server as the “leader” and zero or more servers as the “followers”. The leader is responsible for reading and writing messages, while the followers copy the leader. If the leader goes down, one of the followers automatically becomes the leader. Each service in the cluster plays two roles: as leader of the partition it owns and as followers of the other partitions, so that the cluster has better load balancing.

Producers

The Producer publishes the message to the topic it specifies and decides which partition to publish to. Partitions are usually selected randomly by a simple load balancing mechanism, but partitions can also be selected by specific partitioning functions. The second type is more commonly used.

Consumers

There are usually two modes for publishing messages: queue mode and publish-subscribe mode. In the queue mode, consumers can read messages from the server at the same time, and each message is read by only one of the consumers. In publish-subscribe mode messages are broadcast to all consumers.

Consumers can join a consumer group and compete for a topic, and messages from the topics will be distributed to one member of the group. Consumers in the same group can be in different applications or on different machines. If all consumers are in a group, this becomes a traditional queue pattern, with load balancing among consumers.

If all consumers are not in different groups, this becomes a publish-subscribe model, where all messages are distributed to all consumers.

More commonly, each topic has a number of consumer groups, each of which is a logical “subscriber” composed of several consumers for fault tolerance and better stability. This is essentially a publish-subscribe model, but the subscribers are groups rather than individual consumers.

A cluster of two machines has four partitions (P0-P3) and two consumer groups. Group A has two and group consumerB has four

Compared to traditional messaging systems, Kafka guarantees order.

Traditional queues hold ordered messages on the server, and if multiple consumers consume messages from the server at the same time, the server distributes messages to consumers in the order in which they are stored. Although the server publishes messages sequentially, the messages are asynchronously distributed to consumers, so by the time the messages arrive they may have lost their original order, meaning that concurrent consumption will cause ordering disorder. To avoid failures, such messaging systems often use the concept of “dedicated consumer,” which allows only one consumer to consume messages, which of course means a loss of concurrency.

Kafka does this better. Through the partitioning concept, Kafka can provide better ordering and load balancing when multiple consumer groups are running concurrently. Each partition is distributed to only one consumer group, so that a partition is consumed by only one consumer in the group, and messages for that partition can be consumed sequentially. Because there are multiple partitions, you can still load balance across multiple consumer groups. Note that the number of consumer groups cannot be greater than the number of partitions, that is, as many partitions as concurrent consumption is allowed.

Kafka can only guarantee the ordering of messages within a partition, not between partitions, which is sufficient for most applications. If ordering of all messages in a topic is required, the topic should have only one partition, and of course only one consumer group consuming it.

Welcome everyone to study the relevant technology together willing to understand the framework technology or source code of friends directly ask to exchange technology: 2042849237