preface

There are many concepts and terms in Kafka system, such as Topic (Topic), partition (partition), and so on. Mastering these concepts in advance will be of great benefit to the subsequent in-depth study of various functions and features of Kafka. Let’s take a look at some of these concepts

Service Node Broker

For Kafka, a broker can be viewed simply as a single Kafka node or instance of a Kafka service. Typically, multiple brokers form a Kafka cluster. These brokers, also known as Kafka’s servers, receive and process requests from clients and persist messages.

While it is possible to have multiple brokers running on the same physical machine in a Kafka cluster, it is more common for production lines to have different brokers running on different physical machines. If a machine in the Kafka cluster goes down, the other brokers can still provide services. This is one of the ways Kafka is highly available.

Broker important configuration parameters:

  1. Broker. Id. This parameter specifies the unique identifier of the broker in the Kafka cluster. The default value is -1. If not, Kafka will generate one automatically. To avoid a conflict between the id generated by ZooKeeper and the id configured by the user, the autogenerated broker. Id is reserved.
  2. The dirs. Kafka stores all messages on disk, and this parameter is used to configure the root directory for storing log files. The default path is/TMP /kafka-logs.
  3. Zookeeper. Connect. This parameter specifies the service address of the Zookeeper cluster to which the broker will connect. There is no default value. If the Zookeeper cluster has multiple nodes, separate them by commas (, host1:port,host2:port).

Producers Producer

A client application that publishes a Record to Kafka is called a Producer. In Kafka, the producer and consumer are completely decoupled and independent of each other, which is a key design element point for achieving Kafka’s well-known high scalability. The producer does not have to care about or wait for the status of the consumer, and usually only publishes messages to one or more topics on a continuous basis.

Important configuration parameters of producer:

  1. The bootstrap. The servers. This parameter is used to specify the producer client connection required brokers address list, Kafka cluster format for host1: port, host2: host3: port, the port, you can set one or more address, intermediate by commas, the default value of this parameter is null. It is not necessary to set all broker addresses, but it is recommended to set at least two. If one of the addresses goes down, the producer can still connect to the Kafka cluster.
  2. The key. The serializer and value. The serializer. These two arguments are used to specify the serializer for the key and value sequence number operations, respectively, and are null by default. Note here must fill in the serializer fully qualified name, such as org.apache.kafka.com. Mon. Serialization StringSerializer, just specify StringSerializer is wrong.

Consumers Consumer

Client applications that subscribe to Kafka subscription topics are called consumers. Consumers subscribe to one or more topics in the Kafka cluster, pull messages from the subscribed topics, and do the business logic.

On top of Kafka’s consumption patterns is the concept of a Consumer Group. The so-called consumer group refers to a group composed of multiple consumer instances to consume the subscribed topic. Each consumer has a corresponding consumer group, and when a message is published to a topic, it is consumed by only one consumer instance within the group. No other consumer instance can consume it. This is why Kafka has introduced consumer groups to improve throughput on the consumer side.

A consumer group is a logical concept. Each consumer group has a fixed name. Before consumption, consumers need to specify the name of the consumer group to which they belong. You can set this parameter by using the group.id parameter on the consumer client.

Important configuration parameters for Consumer:

  1. The bootstrap. The servers. This parameter is used to specify the consumer client connection required brokers address list, Kafka cluster format for host1: port, host2: host3: port, the port, you can set one or more address, intermediate by commas, the default value of this parameter is null. It is not necessary to set all broker addresses, but it is recommended to set at least two. If one of the addresses goes down, the producer can still connect to the Kafka cluster.
  2. The group id. The name of the consumer group to which the consumer belongs. It is a unique string. The default value is null. In general, it is recommended that this parameter be set to a name that has some business meaning.
  3. The key. The serializer and value. The serializer. These two arguments are used to specify the serializer for the key and value sequence number operations, respectively, and are null by default. Note here must fill in the serializer fully qualified name, such as org.apache.kafka.com. Mon. Serialization StringSerializer, just specify StringSerializer is wrong.

Topic and Partition

Themes and partitions are two core concepts in Kafka. The producers and consumers mentioned above are actually subject and partition level operations. Topics and partitions are both logical concepts that distinguish the division of business data from the Log files that are really stored at the physical layer. The sent message is appended to the specified partition log file and is assigned a specific Offset. Offset is the unique identifier of a message within a partition. It does not cross its partition, so Kafka uses it to ensure that messages are ordered within the partition. Note that Kafka guarantees partition order, not topic order.

As shown in the figure above, the four partitions of a topic Topic1 can be distributed across broker nodes. Before each message is sent from the producer to the broker, the specific partition to which it is stored is selected based on partitioning rules. If the partitioning rules are set properly, all messages can be evenly distributed across the different partitions. Partition partition not only provides Kafka with scalability, horizontal expansion, but also through the multi-copy mechanism to provide data redundancy for Kafka to improve data reliability.

What is the mechanism for distinguishing multiple replicas? In fact, Kafka in order to improve the disaster recovery capacity to make redundant data storage mechanism, the same message can be copied to multiple places to save the files in these places is called a copy. Copies are classified into Leader Replica and Follower Replica. The Leader copy processes read and write requests, and the Follower copy synchronizes messages with the Leader copy.

As shown in the figure above, there are four brokers in a cluster, and one topic has three partitions, each of which has one Leader copy and two Follower copies. The producer/consumer client only interacts with the Leader copy, while the Follower copy only synchronates messages with the Leader copy (the Follower copy has a message lag for the Leader copy most of the time).

Peer-to-peer P2P and publish subscription Pub/Sub

For messaging middleware, a traditional messaging system has two modules: the point-to-point pattern and the publish/subscribe pattern. The point-to-point pattern is based on queues, where a message producer sends a message to a queue, and a set of consumers can read the message from the server, and each message is processed by one of the consumers. In publish/subscribe mode, messages sent to a topic are broadcast to all consumers.

But these two traditional modes have certain advantages and disadvantages. The advantage of peer-to-peer is that it allows you to distribute data to multiple consumers, extending the processing power of consumers; The downside is that it doesn’t hold multiple subscribers, and once a consumer reads the data, it disappears from the queue. Publish/subscribe allows you to broadcast data to multiple consumers, but it doesn’t extend processing because each message is sent to all subscribers, and the processing power of all consumers doesn’t scale evenly.

To address this, Kafka introduces the concept of Consumer groups. The consumer group concept in Kafka has two levels. In point-to-point mode, data is allowed to be allocated to consumer groups (multiple consumers) for processing; In a publish subscription, data can be broadcast to multiple consumer groups (one of the consumers) for processing.

conclusion

To summarize all the terms mentioned here: 1. Service node: Broker 2. 3. The Producer is a Consumer. 6. Consumer Offset 7. 9. Replica (Leader Replica and Follower Replica) 10. Log: Log 11. Peer-to-peer mode: P2P 12. Publish/subscribe mode: Pub/Sub

Finally, a diagram is used to systematically show the relationship between these main concepts: