Publish and subscribe messaging system

Before formally discussing Apache Kafka (hereinafter referred to as Kafka), let’s look at the concept of a publish and subscribe messaging system and see how important it is. It is a feature of publish and subscribe messaging systems that the sender (publisher) of data (message) does not directly send the message to the receiver. Publishers sort messages in a way, and receivers (subscribers) subscribe to them in order to receive specific types of messages. Publish and subscribe systems typically have a broker, the central point at which messages are published.

Most application scenarios for publish and subscribe messaging systems start with a simple message queue or an interprocess communication. For example, the e-commerce system includes member module, order module, commodity module, recommendation module, distribution and logistics module, and multiple modules (subsystems) involve message transmission.

The earliest application solution was a direct connection, making many subsystems interleaved and complex. This kind of point-to-point connection, the formation of network connection, many disadvantages, not a repetitive.

Later, in order to solve the problem of interlacing directly between subsystems, the queue system appeared. The architecture shown below consists of three separate publish and subscribe systems.

This is much better than using point-to-point connections, but there is too much duplication. Your company therefore maintains multiple systems for data queues, each with its own flaws and deficiencies. Moreover, there are likely to be more scenarios in which messaging systems will be used. What you really need at this point is a single, centralized system that can be used to publish a common type of data and that can grow in size as your business grows. This is where Kafka comes in.

Kafka appearance

Kafka is a publish-and-subscribe messaging system designed to solve this problem. It is commonly referred to as “distributed commit logging” or “distributed Streaming platform”. File system or database commit logs are used to provide a persistent record of all transactions and can be replayed to reconstruct the state of the system. Similarly, Kafka’s data is persisted in a certain order and can be read as needed. In addition, Kafka’s data is distributed throughout the system, providing data fail-safe and performance scaling capabilities.

Messages and Batches

Kafka’s unit of data is called a message. If you have database experience before using Kafka, you can think of a message as a “row” or a “record” in the database. Messages are composed of arrays of bytes, so the data in a message has no particular format or meaning to Kafka. Messages can have an optional metadata, which is a key. A key is also an array of bytes and, like a message, has no special meaning to Kafka. Keys are used when messages are written to different partitions in a controlled manner. The simplest example is to generate a consistent hash value for the key and then use the hash value to modulo the number of topic partitions to select partitions for the message. This ensures that messages with the same key are always written to the same partition.

For efficiency, messages are written to Kafka in batches. A batch is a group of messages that belong to the same topic and partition. If each message travels through the network individually, it will cause a lot of network overhead, which can be reduced by sending messages in batches. However, there is a tradeoff between time latency and throughput; The larger the batch, the more messages processed per unit of time, and the longer the transmission time of individual messages. Batch data is compressed, which improves data transfer and storage, but requires more computational processing.

Topics and Partitions

Kafka’s breath is classified by topic. Topics are like tables in a database, or folders in a file system. Topics can be divided into partitions, and a partition is a commit log. Messages are appended to the partition and then read in first-in, first-out order. Note that since a topic typically contains several partitions, the order of messages cannot be guaranteed across the topic, but the order of messages within a single partition can be guaranteed. The topic shown in the following figure has four partitions, and messages are forced to be written to the end of each partition. Kaflca implements data redundancy and scalability through partitioning. Partitions can be distributed across servers, that is, a theme can span multiple servers to provide more performance than a single server.

We often use the term stream to describe how systems like Kafka interact with data. Many times, people treat a topic’s data as a stream, no matter how many partitions it has. A stream is a set of data that moves from a producer to a consumer. When we talk about streaming, this is how messages are generally described. The Frameworks Kaflca Streams, Apache Samza, and Storm process messages in real time, known as streaming. We can compare streaming to offline processing, where Hadoop is designed to process large amounts of data at a later point in time.

Producers and consumers

Kafka’s clients are the users of the Kafka system, and they are divided into two basic types: producers and consumers. In addition, there are other advanced client apis — Kaflca Connect API for data integration and Kaflca Streams for streaming processing. These advanced client apis use the producer and consumer as internal components to provide advanced functionality.

The producer creates the message. In other publish and subscribe systems, the producer may be referred to as a publisher or writer. Typically, a message is posted to a specific topic. By default, producers distribute messages evenly across all partitions of a topic, regardless of which partition a particular message is written to. However, in some cases, the producer will write the message directly to the specified partition. This is typically done through a message key and a divider, which generates a hash value for the key and maps it to the specified partition. This ensures that messages containing the same key will be written to the same partition. Producers can also use custom partiers to map messages to partitions based on different business rules. The producers are covered in detail in the next chapter.

The consumer reads the message. In other publish and subscribe systems, consumers may be referred to as subscribers or readers. Consumers subscribe to one or more topics and read the messages in the order in which they were generated. Consumers distinguish messages that have been read by examining the message offset disk. An offset is another type of metadata, which is an incrementing integer value that Kafka adds to a message when it is created. Each breath has a unique offset in a given partition. The consumer keeps the offset of the last read from each partition on Zookeeper or Kafka, and its read status is not lost if the consumer closes or restarts.

A consumer is part of a consumer group, that is, one or more consumers read a topic together. The group ensures that each partition can only be used by one consumer. In the group shown below, three consumers read a topic at the same time. Two of the consumers each read one partition, and the other consumer read the other two partitions. The mapping between the consumer and the partition is often referred to as the customer ownership of the partition.

In this way, consumers can consume topics that contain a large number of messages. Also, if a customer fails, other consumers in the group can take over the failed customer’s job. Chapter 4 will detail the consumer and customer groups.

The broker and the cluster

A separate Kafka server is called a broker. The broker receives the message from the producer, sets the offset for the message, and submits the message to disk for saving. The broker serves the consumer, responding to requests to read partitions and returning messages that have already been committed to disk. Depending on the specific hardware and its performance characteristics, a single broker can easily handle thousands of partitions and millions of messages per second.

A Broker can be thought of as a message middleware processing node, a Kafka node is a Broker, and one or more brokers can form a Kafka cluster.

Brokers are part of clusters. Each cluster has a broker that also acts as the cluster controller (automatically elected from the active members of the cluster). The controller is responsible for administration, including assigning partitions to the broker and monitoring the broker. In a cluster, a partition belongs to a broker, and the broker is called the partition leader. Partition replication occurs when a partition can be assigned to multiple brokers (see figure below). This replication mechanism provides message redundancy for the partition, and if one broker fails, another can take over leadership. However, consumers and producers concerned are reconnected to the new leader.

Retaining messages (for a certain period of time) is an important feature of Kafka. The Kafka Broker’s default message retention policy is either to retain messages for a certain period of time (such as 7 days) or until messages reach a certain size in bytes (such as 1GB). When the number of messages reaches these limits, old messages expire and are deleted, so the total number of messages available at any time does not exceed the size specified by the configuration parameter. Topics can configure their own retention policies and can retain messages until they are no longer used. For example, data used to track user activity might be retained for a few days, while metrics that should be used by the program might only be retained for a few hours. Topics can be configured as compact logging, where only the last message with a specific key is retained. This is especially true for change-log type data, because people only care about the last change that occurred.

Why Kafka

Multiple producers

Kafka seamlessly supports multiple producers, whether the client is using a single topic or multiple topics. So it’s good for collecting data from multiple front-end systems and presenting it in a unified format. For example, a web site containing multiple microservices could create a single topic for the page view, to which all services write data in the same message format. Consumer applications get a unified view of the page without having to coordinate data flows from different producers.

Multiple consumers

In addition to supporting multiple producers, Kafka also supports multiple consumers reading data from a single message flow without any impact between consumers. This is different from other queuing systems, where once a message has been read by a client, it cannot be read by another client. In addition, multiple consumers can form a group that shares a message flow and is guaranteed to process each given message only once across the group.

Disk-based data storage

Not only does Kafka support multiple consumers, it also allows consumers to read messages in non-real-time, thanks to Kafka’s data retention features. ? The information is submitted to disk and saved according to the set retention rules. Individual retention rules can be set for each topic to meet the needs of different consumers, and a different number of messages can be retained for each topic. Persistent data ensures that data is not lost because consumers may be able to read messages in a timely manner due to slow processing speeds or sudden traffic spikes. ? The shoemaker can be offline for a short period of time while performing application maintenance without worrying about message loss or congestion at the producer side. Consumers can be turned off, but messages remain in Kafka. The consumer can continue processing the message from where it left off.

scalability

Kafka was designed from the start to be flexible and scalable in order to easily handle large amounts of data. Users can start with a single broker in the development phase, scale to a small development cluster with three brokers, and then deploy to a production cluster with hundreds of brokers as the data salt grows. Scaling an online cluster does not affect overall system availability at all. That is, a cluster containing multiple brokers can continue to serve customers even if individual brokers fail. To improve the fault tolerance of the cluster, you need to configure a high number of replication systems.

A high performance

All of these features make Kafka a high-performance publish-and-subscribe messaging system. By scaling out producers, consumers, and brokers, Kafka can easily handle large message flows. It also guarantees subsecond message latency while processing large amounts of data.