Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitioned, redundant and persistent logging service. It is primarily used to process active streaming data.

In big data systems, we often encounter a problem that the whole big data is composed of various subsystems, in which data needs to be continuously transferred with high performance and low delay. Traditional enterprise messaging systems are not well suited for large-scale data processing. In order to handle both online applications (messages) and offline applications (data files, logs) at the same time, Kafka came along. Kafka serves two purposes:

The system networking complexity is reduced.

To reduce programming complexity, Kafka acts as a high-speed data bus, instead of being a mutually negotiated interface, each subsystem is like a socket plugged into a socket.

Main features of Kafka

Provides high throughput for both publish and subscribe. Kafka is known to produce about 250,000 messages per second (50 MB) and process 550,000 messages per second (110 MB).

It can be persisted. Messages are persisted to disk so they can be used for bulk consumption, such as ETL, as well as for real-time applications. Prevent data loss by persisting data to hard disk and replication.

Distributed system, easy to scale out. There are multiple producers, brokers, and consumers, all of which are distributed. The machine can be expanded without stopping.

The state in which messages are processed is maintained on the consumer side, not the server side. Automatic balancing in case of failure.

Online and offline scenarios are supported.

Kafka architecture

The overall architecture of Kafka is very simple. It is an explicitly distributed architecture. There can be multiple producers, brokers (Kafka), and consumers. The Producer and consumer implement Kafka’s registered interface. Data is sent from the Producer to the broker, which acts as an intermediary buffer and distributor. The broker distributes consumers registered with the system. The broker acts like a cache, a cache between active data and an offline processing system. The communication between client and server is based on the simple, high performance, and programming language-independent TCP protocol. A few basic concepts:

Topic: Refers specifically to the different categories of feeds of messages that Kafka processes.

Partition: A physical grouping of topics. A Topic can be divided into multiple partitions, each of which is an ordered queue. Each message in a partition is assigned an ordered ID (offset).

“Message” is the basic unit of communication. Each producer can publish messages to a topic.

Producers: Message and data Producers. The process of publishing messages to a topic in Kafka is called Producers.

Consumers: Message and data Consumers, and the process of subscribing to topics and processing its published messages is called Consumers.

Broker: Cache Broker. One or more servers in a Kafa cluster are collectively called brokers.

Message Sending Process

Producer publishes messages to partitions of a specific topic based on specified partition methods (such as round-robin and hash)

The Kafka cluster receives messages from Producer and persists them to hard disks for a specified (configurable) length of time, regardless of whether the messages are consumed.

The Consumer pulls data from the Kafka cluster and controls the offset from which the message is retrieved

Kafka designed

1. Throughput

High throughput is one of the core goals of Kafka. Kafka does the following:

Data disk persistence: Messages are directly written to disks instead of being cached in memory, making full use of the sequential read and write performance of disks

Zero-copy: Reduces I/O operations

Batch sending of data

Data compression

Topic is divided into multiple partitions to improve parallelism

Load balancing

Producer sends messages to the specified partition based on the algorithm specified by the user

There are multiple partiitons. Each partition has its own replica. Each replica is distributed on different Broker nodes

Lead partitions must be selected from multiple partitions. Lead partitions are responsible for reading and writing data, and ZooKeeper is responsible for fail over

Manage the dynamic joining and leaving of brokers and consumers through ZooKeeper

Pull system

Because Kafka Broker persists data and there is no memory pressure on the broker, a consumer is ideally suited to take a pull approach to consuming data, with the following benefits:

Simplify Kafka design

Consumer autonomously controls message pull speed based on consumption power

Consumers choose their own consumption mode according to their own situation, such as batch, repeated consumption, consumption from the bottom, etc

scalability

When a broker node needs to be added, the new broker registers with ZooKeeper, and producers and consumers sense the changes based on the watcher registered with ZooKeeper and make adjustments accordingly.

Application scenarios of Kayka:

1. Message queues

Better throughput, built-in partitioning, redundancy, and fault tolerance than most messaging systems make Kafka a good solution for large-scale messaging applications. Messaging systems typically have relatively low throughput, but require much lower end-to-end latency, and rely on the strong persistence guarantees provided by Kafka. In this area Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.

2. Behavioral tracking

Another use of Kafka is to track user browsing, searches, and other behavior in a publish-subscribe mode to a topic. Then these results can be obtained by subscribers for further real-time processing, or real-time monitoring, or put into Hadoop offline data warehouse processing.

3. Meta information monitoring

As a monitoring module of operation records to use, that is, collect and record some operation information, can be understood as the data monitoring of operation and maintenance nature.

4. Collect logs

In terms of log collection, there are many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation. Log aggregation typically collects log files from a server and puts them in a centralized location (file server or HDFS) for processing. Kafka, however, neglects the details of a file, abstracting it more clearly into a message flow of individual logs or events. This allows For lower latency in Kafka processing, making it easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka offers the same high performance, higher durability guarantees due to replication, and lower end-to-end latency.

5. Stream processing

This scenario can be quite numerous and easy to understand. Save the collected stream data for subsequent processing by Storm or other streaming computing frameworks. Many users will stage, aggregate, expand, or otherwise transfer data from the original topic to a new topic for further processing. An example of an article recommendation process might be to grab the content of an article from an RSS source and drop it into a topic called “articles.” Follow-up operations may be required to clear the content, such as restoring normal data or deleting duplicate data. Finally, the matching result is returned to the user. This creates a series of real-time data processing processes outside of a single topic. Strom and Samza are well-known frameworks that implement this type of data transformation.

6. The event source

An event source is an application design in which state transitions are logged as a chronological sequence of records. Kafka can store large amounts of log data, which makes it an excellent backend for this approach. Take the News Feed.

7. Commit log

Kafka can serve as an external, distributed system of persistent logging. Such logs can back up data between nodes and provide a resynchronization mechanism for recovering data from failed nodes. The log compression feature in Kafka enables this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

Kafka design requirements

1. Directly use the Cache of Linux file system to cache data efficiently.

2. Use Linux Zero-copy to improve sending performance. Traditional data sending requires 4 context switches. After using sendFile system call, data is directly exchanged in kernel mode, and the system context switches are reduced to 2 times. According to the test results, the data sending performance can be improved by 60%. Zero-copy Technical details can be found at www.ibm.com/developerwo…

3. The cost of data access on disk is O(1). Kafka uses topics to manage messages. Each topic contains multiple parts (ition), and each part corresponds to a logical log consisting of multiple segments. Each segment stores multiple messages (as shown in the figure below). The message ID is determined by its logical location, that is, the message ID can be directly located to the message storage location, avoiding additional mapping from ID to location. Each part corresponds to an index in memory, which records the offset of the first message in each segment. A message sent by a publisher to a topic is distributed evenly across multiple parts (randomly or according to user-specified callback functions). The broker receives the published message and adds it to the last segment of the corresponding part. When the number of messages in a segment reaches the configured value or the message publication time exceeds the threshold, the messages are flushed to disk. Only subscribers that flush messages to disk can subscribe to the segment. When the segment reaches a certain size, no more data is written to the segment. The broker creates new segments.

4. Explicit distribution: All producers, brokers, and consumers are distributed. There is no load balancing mechanism between producers and brokers. Zookeeper is used for load balancing between brokers and consumers. All brokers and consumers are registered with ZooKeeper, and ZooKeeper keeps some of their metadata. If one broker and consumer changes, all other brokers and consumers are notified.

More detailed source code reference source welcome to study together with the relevant technology, source code to obtain please 2042849237