This article is for the blogger to study notes, the content comes from the Internet, if there is infringement, please contact delete.

Personal Note: github.com/dbses/TechN…

01 | Kafka is introduced

What is Kafka?

Apache Kafka is an open source Messaging System. Just like an engine, it has some power transfer capability.

What is Kafka for?

According to Wikipedia, a messaging engine system is a set of specifications. Enterprises use this set of specifications to deliver semantically accurate messages between different systems, enabling loosely-coupled asynchronous data transfer.

In plain English, system A sends A message to the message engine system and system B reads the message sent by System A from the message engine system.

How to design the transmission format of messages?

Kafka uses a pure binary sequence of bytes. Messages are still structured, but are converted to binary byte sequences before they are used.

How do I get the message out?

There are two common methods:

  • Point-to-point model: Messages sent by system A can only be received by system B, and messages sent by system A cannot be read by any other system.
  • Publish/subscribe model: It has the concept of a Topic, which you can think of as a message container with similar logical semantics. The sender is called Publisher and the receiver is called Subscriber. Unlike the point-to-point model, there may be multiple publishers sending messages to the same topic, and there may be multiple subscribers, all receiving messages to the same topic. Daily newspaper subscription is a typical publish/subscribe model.

Why not send messages directly instead of sending them through a messaging engine?

Peak cutting, valley filling, decoupling, asynchronous. This is where messaging engines like Kafka make the most sense.

Taking the upstream order system as an example, Kafka can store the instantaneous increase in order traffic in the form of messages in the corresponding topic, without affecting the UPSTREAM service’S TPS, and at the same time giving downstream services enough time to consume them.

This is where messaging engines like Kafka make the most sense.

02 | Kafka terminology

  • Record: a message. Kafka is a message engine. Messages are the main objects that Kafka processes.
  • Topic: Topics are logical containers that carry messages and are used to distinguish specific businesses in practice.
  • Broker: A Kafka cluster consists of brokers that receive and process requests sent from clients and persist messages.
  • Replication: backup mechanism. Copies the same data to multiple machines. The same data is called replicas.
  • Replica: a copy. In Kafka, the same message can be copied to multiple places to provide data redundancy. These places are called replicas. Replicas are also divided into leader replicas and follower replicas, each with a different role division. Replicas are at the partition level, that is, each partition can be configured with multiple replicas for high availability.
  • Partition. The partition. A partition is an ordered, unchanging sequence of messages that can have multiple partitions under each topic. Each message produced by a producer is sent to only one partition.
  • Offset: indicates the message displacement. Represents the location information of each message in the partition, and is a monotonically increasing and constant value.
  • They are producers. An application that publishes new messages to a topic.
  • A Consumer is a Consumer. An application that subscribes to new messages from topics.
  • Consumer Offset: Consumer Offset. Characterizing consumer consumption progress, each consumer has its own consumer displacement.
  • A Consumer Group. A group of multiple consumer instances consuming multiple partitions simultaneously to achieve high throughput. The Consumer Group enables the point-to-point and publish-subscribe models to be implemented.
  • Rebalance: Rebalance. The process by which other consumer instances automatically reassign subscribed topic partitions when a consumer instance in a consumer group fails. Rebalance is an important way to achieve high availability on the Kafka consumer side.
  • Log: message logs. Kafka uses a message Log to store data. A Log is a physical file on disk that can Append only messages.
  • Log Segment: Log Segment. Kafka periodically deletes messages to reclaim disks. How do I delete it? In simple terms, Log Segment mechanism.

Kafka is a means to achieve high availability

  1. Broker: When one Broker is down, other brokers on other machines can also provide services.

  2. Replication: Kafka defines two types of replicas, Leader Replica and Follower Replica. The former provides services externally, by which I mean interactions with client programs; The latter only passively follows the leader copy and cannot interact with the outside world.

    Producers always write messages to leader replicas; The consumer always reads the message from the leader copy. As for the follower replica, it only does one thing: it sends a request to the leader replica asking the leader to send it news of the latest production so that it can stay in sync with the leader.

    A replica mechanism ensures that data is persisted or messages are not lost.

  3. Rebalance mechanism;

Kafka implements Scalability

What if the leader replica has accumulated too much data to fit into a single Broker machine?

Kafka divides data into multiple portions and stores them among different brokers, a mechanism known as Partitioning.

Kafka is a means to achieve high throughput

  1. Kafka uses a message Log to store data. A Log is a physical file that can Append only messages, avoiding slow random I/O operations.
  2. Use consumer groups; (Explained later)

Kafka disk reclamation process

In Kafka, logs are used to store data. A log segment is further subdivided into multiple segments, and messages are appended to the current segment. When one segment is full, Kafka automatically splits a new segment and stores the old one. Kafka also has a scheduled task in the background that periodically checks to see if old log segments can be deleted in order to reclaim disk space.

Connection between a Partition and Replica

Each partition can be configured with multiple replicas, including only one leader replica and n-1 follower replica.

The producer writes messages to the partition, and the position of each message in the partition is represented by a number called Offset. The partition displacement always starts at 0. Suppose a producer writes 10 messages to an empty partition, then the displacement of the 10 messages is 0, 1, 2… , 9.

To summarize Kafka’s three-tier messaging architecture:

  • The first layer is the theme layer, where each theme can be configured with M partitions, and each partition can be configured with N replicas.
  • The second layer is the partition layer. Only one of the N copies of each partition can act as the leader and provide services externally. The other N-1 replicas are follower replicas for data redundancy purposes.
  • The third layer is the message layer, which contains several messages in the partition, and the displacement of each message starts from 0 and increases successively.

conclusion

The above can be illustrated with a picture.

03 | Kafka just messaging engine system?

Apache Kafka is a messaging engine and Distributed Streaming Platform.

How Kafka got its name

Jay Kreps took a lot of literature classes in college. He loved Franz Kafka.

The birth of Kafka

Kafka is an in-house incubated project at LinkedIn. LinkedIn initially had a strong need for strong real-time processing of data. The main problems they encountered included the following two points.

  • Insufficient data correctness

    Because the data is collected mainly by Polling, how to determine the Polling interval becomes a highly empirical matter.

  • The system is highly customized

    High maintenance costs.

The Kafka message Engine system is designed to address these issues and is designed to provide three features:

  • Provide a set of apis for producers and consumers;
  • Reduce network transmission and disk storage overhead;
  • Implement a highly scalable architecture.

Is Kafka just a messaging engine?

Apache Kafka is a messaging engine and Distributed Streaming Platform.

Kafka plays an important role in connecting upstream and downstream data streams: almost all data flows from one system into Kafka and then downstream to another system.

Such usage was so common that it prompted the Kafka community to wonder: Instead of passing data from one system to the next for processing, why not implement a streaming framework myself? With this in mind, the Kafka community launched Kafka Streams in 0.10.0.0, which officially “transformed” Kafka into a distributed streaming platform rather than just a messaging engine.

What are the advantages of the Kafka Streams processing platform?

  • The first is that end-to-end Correctness is easier to be implemented.

    Kafka implements end-to-end exact-once processing semantics.

  • The second point is Kafka’s position on streaming computing

    The website clearly identifies Kafka Streams as a client library for building real-time stream processing rather than a fully functional system. That said, you can’t expect out-of-the-box operations features like cluster scheduling, elastic deployment, etc. You need to choose the right tools or systems to help Kafka flow processing applications implement these features.

    The flow processing platform of a large company must be deployed on a large scale, so cluster scheduling capabilities and flexible deployment solutions are essential elements. After all, there are still many small and medium-sized enterprises in the world, whose flow processing data volume is not huge, and their logic is not complex, and the deployment of a few or a dozen machines can handle it. With such requirements, building a heavyweight integrity platform is a dead end, and this is where the Kafka stream processing component comes in. So in this sense, Kafka should have a place in the stream processing framework of the future.

Kafka distributed storage

Jay Kreps, one of the authors of Kafka, wrote an article explaining why Kafka can be used as distributed storage. But few people use it that way these days.