Today I will start learning about a new middleware called Kafka. I will write a series of articles about Kafka. This article will summarize from the following three aspects:
- What is Kafka and how does Kafka work?
- Describes the whole data flow process and global architecture of Kafka end to end
- Explain the specific meaning of Kafka’s core proper nouns
What is Kafka?
Kafka is officially defined as a distributed stream processing platform, and a stream processing platform should have the following three features:
-
Allows you to publish and subscribe to streaming records. This feature is similar to message queues or enterprise messaging systems.
-
Can store streaming records, and has good fault tolerance.
-
Streaming records can be processed as soon as they are generated.
But I prefer to define Kafka as an open source messaging engine system, often referred to as messaging middleware. According to Wikipedia, a messaging engine system is a set of specifications that an enterprise can use to deliver semantically accurate messages between different systems for loosely-coupled asynchronous data transfer.
In simple terms, system A sends A message to the message engine system, and System B reads the message sent by System A from the message engine system (as shown below).
By introducing the message engine system, the upstream and downstream services can be effectively isolated, and the effect of peak clipping and valley filling can be achieved for unexpected traffic, so as to avoid the suspension of downstream services. The so-called peak cutting and valley filling means to buffer the instantaneous burst of upstream and downstream traffic to make the flow smoother, especially for the upstream with strong sending capacity. Without the buffer of message engine, the downstream with slow processing capacity may be overwhelmed by the sudden large flow, resulting in an avalanche of the whole link service. Given the above design, Kafka is used in two broad categories of scenarios:
-
Construct a real-time streaming data pipeline. Similar to message queues, data is reliably retrieved between systems or applications.
-
Build real-time streaming applications. Similar to stream processing, to transform or influence these stream data.
Kafka’s macro architecture
Now that we know what Kafka is, let’s take a look at Kafka’s overall data trend (see figure below).
In order to avoid single point problems and improve system performance and availability, Kafka uses a cluster deployment mode. Each machine in a cluster is called a broker, and the party sending messages to the cluster is called a Producer. The party that gets the message from the cluster is called the Consumer.
Secondly, a messaging engine needs to transfer data between different systems, so how to design an unambiguous data format is particularly important. There are many conventional data formats, such as CSV, XML, JSON, and some serialization frameworks (Google’s ProtocolBuffer, Facebook’s Thrift), but Kafka chooses the raw sequence of bytes as the data format for message transmission.
In addition to data formats, you need to design specific transport protocols. The following two models are common. Kafka supports both models:
-
Point-to-point model: Also called message queue model, the message sent by system A to the message engine can only be received by system B.
-
Publish/subscribe model: This model introduces the concept of a topic, where the sender is called a Publisher and the receiver is called a subscriber. This model allows multiple publishers to post to a topic at the same time, and multiple subscribers to receive data for that topic. This is the model for daily newspaper subscriptions.
Also, when designing a distributed system, we need to consider factors such as availability, reliability, scalability, consistency and high performance, and how Kafka does this.
-
Availability: In addition to increasing Kafka’s high availability through clustering, Kafka also provides a backup mechanism (Replication) to achieve high availability. The redundant copy in Kafka is called a replica. Replicas are divided into leader replicas and follower replicas. The leader replica provides read and write requests. The follower copy only synchronizes the data of the leader copy and does not provide services.
-
Reliability: Data reliability is guaranteed by mechanisms such as replicas, ACK, and ISR (more on ACK and ISR in a later article)
-
Scalability: When the number of requests is too large for the current cluster to handle, the number of machines in the cluster can be expanded horizontally to allocate requests without any modification by the client.
-
High performance: Using partitioning mechanisms and techniques such as zero copy, sequential writes to disk, log formats, and page caching (more on that later). The partitioning mechanism in Kafka is to divide each topic into partitions. Topic is a logical concept, while partitioning is a physical concept, so both producers and consumers produce and consume for partitions.
-
Consistency: Data consistency is ensured through mechanisms such as the High Watermark and leader Epoch
Kafka proper noun
Based on the overall Kafka macro architecture, we know that Kafka mainly refers to the following proprietary terms:
-
Topic: The objects that are published and subscribed to by producers are called topics, and we can create separate topics for each business or application.
-
Producer: The clients that send messages to a Kafka cluster are called producers. Producers can send messages to one or more topics.
-
Consumer: The client that reads messages to the Kafka cluster is called a consumer, and consumers can subscribe to data from multiple topics simultaneously.
-
Broker: Each node in a Kafka cluster is called a broker. Specifically, a Kafka process started on a machine is called a broker. It is responsible for receiving and processing client requests and persisting messages.
-
Partition: A topic is divided into partitions, where each partition holds an ordered set of message logs, and messages produced by producers are sent to only one partition. Scalability can be achieved by storing partitions on different brokers to avoid having too much data in the replica for the broker to accommodate.
-
Replication: Replicas are classified into leader replicas and follower replicas. The replicas are used for redundant backup of data to implement high availability in Kafka.
- Leader Replication: Receives read and write requests from clients
- Follower replication: Sends requests to the leader copy to synchronize the data of the leader copy.