In-depth understanding of Kafka and its surroundings

Please follow xiaobaiai.net or my CSDN blog.csdn.net/freeape

[TOC]

0 foreword

The article is a little long, but write all straight white, slowly read down or relatively easy to understand, from the general introduction of Kafka to Kafka peripheral product comparison, and then to Kafka and Zookeeper relationship, further understand the characteristics of Kafka, including Kafka partition and copy and consumer group characteristics and application scenarios introduction.

1 introduction

Apache Kafka is a distributed stream processing platform.

Publish & subscribe, similar to message system, strong concurrent ability, through the cluster can realize the role of data bus, easily realize streaming record data distributed read and write
Storage of massive streaming data in a highly fault-tolerant manner
Streaming record data can be processed as soon as it is generated

The Kafka cluster is connected to six data input and output parts. Kafka Producer, Kafka Connect Source, Kafka Streams/KSQL, Kafka Consumer, Kafka Connect Sink. Kafka provides four core API sets (in addition to the Kafka AdminClient API) to handle data input and output:

Kafka Producer APIAllow one applicationreleaseA stream of data to one or more Kafka topics
Kafka Consumer APIAllow one applicationTo subscribe toOne or more topics, and processes the streaming data received
Kafka Streams APIAllows one application as oneStream processorsConsume the input streams generated by one or more topics, and then produce an output stream to one or more topics, making efficient transformations in the input and output streams
The Kafka Connector API allows you to build and run reusable producers or consumers that connect Kafka Topics to existing applications or database systems. For example, connect to a relational database and capture all changes to a table.

We already know what Kafka’s publish-subscribe functionality does, but what about KSQL and Kafka Streams?

The first thing we need to know is what is stream processing? Stream processing can be thought of as the real-time processing of news, in a time period, for example, continue to have data to come in, and every moment to the results of these data have a final processing, so this is the stream processing, and if every hour or longer treatment time, that call or batch data analysis. It is more characterized by real-time analysis. In the streaming computing model, the input is continuous, which can be considered as unbounded in time, which means that the full amount of data can never be obtained for calculation. At the same time, the calculation result is continuously output, which means that the calculation result is also unbounded in time. Hadoop, Storm and Spark Streaming and Flink are commonly used distributed computing components. Hadoop is a component that performs batch processing of non-real-time data. Storm, Spark Streaming, and Flink are components for Streaming real-time data, while Kafka Streams is a late starter.

What about KSQL?

KSQL isApache KafkaKafka Streams is a data streaming SQL engine that uses SQL statements instead of writing a lot of code to implement a stream processing task
KSQL based on KafkaStream APIBuild, which supports filters, transformations, aggregations, joins, windowing, andSessionization(that is, capture all stream events during a single session) and other stream processing operations, simplifying direct useStream APIYou can write Java or Scala code to start processing streams with a simple SQL statement
KSQL statement operations are distributed, fault-tolerant, flexible, extensible and real-time
KSQL use cases involve implementing real-time reporting and dashboards, infrastructure and iot device monitoring, anomaly detection, and spoofing alerts

2 Introduction to related concepts

BrokerA Kafka cluster consists of one or more servers, called brokers
Topic: Every message published to a Kafka cluster has a category called Topic
PartitionParition is a physical concept that each Topic contains one or more partitions
Replication: copy: One or more copies can be configured for a partition to ensure that the system can continuously provide services without loss. The number of replication partitions can be set when creating a topic
Segment: Kafka can store multiple topics. A topic contains one or more partitions. Each partition is physically a folder. A partition contains multiple segments. Each segment is named after the initial offset of message in the partition and ends in log. Messages published by producers to topics are sequentially written into the corresponding segment files. Kafka creates an index file with the same name for each segment log file. The index file ends with index.
Offset: The offset of the message in the partition, used to uniquely identify the message in the partition.
Producer: message producer, responsible for publishing messages to Kafka Broker
Consumer: message consumer, a client that reads messages to Kafka Broker
Consumer Group: Each Consumer belongs to a specific Consumer Group (you can specify a Group name for each Consumer, or the default Group if you do not specify a Group name)

3 Kafka with ActiveMQ, ZeroMQ, RabbitMQ, RocketMQ, Redis

This is primarily a sizing evaluation of message-oriented middleware, and here we describe some concepts. There are other more detailed article about: https://juejin.cn/post/6844903626171760653.

3.1 Message queues, point-to-point, and PUB/SUB

Before we begin, we also need to know a little bit about JMS(Java Messaging System), an API for message-oriented middleware (MOM) in the Java platform. JMS supports two messaging modes: P2P and publish-subscribe. Which message pieces support JMS will be described later.

Message queues have two message models: Point to Point (PTP) and publish/subscribe (PUB/SUB).

Message queue point to point, just as its name implies, is a queue, the information can only be one to one, a message is used by a consumer, so wouldn’t exist in the queue, as the postman deliver mail to others, can’t and a copy of this letter and also can ensure the safety of the letter sent to the designated hands (this is given by the framework).

PUB/SUB message subscription publishing is not the same, its characteristics is to support many to one, one to one, one to many, just like the journal newspaper, published journal or newspaper, need to be able to be delivered to different hands, but also can get the previous date of the journal or newspaper (this is the ability given by the framework).

RabbitMQIs a message broker that supports multiple messaging protocols, such as AMQP, MQTT3.1, XMPP, SMTP, STOMP, HTTP, WebSockets protocol, by inherent high concurrencyErlanngLanguage development, used in real time for high reliability requirements of messaging. It supports both message queue point-to-point and PUB/SUB. The RabbitMQ all characteristics of JMS is not fully support (https://www.rabbitmq.com/jms-client.html#limitations)
RedisKnown for in-memory databases. However, it can also be used as a message queue point-to-point and PUB/SUB management tool, although because of the efficiency of the memory buffer, if the consumer loses the connection to the queue, there is a good chance that the message will be lost when the connection is lost. In addition, in order to implement the point-to-point function of the message queue, at least three queues need to be created: the primary queue, the work queue, and the rejected queue. The implementation is a bit complicated.
Apache RocketMQAs a high-performance and high-throughput distributed message middleware of Ali open source, PUB/SUB is the basic function, supporting message priority, message order guarantee, message filtering, and ensuring that each message is delivered at least once. RocketMQ’s cluster consumption capability is roughly equivalent to the PTP model. Because RocketMQ consumers within a single Consumer Group are similar to PTP, consumers in a single Consumer Group share messages equally, which is equivalent to implementing the peer-to-peer function, and the recipient unit is the Group.
Apache ActiveMQSupport for point-to-point and PUB/SUB, support for multiple cross-language clients and protocols, easy-to-use enterprise integration patterns and many advanced features, while fully supporting JMS 1.1 and J2EE1.4
ZeroMQIs realized with C, high performance, lightweight nature is its characteristics. ZeroMQ is not strictlyat least onceorat most onceIn its Pub/Sub mode, ZeroMQ builds a message acknowledgement and retransmission mechanism, but does not persist messages, so messages can be lost due to memory exhaustion or process crash, while retransmission can cause messages to be sent between 1 and N times. Of course, we have fewer ZeroMQ options in enterprise WEB services, especially microservices.
KafkaMore as a publish/subscribe system, combined with Kafka Stream, it is also a Stream processing system

3.2 About Persistence

ZeroMQMemory and disk are supported, but database persistence is not supported
KafkaSupport memory, disk (main), support database persistence, support a large amount of data accumulation
RabbitMQMemory, disk, and data accumulation are supported, but data accumulation affects production efficiency
ActiveMQSupport memory, disk, self-sustaining database persistence
RocketMQAll messages are persistent, first written to the system pagecache(pagecache memory), and then flushed disk, can ensure that memory and disk have a copy of data, access, read directly from memory

3.3 About Throughput

RabbitMQRabbitMQ supports reliable delivery of messages. It supports transactions, not bulk operations. Based on storage reliability requirements, storage can be memory or hard disk.
KafkaWith high throughput, internal use of message batch processing, zero-copy mechanism, data storage and acquisition is the local disk sequential batch operation, with O(1) complexity, high efficiency of message processing
ZeroMQIt also has high throughput
RocketMQThroughput is higher than RabbitMQ, but not as high as Kafka
ActiveMQWeaker than RabbitMQ

3.4 About Clusters

Kafka: a natural ‘leader-slave’ stateless cluster where each server is both Master and Slave
ZeroMQ: Is decentralized and does not support clustering
RabbitMQ: Supports simple clusters
RocketMQ: Supports the cluster, often used in multiple pairs of ‘master-slave’ mode
ActiveMQ: Supports simple cluster mode, such as’ active-standby ‘, but does not support advanced cluster mode.

3.5 About Load Balancing

Kafka: Supports Load balancing, with built-in Zookeeper, effectively implement Kafka cluster Load Balancer
ZeroMQ: decentralized, does not support load balancing, is itself a multi-threaded network library
RocketMQ: Supports load balancing
RabbitMQ: Load balancing is not supported well
ActiveMQ: Supports load balancing based on Zookeeper

3.6 Number of Single-machine Queues

The larger the number of queues, the more topics can be created on a single machine, because each topic is composed of a batch of queues. The cluster size of consumers is proportional to the number of queues. The more queues, the larger the consumer cluster can be.

When a Kafka single machine has more than 64 queues/partitions, the Load increases significantly. The more queues, the higher the Load, and the longer the response time of sending messages. The number of Kafka partitions cannot be too high
RocketMQ standalone machines support up to 50,000 queues without significant load changes

4 Kafka Streams and Storm, Spark Streaming, Flink

4.1 Features and processing methods of flow processing framework

We have mentioned above that stream processing is the process of continuous processing, aggregation and analysis of data sets, and its delay requirement should be as low as possible (milliseconds or seconds). From several important aspects of stream processing, distributed stream processing framework should have the following characteristics:

Guarantee the correctness of message transmission, to ensure the distinction between:
- A message can be lost or delivered At Most Once
- Messages are transmitted At Least Once
- The message is Exactly Once, meaning that the message is neither lost nor repeated
High fault toleranceIn the event of a failure such as node failure, network failure, etc., the framework should be able to recover and should pick up where it left off. This is done by checking the state of the flow to some persistent store from time to time.
State managementMost distributed systems require logic to maintain state processing. The stream processing platform should provide the ability to store, access, and update status information
A high performance: This includes low latency (record processing time), high throughput (record processing per second), and scalability. Latency should be as low as possible and throughput should be as high as possible, but it is difficult to do both at the same time and there is a balance to be struck
Advanced features:Event Time Processing(Event time processing),The watermark, support,windowThese features are required if the flow processing requirements are complex. For example, processing records based on the time they were generated at source (event-time processing)
maturity: Great if the framework has been proven and tested on a large scale by large companies. You’re more likely to get good community support and help on forums and elsewhere

Streams can be processed in two ways:

Native Streaming
- Each incoming record is processed as soon as it arrives, without waiting for other records. A running process (called operators/ Tasks /bolts, depending on the frame) will run forever, and each record will be processed by these processes. Examples: Storm, Flink, Kafka Streams.
Micro-batching
- Fast batch, which means incoming records are batched together every few seconds and then processed in a small batch with a delay of a few seconds, e.g. Spark Streaming

Both approaches have some advantages and disadvantages. When each record is processed as soon as it arrives, the results feel natural, allowing the framework to achieve the smallest possible delay. But this also means that it is difficult to be fault-tolerant without affecting throughput, as we need post-processing tracing and checkpoints for each record. In addition, state management is easy because there are long-running processes that can easily maintain the required state; Small batch processing, on the other hand, is incidental because it is essentially a batch and throughput is high because processing and check points complete groups of records in one go. But it comes at the expense of some latency that doesn’t feel like natural flow processing. At the same time, efficient state management will be a challenge.

4.2 Comparison of mainstream flow processing frameworks

Flow processing framework	The characteristics of	disadvantages
`Strom`Hadoop is the hadoop of stream processing. It is the oldest open source flow processing framework and one of the most mature and reliable	Very low latency, true stream handling, maturity and high throughput; Ideal for less complex streaming scenarios;	Message at least once guarantee mechanism; No advanced features, such as event time handling, aggregation, Windows, sessions, watermarking;
Spark Streaming	Lambda architecture support, free Spark; High throughput, suitable for many scenarios where sub-latency is not required; Easy to use advanced API; Good community support; In addition, structured streaming media is more abstract. In version 2.3.0, you can choose to switch between microbatch and continuous streaming media modes. Ensure that the message is delivered exactly once;	Not true streaming media, not suitable for low latency requirements; Too many parameters, it is difficult to adjust parameters; Lags behind Flink in many advanced features;
Flink	Support for Lambda architecture; Innovative leader in kaiyuan streaming media; The first true streaming framework with all the advanced features such as event time processing, watermarking, etc. Low latency, high throughput, can be configured as required; Automatic adjustment, not too many parameters to adjust; Ensure that the message is delivered exactly once; At big companies like Uber and Alibaba.	Late entry into the stream processing world, has not been widely accepted; Community support is relatively small, but thriving;
Kafka Streams	Very lightweight library for microservices and iot applications; No dedicated clustering is required; Inherited all the good qualities of Kafka; Supports streaming connections for internal use`rocksDb`To maintain state. Ensure that the message is delivered exactly once;	Tightly combined with Kafka, otherwise unusable; It has just started and no large companies have chosen to use it. Inappropriate heavyweight stream handling;

Overall, Flink is a good choice for specialized Streaming, but Kafka Streaming is a good choice for lightweight and use with Kafka.

5 Zookeeper & Kafka?

Zookeeper is used for coordination and management in the Kafka cluster.

Kafka stores metadata information in Zookeeper
Through the coordination management of Zookeeper to achieve the dynamic expansion of the whole Kafka cluster
Load balancing is implemented for the entire cluster
Producer detects the partition Leader using Zookeeper
Save status information for Consumer consumption.
Manage cluster configuration through ZK, elect the Kafka Leader, and Rebalance when the Consumer Group changes

Zookeeper is written in Java, so you need to install the JDK first.

5.1 Is Zookeeper mandatory?

Yes, in Kafka, even though you only want to use one agent, one topic, and one partition, with one producer and multiple consumers, and you don’t want Zookeeper to waste overhead, this situation also requires Zookeeper to coordinate tasks, state management, configuration, and so on in a distributed system. And single-node scenarios obviously don’t take advantage of Kafka.

In addition, the Apacke Kafka maintenance team has started to discuss removing Zookeeper (6 November 2019). Currently, Kafka uses Zookeeper to store partition and Broker metadata, and selects a Broker as Kafka controller. The hope is that removing the dependency on ZooKeeper will enable Kafka to manage metadata in a more scalable and robust manner, enable support for more partitions, and simplify Kafka deployment and configuration because ZooKeeper is a separate system with its own configuration file syntax. Manage tools and deployment patterns. In addition, Kafka and ZooKeeper are configured separately, so it is easy to make errors. For example, administrators may have SASL set up on Kafka and mistakenly believe that they have secured all data transmitted over the network. In practice, doing so also requires configuring security in a separate external ZooKeeper system. Unifying the two systems will provide a unified security configuration model. In the future Kafka may want to support single-node Kafka mode, which is useful for people who want to quickly test Kafka without having to start multiple daemons. Removing ZooKeeper dependencies makes this possible.

5.2 Zookeeper comes with Kafka. Can I use a customized ZK installation?

This is certainly possible, you don’t have to start Kafka’s ZK.

Understand Kafka data model: Topics, Partitions, and Replication

Kafka’s partitioning mechanism allows topics to scale horizontally and sequentially. What do we want to do in this video?

A Topic can logically be thought of as a queue. Each consumption must specify its topic, which simply means that it must specify which queue to put the message in. To make Kafka’s throughput scale horizontally, physically divide topics into one or more partitions, and each partition corresponds to a physical folder where all messages and index files are stored. For example, we created a topic named Xiaobiao. Kafka has three Brokers: Kafka,ZK cluster Development or Deployment environment setup and Experiment

Kafka command line script To create the theme of the two partition two copies xiaobiao. / bin/kafka - switchable viewer. Sh -- create -- zookeeper localhost: 2181, localhost: 2182, localhost: 2183 --replication-factor 2 --partitions 2 --topic xiaobiaoCopy the code

Two partitions, two copies, how do you understand that? Log. Dir = ‘xiaobiao-0’ and ‘xiaobiao-1’; log. Dir = ‘xiaobiao-0’; For multiple Kafka Brokers, partitions (folders) are distributed in different log.dir directories on different Brokers. When there is only one Broker, all partitions are allocated to that Broker and messages are distributed to different partitions through load balancing. The consumer monitors the offset to get which partition has new data and pulls the message data from that partition. This is the performance of the division. However, more partitions can increase message processing throughput to a certain extent, because Kafka is based on file reading and writing, so more file handles need to be opened, and there is some performance overhead, but the Kafka community is working on a solution to implement more partitions without much impact on performance.

If there are too many partitions, the log will be segmented too much. Since the log is written in batches, the log will be written randomly. Random I/O has a significant impact on performance. So in general Kafka cannot have too many partitions.

What about the copy? As the name implies, is the theme of the number of copies, that is, we have two above topic partition, namely physical two folders, then a copy of the specified for 2, will be a copy, will have two two xiaobai xiaobai – 0-1, the copy is located in a different broker in the cluster, This means that the number of replicas cannot exceed the number of brokers, or the topic will fail to be created. So what’s the use of a copy? When a Broker fails in Kafka and fails to service a Consumer, multiple copies of data are set up for the sole purpose of availability, so that there is no fear of a Broker failing in the cluster. This is further seen here. A partitioned copy of a topic needs to be on different brokers, and the corresponding replica partition keeps its data synchronized. Inevitably, the number of replicas has an impact on Kafka’s throughput. Here is a diagram of data synchronization with Replication Factor equal to 2:

Partition Leader: For each partition, a replica is designated Leader. The Leader is responsible for sending and receiving data for that partition, and all other copies are called synchronous copies (or followers) of the partition.

The In Sync Replicas is a subset of all copies of the partition that have the same messages as the primary partition.

For example, when Broker2 fails, broker 2 is now unable to access partition 1 because broker 2 is the Leader of partition 1. When this happens Kafka automatically selects a synchronous replica (in the figure above, there is only one replica) and makes it the Leader. Now, when Broker 2 comes back online, partition 1 in Broker 2 can try again to become Leader.

Of course, the above mentioned copy and partition did not go into the details of how the internal mechanism is implemented, how to guarantee, here will not expand.

Kafka’s Consumer Group

Consumer Group: Each Consumer instance belongs to a Consumer Group, and each message can only be consumed by one Consumer instance in the same Consumer Group (different Consumer groups can consume the same message at the same time). Unlike queues, Kafka does not delete messages after they have been consumed. That is, Kafka can be used for offline processing and real-time processing. Hadoop and Flink can be used in parallel, so you can assign two different consumer groups to send data to different processing tasks.

8 summarizes

This article gives us a basic understanding of Kafka, can do message subscription/publishing system, can do real-time streaming processing, Kafka partitions and replicas have a certain understanding, Kafka consumer group characteristics also have a basic understanding, then into practice, after practice, Let’s take a closer look at Kafka’s internals and implementation mechanisms.

9 Reference Materials

http://kafka.apache.org/intro.html
http://kafka.apachecn.org/intro.html
https://stackoverflow.com/questions/23751708/is-zookeeper-a-must-for-kafka
https://stackoverflow.com/questions/38024514/understanding-kafka-topics-and-partitions
https://www.bigendiandata.com/2016-11-15-Data-Types-Compared/
https://stackoverflow.com/questions/44014975/kafka-consumer-api-vs-streams-api
https://kafka.apache.org/documentation/streams/
https://medium.com/@stephane.maarek/the-kafka-api-battle-producer-vs-consumer-vs-kafka-connect-vs-kafka-streams-vs-ksql- ef584274c1e
https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
https://juejin.cn/post/6844903626171760653
https://www.infoq.cn/article/democratizing-stream-processing-apache-kafka-ksql
https://cloud.tencent.com/developer/article/1031210
http://www.54tianzhisheng.cn/2018/01/05/SpringBoot-Kafka/#
http://www.54tianzhisheng.cn/2018/01/04/Kafka/
https://www.infoq.cn/article/kafka-analysis-part-7
https://juejin.cn/post/6844903626171760653
http://kafka.apachecn.org/documentation.html
https://www.linkedin.com/pulse/message-que-pub-sub-rabbitmq-apache-kafka-pubnub-krishnakantha
https://www.rabbitmq.com/
https://medium.com/@anvannguyen/redis-message-queue-rpoplpush-vs-pub-sub-e8a19a3c071b
http://jm.taobao.org/2017/01/12/rocketmq-quick-start-in-10-minutes/
https://activemq.apache.org/components/artemis/documentation/2.0.0/address-model.html
https://www.journaldev.com/9731/jms-tutorial
https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-proce ssing-91ea3f04675b
https://medium.com/@_amanarora/replication-in-kafka-58b39e91b64e
http://www.jasongj.com/2015/01/02/Kafka%E6%B7%B1%E5%BA%A6%E8%A7%A3%E6%9E%90/
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Client-side+Assignment+Proposal

Please follow CSDNfreeape or xiaobaiai’s wechat official account or xiaobai.net