Hi, I’m Koba
Recently, I rarely published a document. First, I began to summarize what I had done for more than two months. What I had learned and what I had learned.
Two is just this period of time the public began to migrate, so also borrow this period of time can rest, summarize how others write good articles.
From the article typesetting, content positioning, publishing rules, as well as the late publicity, I have deeply reflected on myself. Rectify any deficiencies as soon as possible and start again.
preface
All right, let’s get down to business. Today we bring you the life and death of our old friend Kafka.
As the demand for real time becomes higher and higher, how can we ensure the fast transmission of data in the process of huge data transmission? Therefore, message queue comes into being.
A “message” is a unit of data transferred between two computers. Messages can be very simple, such as containing only text strings; It can also be more complex and may contain embedded objects.
The message is sent to the queue. A message queue is a container that holds messages during their transmission. Kafka is a distributed message queue that is essential for us to master.
In this paper, Kafka’s basic components of the implementation details and basic applications are introduced in detail, at the same time, also stayed up a few days of night drawing diagrams, hoping to let everyone have a deeper understanding of the core knowledge of Kafka, finally also summed up the application of Kafka in the actual business. Follow Koha to familiarize yourself with Kafka’s secrets:
Kafka concept
Kafka is a high-throughput, distributed, publish/subscribe based messaging system originally developed by LinkedIn and written in Scala. Kafka is currently an open source project of Apache.
Kafka main components
-
broker
: Kafka server, which stores and forwards messages
-
topic
Kafka categorizes messages by topic
-
partition
A topic can contain multiple partitions on which the topic messages are stored
-
offset
: Indicates the position of the message in the log. It is the offset of the message on the partition and the unique sequence number of the message
-
Producer
: message producer
-
Consumer
: Message consumers
-
Consumer Group
: Consumer grouping. Each Consumer must belong to a group
-
Zookeeper
: Saves meta data such as cluster broker, topic, partition, etc. In addition, it is responsible for broker fault detection, partition leader election, load balancing and other functions
Kafka advantages
-
The decoupling
The message system inserts an implicit data-based interface layer in the middle of the process, which is implemented by both processes. This allows you to extend or modify both processes independently, as long as they adhere to the same interface constraints.
-
redundant
Message queues persist data until it has been fully processed, thus avoiding the risk of data loss. In the “insert-retrieve-delete” paradigm used by many message queues, before removing a message from the queue, your processing system needs to explicitly indicate that the message has been processed, ensuring that your data is stored safely until you are done using it.
-
scalability
Because message queues decouple your processing, it is easy to increase the frequency with which messages are queued and processed, simply by adding additional processing. No code changes, no parameters adjustments. Scaling is as simple as turning up the power button.
-
Flexibility & peak processing capability
Using message queues enables key components to withstand sudden access pressures without completely collapsing under sudden overload of requests.
-
recoverability
: Message queuing reduces coupling between processes, so that even if a process that processes messages dies, messages that are queued can still be processed after the system recovers.
-
In order to ensure
: Most message queues are inherently sorted and ensure that data will be processed in a particular order. Kafka guarantees the orderliness of messages within a Partition.
-
The buffer
Message queuing helps tasks execute efficiently through a buffer layer. Write queue processing is as fast as possible. This buffering helps control and optimize the speed at which data flows through the system.
-
Asynchronous communication
Message queues provide asynchronous processing, allowing users to queue a message without processing it immediately. Put as many messages on the queue as you want, and then process them as needed.
Kafka application scenarios
-
Activity tracking
: Tracking the interaction between web users and front-end applications, e.g. PV/UV analysis of web sites
-
The message
: Asynchronous information exchange between systems, such as marketing activities (coupon code benefits after registration)
-
Log collection
: Collects system and application program metrics and daily logs, such as application monitoring and alarms
-
Commit log
Publish more updates to kafka, such as transaction statistics
Kafka data storage design
Partition Data file
Each Message in a partition contains three properties: Offset, MessageSize, data, offset table shows the offset of Message in the partition, offset is not the actual storage location of the Message in the partition data file, but a logical value. It uniquely identifies a Message in the partition. Offset can be considered as the ID of the Message in the partition. MessageSize indicates the size of the message content data. Data is the concrete content of Message.
Data file segment
Partition physically consists of multiple segment files. Each segment has the same size and is read and written in sequence. Each segment data file is named with the smallest offset in the segment and the file extension is.log. In this way, when searching for the Message with offset, we can use binary search to locate the Message in the segment data file.
Data file index
Kafka creates index files for the data files after each partition. The file name is the same as the data file name, but the file extension is.index. The index file does not index every Message in the data file. Instead, sparse storage is used to build an index every certain byte of data. This prevents the index file from taking up too much space and keeps the index file in memory.
Zookeeper in Kafka
Both Kafka clusters, producers and consumers rely on ZooKeeper to ensure that the system is available and that the cluster stores meta information.
Kafka uses ZooKeeper as its distributed coordination framework, combining the processes of message production, message storage, and message consumption.
With ZooKeeper, Kafka can establish a subscription relationship between producers, consumers, and brokers in a stateless manner, and achieve load balancing between producers and consumers.
Producer design
Load balancing
Since message topic is composed of multiple partitions, which are evenly distributed to different brokers, producer can use random or hash methods to effectively utilize the performance of broker clusters and improve the throughput of messages. Messages are evenly sent to multiple partitions for load balancing.
Batch send
It is an important way to improve message throughput. After combining multiple messages in memory, the Producer can send a batch of messages to the Broker in a single request, which greatly reduces the NUMBER of I/O operations the broker needs to perform to store messages. However, it also affects the real-time performance of messages to a certain extent, which is equivalent to better throughput at the cost of delay.
The compression
Kafka supports sending messages by batch. Kafka also supports compressing message sets. The Producer can compress message sets in GZIP or Snappy format. After being compressed on the Producer end, it needs to be decompressed on the Consumer end. The advantage of compression is to reduce the amount of data to be transmitted and reduce the pressure on network transmission. In big data processing, the bottleneck is usually reflected on the network rather than the CPU (compression and decompression will consume part of THE CPU resources).
To tell whether a message is compressed or uncompressed, Kafka adds a compression attribute byte to the header of the message. The last two bytes of this byte indicate the encoding used to compress the message. If the last two bytes are 0, the message is not compressed.
Consumer design
Consumer Group
Multiple Consumer instances in the same Consumer Group that do not consume the same partition at the same time are equivalent to the queue mode. Messages within the partition are ordered, and consumers consume messages in pull mode. Kafka does not delete consumed messages. For partitions, sequential reads and writes disk data, providing message persistence in O(1) time complexity.
Practical application
Kafka as a messaging system
By having the concept of parallelism – partitioning – in a topic, Kafka is able to provide order assurance and load balancing in consumer process pools. This is done by assigning partitions in a topic to consumers in a consumer group, so that each partition is used by only one consumer in that group. By doing so, we ensure that the consumer is the only reader of the partition and uses the data sequentially. With many partitions, this can still balance the load on many consumer instances. Note, however, that consumer instances in consumer groups cannot exceed partitions.
Kafka serves as a storage system
Kafka is a very good storage system. Data written to Kafka is written to disk and copied for fault tolerance. Kafka allows producers to wait for validation so that the write is not considered complete until full replication, and guarantees that the write will still exist even if the writing server fails.
Disk architecture Kafka makes good use of scale – Whether there is 50 KB or 50 TERabytes of persistent data on the server, Kafka performs the same operation.
Because you take storage seriously and allow clients to control where they read it, you can think of Kafka as a dedicated distributed file system for high-performance, low-latency commit log storage, replication, and propagation.
Kafka is used for stream processing
For complex transformations, Kafka provides a fully integrated Streams API. This allows you to build applications that perform non-trivial processing, which can compute aggregations of streams or join streams together.
This tool helps solve the challenges such applications face: processing unordered data, reprocessing input when code changes, performing stateful calculations, and so on.
The stream API builds on the core principles provided by Kafka: it uses the producer and consumer apis for input and Kafka for 8
Stateful stores ** and uses the same group mechanism across stream processor instances to achieve ** tolerance.
See you next time!