Kafka concept
Kafka is a high-throughput, distributed, publish/subscribe based messaging system originally developed by LinkedIn and written in Scala. Kafka is currently an open source project of Apache.
Keyword parsing
Broker: Kafka server that stores and forwards messages
Kafka categorizes messages by topic
3. Partition: A topic can contain multiple partitions, and topic messages are stored on each partition
4. Offset: indicates the position of the message in the log. It is the offset of the message on the partition and the unique sequence number of the message
He is a Producer of messages
A Consumer of information
Each Consumer must belong to a Group
8. Zookeeper: Saves meta data of broker, topic, partition and so on; In addition, it is responsible for broker fault detection, partition leader election, load balancing and other functions
Kafka data storage design
1. Offset, MessageSize, data
Each Message in a partition contains the following three properties: Offset, MessageSize, data, where offset represents the offset of Message in the partition. Offset is not the actual storage location of the Message in the partition data file, but a logical value. It uniquely identifies a Message in the partition. Offset can be considered as the ID of the Message in the partition. MessageSize indicates the size of the message content data. Data is the concrete content of Message.
2. Data file segment (sequential read and write, segment command, binary search)
Partition physically consists of multiple segment files. Each segment has the same size and is read and written in sequence. Each segment data file is named with the smallest offset in the segment and the file extension is.log. In this way, when searching for the Message with offset, we can use binary search to locate the Message in the segment data file.
3. Data file index (segmented index, sparse storage)
Kafka creates index files for the data files after each partition. The file name is the same as the data file name, but the file extension is.index. The index file does not index every Message in the data file. Instead, sparse storage is used to build an index every certain byte of data. This prevents the index file from taking up too much space and keeps the index file in memory.
Producer design
1. Load balancing (partition is evenly distributed among different brokers)
Since message topic is composed of multiple partitions, which are evenly distributed to different brokers, producer can use random or hash methods to effectively utilize the performance of broker clusters and improve the throughput of messages. Messages are evenly sent to multiple partitions for load balancing.
2. Batch delivery
It is an important way to improve message throughput. After combining multiple messages in memory, the Producer can send a batch of messages to the Broker in a single request, which greatly reduces the NUMBER of I/O operations the broker needs to perform to store messages. However, it also affects the real-time performance of messages to a certain extent, which is equivalent to better throughput at the cost of delay.
3. Compression (GZIP or Snappy)
The Producer can compress message sets in GZIP or Snappy format. After being compressed on the Producer end, it needs to be decompressed on the Consumer end. The advantage of compression is to reduce the amount of data to be transmitted and reduce the pressure on network transmission. In big data processing, the bottleneck is usually reflected on the network rather than the CPU (compression and decompression will consume part of THE CPU resources).
Consumer design
Consumer Group
Multiple Consumer instances in the same Consumer Group that do not consume the same partition at the same time are equivalent to the queue mode. Messages within the partition are ordered, and consumers consume messages in pull mode. Kafka does not delete consumed messages
For partitions, sequential reads and writes to disk data provide message persistence capability in O(1) time complexity.