This is the third day of my participation in the August More Text Challenge
Type of message consumption:
- Push: When a message sent by Producer arrives at the server, the server immediately delivers the message to the Consumer.
- Pull: The server receives the message and does nothing, waiting for the Consumer to “Pull” it.
Kafka structure
The consumption type for Kafka messages is only Pull.
It is mainly aimed at transferring logs, but is limited in many other scenarios, such as transactions, orders, and recharge.
Producer: A Producer is a Producer, the Producer of information and the entrance to information.
Kafka cluster:
Broker: A Broker is an instance of Kafka. Each server has one or more instances of Kafka. We assume that each Broker has one server. Brokers within each Kafka cluster have a unique number, such as broker-0, broker-1, and so on.
Topic: The Topic of the message, which can be interpreted as the classification of the message. Kafka data is stored in the Topic. Multiple topics can be created on each broker.
Partition: A Topic can have multiple partitions. Partitions are used to load kafka throughput. The data of the same topic in different partitions are not repeated, and the manifestation of partition is a folder!
Replication: Each partition has multiple copies that act as a backup. If the primary partition (Leader) fails, a Follower is selected to become the Leader. The default maximum number of replicas in Kafka is 10, and the number of replicas cannot be greater than the number of brokers. Followers and leaders are definitely on different machines, and the same machine can only store one copy (including itself) on a partition.
Message: The body of each sent Message.
The Consumer is the outlet of information.
Consumer Group: We can Group multiple Consumer groups into one Consumer Group. In Kafka’s design, data from a partition can only be consumed by one Consumer in the Consumer Group. Consumers in the same consumer group can consume data from different partitions of the same topic to improve Kafka throughput!
Zookeeper: Kafka clusters rely on Zookeeper to store cluster meta-information to ensure system availability.
Kafka workflow
The production data
The Producer writes data to the leader instead of directly writing data to the follower. After messages are written to the leader, the followers actively synchronize with the leader.
Producer publishes data to the broker, and each message is appended to a partition and written to disk in sequence to ensure that data in the same partition is in order.
How does the Producer know which partition to send data to? There are several principles:
- Partition Specifies the partition to be written to. If the partition is specified, the corresponding partition is written to.
- If no partition is specified but a data key is set, a partition will be hash based on the key value.
- If neither a partition nor a key is specified, a partition is selected by polling.
What is the purpose of the Kafka partition?
-
Easy to expand. Since a topic can have multiple partitions, we can easily cope with the increasing amount of data by expanding the machine.
-
Improve concurrency. The read/write unit is partition. Multiple consumers can consume data at the same time, improving message processing efficiency.
How to ensure that information is not lost? ACK response mechanism. When a producer writes data to a queue, a parameter can be set to confirm that kafka received data. This parameter can be set to 0, 1, or all.
0 indicates that the producer sends data to the cluster without waiting for the return of the cluster. Therefore, the message cannot be sent successfully. The least safe but the most efficient. 1 indicates that producer sends data to the cluster and sends the next data as long as the leader responds. This ensures that the leader sends data successfully. All indicates that the producer sends data to the cluster only after all followers complete synchronization with the leader, ensuring that the leader sends data successfully and all copies are backed up. Safety is the highest, but efficiency is the lowest.Copy the code
Save the data
After Producer writes data to Kafka, the cluster needs to save the data to disk.
Partition structure
As mentioned earlier, every topic can be divided into one or more partitions. If you think topic is abstract, then partition is more concrete. Each segment file contains a. Index file, a. Log file, and a. Timeindex file. Log files are actually where messages are stored, while index and timeindex files are index files that retrieve messages.
The file is named after the segment’s minimum offset. For example, 000. Index stores messages whose offset is 0 to 368795. Kafka uses segment + index to solve the search efficiency problem.
Consumption data
Just like the production message, the consumer also asks the leader to pull the message.
Multiple consumers can form a consumer group, each with a group ID! Consumers of the same consumer group can consume the data of different partitions under the same topic, but the data of the same partition can only be consumed by a fixed consumer under a certain consumer group. (Since messages produced by producer are only placed into a fixed partition under a Topic, there is no repeated consumption of messages.)
The figure shows that the number of consumers in a consumer group is less than the number of partitions. Therefore, a consumer can consume multiple partitions at a slower speed than a consumer processing only one partition! If there are more consumers in a consumer group than partitions, will there be multiple consumers consuming the same partition? It has already been mentioned that this will not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions!
How to use segment+offset to find messages? What if we now need to find a message whose offset is 368801?
Find the segment file where the 368801message from offset is located. Index = 368796+1; index = 368796+1; offset =368801; So the relative offset we’re looking for here is 5). Since the file uses a sparse index to store the relationship between relative offset and the corresponding physical offset of Message, the index with relative offset of 5 cannot be found directly. Here, the dichotomy method is also used to find the largest relative offset in the index entry whose relative offset is less than or equal to the specified relative offset, so the index whose relative offset is 4 is found. 3. The physical offset of the message store is 256 based on the found index whose offset is 4. Open the data file and scan sequentially from position 256 until you find the Message with offset 368801.