preface
Kafka cluster > Broker > topic > partition > segment >.index file,.log file,.timeindex file > messageCopy the code
1/ Why message queues are needed (message middleware)
Weekend boring brush mobile phone, a Taobao APP suddenly popped out a message "in order to reward old customers, girlfriend buy one get one free, activities only today!" . Buy one get one free! I can't miss it! I couldn't help but order it right away. So I chose two of the latest, order, pay at one go! Satisfied lying in bed, thinking about having a girlfriend immediately, unexpectedly happy insomnia... The second day normal on the class, suddenly received the express little elder brother's phone: little elder brother: "you are xx? Your girl friend arrived, I am downstairs in you now, you take it!" . Me: "This... I'm at work. Can you bring it tonight? ". Little elder brother: "evening can not line, evening I also got off work!" . So two people stalemate for a long time... Finally, the little brother said, or I help you put it downstairs in the small fang convenience store, you came to take it after work in the evening, the embarrassing situation was alleviated! Back to the topic, if there is no Xiao Fang Convenience store, the interaction mode between the Courier boy and me should be as follows:Copy the code
What happens if the interaction pattern is the one shown above? 1, for this girlfriend, I asked for leave to go back to get (boss did not approve). 2. I've been waiting downstairs (I've got other deliveries to make). 3. Deliver over the weekend (apparently can't wait). 4. I'm done with this girlfriend (no way!) After the emergence of Xiaofang Convenience Store, the interaction mode should be as follows:Copy the code
In the above example, "delivery boy" and "ME who bought my girlfriend" are the two systems that need to interact, and Xiaofang Convenience store is the "message middleware" that we will talk about in this article. To sum up, the advantages of xiaofang convenience store (message-oriented middleware) are as follows: <1> Decoupling: There is some kind of connection or dependence between two objects originally, decoupling is to separate these two objects, eliminate the connection and dependence, the Courier has a lot of express delivery, every time he has to call one by one to confirm whether the consignee is free, which time period is free, and then determine the delivery plan so completely dependent on the consignee! If there are more than one delivery, the delivery boy will be busy like crazy... If there is a convenience store, the Courier only needs to put the Courier in the same neighborhood in the same convenience store, and then notify the consignee to pick up the goods. At this time, the Courier and the consignee can realize the decoupling! < 2 > asynchronousCopy the code
The delivery guy had to wait downstairs after he called me, until I took the package and he couldn’t deliver anyone else. Express little brother will express in the small fang convenience store, and can do other work, do not need to wait for your arrival and has been in a waiting state. Improve the efficiency of work.
< 3 > peak clippingCopy the code
Suppose I bought a variety of goods in different stores on Double Eleven, and the express delivery of these stores are different, including Zhongtong, YTO, SHentong, and all kinds of express, etc.. What’s more, they all arrived at the same time! The little brother called me to take the north gate express, yuantong little brother called me to go to the south gate, Shentong little brother called me to the east gate. I was so distracted…
We can see that in the scenario where the system needs to interact, the use of message queue middleware is really beneficial. Based on this idea, there are more professional “middleware” such as Fengchao and Cainiao Station than Xiaofang Convenience store.
2/ Mode of message queue (message middleware) communication
<1> Point-to-point mode
As shown in the figure above, the point-to-point pattern is typically a pull or polling based messaging model characterized by messages sent to a message queue being consumed by one and only one consumer. After the producer puts the message into the message queue, the consumer 'actively' pulls the message for consumption. The advantage of the point-to-point model is that consumers can control how often they pull messages. However, whether there are messages in the message queue that need to be consumed is not sensed on the consumer side, so additional threads are required to monitor on the consumer side.Copy the code
<2> Publish subscribe mode
As shown in the figure above, the publish-and-subscribe model is a push-based messaging model that can have many different subscribers. After the producer puts the message into the message queue, the queue will push the message to the consumers who have subscribed to the message (similar to if you follow a wechat public account, if the wechat public account has a new article, it will actively push to you). Since it is the consumer who passively receives the push, there is no need to sense whether the message queue is waiting to be consumed! Consumer1, Consumer2, and Consumer3 can process messages differently due to different machine performance, but the message queue can't sense how fast the consumer is consuming! So the speed of push is a problem with the publish/subscribe model! Suppose the three consumer processing speeds are 8M/s, 5M/s, and 2M/s respectively. If the queue pushes at 5M/s, consumer3 cannot handle it! If the queue push speed is 2M/s, then consumer1 and Consumer2 will have a huge waste of resources!Copy the code
3 / Kafka cluster
With a brief introduction to message queues (messaging-oriented middleware) and two modes of message queue communication, it's time to introduce kafka Cluster, the main character of this article. Kafka is a high-throughput distributed publish and subscribe messaging system that can process all the action flow data of the consumers in the website, with high performance, persistence, multi-copy backup, horizontal scaling capabilities... Some basic introduction is not launched here, there are too many online about these introduction, readers can baidu!Copy the code
<1> Infrastructure and terminology
Without further ado, let's take a look at the picture first. Through this picture, we can have an overview of related concepts and their relationships:Copy the code
If you're confused by this picture, that's fine! A Producer is a Producer, the Producer of information and the entrance to information. 2) Kafka cluster: Broker: Each server has one or more instances of Kafka. We assume that each Broker corresponds to one server. Each broker in a Kafka cluster has a unique number, such as broker-0, broker-1, and so on... Topic: The Topic of the message, which can be interpreted as the classification of the message. Kafka data is stored in the Topic. Multiple topics can be created on each broker. Partition: A Topic can have multiple partitions. Partitions are used to load kafka throughput. The data of the same topic in different partitions are not repeated, and the manifestation of partition is a folder! Replication: Each partition has multiple copies that act as a backup. If the primary partition (Leader) fails, a Follower is selected to become the Leader. The default maximum number of replicas in Kafka is 10, and the number of replicas cannot be greater than the number of brokers. Followers and leaders are definitely on different machines, and the same machine can only store one copy (including itself) on a partition. Message: The body of each sent Message. 3) The Consumer is the outlet of information. 4) Consumer Group: We can Group multiple Consumer groups into one Consumer Group. In Kafka's design, data from a partition can only be consumed by one Consumer in the Consumer Group. Consumers in the same consumer group can consume data from different partitions of the same topic to improve Kafka throughput! Zookeeper: Kafka clusters rely on Zookeeper to store cluster meta-information to ensure system availability.Copy the code
4/ Workflow analysis
Kafka's basic infrastructure and basic concept terms are introduced in this article. It is not necessary to know if you have a general impression of Kafka. If you are still confused, it is ok! Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: Kafka: KafkaCopy the code
<1> Send data
In the diagram above, a producer is a producer and a gateway to data. Notice the red arrows in the graph. The Producer always looks for the leader when writing data to the follower, rather than directly writing data to the follower! How to find a leader? What is the process of writing? Take a look at the picture below:Copy the code
The sending process has been explained in the picture, not listed in the text alone! Note that the follower actively synchronizes (and pulls from) the leader after the message is written to the leader! The producer uses push mode to publish data to the broker, and each message is appended to the partition and written to disk in sequence, ensuring that all data in the same partition is in order! Write the schematic diagram as follows:Copy the code
Data is written to different partitions. Why partition kafka? As you can probably guess, the main purpose of partitioning is to: 1) facilitate scaling. Since a topic can have multiple partitions, we can scale the machine to cope with the increasing amount of data. 2) Improve concurrency. The read/write unit is partition. Multiple consumers can consume data at the same time, improving message processing efficiency. Those familiar with load balancing should know that when we send a request to a server, the server may load the request and distribute it to different servers. In Kafka, if a topic has multiple partitions, Kafka has several principles: 1) When writing data to a partition, the producer can specify the partition to be written to. If the partition is specified, the corresponding partition is written to. 2) If no partition is specified but the data key is set, a partition will be hash based on the key value. 3) If neither a partition is specified nor a key is set, a partition will be selected by polling. Guarantee against message loss is a basic guarantee of message queuing middleware. How can producer guarantee against message loss when writing messages to Kafka? In fact, the above write flow chart has been described, that is through the ACK response mechanism! When a producer writes message data to a message queue, a parameter can be set to determine whether or not the Kafka cluster receives data. This parameter can be set to 0, 1, or all. 1) 0 indicates that producer sends data to The Kafka cluster without waiting for the return of the Kafka cluster to ensure the message is sent successfully. The least safe but the most efficient. 2) 1 represents that the producer sends data to the Kafka cluster and sends the next data as long as the leader responds. This ensures that the leader sends the data successfully. 3) All indicates that the producer sends data to the Kafka cluster before all followers complete data synchronization from the leader, ensuring that the leader sends data successfully and all copies are backed up. Safety is the highest, but efficiency is the lowest. Finally, if I write to a topic that doesn't exist, can I write to it successfully? Kafka automatically creates the topic, and the number of partitions and replicas is set to 1 by default, with one partition as leader and one partition as followerCopy the code
<2> Save the data
After Producer writes data to the Kafka cluster, the cluster needs to save the data. Kafka stores data on a disk (hard disk). Writing to a disk is a time-consuming operation that is not suitable for this high concurrency component. Kafka initially allocates a separate disk space and writes data sequentially (more efficient than random writes).Copy the code
Partition structure:
As mentioned earlier, every topic can be divided into one or more partitions. If you think topic is abstract, then partition is more concrete. Each segment file contains a. Index file, a. Log file, and a. Timeindex file. Log files are actually where messages are stored, while index and timeindex files are index files that retrieve messages.Copy the code
As shown in the figure above, this partition has three sets of segment files. Each log file has the same size, but the number of messages stored is not necessarily the same. The file is named after the segment's minimum offset. For example, 000. Index stores messages whose offset is 0 to 368795. Kafka uses segment + index to solve the search efficiency problem.Copy the code
The Message structure:
The log file is actually where the message is stored. We know that producer writes message after message to Kafka. What do messages look like stored in a.log file? The message contains the message body, message size, offset, and compression type...... Wait a minute! 1) Offset: offset is an ordered ID of 8 bytes, which uniquely determines the position of each message in PARITION! 2) Message size: The message size takes 4 bytes and describes the size of the message. 3) Message body: The message body stores the actual message data (compressed) and occupies different space according to the specific message.Copy the code
Storage policy:
Kafka stores all messages, whether or not they are consumed. So what's the deletion strategy for old data? 1) Based on time, the default value is 168 hours (7 days). 2) Based on size, the default configuration is 1073741824. Note that the time complexity for Kafka to read a particular message is O(1), so deleting expired files here will not improve Kafka's performance!Copy the code
<3> Consumption data
After the message is stored in the.log file, the consumer can consume it. In the same way that a producer produces a message, the consumer asks the Leader to pull a message. Multiple consumers can form a consumer group, each with a group ID! Consumers in the same consumer group can consume data of different partitions under the same topic, but will not consume data of the same partition by multiple consumers in the same consumer group!! It's a little convoluted. Multiple consumers in the same group can subscribe to the same topic, but consume different partitions within the topic. This is the concurrency mechanism.Copy the code
The figure shows that the number of consumers in the consumer group is smaller than the number of partitions, so a consumer can consume multiple partitions at a slower rate than a consumer processing only one partition! If there are more consumers in a consumer group than partitions, will there be multiple consumers consuming the same partition? It has already been mentioned that this will not happen! The extra consumers do not consume any partition data. Therefore, in practical applications, it is recommended that the number of consumers in the consumer group be the same as the number of partitions, which is the most perfect. Each segment contains.log,.index, and.timeindex files. Message is stored in the.log file. Each message contains offset, message size, message body... How to use segment+offset to find messages? What if we now need to find a message whose offset is 368801? Let's take a look at the picture below:Copy the code
1) Find the segment file where the 368801message from offset is located. 2) open the. Index file in the segment where the offset is 368796+1, and the offset is 368796+5=368801. So the relative offset we're looking for here is 5). Since the file uses a sparse index to store the relationship between relative offset and the corresponding physical offset of Message, the index with relative offset of 5 cannot be found directly. Here, the dichotomy method is also used to find the largest relative offset in the index entry whose relative offset is less than or equal to the specified relative offset, so the index whose relative offset is 4 is found. 3) The physical offset of the message store is 256 based on the found index whose offset is 4. Open the data file and scan sequentially from position 256 until you find the Message with offset 368801.Copy the code