preface

To answer this question, I would like to analyze the process from the beginning of message flow generation (production side), to the Kafka service store (broker), to the consumption (consumer side), to identify the kafka message loss boundary and how kafka can be configured to ensure that messages are not lost.


Every service should have its own clear responsibilities, and Kafka services are no exception. Therefore, kafka guarantees that messages are not lost, first of all, there is an important premise, that is,”
Message committed“. If a message does not reach the Kafka service, or if it does arrive but does not meet the definition of “committed,” then kafka considers the message uncommitted.


Let’s take a look at the process

The production end

We often hear kafka producers complain about losing messages after sending them. In fact, kafka producers provide send() interfaces in three modes:

  1. Fire -and-forget. The call to the Send (Record) interface lets the producer just send the message and not care whether the message reaches the Kafka service correctly. This mode of sending has the highest performance, but it is also the least reliable and may cause message loss.
  2. Sync. A call to the send(record).get() interface causes the producer process to block waiting for a response from the Kafka service. In fact, the send(record) method itself is asynchronous, but the direct link to the get() method blocks synchronously. This mode has the highest reliability, but the performance is much worse. The next message is blocked until the last message is sent.
  3. Async. The call to the send(Record, callback) interface allows the producer neither to be completely indifferent to the success of the send as in mode 1, nor to block and wait as in mode 2. With a callback, the message can be sent asynchronously, and if the Kafka service has an error message, the callback can tell the application exactly how to handle it. Also, for the same partition, the call of the callback function can ensure that the partition is ordered. In production applications, most people would choose the third mode, with a callback to tell if the message was actually committed. In addition to the possibility of sending failure due to network jitter, we can configure a retry mechanism. In general, the production end has the responsibility to ensure that messages are “committed” to the Kafka service.

Analysis of important configuration parameters of the producer:

  1. Set the acks = all. Indicates how many replicas must confirm receipt of the message before the producer considers the message sent successfully. This parameter controls how messages are persisted in three types:

    • acks = 0. The producer sends a message without waiting for a response from the Kafka server. This method is the least reliable, and the producer is not aware of any exceptions in Kafka.

    • acks = 1. After sending a message, the producer only needs to wait for the received leader copy to write successfully to receive a successful response. It does not need to wait for other follower copies to write synchronously. This approach is a compromise between measuring reliability and performance. However, there is still a case of message loss. That is, the leader copy crashes after the message has been written to the leader copy and returned successfully. In this case, the message is lost because the other follower copies are not synchronized.

    • acks = all. Setting acks=-1 is the same. After sending a message, the producer receives a successful response only after all in-sync Replica (ISR) copies are successfully written. This method is the most reliable. However, another parameter, min.insync.replicas, is required for linkage control.


  2. Set retries to a value greater than 0. This parameter is used to set the number of retries after the producer fails to send a message. Also note that if retries parameters has been set up, the recommended Max. In the flight. Requests.. Per connection = 1, otherwise may not be able to guarantee the message order of same partition.

    max.in.flight.requests.per.connectionParameter specifies how many messages a producer can send before receiving a response from the server. The default is five. Setting it to 1 ensures that messages are sent in order. If the value is greater than 1 and the retry mechanism is configured, the error sequence will occur. For example, if the first message fails to be written and the second message succeeds, the producer will send the first message again. If the first message succeeds, the first message will be inserted after the second message, causing misordering.

    retry.backoff.msParameter specifies the time interval between retries to avoid frequent retries that are invalid. The default value is 100.


The broker service

When a message comes to the Kafka service, Kafka persists the message based on our configuration. Only after this operation is complete does Kafka consider the message “committed.” So let’s see which configurations affect message loss.

  1. . Set the replication factor > = 3. Represents the number of copies of a topic partition. In production environments, it is recommended that you have at least three brokers. This enables each broker to store one copy of data, ensuring data reliability.

  2. Set the min. Insync. Replicas > 1. When acks is set to all, this parameter control specifies at least how many copies the message must be written to before it is considered successful. The min.insync.replocas parameter is a specific constraint on acks=all. Make sure replication.factor>min.insync.replicas are configured equal so that if one of the replicas dies, the message will never be sent.

    For example, acks=all and min.insync.replicas=2 are configured.

    1. Refactor =3, because acks=all, a message must be written to three replicas, because 3 is greater than 2 of min.insync.replicas, so the lower limit is not triggered.
    2. Replication. Refactor =2, because acks=all, a message must be written to two replicas to be successful, because the number of replicas 2 is equal to that of min.insync.replicas 2, and the lower limit is reached but the lower limit is not triggered.
    3. Replication. Refactor =1 after two replicas are suspended. Because of the constraint of acks=all, a message needs to be written to one replicas to be successful.
  3. Set the unclean. Leader. Election. The enable = false. This parameter indicates whether brokers that are not in in-sync-Replicas (ISR) are allowed to be elected as partition leaders. The default value is false. You are advised not to set it to True. Because without a copy in the ISR set, some of the broker copies may be too far behind the original leader, and once it becomes the new leader copy, messages will inevitably be lost.

The consumer end

After the consumer side, let’s first explain the ISR concept mentioned above. There are three concepts:

  1. AR. All copies in a partition are called Assigned Replicas (AR).
  2. ISR. All in-sync Replicas (INCLUDING the leader Replicas) that are synchronized to a certain extent with the leader Replicas are called ISRS. The ISR set is a subset of the AR and can be viewed as a replica that is synchronized with the leader replica data comparison.
  3. OSR. The out-of-sync Replicas (OSR) that lag too much in synchronization with the leader Replicas (excluding the leader Replicas) are called OSR. The OSR set is also a subset of AR. OSR is mutually exclusive with ISR.

A consumer in Kafka logs its consumption offset to the server, so that even if the consumer goes down, it can pull new messages from the correct location for consumption once it comes back online, without losing messages. Here is an example of the official website:

Let’s look at an example of how Kafka records this offset. The following is a kafka Log file with 9 messages. The first message has an offset of 0 (LSO, Log Start offset), the last message has an offset of 8, and the latest writable offset is 9 (LEO, Log End Offset).


The High Watermark (HW) is commonly known as the High Watermark. It is used to specify the maximum offset that the consumer can consume. That is, the consumer can only pull the messages before the HW offset for consumption.


Log End Offset (LEO) : Identifies the Offset of the next message to be written in the current Log file.


Each copy in the partitioned ISR set maintains its own LEO, and the smallest LEO in the ISR set is the PARTITION’s HW, so consumers can only consume messages before the HW. In the figure above, for messages with offset between 0 and 5, all specified copies have been written synchronously, so the consumer can pull consumption; The value of offset ranges from 6 to 8. It indicates that some copies of the ISR have been written, and the other copies are being synchronized. Therefore, they cannot be fetched by consumers.

So let’s break down the process. Assume that there are three copies in the ISR set of a partition, that is, one leader copy and two follower copies, and the LEO and HW of the partition are both 3. Messages 3 and 4 are ready to be stored in the leader replica after being sent from the producer, as shown in the figure below:


After the message has been written to the leader copy, the follower copy requests synchronization, as shown below:



During synchronization, the synchronization efficiency and results of copies of followers vary. Some copies are fully synchronized, while others are only partially synchronized. In the figure below, all copies have synchronized message 3, so HW is moved back; While Follower2 only synchronizes message 3 and does not synchronize message 4, so HW takes the smallest LEO in all ISR sets of the current partition, that is, HW is 4.



When all copies successfully write messages 3 and 4, the HW and LEO of the entire partition are 5, so the consumer can consume all messages before the offset is less than 5.



From this process, we can clearly see the kafka message shift update mechanism. On the consumer side, the message is consumed before the update shift is committed, which maximizes the guarantee that the message is not lost.

conclusion

We start from a message flow from the producer to send, to kafka service storage, and then to the consumer consumption of these three modules analysis Kafka for the analysis of the non-lost message mechanism, but also including the practice of some configuration parameters. Of course, more details and practices, we need to continue to explore. But I believe that after today, we should have some confidence that we can answer the question that Kafka does not lose messages!!

Code word is not easy, work together!! Your likes are what keeps me going