This is the 16th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

This article discusses the implementation of Kafka’s 0 message loss configuration.

Establishing problem boundaries

The reason for establishing the boundaries of the problem is that nothing is 100% absolute, and to achieve zero message loss you must first establish the premises. There are two prerequisites to ensure that Kafka messages are not lost:

  1. The message must already be submitted. Committed means that the message is sent by the producer to the Broker, which successfully saves the message and tells the producer that it was successfully committed. We can configure parameters to determine how many brokers in the cluster successfully save messages before committing them.
  2. At least one Broker must be alive in the cluster. This is easy to understand. If the entire Kafka cluster is down, for example, all the machine rooms are out of power, it is impossible to guarantee that messages will not be lost.

How messages are lost

To achieve the result of message loss, we need to know how the message is lost, which can be analyzed from three aspects: producer, Broker and consumer.

producers

Kafka producer has a method for sending messages called producer.send(MSG). This method processes messages asynchronously. In other words, when the method is successfully called, the message is sent, but it does not mean that the message has been sent successfully or that the message has been successfully committed. If a message is not received by the Broker at all because the network is unavailable or the message itself is not up to standard, then the message disappears on the producer side.

Therefore, it is recommended to use the producer.send(MSG, callback) method with a callback when the producer sends messages, so that we can know whether the message is successfully sent or fails. After the message fails, we can perform specific processing.

If a message is lost due to a sending failure, the responsibility lies with the producer and the producer should handle it accordingly.

Broker

Messages are lost on the Broker side, usually because the Broker service is not available. If the Broker is down and the message is lost, Kafka will not consider it as a committed message, so this is outside the boundaries of our discussion.

consumers

The last type is consumer-induced message loss.

In the process of consuming a message, the consumer will update the consumer shift, which is “which message has been consumed”. There is a question as to whether, when consuming a message, to process the message and then update the displacement after success, or to update the displacement and then process the message.

If the displacement is updated first, when the message is processed, when there is a problem in the message processing, or after the displacement is updated, the message is not processed, the consumer downtime and other problems, the message will be lost.

However, if the message is processed first and then the update shift, the problem of consuming the same message repeatedly may occur, but we can solve it by means of idempotent implementation of the consumer processing logic.

How to configure

Based on the above analysis, the configuration scheme is as follows:

  • Producer side:
    • Send message usingproducer.send(msg, callback)Rather thanproducer.send(msg).
    • configurationacks = allThat is, all replica brokers have received the message before it is “committed”.
    • Set the retry times for consumers to be large enough to avoid failure to send messages due to network reasons.
  • The Broker end
    • Disallow lagging brokers from running for the new Leader through configurationunclean.leader.election.enable = falseTo implement.
    • More than three copies are provided to prevent loss through redundancy.
    • configurationmin.insync.replicasSpecifies how many copies you need to save to be considered committed, and the value must be greater than 1.
    • Ensure that the number of copies is greater thanmin.insync.replicasOtherwise, if only one copy goes down, the service will be unavailable. It is recommended that the value be 1 greater than the value.
  • The consumer
    • configurationenable.auto.commit=falseThe automatic submission shift is cancelled and the manual submission mode is adopted after message processing.