This post was first posted by Yanglbme on GitHub’s Doocs (stars over 30K). Project address: github.com/doocs/advan…
The interview questions
How to ensure the reliable transmission of messages? In other words, how do you handle message loss?
Interviewer psychoanalysis
This is for sure, the basic rule of MQ is that there can be no more data, no less data, no more data, which is the problem of repeated consumption and idempotency mentioned earlier. No less, which means don’t lose the data. Well, that’s something you’ll have to think about.
If you are using MQ to deliver very core messages, such as billing and deduction messages, you must ensure that billing messages are never lost in the process of MQ delivery.
Analysis of interview questions
The problem of data loss can occur among producers, MQ and consumers, so RabbitMQ and Kafka can be analyzed separately.
RabbitMQ
The producer lost the data
When producers send data to RabbitMQ, it can be lost halfway through the process, due to network problems and so on.
To do this, the RabbitMQ transaction channel.txselect is enabled before the producer sends a message. If the message is not received successfully, the producer will receive an exception. At this point, you can roll back the transaction channel.txrollback and retry sending the message; If a message is received, the transaction channel.txCommit can be committed.
// Start the transaction
channel.txSelect
try {
// Here to send messages
} catch (Exception e) {
channel.txRollback
// Here is the message again
}
// Commit the transaction
channel.txCommit
Copy the code
The problem, however, is that with RabbitMQ transactions (synchronization), basically throughput slows down because it takes too much performance.
So in general, if you want to make sure that messages to and from RabbitMQ are not lost, you can enable Confirm mode. After enabling Confirm mode, you will be assigned a unique ID for each message you write to RabbitMQ. RabbitMQ will send you an ACK message saying that the message is ok. If RabbitMQ fails to process the message, it will call back to one of your nACK interfaces to tell you that the message failed and you can try again. In combination with this mechanism, you can maintain the status of each message ID in memory yourself, and if you haven’t received a callback for that message for a certain amount of time, you can resend it.
The big difference between a transaction and a Confirm is that a transaction is synchronous, you commit a transaction and it blocks, whereas a Confirm is asynchronous, you send a message and then you send the next message, RabbitMQ then receives the message and asynchronously calls back to one of your interfaces to inform you that the message was received.
Therefore, the confirmation mechanism is generally used to avoid data loss in the producer.
RabbitMQ has lost data
If RabbitMQ loses data, you must enable RabbitMQ persistence, meaning messages will be persisted to disk after being written to, even if RabbitMQ dies, and will be read automatically after recovery, data will not be lost. Unless very rarely, RabbitMQ dies of its own without persisting, a small amount of data can be lost, but this is unlikely.
There are two steps to setting up persistence:
- This ensures RabbitMQ will persist the metadata of the queue, but it will not persist the data in the queue.
- The second is when the message is sent
deliveryMode
Set to 2
The message is set to persist and RabbitMQ will persist the message to disk.
Both must be set at the same time, and RabbitMQ will restart its queue from disk even if it hangs and restarts again.
Note that even if you enable persistence for RabbitMQ, it is possible that the message will be written to RabbitMQ before it is persisted to disk, and then RabbitMQ will hang up, causing a small loss of data in memory.
Therefore, persistence can be combined with the producer confirm mechanism to notify the producer ack only after the message has been persisted to disk, so even if RabbitMQ dies before persisting to disk, the data is lost and the producer cannot receive the ACK, you can send it yourself.
The consumer lost the data
If RabbitMQ loses data, mainly because the process hangs (for example, after a restart) when you consume it, RabbitMQ will assume that you have consumed it and the data is lost.
To do this, use RabbitMQ’s ack mechanism. In simple terms, you must turn off RabbitMQ’s automatic ack, which can be called through an API, and then ack it in your own code every time it is done. In this case, if you haven’t finished processing, there is no ACK? RabbitMQ will assume that you have not finished processing the purchase, and will assign the purchase to another consumer so that the message is not lost.
Kafka
The consumer lost the data
The only way a consumer can lose data is if you consume the message, then the consumer automatically submits the offset, making Kafka think you have consumed the message, but you are just about to process the message, and before you can process it, you hang up and the message is lost.
This is similar to RabbitMQ. Kafka is known to automatically commit offsets, so simply turn off the automatic commit and manually commit offsets after processing to ensure data is not lost. However, there may still be repeated consumption at this time. For example, if you die before submitting offset after processing, you will definitely repeat consumption once, as long as you ensure idempotency.
One of the problems in production is that our Kafka consumers consume data and write it to an in-memory queue to buffer it. As a result, sometimes you just write the message to the in-memory queue, and then the consumer automatically submits the offset. Then we restart the system, causing the queue to lose data before it has time to process it.
Kafka lost the data
A common scenario in this scenario is that a Kafka broker goes down and the partition leader is reelected. If some data from other followers is not synchronized at this time, the leader dies. If a follower is elected as the leader, some data will be lost. There’s some data missing.
Kafka’s leader machine crashed. After switching from follower to leader, we found that this data was lost.
Therefore, you must set at least four parameters as follows:
- For a topic
replication.factor
Parameter: This value must be greater than 1, requiring that each partition must have at least 2 copies. - Set on the Kafka server
min.insync.replicas
Parameter: This value must be greater than 1, which requires the leader to be aware that there is at least one follower still in contact with the leader, so as to ensure that there is still one follower after the leader dies. - Set on the producer side
acks=all
: This is required for each piece of data, must beWrite is considered successful only after all replicas are written. - Set on the producer side
retries=MAX
(a very, very, very large value, infinite retries) : this isRequires unlimited retries if a write failsIt’s stuck here.
Our production environment is configured in such a way that, at least on the Kafka broker side, data will not be lost in the event of a leader switch due to a failure of the leader broker.
Will the producer lose the data?
If you set acks=all, the write will not be lost. The requirement is that the leader receives the message and all the followers have synchronized the message. If this condition is not met, the producer automatically retries an unlimited number of times.
Welcome to follow my wechat public account “Doocs Open Source Community” and push original technical articles as soon as possible.