I. Reliability delivery at the production end
1. Ensure the sending of messages 2. Ensure the receiving of MQ nodes 3. The sender receives an acknowledgement from the MQ broker. 4. A complete message compensation mechanismCopy the code
In actual production, it is difficult to guarantee the complete reliability of the first three points. For example, in extreme environments, producers fail to send messages, and the sender suddenly breaks off the network when receiving the confirmation response, it is difficult to guarantee the reliability of delivery. Therefore, it is necessary to have the fourth perfect message compensation mechanism.
Second, the solution of the Internet factory
The first one: the message falls into the database, carries on the standard to the message state. Specifically, messages are persisted to the database and state values are set, and the state of the current record is changed upon receipt of the consumer's response. Polling is used to resend messages that did not receive a reply. Note that the number of retries is set here. Second: message delay delivery, do second confirmation, callback check.Copy the code
Third, the message falls into the database, marking the message status
Flow chart of message drop
After the business data and messages are stored in the database, setP2 is entered and the message is sent to the MQ service. According to the normal process, the consumer listens to the message, changes the state of the message to consumed according to the unique ID, and gives an acknowledgement ack to the Listener. If something unexpected happens and the consumer does not receive the message or the Listener fails to receive the message due to network interruption when receiving the confirmation, then we need to use our distributed timing task to grab the message from the MSG database that has not been consumed after the timeout and send it again. In the retry mechanism, you need to set the retry times limit. If the sending fails for some external reasons, do not retry too many times. Otherwise, the whole service will be destroyed. For example, if three attempts fail, set the status of the message to 2, and then manually process it through a compensation mechanism. In actual production, this is still relatively rare, but you can’t do without this compensation mechanism, or you won’t be able to achieve reliability.
Want to see the code implementation can go and see my this series: www.jianshu.com/c/c1785aa6c…
Four, delay delivery, do second confirmation, callback check.
Think back to the first scenario, where the production side has to store both business data and message data. Is this really appropriate in a high-concurrency scenario? On the core link, each persistence is carefully considered and takes 100-200 milliseconds, which is unacceptable in high concurrency scenarios. At this time, we need our second plan. The flow chart is as follows.
Upstream Server is our upstream service, which is a producer. Once the producer has successfully stored the business data, it generates two messages, one of which is immediately sent to the downstream Server. One is a delayed message to the compensation service Callback Server.
Normally, the downstream service listens for this instant message and sends a message back to the callback Server. Note that instead of returning an ACK, the downstream service sends a message back to the callback Server.
The callback Server listens to the message, learns that a message was successfully consumed, and persists the message to the database. When the delayed message sent by the upstream service reaches the Callback Server, the Callback Server queries the database. If there is such a record in its MSG DB, it means that the message has been consumed. If there is no such record, the callback Server will send an RPC request to the upstream service to tell the upstream service, You failed to send the message just now and need to send it again. The upstream service will send the instant message and the delayed message again and continue to follow the previous process.
Although the second option is not 100% reliable, in extreme cases, scheduled tasks and compensation mechanisms are needed to assist. But the core of the second option is to reduce database operations, and that’s important!
In a high concurrency scenario, I’m not thinking about 100 percent reliability anymore, I’m thinking about availability, whether the performance can hold the traffic, so I can take one database operation off the table. My upstream service had one less database operation, my service performance was relatively improved, and I was able to decouple the asynchronous Callback Server compensation service.
Five, the conclusion
Both solutions are feasible and need to be selected according to the actual business. The second solution will be used for large ultra-high concurrency scenarios, and the first solution will be used for common scenarios.