Ali interviewer: How to answer the problem of missing, duplicate and backlogged message queues

When interviewing a candidate for a job and discovering the use of MQ technology (such as Kafka, RabbitMQ, RocketMQ) in a project, one question will be asked: How can I ensure that 100% of messages are not lost while using MQ?

This question is a common one in practice, both to test a candidate’s knowledge of MQ middleware technology and to distinguish a candidate’s level of competence. Next, we will start from this question, discuss the basic knowledge and answer ideas we should grasp, as well as the extension of the interview test point.

Case background

Taking JINGdong system as an example, when users purchase goods, they usually choose to use Jingdou to deduct part of the amount. In this process, the transaction service and Jingdou service communicate with each other through MQ message queue. When placing an order, the transaction service sends a “deduct account X 100 Peking beans” message to the MQ message queue, and Peking Beans service consumes this command on the consumer side for the real deduction operation.

What problems do you encounter in this process?

Case analysis

To know, in the Internet interview, the most direct purpose of introducing MQ message middleware is to do system uncoupling flow control, and to solve the problem of high availability and high performance of the Internet system.

System decoupling: can use MQ message queue, isolation system environment change of upstream and downstream of the unstable factors, such as the beans service system needs no matter how to change, trading service need not do any change, even when the bean service failure, the main trading process can also be the bean service downgraded, realize decoupling trading service and Beijing beans, do the high availability of the system.
Flow control: In the scenario of sudden increase of traffic, such as SEC, MQ can also be used to “peak load and valley fill” the flow. The flow can be automatically adjusted according to the downstream processing capacity.

But the introduction of MQ, while enabling system uncoupling flow control, also brings other problems.

The introduction of MQ message middleware to achieve system decoupling will affect the consistency of data transmission between systems.

In a distributed system, if there is data synchronization between two nodes, there will be data consistency problems. Similarly, in this article, we will address the problem of message data consistency between message producers and message consumers (that is, how to ensure that messages are not lost).

And the introduction of MQ message middleware to solve the flow control, will make the consumer end processing capacity is insufficient, resulting in message backlog, which is also the problem we need to solve.

Therefore, it will be found that questions are often linked to each other, and the interviewer will take the opportunity to examine the coherence of our problem-solving ideas and the mastery of the knowledge system.

What about the answer to the question, “How do YOU ensure messages are not lost when using MQ message queues?” First of all, we need to analyze some of the questions.

How do I know if a message is lost?
What links may lose messages?
How do I ensure that messages are not lost?

When we answer, want to let the interviewer know our analysis methods, and then provide a solution: the network data transmission unreliable, want to solve the problem of how to don’t throw the news, the first thing to know which links may throw messages, and how can we know whether the message is lost, the last is solution (rather than come up directly said his solution).

“Architecture” represents the architect’s thought process, and “design” is the final solution. Both are indispensable.

Case to answer

Let’s first look at the link of message loss. The process from production to consumption of a message can be divided into three stages, namely message production stage, message storage stage and message consumption stage.

Message production phase: From the time a message is produced and then submitted to MQ, the MESSAGE is successfully sent as long as an ACK response from the MQ Broker is received normally, so there is no message loss at this stage as long as return values and exceptions are handled.
Message storage phase: This phase is typically handled directly by MQ messaging middleware, but you need to understand how it works, such as brokers making replicas to ensure that a message synchronizes at least two nodes before returningack.
Message consumption phase: the consumer side fromBrokerA pull-up message can be saved as long as the consumer does not immediately send a consumption acknowledgement to the Broker but waits until the business logic has been executed.

However, in distributed systems, failures are inevitable. As a consumer producer, we cannot guarantee whether MQ loses your messages or consumers consume your messages. Therefore, based on the Design principle of “Design for Failure”, We still need a mechanism to Check if messages are lost.

Then, you can explain to the interviewer how to do message detection.

The overall solution is as follows: On the message production end, assign a global unique ID to each sent message, or attach a continuously increasing version number, and then perform version verification on the consuming end.

How to implement it?

We can use the interceptor mechanism. The message version number is injected into the message by the interceptor before the message is sent by the production side (the version number can be generated using either a continuously increasing ID or a distributed globally unique ID). Then, after the consumer receives the message, it detects the continuity or consumption status of the version number through the interceptor. The advantage of this implementation is that the message detection code will not invade the business code, and the lost message can be located through a separate task for further investigation.

Note here: if there are multiple message producers and message consumers at the same time, it is difficult to implement the method of incrementing the version number, because the uniqueness of the version number cannot be guaranteed. In this case, the message detection can only be carried out by using the globally unique ID scheme. The specific implementation principle is the same as the method of incrementing the version number.

Now that we know what steps (message storage phase, message consumption phase) might go wrong, and have a plan for how to detect message loss, it’s time to come up with a design plan for how to prevent message loss.

After answering the question “How do I ensure that messages are not lost?” After that, the interviewer will often ask, “How can I solve the problem of messages being reused?” .

For example, in the process of message consumption, if there is a failure, the sender will retry through the compensation mechanism, and the retry process may produce repeated messages. How to solve this problem?

In other words, the problem is how to solve the consumption-side idempotency problem (idempotency is a command that can be executed multiple times with the same effect as a single execution). Once the consumption-side idempotency is achieved, the problem of repeated consumption of messages is solved.

Again, let’s look at the example of the deduction of Jingdou. The number of gold beans in account X is deducted by 100. In this example, we can make the business logic idempotent by transforming it.

The simplest implementation is to create a message log table in the database with two fields: message ID and message execution status.

Thus, our message consumption logic can be: add a message record to the message log table, and then asynchronously update the user’s jingdou balance based on the message record.

Because we check each time to see if the message exists before insertion, we don’t have to execute a message more than once, thus achieving an idempotent operation. Of course, based on this idea, not only can use relational database, but also through Redis to replace the database to achieve a unique constraint scheme.

One of the prerequisites for solving the problem of “message loss” and “message repeated consumption” is to implement a globally unique ID generation solution. This is one of the questions interviewers like to ask, and one you need to master.

In distributed systems, globally unique ID generation methods include database auto-increment primary key, UUID, Redis, twitter-Snowflake algorithm. I summarized the characteristics of several solutions for your reference.

We should be aware that either way, there is a trade-off if we want to meet simplicity, high availability and high performance at the same time, so we need to stand in the real business and explain what the balance is in our selection. Personally, I prefer Snowflake algorithm in business, and I have made some modifications in the project, mainly to make the ID generation rules in the algorithm more consistent with business characteristics, and optimize such problems as clock callback.

Except, of course, “How do I solve the problem of messages being re-consumed?” In addition, the interviewer will ask about our “news backlog”. The reason for this is that message backlogs are a performance issue, and resolving them shows that a candidate is capable of handling the consumption problem in a high-concurrency scenario.

When we answer this question, we still want to convey to the interviewer a thought process like this:

If there is a backlog, it must be a performance problem. To solve the performance problem of messages from production to consumption, you need to know where the backlog is likely to occur, and then consider how to solve it.

Since backlogs occur after messages are sent, they have nothing to do with the message production side, and since most message queue nodes can handle tens of thousands of messages per second, performance is not seen in the message storage of the middleware relative to the business logic. There is no doubt that the problem is definitely in the message consumption stage, so from the consumer side, how to answer?

If there is a sudden problem online, it is necessary to temporarily expand the capacity to increase the number of consumer terminals, and at the same time, downgrade some non-core businesses. Take on traffic by scaling up and downgrading to show your ability to deal with emergent problems.

Secondly, it is necessary to troubleshoot and solve abnormal problems, such as analyzing whether the business logic code of the consumer side has problems through monitoring, logging and other means, and optimizing the business processing logic of the consumer side.

Consumption in the end, if it is the processing capacity is insufficient, can provide consumer side by horizontal expansion concurrent processing ability, but there is a test need to pay special attention to, that is the number of instances in expanding the consumer at the same time, must be synchronous expansion theme Topic partition number, to ensure that the consumer is the same number of instance and partition. If the number of instances of consumers exceeds the number of partitions, this expansion will not work because partitions are single-threaded consumption.

For example, in Kafka, a Topic can be configured with multiple partitions. Data can be written to multiple partitions. However, Kafka specifies that a Partition can only be consumed by one consumer. Consumer processing power can be increased by adding partitions.

conclusion

In this article we have shared solutions to some of the most popular problems with MQ message queues, which we hope will be helpful to both junior and senior developers.

In addition, we can all start from these points and have a friendly communication with the interviewer. Let’s sum up today’s highlights.

How do I ensure that messages are not lost?

Knowing each stage of a message from delivery to consumption, whether a message is lost, how to monitor if a message is lost, and finally how to resolve the problem can be based on the “Reliable message delivery of MQ” approach.

How to ensure that messages are not re-consumed?

In the process of message compensation, there must be repeated messages, so how to realize the idempotency of the consumption side is the focus of this question.

How to deal with message backlog?

In order to achieve true high performance through MQ, the highest priority is to resolve online exceptions, then monitor and log to troubleshoot and optimize the business logic, and finally expand the number of consumers and shards.

When answering questions, it’s important to give the interviewer a sense of your thought process. This kind of problem-solving ability is much more valuable to the interviewer than your direct answer to an interview question.

In addition, if you are applying for a position in the infrastructure department, you will need to master other knowledge systems of message-oriented middleware, such as:

How do I choose message-oriented middleware?
What is the difference between the queue model and the publish-subscribe model in messaging middleware?
Why can message queues achieve high throughput?
Serialization, transport protocols, and memory management
…

Ok, today share so much, there is no concern, pay attention to a bai!

This article is published by OpenWrite!