Based on my experience with RocketMQ, this article will share common problems with message sending, and basically follow the problem, analyze the problem, and solve the problem.
1, No route info of this topic
Routing information could not be found. The complete error stack is as follows:
As many readers will tell you, this is also the case when the Broker turns on automatic theme creation.
RocketMQ’s route finding process is shown below:
The key points above are as follows:
- If the Broker has enabled automatic Topic creation, the Topic TBW102 is created by default at startup and reported to Nameserver as a heartbeat packet that the Broker sends to Nameserver. Routing information is returned when the Broker queries routing information from Nameserver.
- The sender checks the local cache before sending a message. If the local cache exists, the sender directly returns routing information.
- If the cache does not exist, Nameserver queries the route information. If the route information exists, Nameserver returns the route information.
- If Nameserver does not have routing information for this topic, or if automatic topic creation is not enabled, throw No route info of this topic.
- If automatic Topic creation is enabled, the system queries routing information from Nameserver using the default Topic and uses the routing information of the default Topic as its own routing information. No route info of this Topic is thrown.
Normally, No route info of this topic is used when RocketMQ is just built. It is usually used when RocketMQ is just started.
-
You can run the Rocketmq-console command to check whether the route information exists, or run the following command to query the route information:
CD ${ROCKETMQ_HOME}/bin sh./mqadmin topicRoute -n 127.0.0.1:9876 -t dw_test_0003Copy the code
The output is as follows:
-
If routing information cannot be queried, check whether the Broker has enabled automatic topic creation. The parameter is autoCreateTopicEnable, which is true by default. However, this function is not recommended in the production environment.
-
If automatic route creation is enabled but this error is still thrown, check whether the Nameserver address connected to the Producer is the same as the Nameserver address configured in the Broker.
After the above steps, you can almost resolve the error.
2. Message sending times out
Message sending times out. The client logs are as follows:
When a client sends a message timed out, the first suspect is usually the RocketMQ server, if Broker performance is jitter and unable to handle the current volume.
So how do we find out if RocketMQ currently has a performance bottleneck?
First we run the following command to check the distribution of RocketMQ message write time:
cd /${USER.HOME}/logs/rocketmqlogs/
grep -n 'PAGECACHERT' store.log | more
Copy the code
The following output is displayed:
RocketMQ will be able to see if there is a detailed performance bottleneck in RocketMQ message writing based on the distribution of message delivery time per minute prior to printing. The range is as follows:
- [<=0ms] is less than 0ms, which is subtle.
- [0 to 10ms] indicates the number of milliseconds smaller than 10ms.
- [10 to 50ms] Smaller than 10ms
- At 50ms
According to the author’s experience, if there are more than 20 intervals of 100-200ms and above, it means that the Broker does have a certain bottleneck. If there are only a few, it means that this is the jitter of memory or pagecache, which is not a problem.
Timeouts usually have little to do with the Broker’s ability to process them. Another example is RocketMQ Broker’s fast failure mechanism, where the Broker queues messages when it receives a request from a client and executes them sequentially. If a message queue waits more than 200ms and starts a quick failure, return the [TIMEOUT_CLEAN_QUEUE]broker busy to the client, which is covered in more detail in Part 3 of this article.
When encountering network timeouts on the RocketMQ client, it is often a good idea to consider whether some of the application’s own garbage collection is causing message delivery timeouts due to GC pauses. This is something I have encountered in my stress test environment, but not in production.
Network timeouts are commonly encountered in RocketMQ, usually related to network jitter, but since I am not particularly good at networking, I cannot find direct evidence, but I can find indirect evidence, such as connecting kafka and RocketMQ clusters in one application. Discover that all brokers connected to the RocketMQ cluster and connections to the Kafka cluster timed out at the same time as the timeout occurred.
But if there is a network timeout, we have to deal with it. Is there a solution?
Our lowest expectations of message middleware is high and low latency, from the above message is sent time distribution can also see RocketMQ indeed conforms to our expectations, most of the requests are within the subtle level, therefore I give solution, reduce the timeout message is sent, increase retries, and increase the rapid failure of the maximum waiting time. Specific measures are as follows:
-
To increase the duration of a quick Broker failure, 1000 is recommended. Add the following configuration to the Broker configuration file:
maxWaitTimeMillsInQueue=1000 Copy the code
The main reason is that in the current version of RocketMQ, a quick fail error is SYSTEM_BUSY and does not trigger a retry. Increase this value appropriately to avoid triggering this mechanism as much as possible. For details, see Part 3 of this article, which focuses on SYSTEM_BUSY and Broker_BUSY.
-
If RocketMQ is a client version 4.3.0 or later (excluding 4.3.0) set timeout to 500ms for sending messages and set the number of retries to 6 (this can be adjusted to be greater than 3). The philosophy behind this is to timeout as soon as possible and retry. Since network jitter in the LAN is found to be instantaneous, it can be recovered in the next retry. Besides, RocketMQ has a fault avoidance mechanism, and tries to select a different Broker when retrying. The related code is as follows:
DefaultMQProducer producer = new DefaultMQProducer("dw_test_producer_group"); producer.setNamesrvAddr("127.0.0.1:9876"); producer.setRetryTimesWhenSendFailed(5);// Synchronous sending mode: Retry times producer.setRetryTimesWhenSendAsyncFailed(5);// Asynchronous sending mode: Retry times producer.start(); producer.send(msg,500);// Message sending timeout Copy the code
-
If RocketMQ client version is 4.3.0 or later
If the client version is 4.3.0 or later, the message sending timeout period is set to the total timeout period of all retries. Therefore, you cannot directly set the RocketMQ sending API timeout period. Instead, you need to wrap the RocketMQ SENDING API, and retry is performed on the outer receiver.
public static SendResult send(DefaultMQProducer producer, Message msg, int retryCount) { Throwable e = null; for(int i =0; i < retryCount; i ++ ) { try { return producer.send(msg,500); // set the timeout period to 500ms with an internal retry mechanism } catch(Throwable e2) { e = e2; }}throw new RuntimeException("Message sending exception",e); } Copy the code
3, System busy, Broker busy
In RocketMQ, if the RocketMQ cluster reaches the 1W/ TPS stress load level, System busy and Broker busy will be a common problem. An example is the exception stack shown below.
RocketMQ error keywords related to System BUSY and Broker busy are as follows:
- [REJECTREQUEST]system busy
- too many requests and system thread pool busy
- [PC_SYNCHRONIZED]broker busy
- [PCBUSY_CLEAN_QUEUE]broker busy
- [TIMEOUT_CLEAN_QUEUE]broker busy
3.1 Principle Analysis
Let’s start with a graph to illustrate when these errors are thrown during the lifetime of the message.
According to the above five types of error logs, the original triggered can be classified into the following three types.
-
Pagecache is under high pressure
The following three types of errors fall into this category
- [REJECTREQUEST]system busy
- [PC_SYNCHRONIZED]broker busy
- [PCBUSY_CLEAN_QUEUE]broker busy
Determine whether pagecache is busy based on writing messages, in the memory of the time to add a message lock, the default judgment criteria is lock time is more than 1s, it is considered that pagecache pressure, to the client to throw the relevant error log.
-
In RocketMQ, messages are sent from a thread pool of only one thread. A bounded queue, with a default length of 1W, is maintained internally. If the number of squeezes in the current queue exceeds 1W, the thread pool reject policy is executed. Throw the [too many Requests and System Thread pool busy] error.
-
The Broker side fails rapidly
By default, RocketMQ has fast failure enabled on the Broker. If the pagecache is not busy (locked for more than 1s), RocketMQ will not queue up any requests for 200ms. System BUSY is returned directly to the client, but since the RocketMQ client currently does not retry this error, extra processing is required to resolve this type of problem.
3.2 PageCache busy solution
Once the message server appears a large number of pagecache busy (in the memory to add data lock more than 1s), this is a serious problem, need human intervention to solve the problem idea is as follows:
-
transientStorePoolEnable
To enable transientStorePoolEnable, add the following configuration to the broker configuration file:
transientStorePoolEnable=true Copy the code
The following figure shows the working principle of transientStorePoolEnable:
TransientStorePoolEnable can relieve pagecache stress.
-
The message is first written to the out-of-heap memory, which is close to the direct operation memory due to the memory locking mechanism enabled. The performance is guaranteed.
-
After the message enters the memory outside the heap, the background will start a thread and submit the message to the Pagecache in a batch. That is, when writing the message, the pagecache write operation is changed from a single write to a batch write, reducing the pressure on the Pagecache.
Introducing transientStorePoolEnable increases the possibility of data loss. Messages submitted to PageCache will not be lost if the Broker JVM process exits unexpectedly. The part of the message that exists in the out-of-heap memory (DirectByteBuffer) but has not been committed to the PageCache will be lost. In general, however, it is unlikely that the RocketMQ process will exit. In general, if transientStorePoolEnable is enabled, the message sender needs to have a repush mechanism (compensation idea).
-
capacity
If pagecache is still busy after transientStorePoolEnable is enabled, you need to expand the cluster or split some topics in the cluster to reduce cluster load.
Note: RocketMQ Client has a retry mechanism when the Broker is busy due to pagecache.
3.3 TIMEOUT_CLEAN_QUEUE Solution
Since the client will not retry a TIMEOUT_CLEAN_QUEUE error, it is recommended to add a quick failure criterion to the broker’s configuration file:
# The default value is 200, indicating 200ms
waitTimeMillsInSendQueue=1000
Copy the code
Broker Busy: Broker busy: Broker busy
RocketMQ sends System BUSY and Broker Busy Cause analysis and solution
RocketMQ broker busy
This article is based on RocketMQ In Action and Progression,This column introduces RocketMQ using scenarios, what problems are encountered, how to solve them, and why. The design of the column focuses on the actual combat, aiming to enable a RocketMQ beginner to quickly “beat the monster” through the study of this column, the combination of theory and practice, to become a leader in the field.