Performance Tuning: The RocketMQ Timeout Exception that has been bothering me for half a year has finally been cracked

Be a positive person, the harder you work, the luckier you get!

In the Intranet environment, timeout problem, network means that I do not carry the pot.

My understanding of timeouts has become more and more profound as I continue to practice in my work. The problem of timeouts encountered by RocketMQ in the production environment has been bothering me for nearly half a year, and now I have achieved a good result, so I would like to share with you.

This technology sharing involves many fields that THE author is not familiar with, such as network. If there are any mistakes, please correct them in time to achieve common growth and progress.

1. Network timeout


From time to time, I received feedback from the project team that MQ delivery in the production environment had timed out.

Today’s story will begin with a picture.

2. Troubleshooting


2.1 Preliminary Analysis

There are two key logs in the figure above:

  • InvokeSync: Wait Response Timeout Exception Network call times out

  • recive response,but not matched any request

    This log is critical because it indicates that although the client has timed out while retrieving the result returned by the server, the client eventually receives the response from the server, but the client has waited enough time to give up processing.

For the second log, I’ll elaborate on how it works, but it’s actually a programming trick that sends multiple requests with a single link. A long connection sends two requests to the server. When receiving the response result from the server, how can the client determine which request corresponds to the response result? As shown below:

The client has multiple threads that send reQ1 and REQ2 requests through a single connection. However, the server is usually multithreaded and may receive reQ2 response first when returning the result. How can the client identify which request the data returned by the server is corresponding to?

The solution is that before sending the request, the client generates a local unique requestId (requestId) for the request, and uses Future mode to put the requestId and Future objects into a Map, and then puts the reqestId into the request body. When the server returns the response result, it puts the requestId intact into the response result. When the client receives the response, it first displays the requestId, then finds the corresponding Future object from the cache, wakes up the business thread, notifes the caller of the return structure, and completes the whole communication.

Therefore, it can be seen that the client does not receive the request from the server within a specified period of time, but it still receives the request at last. The target is directly directed at the Broker, which is caused by the bottleneck and slow processing.

2.2 Analysis of Broker processing bottleneck

In my “experience”, RocketMQ message delivery typically returns a variety of Broker Busy if there is a bottleneck, and you can tell if the Broker has hit a bottleneck by tracking the data metrics that the Broker side writes to the PageCache.

`grep "PAGECACHERT" store.log`

Copy the code

The result is similar to the screenshot below:

Tips: The above figure is a screenshot from my machine. When analyzing the problem, the time for writing PageCache to the data from each Broker in the MQ cluster did not exceed 100ms.

It is because of the good pagecache writing data, based on the following rough network interaction features, that I propose to shift the contradiction to the network aspect:

In addition, I also confirmed with the business side that although the message sent back timed out, the message was persisted to MQ, and the consumer side could also consume normally. Although there was no problem with the LAN in theory, the network group still opened the network investigation in view of the above phenomenon.

Warm reminder: the final proof is that I took the side.

2.3 Network Analysis

There are two methods for network analysis: Netstat and network packet capture.

2.3.1 netstat Querying recv-q and send-q

We can observe two indicators recV-Q and Send-Q through netstat.

  • Recv-q Indicates the cache area for receiving TCP channels

  • Send-Q

    Send cache of the TCP channel

In TCP, the functions of recV-q and send-q are shown as follows:

  • The client invokes a network Channel, such as NIO’s Channel, to write data. The data is first written to TCP’s send cache. If the send cache is full, the client cannot send requests to this Channel. Write events need to be registered to notify the upper-layer application that messages can be sent when the outgoing cache is free.

  • After the data enters the send cache, the next data will arrive at the target end along with the network. The first is the receive cache of the target end. If the NIO is linked to the channel, the read event will continue.

  • After receiving an ACK, the sender deletes the data in the send buffer. If the receiver does not read the data, the sender cannot send the data.

Network colleagues are distributed on the client and MQ server and netstat is collected every 500ms. The collection results are summarized as follows:

From the perspective of the client, there is a large backlog in the recV-Q of the client and a large backlog in the Send-Q of MQ.

From the above communication model, it is inferred again whether the client is slow to read bytes from the network, because the client is a virtual machine. According to the netstat result, it is suspected to be a problem of the client (note, in fact, it is not a problem of the client, please do not be distracted).

2.3.2 Network Packet Capture

To be more rigorous, netgroup colleagues also found the following packages:

There is a very questionable problem here, that is, the sliding window on the client side and the server side is only 190 bytes, and an MQ message sent back packet is about 250 bytes, so the response packet needs to be transmitted twice before it can be received. At first, I thought this was the main reason, but through other comparison, It was found that the client with sliding window greater than 250 also had timeout, so it was not the main reason. Later, network colleagues used various tools to conclude that there was no problem with the network, but the problem with application.

After all, it is a LAN, so we will focus on the client read performance according to the results of Netstat.

2.4 Analysis of Network read Performance Bottlenecks on clients

In order to demonstrate read and write performance, I modified the RocketMQ CLient package and added the Netty performance capture code, which is shown in the following screenshot:

My main idea is to determine how many read operations a channel performs for each read event triggered by a client. If a read event is triggered, multiple read operations need to be triggered, indicating that a channel has a large backlog of data and a network read bottleneck.

However, disappointingly, there is no bottleneck in network reading on the client side. Some collected data are shown as follows:

The awK command is used to analyze it, and it is found that when a single read event is triggered, data in the read cache can be extracted successfully for most channels after two reads. There is no bottleneck in the read aspect. The statistical analysis of AWK execution is as follows:

Is it slow to write to the cache?

2.5 Network Bottlenecks at the Broker End

After the above analysis, the Broker server writes pagecache quickly, and the dimension writes the response result to the network without monitoring. Whether the response result is written in a timely manner, a large amount of backlog in Netty write cache, resulting in not timely write to TCP send buffer, resulting in a timeout.

The author originally intended to modify the code to monitor the write performance of the server side from the Netty level, but considering the high risk, I did not modify the code for the time being. Instead, I carefully read the RocketMQ encapsulating Netty code again. Before reading the source code this time, I always thought that the network layer of RocketMQ did not need parameter optimization. In RocketMQ, there are two thread parameters related to I/O:

  • ServerSelectorThreads The default value is 3.

  • ServerWorkerThreads The default value is 8.

ServerSelectorThreads, or workgroups in Netty, each thread pool holds a Selector object in the NIO for event selection, and all the channels are registered in turn among the three threads. All channels bound to a thread perform sequential read and write operations, that is, all channels read data from the TCP cache and write data to the send cache in this thread.

Our MQ server configuration, with CPU cores of 64C and above, is a bit “petty” to do this with 3 threads. This parameter can be tuned.

ServerWorkerThreads: In Netty’s threading model, events are propagated (encoded and decoded) by default in the IO thread, which is the thread of the Selector object mentioned above. To reduce the pressure on the IO thread, RocketMQ defines a separate thread pool for event propagation. That is, IO threads are only responsible for network reads and writes. Read data is decoded and written data is encoded in a separate thread pool. The number of threads in the pool is specified by serverWorkerThreads.

Seeing this finally makes my heart swell, feeling closer and closer to the truth. I set the first wave of parameter optimization to serverSelectorThreads=16 and serverWorkerThreads=32, which was verified in the production environment.

After more than a month of verification, when the number of cluster machines is reduced (after double 11), only a very small number of message sending timeout occurs, which can be ignored basically.

Tips: For a detailed explanation of Netty threading, please refer to the illustration of Netty Threading model, follow the public account, reply to NPDF, and get a free PDF of Netty Channel.

Insert a picture description here

3, summarize


This article describes the author’s efforts to troubleshoot MQ send timeouts, and finally locate the RocketMQ server about network communication thread model parameter Settings.

The reason why it took so long is that I was misled by my “experience”. In the past, I directly attributed the timeout to the network. The reason was that RocketMQ wrote PageCache data very well, and there were almost no write requests exceeding 100ms. The entire network communication layer is not considered.


Well, this article is introduced here, key three even (attention, like, message) is the biggest encouragement to me, of course, can add the author wechat: DINGwPMz, remarks CSDN, common exchange and discussion.

Finally, share a core RocketMQ ebook with me and you will gain experience in the operation and maintenance of billions of message flows.

Access: wechat search “Middleware Interest circle”, reply RMQPDF can be obtained.

Middleware interest circle

RocketMQ Technology Insider author maintenance, mainly into the system analysis of JAVA mainstream middleware architecture and design principles, to build a complete Internet distributed architecture system, help break the workplace bottleneck.