The charging pile project in the production environment has been running smoothly. Users can operate on the H5 page, scan the charging pile, and then pay. After entering the corresponding interface, they can control the discharge and power outage of the charging pile.

The user interacts with the server through HTTPS. After receiving the request, the server will assemble the parameters and send a message to the MQTT server (RabbitMQ). The MQTT client of the charging pile will receive the message. The charging pile feeds back to the page in exactly the opposite way.

After the project went online, the average time of sending messages to hardware response was basically around 2s (depending on the local 4G network signal). However, one afternoon, the customer of the community reported that the whole process became very slow. After placing an order, the discharge was successful, but the page would not jump to the interface of discharge display, and clicking the power off button on the page would not take effect.

First, check the log of H5 module on the server, and find that all incoming Mqtt messages of this module have been successfully processed on the server without delay. Therefore, it is suspected that there is a signal problem with the hardware of the local charging pile (there was no signal and weak signal in a large scale in a community not long ago). For insurance, a set of hardware equipment was debugged and installed in the office to simulate the situation on the line, and the response was timely and there was no delay. So immediately let the field personnel check, but after more than ten minutes of testing, the field personnel responded that there was no problem with 4G signal.

No way, can only repeat the problem in the office for many times, there are also on-site personnel to cooperate with the test. It did turn out that messages were leaking to the server. That is, the server processes the messages that have been uploaded successfully. The server does not process the messages that have not been uploaded successfully. The page can only stay in the waiting state and enter the subsequent waiting processing process. Really missing at this point, in order to verify the charging pile preach the message, or for any other reason the server processes this message, opened up an Mqtt client, and set up the environment, even on the production server, transferred to the corresponding topic and open the consumer end, the results found that the charging pile upload all of the messages is normal, But the server did not receive it. At this point, it occurs to me that there are multiple programs that set up the same Clientid while consuming messages in the production environment. After checking each server (which allows access to the production environment IP), it was found that there was indeed a consumer that shared all messages with the MQTT message consumption module of the production environment. Kill the process on the consumer end, and the production environment recovers immediately.

In fact, this problem has happened many times before, but each time will appear in different scenarios, and because the same terminal ID consumption message strategy is different, it may also appear that sometimes the test response is timely, some response is slow, causing interference to the troubleshooting problem. For example, when debugging the hardware MQTT message at first, because the hardware engineer was not clear about the details of the protocol, all the hardware burned exactly the same code, that is, the Clientid was the same. Thus, when two or more hardware online at the same time, is equivalent to the same consumption at the same time open two clientId, that originally sent to a consumer may be another to receive the news, target consumers will not have a reaction, and the server after discovering target consumers don’t respond, will start the retransmission mechanism, the ultimate goal consumption after client receives the message, It creates the illusion of delayed communication. At the beginning, when there was only one hardware debugging, everything was normal. However, after several hardware debugging, the network was not smooth, and it was suspected that there was a network problem in the office area. In fact, several terminals jointly consume all messages on the same channel, resulting in missed packets. Although there is a retransmission mechanism to ensure the complete execution of the whole process, the whole process will be very slow, and the whole execution cycle is quite long. I have encountered similar problems with Kafka before.

However, this feature is a solution that can be used to disperse traffic during large data processing. For example, when massive terminal production data is uploaded to the message queue, a consumer terminal may not be able to withstand such large data traffic and timely processing. Therefore, multiple terminals must be enabled on different servers and use the same terminal ID to share and process the batch of messages in a timely manner.

For the record, this is a problem spot of concern in troubleshooting problems involving message queues.