In the real-time message queue implemented by Go, THE popularity of NSQ can rank first.

NSQ is a messaging middleware that is simple to use and designed to provide a powerful infrastructure for decentralized services to operate in a distributed environment. It has a distributed and decentralized topology, which is characterized by no single point of failure, fault tolerance, high availability and reliable delivery of messages.

NSQ captured the hearts of gopher with its distributed architecture and ability to process hundreds of millions of messages. Our company is no exception. Many businesses rely on NSQ for message push. I want to talk to you today about NSQ surveillance.

Why deploy monitoring?

We all know the importance of monitoring. A service without monitoring is a blind man riding a blind horse in a deep pool at midnight. This may sound a little abstract, but let me share with you a real case from my own experience.

I still remember that day, I was eating hot pot singing, the in the mind don’t mention how beautiful, suddenly the mobile phone rang, opened a look, found that the customer feedback CDN refresh success did not take effect.

Hotpot is impossible to continue to eat happily. I picked up the computer and operated like a tiger, but unfortunately the results were unreliable: I checked the logs of related services on the system call link, but the URL that the customer needed to refresh was not found in the logs. So what’s the problem?

The service invocation involved in this business is illustrated in the figure above. Because of the urgency of the customer’s needs, I also took my eyes off the boiling hot pot and pondered over the service link diagram:

  • As shown in the figure, the successful submission of the refresh request by the user indicates that the request flows to the OHM service layer and oHM successfully processes the request.

  • The OHM service is the gateway for refreshing warm-up related services, and logs are at the ERROR level. No records of the request can be found in oHM logs, indicating that oHM succeeded in pushing messages to downstream NSQS.

  • The corresponding consumers of NSQ are the Purge and Preheat components. Purge, which performs the refresh, logs every refreshed URL at the info-level. But why isn’t the purge log available?

I just got stuck in here. The crux of the problem is when the corresponding log is not found in the Purge service. I have roughly listed the following situations:

  • Service changes. Purge service may cause an exception if there is a Bug in the recently updated code. But I quickly ruled this out, as the release history indicated that the service had not been touched in recent months.

  • NSQ is broken. This is even more unreliable, NSQ is clustered deployment, single point of failure can be avoided, global failure I’m afraid the company has blown up by now.

  • NSQ didn’t send the message. But NSQ is a real-time message queue, so message delivery should be fast, and the customer refresh is hours in advance.

Is it possible that NSQ messages were not delivered in time because they were piled up? This type of problem was not detected in the test environment before, because the magnitude of the test was far less than in the online environment… When I thought of it, I had an Epiphany. Log in the NSQ console to see the corresponding Topic, sure enough, the problem appeared here, NSQ has accumulated hundreds of millions of undelivered messages!

After locating the problem, it is normal to use internal tools to solve the problem for customers, which will not be expanded here.

Monitoring deployment Landing

I was relieved that the work order was finished, but it didn’t end there. This glitch is a wake-up call: just because NSQ performance is good doesn’t mean messages won’t pile up, and the necessary monitoring alarms still need to be arranged.

Because of our existing infrastructure, I decided to use Prometheus to monitor NSQ services. (For background on Prometheus, please leave a comment.)

Prometheus collects data from third-party services through a close friend, which means NSQ must maintain a close friend to access Prometheus.

Prometheus’s official document [Prometheus. IO/docs/instru… exporter have recommended, I follow the link to find the official recommendation NSQ exporter [https://github.com/lovoo/nsq_exporter]. NSQ exporter this project disrepair, a recent submission is already four years ago.

So I took the project local and made some simple changes to make it go mod. (PR is here [github.com/lovoo/nsq_e…

With NSQ Exporter deployed, the next question is what metrics need to be monitored?

[nsq.io/components/…

  • Depth: indicates the current NSQ heap message. NSQ only holds 8000 messages in memory by default, and any more messages are persisted to disk.

  • Requeued: Number of times a message is requeue.

  • Timed Out: Processes a Timed message.

Prometheus recommended configuring Grafana to see the changes in metrics more intuitively, and the general results of my configuration are as follows:

A timeout message corresponds to the Timed Out indicator

  • The heap message corresponds to the Depth indicator

  • The load is generated according to the formula sum(irate(NSQ_topic_message_count{}[5m])).

  • The detection service detects whether the NSQ exporter service is normal. Because the service often suffers from PRESSURE from NSQ, my friend’s services become unavailable.

Since NSQ is configured with monitoring service, we can quickly perceive the current status of NSQ and handle and follow up manually in time after the alarm is issued. The stability of relevant business has been significantly improved, and the work orders caused by such problems have become less; In addition, monitoring the collected data makes our thinking clearer and direction more obvious in the following performance optimization work.

Recommended reading

Server standard SSH protocol, how much do you know?

Talk about eBPF on the tuyere