As the entry point for all HTTP requests, Nginx is a very important layer. This article focuses on how to use Nginx logs to monitor request exceptions for each business in real time.

This article is based on my previous article “Nginx Log Real-time Monitoring System based on Lua+Kafka+Heka”.

You can scan the QR code at the end of the article to follow my official account. Most of the content will be about back-end technology, front-end engineering, DevOps, and occasionally some big data related, and some interesting things will be recommended. I hope you like it

Because of its excellent performance, Nginx is widely used in the Internet. It is usually used as the HTTP access layer responsible for streaming and static file processing. As a result, a large number of logs are generated each day, and these logs can be used to generate a lot of value, such as user behavior analysis, service performance quality analysis, and exception monitoring described in this article. An access log usually records the source of the user request, target resource, device information, response status, etc., focusing on the abnormal response status code such as 500, and the other is upstream_response_time, which reflects the response speed of the back-end service. So, there are two main things to do here: 1. Monitor errors; 2. Monitor slow responses. The ultimate goal is to detect which module is out of order and which machine the problem is on.

Solutions in small-traffic scenarios

I will assume that there is only one Nginx node and the QPS is not high enough to worry about performance. The simplest way to do this is to write a script to calculate the number of 500 status codes per minute. If the number of status codes exceeds the preset threshold, send an alarm email with as much detail as possible, such as module name, error number, alarm severity, etc. And the abnormal log output to another file to facilitate troubleshooting. For slow response monitoring, the number of slowness is calculated according to upstream_response_time, and the average.

Solutions in heavy traffic scenarios

The above method is basically enough for some small traffic scenarios, but when a single node can not meet the requirements and encounter large traffic, this scheme will be a little difficult, such as performance, such as aggregated calculation of indicators. For the new solution, I drew a simple graph:

The bottom-up is Nginx cluster -> log collection -> message queue -> indicator calculation and output -> visualization. Now I will introduce the practice of each stage.

Log collection

There are a lot of tools to choose from, such as Logstash and Flume. I recommend using the Lua-resty-Kafka module to write lua extensions to join data in a certain format and write it to the message queue. It is also possible to turn off Nginx’s own logging switch to reduce disk consumption.

The message queue

You can choose Redis or Kafka, depending on whether you need to make other use of logs. Redis is lightweight. Kafka has the advantage of high throughput, distributed architecture, and in addition to exception monitoring, data can be put into Hadoop/ offline data warehouse for user behavior analysis.

Abnormal monitoring calculation

In this step, you need to calculate indicators, send alarms, and save abnormal data. If you use a Logstash for log collection, this step is also recommended to use a logstash. I won’t go into details on how to do this, but refer to the official documentation. But if you’re collecting custom format data using Lua extensions, I recommend using Heka. Heka is written in Go language with good performance and rich built-in plug-ins that can meet most of the requirements. If the requirements are not met, you can develop your own extensions using Go or Lua. In the Filter phase, the indicator is calculated, and an alarm message is written to the Heka message flow if there is an error. SMTPOuter matches the alarm message, customizes the email content through the custom Encoder, and sends the email.

visualization

Create Kibana for Elasticsearch using Message Matcher Syntax. After receiving an alarm email, you can access the Kibana background to view abnormal logs. You can also customize charts to see error trends in your system.

Other improvements

  1. I recommend not using Heka directly to send emails, otherwise you may get bombarded. Alarm messages can be written to another service through the HTTP interface, where some alarm policy and frequency control, as well as recovery checks, can be done.

  2. You need to monitor the Heka process and support automatic restart. Otherwise, you do not know when the process hangs.

conclusion

The main development points of this solution are the Nginx Lua extension and the Heka extension. Nginx Lua is relatively simple, and then you need to familiarize yourself with the entire message processing flow and mechanism of Heka, as well as how to develop plug-ins. Another pitfall is that Heka’s error prompts are incomplete, and debugging is not convenient, sometimes completely rely on guessing, but fortunately it itself is not much complex, some problems can be understood by looking at the source code.

Scan the QR code on wechat and follow me.

I write about back-end technology, front-end engineering, DevOps, the occasional big data, and some fun stuff. I hope you like it

Everything comes from love.