What kind of monitoring really shows that there is something wrong with the system?

Monitoring does not alarm, the system must have no problem? What kind of monitoring really shows that there is something wrong with the system? Today I’m talking to you about multi-dimensional three-dimensional surveillance. \

What is multi-dimensional three-dimensional monitoring?

Different companies have more or less automated monitoring methods, such as:

(1) HTTP interface monitoring;

(2) log keyword monitoring;

(3) Operating system, process and port;

(4) HTTP status code;

(5) Service viability;

(6) Interface processing time;

(7) RPC interface monitoring;

(8) User-level monitoring;

If only one or a few dimensions are monitored:

(1) When abnormalities are monitored, we are basically convinced that there is a problem in the system;

(2) Conversely, no abnormality is monitored, so we cannot be sure that there is no problem with the system;

Such as:

(1) Monitoring the operating system CPU100%, the system is likely to have a problem, but the NORMAL CPU, does not mean that the system is normal, for example, Tomcat hangs, THE CPU must be normal, but the operating system monitoring can not detect, so the process, port, memory activity and other monitoring to assist;

(2) Process, port monitoring is abnormal, the system has a large probability of problems, but the process is running, port monitoring, does not mean that the system is normal, such as program deadlock, process and port are normal, so the interface processing time and other monitoring to assist;

(3) When the interface processing time is monitored to timeout, there is a large probability of problems in the system. However, if the interface processing time does not timeout, it does not mean that the system is normal. For example, if the database is hung up and the database connection cannot be obtained, each interface of the service layer returns quickly without timeout;

The point here is: single dimension monitoring is easy to miss, multidimensional three-dimensional monitoring is the fundamental way of monitoring platform.

The two introduced above:

How to Get HTTP Monitoring Done in 12 Hours

How to Monitor Logs in 12 Hours

Pay attention to universal + extensible in design.

The following four dimensions of monitoring are also designed to be “universal” and “non-invasive”, that is, the monitored sites and services do not need any burying point or modification, and the responsible person of the monitored module does not need to do anything to cover the whole area.

Dimension 1: How to monitor operating systems, processes, and ports?

Monitoring requirements:

(1) Whether the network of the system is full, whether the disk has space, whether the CPU is busy, whether the memory is used up, whether the load value is too high, and whether the JVM is normal;

(2) Whether the service process is running;

(3) Whether the monitoring port is normal;

(4) Whether the machines are connected;

Common scheme 1: Zabbix

Those engaged in operation and maintenance all understand, but they don’t talk about it in detail, because they are afraid of being scolded if they talk too much.

Common scheme 2: Shell

By writing some simple scripts, you can obtain network, disk, CPU, memory, load, and JVM information. With some threshold configurations, you can realize the alarm function of exceeding the threshold.

If the cluster information management service, through ps, netstat, Telnet and other commands, it can quickly achieve simple monitoring of processes, ports, and connectivity.

Key points:

(1) Focus on scalability, configurability and non-invasiveness;

(2) Cluster information management service (or cluster information configuration file);

Dimension two: How to monitor the 404 status code?

Monitoring requirements: Monitors HTTP exception status codes.

Monitoring scheme: NGINx logs are monitored in a unified manner

If you implement uniform monitoring of HTTP interfaces, the need for 404 monitoring is not so strong, but after all, the implementation is simple, the whole general purpose does not take much time.

Before we talk about activity monitoring and interface processing time monitoring, we should say more about system architecture. If the framework and components are unified, unified monitoring will save a lot of effort.

The diagram above shows a typical layered architecture of the Internet:

(1) The most upstream are APP and Browser;

(2) the reverse proxy layer is NGINx, unified HTTP404 status code monitoring is implemented in this layer;

(3) Web layer, assuming self-developed Web-framework;

(4) Service layer, assuming that the self-developed Service-framework, the Web layer will call service through rPC-client;

(5) Data layer DB, assuming the daojia-DAO component is invoked;

(6) daojia-KV cache-level cache.

The D-DAO and D-KV components were not as complex as everyone thought, and were simply encapsulated in one layer at the beginning.

Dimension 3: How to monitor the service activity?

Monitoring requirements: The monitoring of processes and ports can only ensure that processes and ports exist, but cannot determine whether the service can respond to requests. It is necessary to ensure that the service is “alive”.

Monitoring scheme: Ping-pong monitoring, unified implementation at the site framework and service framework level, providing keepalive interface:

(1) Ping pong interface can be realized at the framework level;

(2) The monitoring center obtains the cluster type (Web/Service) and cluster IP address list from the cluster information management service (or configuration file).

(3) The monitoring center uniformly sends built-in ping pong requests to the cluster;

Two points are emphasized:

(1) If the open source framework does not provide the ping pong interface, secondary development can be done (be careful, the secondary development of any open source framework is the beginning of a pit);

(2) A unified cluster information management service, or a unified cluster information management profile, is really important as the cornerstone of a unified technology system;

Dimension 4: How to monitor the interface execution time?

Monitoring requirements:

(1) HTTP site interface timeout;

(2) Whether the RPC service interface times out;

(3) whether db access times out;

(4) Whether cache access times out;

(5) In addition to timeout, it is necessary to monitor whether the execution time of the same interface fluctuates significantly from the previous year to the previous month. For example, the average response time of an interface is 100ms, but it suddenly increases to 300ms one day. Even if there is no timeout, it is reasonable to suspect that there is something wrong with the interface.

Monitoring scheme: framework components are reported in a unified manner (1,2,3,4 in figure 1).

(1) In the Web-framework, the data of all HTTP interfaces can be reported, including URL, parameters, execution time and other core data;

(2) In the Service-Framework, the data of all RPC interfaces can be reported, including interface, parameters, execution time and other core data;

(3) In DAO, all database SQL access data can be reported, including SQL, parameters, execution time and other core data can be reported;

(4) In KV-Client, all cache access data can be reported, including key, execution time and other core data;

Unified reporting is the idea. For details, you can use Flume to brush logs or Storm/Spark to process real-time streams.

conclusion

Monitoring is a skill:

(1) The idea of monitoring platform is multi-dimensional three-dimensional monitoring;

(2) “unified operating system, HTTP404, service activity, interface processing time” and other four categories of unified monitoring design core is “non-invasive”, do not need anyone to modify, can achieve many functions of the technology platform, is a good technology platform;

(3) Unified cluster information management service, unified personnel information management service, unified alarm policy service (or configuration file), is the cornerstone of the unified technical system;

The architect’s Path– Share technical ideas

Thinking is more important than conclusion, I hope you have a harvest.

research

What aspects does your company monitor cover?

What kind of monitoring really shows that there is something wrong with the system?

Related Posts

How to build a good big data platform architecture

As an architect, this IO stream File explanation and use you must read, write very detailed

Redis To Master series 4: Jedis- Using Java to operate Redis in detail