This is the 12th day of my participation in the August Challenge

Prometheus 4 gold indicators

The introduction to Prometheus describes the basic objectives of monitoring, first to detect problems in time and second to be able to locate problems quickly. For traditional monitoring solutions, users still see a black box and cannot really understand the real running status of the system. Prometheus therefore encourages users to monitor everything.

Four Golden Signals is Google’s experience summary for a large number of distributed monitoring. Four Golden Signals can help measure end-user experience, service interruption, business impact and other problems at the service level. The focus is on four types of metrics: latency, traffic, error, and saturation:

  • Latency: Time required for service requests.

Keep track of the time required for all requests from the user, with an emphasis on differentiating the latency for successful requests from that for failed requests. For example, in cases where HTTP 500 is triggered by a database or other critical backend service exception, the user may also get a response to a failed request very quickly, and if the delay for these requests is calculated indiscriminately, the result may be significantly different from the actual result. In addition to this, “fail fast” is often advocated in microservices, and it is important for developers to pay special attention to errors with high latency because these slow errors can significantly affect the performance of the system, so it is important to track the latency of these errors.

  • Traffic: Monitors current system traffic to measure capacity requirements for services.

Traffic can mean different things to different types of systems. For example, in HTTP REST apis, traffic is typically HTTP requests per second;

  • Error: Monitors all error requests in the current system and measures the rate at which errors occur in the current system.

Some are explicit for failures (e.g., HTTP 500 errors) and some are implicit (e.g., HTTP response 200, but the actual business process is still failing).

Some explicit errors such as HTTP 500 can be caught by a load balancer (such as Nginx), while some internal system exceptions may need to be captured by adding hook statistics directly from the service.

  • Saturation: Measures the saturation of the current service.

The focus is on the limited resources that most affect the state of the service. For example, if the system is primarily memory affected, focus on the memory state of the system, and if the system is primarily disk I/O constrained, focus on the disk I/O state. Typically, when these resources reach saturation, service performance degrades significantly. Saturation can also be used to make predictions about the system, such as whether the disk is likely to be full in four hours.

Four gold metrics for Spring Cloud microservices

QPS

sum(rate(http_server_requests_seconds_count{application="$application", instance="$instance"}[1m]))
Copy the code

Wrong number

Statistics status code 5XX

sum(rate(http_server_requests_seconds_count{application="$application", instance="$instance", status=~"5.."}[1m]))
Copy the code

Delay Duration

sum(rate(http_server_requests_seconds_sum{application="$application", instance="$instance", status! ~"5.."}[1m]))/sum(rate(http_server_requests_seconds_count{application="$application", instance="$instance", status! ~"5.."}[1m]))
Copy the code

saturation

Monitor Tomcat/Jetty by number of threads

// tomcat 
A: tomcat_threads_busy_threads{application="$application", instance="$instance"} 
B: tomcat_threads_current_threads{application="$application", instance="$instance"} 
C: tomcat_threads_config_max_threads{application="$application", instance="$instance"} 
// jetty 
D: jetty_threads_busy{application="$application", instance="$instance"} 
E: jetty_threads_current{application="$application", instance="$instance"} 
F: jetty_threads_config_max{application="$application", instance="$instance"}
Copy the code

Monitor figure