Mac APM Monitoring on Android Based on Grafana + Promthues + Pushgateway (3)

Implementation of APM monitoring on Android terminal based on Grafana + Promthues + Pushgateway (2) Has introduced the construction of service environment based on Grafana, Promthues, Pushgateway and simple data uploading. Simple panel configuration, etc., because the APM is mainly done on the client side data collection, so the corresponding client data encapsulation, client data acquisition is also very important, considering the network library and protocol encapsulation can realize sharing on both ends of the android and ios, the use of KMM is expanding platform technical solution way to do this, look at the overall architecture diagram below:

KMM layer is important to use the kotlin launched a cross-platform implementation scheme, and is particularly suitable for the SDK, that have no interface development, KMM framework would not be in detail here, interested can website kotlinlang.org/docs/mobile… , the purpose of using KMM is that the network and protocol, including SDK interconnection business layer, can share a set of code

APM Platform – Data reporting

In phase 1,2, data is reported by PushWay, which Prometheus then pulls data into its own service, storage, So I want to report data will choose to follow Prometheus data format, Prometheus format of data to check the address Prometheus. IO/docs/instru…

Prometheus has four metrics for sending data, which are:

  1. Counter

    The Counter type represents a cumulative index of data that increases monotonically, but does not decrease. Application scenarios such as number of requests, number of errors and so on are more suitable to use Counter as an indicator type

  2. Gauge

    The Gauge type represents a Gauge data that can be changed arbitrarily, and can be increased or decreased. In application scenarios, CPU and Memory statistics are common in the system. In service scenarios, the number of service queues can also be measured using Gauge. In this way, the number of queues can be measured in real time and the accumulation condition can be detected in time

  3. Histogram(cumulative Histogram)

    The Histogram type samples data over a period of time (typically request duration, response size, and so on) and feeds it into a configurable bucket, which can then filter samples through a specified interval or count total samples.

    Histogram type is very commonly used in application scenarios, because it represents the statistics of grouping interval. However, as link tracking system is essential in distributed scenarios, analysis and statistics of different links are very necessary. For example, by calculating statistics on P90, P95, and P99 of RPC, SQL, HTTP, and Redis, and further generating alarms, the application link slowness can be detected in a timely manner, and the impact of third-party systems can be discovered and reduced.

  4. The Summary (Summary)

    The Summary type samples the data over a period of time, but unlike Histogram type, which stores quantiles (calculated at the client), the Summary type stores statistics based on the interval set up, as opposed to Histogram type. Three summary metrics are provided: the quantile distribution of sample values, the sum of sizes of all sample values, and the total number of samples

The body data format needs to be followed when reporting

# HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: Msdos_file_access_time_seconds {path="C:\DIR\ file.txt ",error="Cannot find FILE :\n" file.txt ""} 1.458255915E9 # Minimalistic line: Metric_without_timestamp_and_labels 12.47 # A weird Metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram 24054 http_request_duration_seconds_bucket http_request_duration_seconds_bucket {le = "0.05"} {le = "0.1"} 33444 100392 http_request_duration_seconds_bucket http_request_duration_seconds_bucket {le = "0.2"} {le = "0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320 # Finally a summary, which has a complex representation, too: # HELP rpc_duration_seconds A summary of the RPC duration in seconds. # TYPE rpc_duration_seconds summary 3102 rpc_duration_seconds rpc_duration_seconds {quantile = "0.01"} {quantile = "0.05"} 3272 4773 rpc_duration_seconds rpc_duration_seconds {quantile = "0.5"} {quantile = "0.9"} 9001 Rpc_duration_seconds {quantile="0.99"} 76656 rpc_duration_seconds_sum 1.7560473e+07 rpc_duration_seconds_count 2693Copy the code

From the above format, fixed headers are mandatory

# HELP indicator name Description # TYPE Indicator name TYPE (the four types mentioned above)Copy the code

The following content formats are available for different metrics. For example, the most common Gauge class requires the following format

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value
Copy the code

Taking the performance index of startup time as an example, the body content of the final report is as follows:

# HELP StartupGauge Application startup time statistics # TYPE StartupGauge StartupGauge {appid = "12345678", appinfo = "1.0", application_create = "6090", application_create_scene = "159", appname = "ApmDemo", c Type = "android", the instance = "172.26.177.179 is_warm_start_up =" false ", job = "ApmCollect", the machine = "MIDDLE", mobileinfo = "samsung_ SM-G9600",nt="0",osver="10",splash_activity_duration="",stage_between_app_and_activity="",startup_duration="6837",time_i nterval="5-7s",uid="1234",uuid="f23ed8f4-0093-4b3b-af14-7cf559482efb"} 6837Copy the code

Other indicators only need to refer to the format of the official website

Corresponding KMM network packaging and protocol packaging details here unspecified details, the last attached code address github.com/dengqu/KMMN…

APM Platform – Alarm configuration

Above, data collection and reporting, data visualization and configuration are introduced. Finally, the configuration related to alarms is discussed, and the configuration related to alarms can be timely found when the performance indicators of the importance of concern are abnormal. First, go to the Grafana home page first

Click on Notification Channels and select New Channel

When you go in, you can set the name, the type of the alarm notification, and there are a lot of options, such as Email, nails, Prometheus Alertmanager, etc. As you can see here, Grafana doesn’t directly support enterprise wechat, You can select Prometheus Alertmanager, which is an alarm component of the Prometheus service (described in the preceding section), and configure enterprise wechat alarms. The above has introduced Prometheus Alertmanager service building, want WeChat docking enterprise, need according to the process www.cnblogs.com/miaocbin/p/… Go to the wechat account of the enterprise and apply for the ID of the alarm group. After applying for the ID, go to Prometheus Alertmanager and edit the AlertManager. yml file

Global: resolve_timeout: 1 m # once every 1 minute to detect whether restore wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' wechat_api_corp_id: 'ww9ba8cb7b4AD14e66' # 'Buhe4HSBhBIGr9Rx2j_OIaQRGqmu7e_zEiVlk_GdyMc' # Secret Templates for enterprise wechat: - '/ Users/dengqu/Downloads/alertmanager - 0.23.0 - rc. 0. Darwin - amd64 / template / *. TMPL' route: receiver: 'wechat' group_by: ['env','instance','type','group','job',' alertName '] group_wait: 10s # Group_interval: Repeat_interval: 5m # Alarm retransmission time: - name: 'wechat' wechat_configs: -send_resolved: True message: '{{template "wechat.default.message".}}' to_party: '2' # # Wechat department ID agent_id: '1000002' # enterprise wechat app ID API_secret: 'Buhe4HSBhBIGr9Rx2j_OIaQRGqmu7e_zEiVlk_GdyMc' # enterprise wechat app SecretCopy the code

Above is the enterprise WeChat related configuration, templates: alarm template configuration, in/Users/dengqu/Downloads/alertmanager – 0.23.0 – rc. 0. Darwin – amd64 / template directory, You can create a new wechat. TMPL alarm template

{{ define "grafana.default.message" }}{{ range .Alerts }}
{{ .StartsAt.Format "2006-01-02 15:03:04" }}
{{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}
{{ end }}{{ end }}{{ end }}
​
{{ define "wechat.default.message" }}
{{ if eq .Status "firing"}}[Warning]:{{ template "grafana.default.message" . }}{{ end }}
{{ if eq .Status "resolved" }}[Resolved]:{{ template "grafana.default.message" . }}{{ end }}
{{ end }}
Copy the code

Restart the Prometheus Alertmanager service. By now, the alarm service is set up and Grafana is configured to connect to Prometheus Alertmanager. To configure the indicators that you want to monitor, go to the Indicators panel editing page. Select Alert and configure Rule, for example, to set an alarm when my blog visits are below 50Then select the channel you just notified, set message, and Test Rule

When the alarm is triggered, the enterprise will receive the wechat notification

The alarm section is only a brief description, you can update the detailed configuration by Google, resulting in the formation of a closed loop of the entire APM platform, so the whole APM series is over. Of course, the client collection is using Tencent open source framework Matrix framework, you can also read the relevant source code