The monitoring systemprometheusEnvironment building & Practice

1. Prepare and install components

The following software is installed using Docker, docker command can see docker– from getting started to practice

1.1 Prometheus + clickhouse installation

#Prometheus installed and started at localhost:9090
docker pull prom/prometheus
docker run -d -p 9090:9090 prom/prometheus
#Or specify the configuration path
docker run -d -p 9090:9090 -v /home/webedit/monitoring/prometheus/config:/etc/prometheus prom/prometheus

#Alertmanager is installed and started at localhost:9093
docker pull prom/alertmanager
docker run --name alertmanager -d -p 9093:9093 quay.io/prometheus/alertmanager

#Grafana is installed and started at the address localhost:3000
docker pull grafana/grafana
docker run -d -p 3000:3000 grafana/grafana

#Pushgateway installed and started with the address localhost:9091
docker pull prom/pushgateway
docker run -d -p 9091:9091 prom/pushgateway

#Clickhouse is installed and started for grafana reading. Internet Access IP address :8123. Used to test the Grafana integration clickHouse
docker pull yandex/clickhouse-server
docker pull yandex/clickhouse-client
docker run -d --name clickhouse-server -p 8123:8123 -p 9009:9009 -p 9000:9000 --ulimit nofile=262144:262144 --volume=/home/webedit/monitoring/clickhouse/clickhouse-db:/var/lib/clickhouse yandex/clickhouse-server
#Use clickhouse-client connection
docker run -it --rm --link clickhouse-server:clickhouse-server yandex/clickhouse-client --host clickhouse-server

#After the above components/software are installed and started, you can visit the address first to see if there is any exception
Copy the code

1.2 Grafana installation plug-in

#Grafana does not have clickHouse data source plug-ins by default and needs to be installed
#Enter the grafana container, use the grafana-cli plugins install command to install the plug-in, and then restart the container to take effect
docker exec -it `docker container ls | grep grafana | awk -F' ' '{print $1}'` bash
grafana-cli plugins install vertamedia-clickhouse-datasource
docker restart `docker container ls | grep grafana | awk -F' ' '{print $1}'`
Copy the code

Add the data source to Grafana and select ClickHouse. If you are using the default Settings, fill in URL and user, URL :(IP :8123),user:default

2. Prometheus use cases

2.1 Case 1: System Monitoring Node – Exporter

  • Start Node-Exporter on the machine you want to monitor
#After following the above process, you are ready to monitor some indicators.
#Here's another official example: Monitoring LINUX host metrics with NODE EXPORTER
docker pull prom/node-exporter
docker run -d -p 9100:9100 \
  -v "/proc:/host/proc:ro" \
  -v "/sys:/host/sys:ro" \
  -v "/:/rootfs:ro" \
  --net="host" \
  --restart=always \
  --name node-exporter \
  prom/node-exporter
Copy the code
  • Configure monitoring indicators to capture Prometheus. Yml (Prometheus, AlertManager, and PushGateway also expose indicator data, and add ~ to this parameter).
#Prometheus indicator capture configuration

#Global configuration
global:
  #Grab cycle
  scrape_interval:     15s
  #The default fetching timeout [scrape_timeout: < duration > | default = 10 s]
  #The default period for estimating rules# Evaluate the rule every 15 seconds. The default 1 minute [evaluation_interval: < duration > | default = 1 m]
  #Mandatory tag list for time series or alerts when communicating with external systems (such as AlertManager)
	external_labels:
    monitor: 'gucha-monitor'
#Rule file list  
rule_files:
  - 'prometheus.rules.yml'
#Fetching the configuration list
scrape_configs:
  #Monitor your Own Indicators
  - job_name: 'prometheus'
  	#The time interval for fetchingScrape_interval: 5s Static_configs: -targets: ['10.224.192.113:9090']
  #Node - exporter
  - job_name:       'node'
		#You can specify the capture period of the job to override the global configurationScrape_interval: 5s Static_configs: ['localhost:8080', 'localhost:8081'] labels: group: 'not_exist' -targets: ['10.224.192.113:9100'] labels: group: 'node_demo' -job_name: 'PUSHgateway' STATIC_configs: -targets: ['10.224.192.113:9091'] labels: instance: Pushgateway-job_name: 'alertManager' STATIC_configs: -targets: ['10.224.192.113:9093'] labels: instance: AlertManagerCopy the code
  • Now let’s take Node-Exporter + Grafana and integrate a nice system metrics monitoring Dashboard
1.Find a nice dashboard template on the Grafana website, [dashboard] (https://grafana.com/grafana/dashboards). Click on your favorite template and you will find a template ID. Here I found a template of 8919 [The template] (https://grafana.com/grafana/dashboards/8919)
2.Import the template in Grafana and select the Prometheus data source, and you have a nice looking dashboardCopy the code

2.2 Case 2: Application Monitoring clickHouse – Exporter

  • Start ClickHouse -exporter and configure the CLICKhouse URL, as well as add fetching jobs
#Start the clickhouse - exporterdocker pull f1yegor/clickhouse-exporter docker run -d -p 9116:9116 f1yegor/clickhouse-exporter - scrape_uri = http://10.224.192.113:8123/
#Add fetching work- job_name: 'clickhouse-metrics' STATIC_configs: -targets: ['10.224.192.113:9116'] labels: instance: clickhouseCopy the code
  • Find a nice dashboard template on the Grafana website with id: 882

2.3 Case 3: Monitoring SpringBoot Based on Eureka

  • Service discovery was added to prometheus.yml

      Use eureka_sd_configs for service discovery. Cons: Requires configuration changes in the project
      - job_name: 'eureka'
        metrics_path: '/admin/prometheus'
        # Scrape Eureka itself to discover new services.
        eureka_sd_configs:
          - server: http://10.224.192.92:1111/eureka
        relabel_configs:
          - source_labels: [__address__.__meta_eureka_app_instance_metadata_prometheus_port]
            action: replace
            regex: (([^ :] +)? ::\d+)? ; (\d+)
            replacement: $1, $2
            target_label: __address__
    Copy the code
  • SpringBoot project import package

    # SpirngBoot project import package<! -- Third party Open Source Toolkit -->
    <dependency>
      <groupId>io.micrometer</groupId>
      <artifactId>micrometer-registry-prometheus</artifactId>
      <version>1.1.2</version>
    </dependency>
    <dependency>
      <groupId>io.micrometer</groupId>
      <artifactId>micrometer-core</artifactId>
      <version>1.1.2</version>
    </dependency>
    Copy the code
  • Specify the indicator application name

    # Specify the indicator application name and management port
    management.metrics.tags.application=${spring.application.name}
    Copy the code
  • Specify the management port (since service discovery is discovered by default, SprintBoot exposes a different management port)

       eureka:
      	instance:
        	metadataMap:
          	"prometheus.port": "${management.server.port}"
    Copy the code
  • Another configuration method. Discovery using consul_sd_configs services [not practiced]

     - job_name: 'consul-prometheus'
        scheme: http
        metrics_path: '/admin/prometheus'
        consul_sd_configs:
        # consul address
          - server: '10.224.192.92:1111'
            scheme: http
            services: [SPRINGBOOT_PROMETHEUS_CLIENT]
    Copy the code
  • Look for a nice dashboard template on the Grafana website, such as ID: 11378

3. Use case for Prometheus and AlertManager

  • Configure the alarm rule prometheus.rule-yml

    1. 10.224.192.113:9100 Disk write rate exceeds 60 in one minute for 30 seconds. After triggering the condition, some available labels such as Severity: slight will be added to the generated alarm records to facilitate subsequent grouping.

    2. 10.224.192.113:9100 Disk write rate exceeds 70 in one minute for 30 seconds. After triggering the condition, some available labels such as Severity: critical will be added to generated alarm records to facilitate subsequent grouping.

    3. For other configurations, see the comments

groups:
- name: example
  rules:
  #The alarm name- alert: node_disk_reads_completed_total # promQL 10.224.192.113:9100 Disk write rate exceeds 60 30s expr: Rate (node_disk_writes_completed_total{instance="10.224.192.113:9100"}[30s]) > 60 Severity: slight # Alarm template Annotations: summary: "summary test {{$alllabels. Instance}} slight" description: "{{ $labels.instance }} rate(node_disk_writes_completed_total) = {{ $value }})" - alert: Node_disk_reads_completed_total # promQL 10.224.192.113:9100 Disk write rate exceeds 30 expr: Rate (node_disk_writes_completed_total{instance="10.224.192.113:9100"}[30s]) > 70 # Duration 1 min for: 1m # labels: severity: critical annotations: summary: "summary test {{ $labels.instance }} critical" description: "{{ $labels.instance }} rate(node_disk_writes_completed_total) = {{ $value }})"Copy the code
  • Alertmanager configuration alertmanager. Yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'FAMAPDNAOUPJMMOV'
  smtp_require_tls: false
route:                                          # Each input alert enters the root route
  receiver: 'team-b'                      			The root route must not contain any matches because it is the entry point for all alerts
  group_by: ['alertname'.'instance']           # Tag to group incoming alerts. For example, those with the same alertName and instance value are grouped into the same group
  group_wait: 30s                               The minimum number of seconds to wait to send an initial notification when an incoming alert creates a new set of alerts
  group_interval: 5m                            # How many minutes to wait before sending the first notification to send a batch of new alerts that have started to trigger for the group
  repeat_interval: 3h                           # If the alert was successfully sent, how many hours to wait to resend the alert
  routes:                                       All attributes of the parent route are inherited by the quilt route
  - match_re:                                   # This route performs regular expression matching on the alert tag to catch alerts related to the list of services
     severity: ^(slight|critical)$
    receiver: team-a
    routes:                                     The service has a critical alert, and any alert subpaths do not match, i.e. sent directly to the recipient via the parent route configuration
    - match:
        severity: critical
      receiver: team-b
# The disable rule allows a set of alarms to be silenced if another alarm is being triggered. If the same alarm is already serious, we will use this option to disable any warning level notification
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'slight'
  equal: ['alertname']
If all the tag names listed in "equal" are missing in both the source and target alerts, the ban rule will be applied!
receivers:
- name: 'team-a'
  email_configs:
  - to: '[email protected],[email protected]'
- name: 'team-b'
  email_configs:
  - to: '[email protected]'
  Configure hooks to extend notification methods
  # webhook_configs:
  # - url: 'http://prometheus-webhook-dingtalk.kube-ops.svc.cluster.local:8060/dingtalk/webhook1/send'
  # send_resolved: true
Copy the code
  • Start AlertManager with the configuration, specifying the configuration file
docker run -d -p 9093:9093 -v /home/webedit/monitoring/alertmanager/config/alertmanager.yml:/etc/alertmanager/config.yml  --name alertmanager quay.io/prometheus/alertmanager --config.file=/etc/alertmanager/config.ymlCopy the code
  • Associated Prometheus and Alertmanger. Added in Prometheus. Yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['10.224.192.113:9093']
Copy the code

4. Service indicator monitoring case

  • Demo case based on SpringBoot+ Eureka. The packages and configurations to be imported are the same as those in Monitoring SpringBoot based on Eureka
// This code can run periodically. Wait for Prometheus Server to pull.
Metrics.gauge("mp.mkt.automkt.cnt.gauge", ImmutableList.of(new ImmutableTag("jobType"."vipUserMktJob")), 	
              RandomUtil.randomLong(19000.25000)));
Metrics.gauge("mp.mkt.automkt.cnt.gauge", ImmutableList.of(new ImmutableTag("jobType"."dwUserMktJob")), 
              RandomUtil.randomLong(19000.25000)));
Copy the code
  • Grafana adds views

  • Configure alarm rules. For details, see ~
  # Marketing system service alarm
  - alert: mkt-alert
    # promQL VIP automated marketing triggers, average less than 2W in the past 5 minutes
    expr: avg(mp_mkt_automkt_cnt_gauge{jobType="vipUserMktJob"}[5m]) < 25000
    # Add an alarm label
    labels:
      severity: slight
      type: automkt
    annotations:
      summary: "The average trigger number of VIP automated marketing in the last 5 minutes is less than 25000, Severity: slight"
      description: "Average number of VIP automated marketing triggers in the last 5 minutes ={{ $value }})"
  - alert: mkt-alert
    # promQL
    expr: avg(mp_mkt_automkt_cnt_gauge{jobType="vipUserMktJob"}[5m]) < 20000
    # Last 1 minute
    for: 1m
    # label
    labels:
      severity: critical
      type: automkt
    annotations:
      summary: "The average trigger number of VIP automated marketing in the last 5 minutes is less than 20000, severity: critical"
      description: "Average number of VIP automated marketing triggers in the last 5 minutes ={{ $value }})"
 	- alert: mkt-alert
    # promQL
    expr: avg(mp_mkt_automkt_cnt_gauge{jobType="dwUserMktJob"}[5m]) < 200000
    # Last 1 minute
    for: 1m
    # label
    labels:
      severity: slight
      type: automkt
    annotations:
      summary: "Plus automated marketing data abnormal, Severity: slight"
      description: "Trigger number:{{ $value }})"
Copy the code
  • The alarm configuration is partially modified based on the previous configuration.
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'FAMAPDNAOUPJMMOV'
  smtp_require_tls: false
route:                                          # Each input alert enters the root route
  receiver: 'team-b'                            The root route must not contain any matches because it is the entry point for all alerts
  group_by: ['alertname'.'instance']           # Tag to group incoming alerts. Group the alertName and instance values into the same group
  group_wait: 30s                               Wait 30 seconds to send the initial notification, used to aggregate the alarm information of the same group
  group_interval: 5m                            Time period for grouping alarm logic execution
  repeat_interval: 3h                           # If the alert was successfully sent, how many hours to wait to resend the alert
  routes:                                       All attributes of the parent route are inherited by the quilt route
  - match_re:                                   # rematches the tag severity=slight or critical
     severity: ^(slight|critical)$
    receiver: team-a
    routes:                                     # Critical alert, notify Team - B
    - match:
        severity: critical
      receiver: team-b
      routes:																		# Serious alarm for automated marketing, notify team- Automkt
      - match:
          alertname: mkt-alert
          type: automkt
        receiver: team-automkt

# same alertName, mask minor alarms if there have already been serious alarms
# Suppression rule
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'slight'
  equal: ['alertname']
If all the tag names listed in "equal" are missing in both the source and target alerts, the ban rule will be applied!

receivers:
- name: 'team-a'
  email_configs:
  - to: '[email protected],[email protected]'
- name: 'team-b'
  email_configs:
  - to: '[email protected]'
Responsible person for automation marketing
- name: 'team-automkt'
  email_configs:
  - to: '[email protected],[email protected]'
  Configure hooks to extend notification methods
  # webhook_configs:
  # - url: '短信 popo'
  # send_resolved: true
Copy the code
  • Alertmanager Alarm example. Can be displayed in the background (this background can also be configured with silent rules)

5. Use cases of pushGateway

<! -- Import package -->
<dependency>
    <groupId>io.prometheus</groupId>
    <artifactId>simpleclient_pushgateway</artifactId>
    <version>0.10.0</version>
</dependency>
Copy the code
String url = "10.224.192.113:9091";
CollectorRegistry registry = new CollectorRegistry();
Gauge guage = Gauge.build("my_custom_metric"."This is my custom metric.").create();
guage.set(23.12);
guage.register(registry);
PushGateway pg = new PushGateway(url);
Map<String, String> groupingKey = new HashMap<String, String>();
groupingKey.put("instance"."my_instance");
pg.pushAdd(registry, "my_job", groupingKey);


String url = "10.224.192.113:9091";
CollectorRegistry registry = new CollectorRegistry();
Gauge guage = Gauge.build("my_custom_metric"."This is my custom metric.").labelNames("app"."date").create();
String date = new SimpleDateFormat("yyyy-mm-dd HH:mm:ss").format(new Date());
guage.labels("my-pushgateway-test-0", date).set(25);
guage.labels("my-pushgateway-test-1", date).dec();
guage.labels("my-pushgateway-test-2", date).dec(2);
guage.labels("my-pushgateway-test-3", date).inc();
guage.labels("my-pushgateway-test-4", date).inc(5);
guage.register(registry);
PushGateway pg = new PushGateway(url);
Map<String, String> groupingKey = new HashMap<>();
groupingKey.put("instance"."my_instance");
pg.pushAdd(registry, "my_job", groupingKey);
Copy the code

Remark:

Yml and AlertManager. Yml configuration is a bit tricky. Grafana can configure AlertManager to alertCopy the code