This paper introduces the implementation process of Kubernetes monitoring scheme with high availability, persistent storage and dynamic adjustment.


Review of previous article:
Kubernetes cluster exception: Kubelet connection to apiserver timeout

Xiaomi’s Flexible scheduling platform (Ocean) and container platform are mainly based on the open source container automation management platform Kubernetes (K8S for short) to provide services, and the premise of improving the quality of container services is the perfect monitoring system. Different from traditional physical hosts, each container is equivalent to one host. As a result, the cost of system indicators on a physical host increases, and the total number of monitoring indicators is quite large (statistics on longitude lines show that indicators per node reach 10,000 +). In addition, in order to avoid repeated wheel building, the company’s monitoring and alarm system should be utilized to the maximum extent, and the monitoring and alarm of K8S should be integrated into it. Implementing this monitoring on top of Xiaomi’s existing infrastructure is no small challenge.

When the surveillance hit K8S

To facilitate container management, K8S encapsulates Containers and has concepts such as Pod, Deployment, Namespace, and Service. Compared with traditional clusters, k8S cluster monitoring is more complex:

(1) More monitoring dimensions, including core service monitoring (Apiserver, ETCD, etc.), container monitoring, Pod monitoring, Namespace monitoring, etc., in addition to traditional physical cluster monitoring.

(2) The monitored object is dynamically variable, and the destruction of containers in the cluster is very frequent and cannot be preset in advance.

(3) Monitoring indicators With the explosive growth of container scale, how to process and display a large amount of monitoring data.

(4) With the dynamic growth of the cluster, the monitoring system must have the ability of dynamic expansion and contraction.

In addition to the characteristics of K8S cluster monitoring itself, the implementation of specific monitoring scheme should consider the actual situation inside the company:

(1) At present, the K8S cluster provided by the elastic scheduling computing platform includes: fusion cloud container cluster, part of Ocean cluster and CloudML cluster, with more than ten clusters and 1000+ machines. Different K8S clusters have different deployment modes, network modes, and storage modes. Therefore, monitoring schemes must take into account various differences.

(2) Open-Falcon is a common monitoring and alarm system in the company, which has a complete data collection, display and alarm mechanism. However, Open-Falcon does not support the pull acquisition scheme like K8S. In addition, k8S has a natural hierarchy of resources, which means that monitoring data integration requires strong and flexible aggregation capabilities, which Falcon does not meet. But we didn’t want to reinvent the wheel. We needed to maximize our existing infrastructure to save on development and maintenance costs.

(3) For the persistent storage of monitoring, how to combine the database within the company to realize the long-term storage of monitoring data is a problem that needs to be considered.

There are some mature solutions for K8S monitoring in the industry:

(1) Heapster/ metrics-server + InfluxDB + Grafana

Heapster is a k8S native cluster monitoring solution (now abandoned and switched to metrics-server). It obtains computing, storage, and network monitoring data from THE CAdvisor on nodes and sends the data to external storage, such as InfluxDB. Finally, a corresponding UI interface is used for visualization, such as Grafana. This solution is simple to deploy but collects single data, which is not suitable for the overall monitoring of the K8S cluster. It is only suitable for monitoring resource information of each container in the cluster, for example, as the data display source of the K8S Dashboard.

(2) the Exporter + Prometheus + Adapter

Prometheus is an open source system monitoring and alarm framework, featuring multidimensional data model, flexible and powerful query statements, and good performance. Prometheus can be a half exporter of monitoring metrics data, such as Node-exporter, Kube-state-metrics, and Cadivsor. Prometheus can dynamically find PODS in the K8S cluster. Objects such as node. Prometheus collects data from various dimensions, aggregates data and provides alarms, and then uses Adapter to write data to remote storage (such as OpenTSDB and InfluxDB) for persistent storage. However, because of the possibility of data loss, Prometheus is not suitable for situations where data acquisition is 100% accurate, such as real-time monitoring.

Monitor solutions and evolution

The initial plan

In the early stage, in order to realize the landing of K8S as soon as possible, the monitoring system only realized the collection of core monitoring data, such as the USAGE of CPU, memory, network and other resources of Pod, with Falcon and its internally developed EXPORTER. The specific architecture is shown in the figure below.

By implementing CAdvisor – EXPORTER collects container monitoring data of CAdvisor. Kube-state-exporter collects K8S key Pod indicators; Falcon-agent Collects physical node data. In the initial scheme, only core monitoring data is collected to preliminarily monitor the usage of core resources, lacking more comprehensive data monitoring, such as Apiserver, ETCD, etc. Falcon currently does not support container monitoring, so manually implement various exports to meet K8S monitoring requirements. Moreover, monitoring data is not stored persistently and does not support long-term query.

Monitoring system based on Prometheus

Due to the deficiency of the initial monitoring system, Prometheus was selected as the monitoring scheme of K8S after investigation and comparison, mainly considering the following reasons:

(1) Native support for K8S monitoring, k8S object service discovery capability, and the core component of K8S provides Prometheus acquisition interface

(2) Strong performance: Single Prometheus can capture 100,000 metrics per second, which can meet the monitoring requirements of K8S cluster in a certain scale

(3) Good query ability: Prometheus provides PromQL, a data query language. PromQL provides a large number of data calculation functions, and in most cases users can query aggregated data directly from Prometheus through PromQL.

The architecture of the K8S monitoring system based on Prometheus is shown below:

Data source: Node-exporter Collects indicators of physical nodes. Kube-state-metrics Collection of K8S-related indicators, including resource usage and status information of various objects; Cadvisor collects container-related metrics; Monitoring data of apiserver, ETCD, Scheduler, K8S-LVM, GPU and other core components; Other custom metrics can be automatically captured by adding Promethe. IO /scrape: “true” to the pod Yaml file Annotations.

Prometheus data processing module: Prometheus is deployed on K8S in Pod mode, which contains Prometheus and Prom-Reloader. Prometheus collects aggregate data; PROM -config is the aggregation rule and crawl configuration for monitoring, which is stored in ConfigMap. Prom-Reloader monitors the hot update of configuration, monitors configuration files in real time, and dynamically loads the latest configuration without restarting the application.

Storage backends: Falcon and OpenTSDB.

Open-falcon is the company’s unified monitoring and alarm system, providing comprehensive data collection, alarm, display, historical data storage and permissions. Due to the early design of Falcon, no monitoring of container-related indicators is provided, while Prometheus supports K8S natively, but its alarm function can only be statically configured and it needs to be connected to the company’s account for users to configure monitoring, and some K8S indicators need to be exposed to container users. Based on this consideration, we used Falcon as the alarm and external display platform for K8S monitoring. By implementing falcon-Adapter, monitoring data is forwarded to Falcon for alarm and display. According to the K8S service object, the monitoring target is divided into multiple levels: Cluster, node, Namespace, Deployment and POD. Key alarm indicators are sent to Falcon through falcon-Agent, and users can check the indicators in alarm configuration by themselves.

Monitoring data for native Prometheus is stored locally (using the TSDB time zone database) for 15 days by default. Monitoring data is not only used for monitoring and alarm, but also for subsequent operation analysis and fine operation and maintenance. Therefore, data persistence is required. Some read and write solutions are also provided in Prometheus community, such as Influxdb, Graphite, OpenTSDB, etc. Xiaomi happens to have an OpenTSDB team, which stores timing data in HBase. Our HBase also has a stable team to support it. Based on this, OpenTSDB provides remote storage for monitoring data. The opentSDB-Adapter is implemented to forward the monitoring data to the timing database OpenTSDB to achieve persistent storage of data and meet the needs of long-term query and later data analysis.

Deployment way


The core systems of system monitoring are all deployed in THE K8S cluster through Deployment/Daemonset to ensure the reliability of monitoring services. All configuration files are stored using ConfigMap and updated automatically.

Storage Mode Prometheus stores local and remote storage. The local storage stores monitoring data in a short period of time, and stores data generated within two hours in chunks in a two-hour time window. Each chunk contains all sample data (chunks) in this time window. Metadata file (meta. Json) and index file (index). Due to the insufficient storage types provided by clusters, multiple storage modes have been deployed, including PVC, LVM, and local disks.

Remote storage Operations on OpenTSDB are implemented through the remote read/write interfaces of Prometheus, facilitating long-term data query.

To keep Prometheus simple, Prometheus did not attempt to solve these problems itself, but defined two standard interfaces (remote_write/remote_read) on which users could store data to any third-party storage service. This approach is called Remote Storage in Promthues.

As shown in the figure above, the Remote Write URL can be specified in the Prometheus configuration file, and once set, Prometheus sends the sample data collected to the Adapter via HTTP. The user can connect to any external service in the adapter. An external service can be a real storage system, a public cloud storage service, or a message queue. Similarly, Promthues’ Remote Read is implemented through an adapter. In the remote read process, when a user initiates a query request, Promthues sends a query request (Matchers, Time ranges) to the URL configured in Remote_read. Adapter obtains the response data from the third-party storage service according to the request conditions. Raw sample data converted to Promthues is also returned to Prometheus Server. When the sample data is obtained, Promthues uses PromQL to perform secondary processing on the sample data locally. When remote read is enabled, it is only valid for data queries, and processing of rule files and Metadata API is performed based on Prometheus local storage only.

Remote storage now supports Falcon and OpenTSDB, which enables users to view monitoring data and configure alarms. OpenTSDB has been written to persistent storage and supports remote reading and writing through Prometheus.

At present, monitoring schemes based on Prometheus have been deployed in various clusters, but some problems have been exposed with the growth of cluster scale.

First, with the increase of monitoring indicators, the pressure on falcon-Agent and transfer is increased, which often leads to falcon-Agent congestion and some monitoring data delay and loss. In the online test, when more than 150,000 /m is sent through a single falcon-Agent, data loss frequently occurs. The sending of some monitoring data has been closed. The root cause is the centralized data aggregation and push of PrometheUse, which brings the indexes scattered in different clusters to one host, thus bringing extraordinary pressure.

Second, in large clusters, Prometheus occupies a large amount of CPU and memory resources (the following table shows the operation of an online cluster of Prometheus), which may occasionally fail to capture metrics. With the expansion of cluster size, a single Prometheus may encounter performance bottlenecks.

Zone monitoring scheme

For the deficiency of a single Prometheus monitoring scheme, it needs to be extended to meet the large-scale K8S cluster monitoring and adapt to the performance of the Falcon system Agent. Prometheus supports cluster federation. This partitioning method increased the extensibility of Prometheus itself and distributed the pressure on individual Falcon Agents.

Federated functionality is a special query interface that allows one Prometheus to capture metrics from another Prometheus for the purpose of partitioning. As follows:

There are two common partitioning methods:

The first is functional partitioning. The federated cluster feature enables users to adjust the Promthues deployment architecture for different monitoring scales by deploying multiple Prometheus Server instances in each data center. Each Prometheus Server instance collects only part of the jobs in the current DATA center. For example, different monitoring tasks can be assigned to different Prometheus instances and aggregated by the central Prometheus instance.

The second is horizontal expansion. In extreme cases, the number of targets in a single collection task becomes very large. When functional partitioning is simply done through a federated cluster and Prometheus Server cannot handle it efficiently. In this case, you can only consider continuing functional partitioning at the instance level. Divide the monitoring data collection tasks of different instances of the same task into different Prometheus instances. With the Relabel setting, we can ensure that the current Prometheus Server only collects monitoring metrics for a subset of instances of the current collection task.

According to the actual situation of K8S, the partitioning scheme architecture is as follows:

The Prometheus subdivision includes Master Prometheus, Slave Prometheus, and Kube State Prometheus: since a large number of indicators are collected from node services, For example, Kubelet, Node-Exporter, and CAdvisor collect node data, so they are classified into different jobs according to node nodes. Slave Prometheus collects node and POD data according to node slices.

Kube-state-metrics cannot be sliced for the time being, and used as a separate Kube-state Prometheus for Master Prometheus to collect; Other ETcD, Apiserver, and custom indicators can be collected through Master Prometheus.

Prometheus Master To capture other Prometheus slaves, perform the following operations:

- job_name: federate-slave  honor_labels: true  metrics_path: '/federate'  params:    'match[]':      - '{__name__=~"pod:.*|node:.*"}'  kubernetes_sd_configs:  - role: pod    namespaces:      names:      - kube-system  relabel_configs:  - source_labels:    - __meta_kubernetes_pod_label_app    action: keep    regex: prometheus-slave.*Copy the code

Prometheus slave partitioning fetching tasks using hashmod provided by Prometheus:

- job_name: kubelet  scheme: https  kubernetes_sd_configs:  - role: node    namespaces:      names: []  tls_config:    insecure_skip_verify: true  relabel_configs:  - source_labels: []    regex: __meta_kubernetes_node_label_(.+)    replacement: "The $1"    action: labelmap  - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]    modulus:       ${modulus}    target_label:  __tmp_hash    action:        hashmod  - source_labels: [__tmp_hash]    regex:         ${slaveId}    action:        keepCopy the code

Deployment way

Master Prometheus and Kube-State Prometheus are deployed in Deployment. Slave Prometheus may have multiple PODS, but the configuration of each pod is different (${slaveId} in the configuration is different), and the partition number of each slave Prometheus must be reflected in the configuration. And native deployment/statefulset daemonset don’t support the same Pod template mount different ConfigMap configuration. To facilitate Slave management, Prometheus deploys slaves through StatefulSet, which numbers each pod smoothly such as slave-0, slave-1, etc. The Pod name is obtained from the PROM-reloader, which continuously listens for Prometheus configuration changes, and then generates numbered configurations to distinguish the different partition modules.

The validation test

The testing includes two aspects: one is to test the function of the post-partitioned monitoring scheme to see if it meets expectations, and the other is to test its performance

In the functional test, the aggregation rules of partition scheme were verified to be normal, especially the data before and after partition were verified. By comparing the data within one week, the average difference ratio within one hour was taken, as shown in the figure below:

According to statistics, the comparison error of more than 95% of time series is less than 1%, and some indicators fluctuate greatly instantaneously (such as network utilization rate), but the difference will be offset with the increase of time.

In the performance test, different partitions are monitored and tested under different loads to verify their performance.

Create 1000 virtual nodes on the test cluster and create a variety of pods to test Prometheus partition performance:

For Prometheus master and Prometheus Kube-State, a maximum of 8W pod can be supported within 1 minute fetching time. The main bottleneck is kube-state-metrics. With the increase of pod, the data volume increases rapidly, and the time of a fetching keeps increasing.

For Prometheus Slave, there is less pressure to collect partial data; a single Prometheus captures more than 400 nodes (60 pod/node). As shown in the following figure, after remote Write is enabled, the capture time keeps increasing, and the performance of remote-storage-Adapter keeps increasing.

The Prometheus partition monitoring architecture supported up to 8W of PODS, as verified by k8S test clusters, to meet the expected cluster growth demand.

Looking forward to

Currently, the zone monitoring solution has been deployed in some clusters and features high availability, persistent storage, and dynamic adjustment. In addition, we will continue to make improvements in the future: automatic expansion of monitoring, performance optimization for Kube-state-metrics (partitioning is not currently supported); In terms of deployment mode, prometry-Operator and HELM are used to realize simpler configuration management and deployment. In the utilization of monitoring data, specific algorithms can be used to deeply mine the data to provide valuable information, for example, monitoring data can be used to forecast capacity expansion and find the appropriate expansion time. Through continuous optimization, to ensure better stable, reliable and intelligent monitoring services for K8S.



This article was first published on the public account “Miui Cloud Technology”, please indicate the author and source of reprint, click to view the original article.