The introduction

The TKE team is responsible for the o&M management of nearly 10,000 clusters and millions of core nodes in public cloud and private cloud scenarios. In order to monitor such a large cluster federation, the TKE team explored and improved a set of Prometheus cluster system, which was scalable, highly available and compatible with native configuration. It could theoretically support unlimited series and storage capacity, and support TKE cluster management. Monitoring demands of EKS cluster and self-built K8s cluster.

Starting from the TKE architecture, this paper introduces the evolution process of the whole monitoring system, including the early scheme and the problems encountered, the bottleneck of the community scheme, our improvement principle and so on.

Introduction to TKE architecture

To give readers a better understanding of our scenario, let’s start with a brief overview of the TKE infrastructure.

TKE team is the first Kubernetes operation team that adopts Kubernetes in Kubernetes for Cluster federated management in the public cloud sector. Its core idea is to use a Meta Cluster to host apiserver of other clusters. Non-business components such as controller-manager, Scheduler, and monitoring suite are hidden from users in Meta Cluster, as shown in the figure below.

The components in the Meta Cluster above are hidden from the user. The support environment service is used to process requests directly to the TKE console.

  • Meta Cluster is used to manage Cluster control panel components, such as Apiserver
  • Meta Cluster also includes some hidden functional components, such as monitoring components
  • The support service is used to receive requests to the console and connect to the user cluster for actual operations

Early monitoring schemes

demand

The early TKE monitoring solution does not support users to add service-related monitoring indicators, which only include monitoring related to cluster O&M. The main monitoring objectives are as follows:

  • Monitor the core components of each user cluster, such as Apiserver, Scheduler, Controller-Manager, etc
  • Base resource monitoring for each user cluster, such as Pod status, Deployment load, total cluster load, etc
  • Monitor all components of Meta Cluster, including cluster-monitor itself, some Addon components of Meta Cluster itself, etc
  • Support environment component monitoring, such as support web Server service processing success rate, external interface call success rate, etc

architecture

Cluster level

In the TKE architecture diagram in the previous section, we saw in Meta Cluster that each Cluster has a set of cluster-monitor components, which are the monitoring and acquisition suites at the single Cluster level. Cluster-monitor contains a series of components based on Prometheus, whose basic function is to collect basic monitoring data of each user Cluster, such as Pod load, Deployment load, Node CPU usage, etc. The collected data will be written directly to the Argus system provided by the cloud monitoring team and stored in the alarm. The core components are shown below.

Barad: The multidimensional monitoring system provided by cloud monitoring is mainly used by other services on the cloud. It is relatively mature and stable, but not flexible. Indicators and labels need to be set on the system in advance.

Argus: Multi-dimensional business monitoring system provided by the cloud monitoring team, featuring flexible indicator reporting mechanism and strong alarm capability. This is the main monitoring system used by the TKE team.

The data flow:

  • Prometheus collects container load information from Kubelet and cluster metadata, such as Pod status and Node status, from Kube-state-metrics. Data is aggregated in Prometheus to generate fixed aggregation indicators, such as Container level indicators and Pod level indicators. The collected data is written to the Argus system, which supports the monitoring panel and alarms on the TKE console, and the Barad system, which supports container-specific alarms on the Barad console. This data is to enable the old version of the alarm to continue to use.
  • Another data flow is that the Barad-importer component will pull node-related data from Barad(cloud monitoring), such as CPU usage, memory usage and so on, and import the data into Argus system, so that Argus can also display and alarm node-related data. Here, node-Exporter, the mainstream of the community, was not selected to collect node data because Node-Exporter needed to deploy Daemonset in the user cluster, and we hoped that the whole monitoring data collection system would be hidden from users.

This part of the data is output to the user through the console

Regional level

Data was successfully collected for each user cluster, however, for some geographic level monitoring, including

  • Management components in Meta Cluster
  • Cluster-monitor itself
  • Region-level collection information, such as the total number of clusters, average number of nodes in a cluster, and average creation time

The Cluster monitor cannot collect data. A higher level of territorial monitoring needs to be built.

Region Prometheus not only pulls data from core components such as meta Cluster operator and Meta Cluster Service Controller, It also pulls single Cluster data from cluster-monitor through Prometheus federated interface for secondary aggregation to generate region-level Cluster data. The region-level data is stored locally and not written to Argus because it needs to be docked to Grafana for internal use by the team.

Entire network level

We built a layer of network level monitoring on the basis of single area monitoring. Used to monitor

  • Supports monitoring of environmental components
  • Region Prometheus data in all regions are aggregated to obtain network-level indicators

The whole network data is also for internal personnel to view.

The architecture overview

A problem that is unfolding

Although the architecture described above solves our basic monitoring requirements for large-scale cluster federation, it still has several shortcomings.

Prometheus performance is insufficient

Native Prometheus does not support high availability, nor can it scale horizontally. When a cluster is large, a single Prometheus performance bottleneck will occur and data collection will not be normal. We will present pressure data for Prometheus in subsequent chapters.

The collection period is too long

The current collection cycle is 1m, we hope to reduce it to 15s.

The original data is stored for a short time. Procedure

Due to the limited aggregation capability of Argus system provided by cloud monitoring, we did not directly output the data collected by Cluster-Monitor to Argus. Instead, we aggregated the data according to predetermined indicators and only sent the aggregated data. TKE console only aggregated the data in time during data display. Raw data we only keep for 15 minutes. If local storage takes longer, cloud disks need to be deployed for each cluster-monitor. As TKE has some empty clusters (with 0 nodes), resources are wasted.

Cross-cluster queries are not supported

Because data in each cluster is locally dropped, Region Prometheus collects only some aggregation indicators due to limited performance, which makes it impossible to conduct cross-cluster aggregation query of original data, which is helpful for obtaining comprehensive data of single user and multiple clusters.

Operation and maintenance is difficult

Each level of Prometheus is managed separately, lacking global management tools.

Design ideal model

What kind of monitoring system can solve these problems at the same time? Let’s start with an ideal model called Kvass.

Acquisition [High performance]

Let’s look at the collection first. The problems we encountered on the collection side were mainly performance issues, and we wanted Kvass to have the following capabilities

  • High performance: collector with unlimited performance.
  • Native: support native Prometheus mainstream way of configuration, including Prometheus ServiceMonitor, supported by operator PodMonitor, etc.

Storage long-term storage

On the storage side, we had problems with storage duration and resource utilization, and we wanted Kvass storage to have the following capabilities

  • It can take up to a year
  • The storage resource utilization is high

Show [Global View]

On the presentation side, the problem we had was that we couldn’t get a global view, so for an ideal presentation, we wanted Kvass’s presentation to have the following capabilities

  • Can docking Grafana
  • Queries can be aggregated across clusters
  • Support for native Prometheus statements

Alarm [native]

On the alarm side, we want to support the alarm configuration of native Prometheus.

Operation and Maintenance [convenient]

We hope that Kvass does not have too complicated configuration items, and the system has a complete set of operation and maintenance tools, can use Kubernetes native way to manage.

The overall model

If we have a model like this, then our surveillance can be structured like this, where we have all the raw data we need in a single region.

  • Prometheus in Cluster-Monitor was removed
  • Region Prometheus was removed

High-performance acquisition

This section describes how we implement a high-performance collector in our ideal model

Prometheus Collection principle

Relationships between modules

First of all, we will understand the acquisition principle of Prometheus to lay a foundation for modifying Prometheus to realize high availability sharding. The following figure shows the relationship between modules in a Prometheus acquisition

  • Configuration management module: this module is responsible for receiving configuration update actions. All modules that depend on configuration files register configuration update listening functions with the configuration management module during initialization.
  • Service discovery module: When a job is configured with service discovery, the number of targets changes dynamically. This module is responsible for discovering services, generating change information about targets, and notifying the capture module.
  • Storage module: this module consists of two parts, one is the local TSDB module, the other is the remote storage module, the module is responsible for the local storage of the data collected by target, but also manages the remote storage sending process.
  • Grab module: This module is the core module for fetching. It is responsible for generating multiple job objects according to the target information given by the configuration file and service discovery module. Each job object contains multiple target scaper objects, and each target scraper object starts a coroutine to periodically capture targets. And send it to the storage module.

Memory footprint

We know from Prometheus’ performance that its memory usage increases with the size of the acquisition target, so where does Prometheus use its memory?

Storage module

  • Prometheus stores wal files, which are collected for a period of time and compressed into blocks, rather than dropping each captured point directly from the disk. During this time, the storage module needs to cache the label information of all series, and it also needs to incur a large temporary memory consumption when compressing.
  • The principle of remote storage is to gradually send points in WAL files to the remote end by listening for changes in WAL files. Before a WAL file is sent completely, the remote storage manager also caches all the label information of discovered series and maintains multiple send queues, which also consumes a lot of memory.

Fetching module

  • For each target, the series label is transmitted to the storage module only when the series is stored for the first time. The storage module returns an ID, and the target scraper hashes the series to the ID. On subsequent fetching, This series only needs to tell the storage module the ID and value. The hash and ID tables also occupy memory.

Prometheus performance manometry

Pressure measurement purpose

After analyzing the principles of Prometheus acquisition, we want to determine a few things

  • Relationship between target number and Prometheus load
  • Relationship between Series size and Prometheus load

Target correlation

Pressure measurement method

Pressure test data

Pressure test conclusions
  • The number of targets has little effect on the overall load of Prometheus

Series scale pressure measurement

Pressure measurement method

Pressure test data

Official large-scale clustering of various resources produced by series

The number of resources in the following table is the number of resources that should be included in a large-scale cluster officially given by Kubenetes. The number of series was calculated by the indicators of CAdvisor and Kube-state-metrics

Total 5118W series.

Pressure test conclusions
  • When the number of series exceeds 300w, Prometheus memory explodes
  • Scaling to equal proportions, single-Prometheus would have seen a large increase in memory for a cluster of 300 nodes or more

Implement sharable high availability Prometheus

There were a large number of clusters with more than 300 nodes, and Prometheus alone did have performance bottlenecks from the previous pressure tests. We then tried to modify Prometheus to support horizontal scaling based on the previous acquisition principles.

Design principles

No matter how modified, we want to keep the following features

  • When expanding or shrinking the capacity, keep pointing
  • Load balancing
  • 100% compatible with original configuration files and collection capabilities

Core principles

Review the acquisition schematic again to see where we can change it.

The source of the load is the target scrapers, and if you reduce the number of target scrapers, you can reduce the overall number of series and therefore the load.

Given that we had multiple Prometheus sharing the same configuration file, their target scrapers should theoretically be identical. If multiple Prometheus can be coordinated, the load is equalized by allocating the target scrapers according to the amount of target data captured by each scraper. As shown in the figure below.

Realize dynamic scatter

  • To do this, we needed a load coordinator, independent of all Prometheus, that periodically (15s) performed the load calculation, that collected all the target scrapers, and all Prometheus information, and then allocated an algorithm, Several target scrapers were assigned to each Prometheus, and the results were synchronized to all Prometheus.

  • Accordingly, each Prometheus needs to add a local coordination module, which is responsible for interconnecting with an independent coordinator and reporting all targets discovered by Prometheus through services and the amount of target data acquired last time, and receiving collection task information from the coordinator. Controls which target scrapers should be enabled for this Prometheus.

Targets assignment algorithm

After the coordinator has collected all target information, target needs to be assigned to all Prometheus. We maintain the following principles when assigning

  • Preferentially assigned to Prometheus, which is collecting the target
  • Load balancing as possible

We finally used the following algorithm to assign targets

  1. Specify target load = series * Number of collection times per minute.
  2. The target information of each Prometheus is summarized to obtain global information, which is assumed to be global_targets and marked as unallocated.
  3. Calculate the average collection load for which each Prometheus is theoretically responsible and set it to AVg_load.
  4. For each Prometheus, an attempt is made to assign the targets it is collecting to Prometheus, provided that the load of the Prometheus does not exceed AVG_load, and the targets successfully allocated are marked as allocated in global_targets.
  5. Iterate over global_targets. For the targets left in Step 3, the following scenarios occur

4.1 If no collection has been made before, a Prometheus shall be randomly assigned.

4.2 If the load originally collected by Prometheus does not exceed AVg_load, it shall be assigned to him.

4.3 Find all instances of Prometheus with the lowest load, and if the total current load of the instance plus the current load of target is still less than avg_load, allocate it to other instances; otherwise, allocate it to Prometheus, the original collector.

We can also express the algorithm in pseudocode:

func load(t target) int { return t.series * (60 / t.scrape_interval) } func reBalance(){ global_targets := Avg_load = avG (global_targets) for each Prometheus {p := Target {t := that Prometheus for collecting targets Current target if p.lloyd <= avg_load {p.ddtarget (t) global_targets[t] = allocated p.lloyd += load(t)}} for global_targets{t := {continue} p := Prometheus if p does not exist {p = random Prometheus}else{if p.load > avg_load {exp := Prometheus if exp.Load + Load (t) <= avg_load{p = exp}} p.ad += Load (t)}}Copy the code

The targets handover

When a target fetching task on Prometheus was assigned to another Prometheus, a smooth transfer mechanism was added to ensure that no points were dropped during the transfer. We tolerate repetition points here because we will deduplicate the data later.

The realization of target transition is very simple, because the target updates of each Prometheus occur almost simultaneously, it is only necessary to delay the end of the task for 2 fetch cycles after the first Prometheus discovery fetch task is transferred.

capacity

The coordinator calculates the load of all Prometheus during each coordination cycle to ensure that the average load does not exceed a threshold; otherwise, it increases the number of Prometheus and allocates a portion of targets to it in the next coordination cycle using the targets handoff method described above.

Shrinkage capacity

Considering that each Prometheus has local data, scaling does not directly remove excess Prometheus. We used the following methods to reduce capacity

  • Mark excess Prometheus as idle and record the current time.
  • All targets on Prometheus that are idle are removed and no longer participate in subsequent assignments.
  • When all idle Prometheus data has been reported to a remote end (described later), the instance is deleted.
  • In particular, if a capacity expansion operation occurs during the idle process, the instance that has been idle for the longest time will be cancelled and continue to participate in the work.

High availability

In the scheme described above, when a Prometheus service becomes unavailable, the coordinator will immediately transfer target to another Prometheus to collect data, and the probability of breakpoints is very low in the case of a very short coordination period (5s). However, if higher availability is required, a better approach is data redundancy, where each targets is assigned to multiple Prometheus instances for high availability.

About storage

Although we have fragmented Prometheus’ collection functions successfully so far, the data collected by Prometheus are scattered, and a unified storage mechanism is needed to integrate the data collected by Prometheus.

Unified storage

At the end of the last section, we saw that we needed a unified store to store fragmented Prometheus data. There are many excellent open source projects in the industry, and we selected two of the most well-known projects to conduct research from the aspects of architecture, access, community activity, performance and so on.

Thanos vs Cortex

The overall comparison

Thanos profile

Thanos is a highly available solution for Prometheus that is very popular in the community, and its design is shown here

From the collection side, Thanos uploads the local data disk of Prometheus to the object storage for remote storage by using The Thanos Sidecar beside Prometheus. There may be multiple Prometheus and each of them reports its own data.

During data query, Prometheus preferentially searches for historical data from the object store. If no historical data is found, Thanos deduplicates the queried data. Thanos’s design fits well with the unified storage mentioned in our previous collection solution. The following figure shows the connection.

Introduction of architecture

Cortex is Weavework’s Open source Prometry-compatible TSDB with native multi-tenant support, and is officially touted as a very powerful storage system capable of storing up to 25 million series levels. Its architecture is shown here

From the architecture diagram, it is not difficult to find that Cortex is much more complex than Thanos, and there are many external dependencies, so it is estimated that the overall operation and maintenance is relatively difficult. The Cortex no longer uses the storage provided by Prometheus, but lets Prometheus write all data to the Cortex system via Remote Write for unified storage. The Cortex receives data via a shardable receiver, then stores blocks of data in object stores and indexes of data in Memcache.

  • Architecturally, the Cortex seems more complex and difficult to operate
  • In terms of access, Thanos does not modify the original Prometheus configuration file, which is non-intrusive, while The Cortex requires remote Write in the configuration file. In addition, the current version of Prometheus does not have a parameter to disable local storage. So even if you only use Remote Write to store data into the Cortex, Prometheus will still have data locally.

The community status quo

  • Thanos does even better in terms of community activity

Performance pressure measurement

The above comparison is made between the two projects from the perspective of architecture, but how do they perform in practice? We conduct performance pressure test:

Pressure measurement way

We keep the total number of series of the two systems always having the same change, from query performance, system load and other aspects, to evaluate their previous advantages and disadvantages

Pressure test results
  • Stability: How well components work under different data sizes

    Thanos is statistically more stable.

  • Query performance: Query efficiency in different data sizes

    Statistically, Thanos is more efficient with queries.

  • Not enabled Ruler resource consumption: the load of each component without the Ruler enabled

    Thanos has a much lower resource consumption than Cortex in terms of collection and query.

Throughout the process, we found that the Cortex’s performance was not as good as the official claim. Of course, it was possible that our tuning was not reasonable, but it also reflected that the Cortex was very difficult to use, very complex operation and maintenance (hundreds of parameters), and the overall experience was very poor. In contrast, Thanos’s overall performance is similar to the official introduction, and the operation and maintenance difficulty is relatively low, and the system is easy to control.

The selection

Compared to the previous analysis, Thanos has significant advantages over Cortex in terms of performance, community activity and access. So we chose the Thanos solution as the unified storage.

Kvass system as a whole

So far, we have implemented a high-performance scalable Kvass monitoring system that is 100% compatible with the native Prometheus configuration by implementing sharable Prometheus plus Thanos. Component relationships are shown below:

Multiple K8S clusters are connected

In the figure above we have shown only one set of collectors (i.e., multiple collectors of Prometheus and their coordinator sharing the same configuration file), but in fact the system supports multiple collectors, i.e., a system can support monitoring of multiple Kubernetes clusters to obtain a global view of multiple clusters.

Kvass-operator

Reviewing the shortcomings of the old version of monitoring in the operation and maintenance method, we hope that our new monitoring system has a complete management tool, and can be managed in Kubernetes way. We decided to use operator mode for management. Kvas-operator is the management center of the entire system, and it contains the following three custom resources

  • Thanos: defines the configuration and status of ThanOS-related components, which are globally unique.
  • Prometheus: Each Prometheus defines the configuration of a Prometheus cluster, such as its associated Kubernetes cluster base information, some thresholds for the coordination algorithm, etc
  • Notification: The alarm channel is defined. The Kvas-operator updates the alarm configuration on the cloud based on the definition

Prometheus-operator and intra-cluster collection configuration management

Due to the complexity of Prometheus configuration file management, CoreOS open-source a Promethee-Operator project for Prometheus and its configuration file management, which supports the definition of ServiceMonitor, PodMonitor, two custom types with better readability than native profiles, assists users in generating the final collection profile.

We wanted to implement a virtual Prometheus mechanism, whereby each user cluster would be able to manage its own Prometheus collection profile internally, adding, deleting, modifying, and checking ServiceMonitor and PodMonitor, i.e., Prometheus appears to be deployed in its own cluster.

To achieve this effect, we introduced and modified the Prometry-operator. The new Prometry-Operator will connect to the user cluster for the ServiceMonitor and PodMonitor, and generate configuration files on the collector side.

We also put the coordinator and prometheus-operator together.

TKE monitoring scheme based on Kvass

Step by step, we finally have a highly available monitoring system that supports multi-cluster acquisition and scaling, which we replaced with Cluster-monitor + Region Prometheus in the original monitoring scheme. Realized the appeal at the beginning of the article.

The first version

A new scheme

The solution we introduced above can completely replace Region Prometheus and Cluster-Monitor in the earlier solution. Now we add a set of Thanos to integrate the data of the whole network.

Compared with the pre-defined monitoring indicators of the previous version, the new version of the monitoring system supports user-defined data reporting because Prometheus is scalable.

conclusion

Project ideas

The design of Kvass is not an arbitrary decision, but rather the result of solving a number of problems in the current situation.

Look at the old version objectively

Although we the whole article is to introduce a new system to replace the old version, but this does not mean that we think the old version monitor designed to be bad, but with the development of the business, the old version control system facing brought about major changes on the scene compared to the beginning of design, reasonable decisions at the time, in the current scenario becomes no longer apply. It’s called evolution rather than replacement.

Design the model first

Compared with the direct start system to the ground, we tend to design system model first, and determine the design principle, system model is used to sort out what we have is to solve problems, our system by which a few core module components, each module what is the core to solve the problem, a system model, and train of thought is to have the design blueprint.

Define design principles

In the system design process, we paid particular attention to the principle of design, that no matter what form we adopted, what solution we adopted, what features were required for the new system, native compatibility was our first design principle for Kvass, and we wanted to be Prometheus for the user cluster, no matter how we designed it.

Fall to the ground

In the overall development process, we also stepped on a lot of holes. Architecture architectural design for thaos, compared with the method of the index and data separation, design is more reasonable, in theory for the large-scale data, read performance will be better, and the Cortex as native support for multi-tenancy, implemented a large number of parameters used to limit the scale of the user’s query, this is the place where Thanos remains to be strengthen. Our initial solution also tried to use Cortex as the unified storage, but in actual use, we found that Cortex has high memory footprint, complicated parameter tuning and other problems. Compared with Thanos, its performance is more stable and more close to our scenario, we chose to switch to Thanos based on the pressure test report.

The transition

Since the problems solved by Kvass system have a certain universality, TKE decided to expose it to users as a sub-product, providing users with a cloud native system based on Kvass, which is currently open for private testing.

[Tencent cloud native] cloud said new, cloud research new technology, cloud travel new live, cloud appreciation information, scan code to pay attention to the public account of the same name, timely access to more dry goods!!