Here’s what you need to know about time series data and indicator aggregation tools.
Is surveillance just surveillance? Doesn’t it include logging, visualization, and time series data?
The terminology surrounding monitoring has caused a lot of confusion over the years, leading to bad tools that claim to be able to do everything in one format. Observability proponents recognize that there are many layers to an observation system. Metric aggregation is primarily time series data, which is what we will discuss in this article.
Characteristics of time series data
counter
A counter is a measure of a value that only increases. In other words, the counter is never reduced, accumulating values and displaying the current total on request. These are usually used for things like the total number of network requests, the number of errors, the number of visitors, etc. This is similar to someone with a counter device standing at the entrance to an event and counting all the people who enter. There is usually no option to decrement a counter without resetting it.
instrument
A meter is similar to a counter in that it represents a single value, but it can also be reduced. It’s essentially a representation of some value at some point in time. A thermometer is a good example of a meter: it moves up and down with temperature and provides point-in-time readings. Other uses include CPU usage, memory usage, network usage, and thread count.
quantile
Quantiles are not a metric, but they are closely related to the next two parts: the histogram and the summary. Let’s illustrate our understanding of quantiles with an example:
A percentile is a quantile. Hundredths are things we see all the time, and they should help us understand general concepts more easily. There are 100 buckets in the percentile. We often see them related to testing or performance, and are usually represented as someone scoring, for example, within the 85th percentile or other values. This means that people who score within that percentile have actual values that fall between the 85th and 86th percentiles. This person was in the top 15 percent of all students. We do not know the score in the bucket from this metric, but we can calculate it by dividing the sum of all the scores in the bucket by the count of those scores.
Quantiles allow us to understand our data better than using means or other statistical functions that do not take into account outliers and uneven distributions.
histogram
Histograms are slightly more complex than counters or meters. This is an observation sample, which consists of a counter that calculates all observations and is basically a meter that sums the observations. It uses buckets or groups to segment values to bind data sets in an efficient way. You usually see quantiles associated with the request service level agreement (SLA). Suppose we want to ensure that 95% of requests are less than 500 milliseconds. We can use a bucket with an upper limit of 0.5 seconds to collect all values up to 500ms. We will then be able to determine how many requests have fallen into the bucket. We can also determine how close we are to the SLA, but this is difficult to do (as explained more in the Prometheus documentation).
Histograms are aggregated metrics accumulated from multiple instances to a central server. This provides an opportunity to understand the system as a whole, rather than node by node.
summary
Aggregates are similar to histograms in that they are samples of observation, but aggregation occurs on the server side. In addition, quantile estimates are more accurate than histograms. The summary uses a sliding time window, so it is a slightly different case than the histogram, but is generally used for the same type of measurement. I usually use histograms unless I need very precise quantile measurements.
Push and pull
There would be no article on metric aggregation tools without addressing the push versus pull debate.
The debate is whether it’s better for your metrics aggregation system to push data to, or for your metrics aggregation system to grab data by grabbing endpoints. This has been discussed in several articles. My point is that it basically doesn’t matter. Additional research is at the discretion of the reader.
Tool options
There are many tools available, both open source and commercial. We will focus on open source tools, but some of them are open tools with paid components.
Some of these tools have observable add-ons, mainly alerts and visualizations. These are covered as additional features in this section and will not be covered in subsequent articles.
1.Prometheus
This is the most recognized time series monitoring solution for cloud native applications. It was hosted by the Cloud Native Apps Foundation (CNCF), but created by Matt Proud and Julius Volz, and sponsored by SoundCloud, with outside contributors coming in early to help develop it. Brian Brazil of Robust Perception set up a business that helped the company adopt Prometheus. He also has a great blog on his website. The Prometheus documentation is extensive and provides a great deal of detailed information for understanding and using the tool.
Prometheus is a pull-based system that uses a local configuration to describe the endpoints to be collected and the interval required for the collection. Each endpoint has a client that collects the data and updates the presentation (or configures the client) on each request. This data is collected and stored in an efficient storage engine on local disk. Storage systems use only attached files for each metric. This storage is not lossy, which means the fidelity of data from a year ago is as high as the data you collect today. However, you may not want to keep that much data locally. Fortunately, there is a remote storage option available for long-term retention and analysis.
Prometheus includes a high-level expression language for selecting and displaying data called PromQL. This data can be presented graphically through REST apis, in tables, or used by external systems. The expression language allows users to create regressions to analyze real-time data or trend history data. Tags are also great tools for filtering and querying data. Labels can be associated with each metric name.
Prometheus also provided a federated model that encouraged more localized control by allowing teams to own their own Prometheis, while central teams could own their own Prometheis. The central system can grab the same endpoints as the local Prometheis, but they can also extract the local Prometheis to get aggregated data that the local instance is collecting. This reduces overhead at the endpoints. This federation model also allows local instances to collect data from each other.
Prometheus ships with AlertManager to handle alerts. The system allows for aggregate alerts as well as more complex traffic to limit the time to send alerts.
Suppose 10 nodes suddenly drop when the switch is off. It might not be necessary to send alerts about the 10 nodes because everyone receiving them might not be able to do anything until they modify the switch. With AlertManager, alerts can be sent only to the switch’s network team and contain additional information about other systems that may be affected. You can also send an email (rather than a page) to the system team so that they know these nodes are down and they don’t need to respond unless the system doesn’t show up after the switch is fixed. If this happens, the AlertManager will reactivate those alerts that are suppressed by switch alerts.
2.Graphite
Graphite has been around for a long time, detailed in James Turnbull’s recent book “The Art of Monitoring”. Graphite is already widespread in the industry, with many large companies using it on a large scale.
Graphite is a push based system that receives data from applications by having them push data into Graphite’s Carbon components. Carbon stores this data in the Whisper database, which is read by the Graphite Web component as well as Carbon, allowing users to plot data in a browser or extract data through an API. One really cool feature is the ability to export these graphs as images or data files so that they can be easily embedded in other applications.
Whisper is a fixed-size database that stores digital data quickly and reliably over time. It is a lossy database, which means the resolution of your metrics will decrease over time. It will provide a high fidelity metric for the latest collection and progressively lower fidelity.
Graphite also uses the dot separator name, which means dimension. This dimension allows for some creative aggregation of metrics and relationships between metrics. This allows services to be aggregated across different versions or data centers, and (more specifically) a single version running in a single data center in a particular Kubernetes cluster. Granularity comparisons can also be performed to determine if a particular cluster is underperforming.
Another interesting feature of Graphite is the ability to store arbitrary events that should be related to time series metrics. In particular, applications or infrastructure deployments can be added and tracked in Graphite. This allows the operator or developer to troubleshoot the problem to get more context about what is happening in the environment where the abnormal behavior is being investigated.
Graphite also has a large number of functions that can be applied to the metric family. However, it lacks a powerful query language, and other tools are included. It also lacks any alarm features or built-in alarm systems.
3.InfluxDB
InfluxDB is a relatively new entrant, newer than Prometheus. It uses an open core model, which means additional costs for scaling and clustering. InfluxDB is part of the larger TICK stack (Telegraf, InfluxDB, Chronograf and Kapacitor), so we will include the functionality of all these components in this analysis.
InfluxDB uses key-value pairs called tags to add dimensions to metrics, similar to Prometheus and Graphite. The results are similar to the other systems we discussed earlier. The metric data can be float64, INT64, bool, and strings with nanosecond resolution. This scope is broader than most other tools in this area. In fact, the TICK stack is more of an event aggregation platform than a native time series metric aggregation system.
InfluxDB is stored using a system similar to a log structure merge tree. In this context, it is referred to as a time structure merging tree. It uses a pre-written log and a set of read-only data files that are similar to a sorted string table but have series data rather than pure log data. These files are sharded in time blocks. For more information, check out this excellent resource on InfluxData.
The architecture of the TICK stack varies depending on whether it is an open source or commercial version. The open source InfluxDB system is independent from a host, whereas the commercial version is distributed in nature. The same is true for other central components. In the open source version, everything runs on one host. No data or configuration is stored on the external system, so it’s fairly easy to manage, but it’s not as powerful as the commercial version.
InfluxDB contains an SQL-like language, named InfluxQL, used to query data in the database. The primary method of querying data is the HTTP API. The query language doesn’t have as many built-in helper functions as Prometheus, but those familiar with SQL may feel more comfortable with it.
The TICK stack also includes an alarm system. The system was capable of some mild aggregation, but did not have the full functionality of Prometheus’s AlertManager. But it does provide a lot of integration. Also, to reduce the load on the InfluxDB, sequential queries can be arranged to store the query results that Kapacitor will take to alert.
4.OpenTSDB
As its name suggests, OpenTSDB is an open source time series database. It is unique in this toolset because it stores its metrics in Hadoop. That means it’s inherently scalable. If you already have a Hadoop cluster, this might be a good choice for metrics that you want to store for the long term. If you don’t have a Hadoop cluster, the operational overhead may be too much of a burden. However, OpenTSDB now supports Google’s Bigtable as a back end, a cloud service that you don’t need to operate.
OpenTSDB shares many functions with other systems. It uses a key-value pairing system, which calls labels to identify metrics and add dimensions. It has a query language, but it is more limited than Prometheus’s PromQL. However, it has several built-in features that are helpful for learning and using. The API is the main entry point for queries, similar to InfluxDB. The system can also store all data permanently, unless the lifetime is set in HBase, so you don’t have to worry about fidelity degradation.
OpenTSDB does not provide alerts, which will make it harder for you to integrate with the event response process. This type of system could be very useful for long-term Prometheus data storage and for performing more historical analysis to uncover systemic problems rather than as a tool for quickly identifying and responding to acute problems.
OpenMetrics standard
OpenMetrics is a working group that seeks to establish a standard presentation format for metrics. It was influenced by Prometheus. If this is successful, we will have an industry-wide abstraction that allows us to easily switch between tools and vendors. Leading companies such as Datadog have begun to provide tools that can use the Prometheus Expo format, which will be easily converted to the OpenMetrics standard once released.
It’s also important to note that contributors to the project include Google and InfluxData (among others). This could mean that InfluxDB will eventually adopt the OpenMetrics standard. It could also mean that if Google participates in a metric, one of the big three cloud providers will adopt it. Of course, the browse format is already used in the Kubernetes project created by Google. SolarWinds, Robust Perception and SpaceNet are also involved.