These open source tools provide output to help users understand the health of the system and alert them to potential problems.

You probably already know (or guess) what the Alerting and Visualization tools are for. Here’s why we’re talking about tools like this, and even systems that specialize in visualization as a unique feature.

The concept of Observability, derived from control theory, describes our ability to understand a system through its inputs and outputs. This article focuses on observable output components.

The alarm visualization tool analyzes the output of other systems and presents the output information in a structured manner. An alarm is a description of a system exception, and a visualization is a structured representation that users can intuitively understand.

Common visual alarms

The alarm

First, we need to clarify the meaning of an alert. Alarms should not be sent when people are unable to respond to the contents of an alarm — including those sent to more than one person but only a few of them can respond, and those triggered by every exception in the system. Because of alarm fatigue, alarm recipients tend to ignore the plethora of alarms — until the system deteriorates to the point of being alerted in rare ways.

For example, if an administrator receives hundreds of alarm emails from an alarm system every day, it is easy to ignore all the emails. The administrator does not pay attention to alarms unless he sees a problem or is queried by a customer or superior. In this case, the alarm loses its original meaning and function.

An alarm is not a continuous flow of information or status update. The purpose of an alarm is to expose the fault that cannot be rectified automatically. The alarm should be sent only to the person who is most likely to solve the problem. Anything beyond this definition should not be regarded as an alarm. Otherwise, the actual work will be adversely affected.

Different alarm systems have their own alarm types, so it is not possible to generalize in terms of priority (P1-P5) or words such as “information,” “warning,” and “critical.” Below, I will introduce some general categorization methods emerging in the event response of emerging complex systems.

I mentioned a “message” alarm type earlier, but an alarm should not actually be a message, although some people might think otherwise. But I feel that if an alarm isn’t sent to anyone, it shouldn’t be an alert, but just data points that are considered alerts in many systems, representing events that should be known about but don’t need to be responded to. It should be part of an alarm visualization tool rather than an event that causes an alarm to be triggered. Mike Julian, author of Practical Monitoring, a must-read book in the field, introduces his own view of alarms.

Non-informational alerts indicate that the alarm needs to be responded to and action is required. I roughly divide these alarms into internal and external failures, and for most companies there are usually two or more levels to prioritize response to alarms. System performance degradation is a failure because its impact on users is often unknown.

Internal faults have a lower priority than external faults, but they also require a quick response. Internal failures typically include internal systems used by employees or application failures that are visible only to employees.

External failures include any system failures that have an immediate business impact, but do not include failures that affect system updates. External faults include application faults, database faults, and network faults that cause system availability or consistency failures. Dependent component faults that do not directly affect users are also external faults. With the continuous running of applications, once dependent component faults occur, system performance will be affected. This situation is common for systems that use external services or data sources that may not involve the primary functionality of the system, but may experience significant delays in dealing with errors in dependent components.

visualization

There are many types of visualization, and I won’t go through them all. This is an interesting research area, and during my years of experience in data analysis, learning and applying the knowledge of visualization can be quite challenging. We need complex system output to be presented to others in an intuitive way in order to effectively disseminate information. Both Google Charts and Tableau offer many tools for visualization. Here are some of the most common innovative visualization solutions.

The line chart

Line charts are probably the most common form of visualization, giving the user an intuitive view of the system in terms of time. Each single or aggregate metric in the system is represented in a chart as a broken line. However, when there are more than one polyline in the same chart at the same time, it can affect reading (as shown in the figure below), so in most cases you can choose to view only a few polylines rather than all of them at the same time. It is easy to spot if the value of an indicator fluctuates beyond the normal range. For example, the abnormal purple, yellow, and light blue lines in the figure below.




Another use for line charts is to stack multiple lines to show their relationships. For example, if the number of requests to a server is reflected in a line chart, requests on each server can be viewed individually or collectively. This gives you the flexibility to view the entire system and each instance in the same chart.




Heat map

Another common form of visualization is a thermal map. The thermal diagram is similar to the bar chart, and can also show the change of the proportion of a certain part in the whole on the basis of the bar chart. For example, when viewing the network request delay, the thermal map can be used to quickly view the overall trend and distribution of all network requests. In addition, different colors can be used to represent the values of different parts.

In the following thermal map, it is clear from the vertical distribution of the number of color blocks for each time period that most of the data is concentrated in the center of the whole range. We can also find that the color block distribution is relatively loose in most time periods, while the color block distribution is very dense from 14:00 to 15:00, which may indicate an unhealthy state.




Instrument figure

Another common form of visualization is the meter chart, which allows users to quickly learn about individual metrics. The instrument is generally used for the display of a single indicator, such as the speedometer represents the speed of the car, the oil gauge represents the amount of gasoline in the tank and so on. Most instrument charts have one thing in common, that is, they will divide the corresponding status of the indicator shown. As you can see in the figure below, green is normal, orange is bad, and red is very bad. The middle line of the figure below simulates the display of a real meter.




In the above chart, in addition to the conventional instrument style display mode, there is a more direct data display mode, with the same color scheme, you can see the status of each indicator at a glance, which is similar to the characteristics of the instrument. Therefore, the bottom line is probably the best way to display the meter diagram so that the user can get a rough idea of the different states of each indicator without having to read it carefully. This type of visualization is the one I use most, and in a matter of seconds I can get a complete overview of how the system is performing.

Flame figure

The fire map, started in 2011 by Brendan Gregg of Netflix, is one of the rarer forms of visualization. Unlike meter diagrams, which can quickly extract key information from diagrams, they are usually used only when an application problem needs to be solved. The flame diagram is used to represent CPU, memory, and related frames, with the X axis listing frames alphabetically and the Y axis showing stack depth. Each rectangle in the figure is a stack frame that identifies the function being called. The wider the rectangle, the more frequently it appears on the stack. When analyzing system performance problems, the flame chart can play a great role, you may wish to try it.




Tool selection

On the alarm tool side, there are several commercially available tools that are quite good. But since this is an introduction to open source technology, I’ll only cover the free tools that are already widely available. Hopefully, you’ll contribute your own code to these tools to make them better.

The alarm tool

Bosun

If you have a problem with your computer, Stack Exchange can help you find a solution online. Stack Exchange operates many different types of sites on a crowdsourced q&A model. Among them are Stack Overflow, which is popular among developers, and Super User, which is famous in operation and maintenance. Beyond that, Stack Exchange has everything from parenting lessons to science fiction, philosophy discussions to bike forums.

Stack Exchange open-source its alarm management system, Bosun, and also released Prometheus and its AlertManager system. The two systems have something in common. Bosun was developed using Golang like Prometheus, but Bosun was more powerful than Prometheus because it could interact with the system in ways other than metrics aggregation. Bosun can also pull data from log and event collection systems and supports Graphite, InfluxDB, OpenTSDB, and Elasticsearch.

The Bosun architecture consists of a single server binary, a backend such as OpenTSDB, Redis, and the SCollector agent. The SCollector agent automatically detects the services running on the host and reports back on the status of these processes and other system resources. This data is sent to the back end. The binary service file of Bosun queries the backend to determine whether an alarm needs to be triggered. You can also query the underlying back end of Bosun through a common interface through the Grafana tools. Redis is used to store Bosun’s state information and metadata.

Bosun has a neat feature that allows you to test alarms against historical data. This is a feature THAT I really needed when I was using Prometheus a few years ago, when I had an abnormal data that needed to generate an alarm and didn’t have an easy way to test it. In order to ensure that the alarm would trigger properly, I had to create the corresponding data to test it. Bosun makes this step much quicker.

Bosun covers all the usual features, including simple graphical representation and alarm creation. It also comes with a powerful expression language for writing alarm rules. But Bosun only comes with email notification configuration and HTTP notification configuration by default, so if you need to connect to Slack or other tools, you’ll need to customize the configuration to a greater degree (it’s documented). Similar to Prometheus, Bosun also uses template notifications, where you can use HTML and CSS to create the email notifications you need.

Cabot

Cabot was developed by Arachnys. You may not know much about Arachnys, but it’s influential: Arachnys has built an advanced cloud-based solution for preventing financial crime. At my previous company, I worked on something like Know Your Customer (KYC). Most companies agree that being associated with a terrorist group can be quite damaging because it can use its own systems to raise money. These solutions will help prevent fraud, which, while relatively minor, still poses a risk to an organization.

Why did Arachnys develop Cabot? It’s just that the Arachnys developers aren’t familiar with Nagios. Cabot is good news for a lot of people. It is built on Django and Bootstrap, so the bar is not too high to make your own contribution to the project. (It should also be noted that Cabot is named after the developer’s dog.)

Like Bosun, Cabot does not collect data, but instead uses data provided by the API of the monitored object. Therefore, the mode of Cabot alarms is pull rather than push. It accesses the API of each monitored object, retrieves the required data according to specific indicators, and then stores the alarm data to the Postgres database persistently using Redis cache.

A rare feature of Cabot is that it supports Graphite native-but also Jenkins. Jenkins is seen here as a centralized, timed task that treats build failures as failures. A build failure is certainly not as urgent as a system failure, but when a build fails, the team needs to take action, and not everyone checks in with Jenkins when they get an email about a failed build.

Another interesting feature of Cabot is the ability to access Google’s calendar to schedule people on duty. This feature, called Rota, is very useful and hopefully will be added to other alarm systems as well. Cabot currently only supports scheduling primary and secondary contacts, but there is room for improvement. Its own documentation notes that if full functionality is needed, a paid solution should be considered.

StatsAgg

It’s rare and admirable that Pearson is the publishing company behind the StatsAgg alarm platform. In addition, Pearson operates several other websites and a joint venture with O’Reilly Media. But I still think of it as a company that publishes teaching books.

In addition to being an alarm platform, StatsAgg is also a metric aggregation platform, and even acts as a proxy for other systems. StatsAgg supports input of data through Graphite, StatsD, InfluxDB, and OpenTSDB, as well as forwarding to various platforms. But as the load on the center’s services increases, so does the risk. However, if StatsAgg’s infrastructure is strong enough, even if the back-end storage platform fails, it will not have an impact on its alarm process.

StatsAgg was developed in Java and, to minimize complexity, includes only the main service and a UI. StatsAgg supports sending alarms based on regular expression matching, and it focuses on service-side alarms rather than server-based alarms. I think it fills a gap in open source monitoring tools, which is its own goal.

Visualization tool

Grafana

Grafana is well known and widely adopted. Whenever I need to use the data panel, I always think of it because it is better than any similar product I have used. Grafana was developed by Torkel Odegaard and, like Cabot, was developed over Christmas and released in January 2014. It has come a long way in just a few years. Grafana was based on Kibana, and Torkel opened a new branch called Grafana.

Grafana focuses on practicality and aesthetic presentation of data. It naturally collects data from Graphite, Elasticsearch, OpenTSDB, Prometheus, and InfluxDB. There is also a commercial version of Grafana that can get data from more data sources, but other data source plug-ins are not without open source versions, and the Grafana plug-in ecosystem already provides a variety of data sources.

What can Grafana do? Grafana provides a centralized way to understand systems. It presents the data over the Web, and anyone has access to the information, although authentication can be used to restrict access. Grafana uses a variety of visualizations to provide an at-a-glance view of the system. Grafana also supports different types of visualization, including integrated alarm visualization.

Now you can set alarms more intuitively. With Grafana, you can view the chart, and you can also see where alarms are triggered due to system performance degradation, click where you want to trigger the alarm, and tell Grafana where to send the alarm. This is a very powerful addition to the alarm platform. The alarm platform may not be replaced, but the alarm system will certainly get more inspiration and development.

Grafana also introduces a number of team collaboration features. Data panels can be shared between different users, and you no longer need to create separate data panels for the Kubernetes cluster, as some data panels maintained jointly by the Kubernetes developers and Grafana developers are already available.

An important feature of team collaboration is annotation. Annotations allow users to add context to a chart so that other users can more intuitively understand the chart. This feature is important when team members are working on an event and need to communicate and understand. Having all the relevant information where it needs to be allows the team to quickly reach consensus. This feature comes into play when the team needs to investigate the cause of the failure and determine responsibility for the incident.

Vizceral

Vizceral was developed by Netflix to more effectively understand traffic in the event of a failure. Grafana is a more versatile tool, while Vizceral is specialized in certain areas. Although Netflix says it no longer uses Vizceral internally or proactively maintains it, it still updates regularly. I introduce the tool here primarily to describe its visualization mechanisms and how it can be used to assist in problem solving. You can use it in the sample environment to better understand the characteristics of such systems.


via: https://opensource.com/article/18/10/alerting-and-visualization-tools-sysadmins

By Dan Barker (lujun9972

This article is originally compiled by LCTT and released in Linux China