A roundup of some useful open source monitoring tools
Efficient development and operation
Efficient development and operation
WeChat ID DevOpsGeek
InfoQ is a vertical public account in the field of development operation and cloud computing. General operations, or the rise of DevOps, and the dynamics of cloud computing, explore how IT delivers value. To serve the majority of operation and maintenance workers, for you to offer technical knowledge support.
today
Monitoring system is the top priority in the whole IT architecture. From troubleshooting and problem location to business prediction and operation management, monitoring system is indispensable. IT can be said that a stable and healthy IT architecture must have a reliable monitoring system.
But is surveillance just surveillance? Over the years, there has been a lot of confusion about monitoring terminology, and some really bad tools claim to be able to do everything in one format.
In the era of DevOps and cloud native, “Observability” was introduced into the IT field this year, first manifested by the emergence of Observability grouping for CNCF-landscape. The content of this group includes projects in monitoring, logging, Tracing and other fields. What is the difference between observability and monitoring? In short, the latter is a subset of the former. Monitoring focuses on the failure factors of the system to define a failure model of the system. Its core is operation and maintenance, is monitoring facilities. In addition to focusing on failure, the core of observability is research and development, application, and self-examination of the system. Is to stand on the creator’s point of view to explore how the system should properly display its state. One from the outside in, one from the inside out.
Observation tools include: Metrics aggregation, Log aggregation, Alerting/ Visualizations, Distributed tracing.
Monitoring tools
Prometheus
Prometheus, the most recognized time series monitoring solution for cloud native applications, is hosted by CNCF and developed in the Go language as a similar implementation of Google’s BorgMon monitoring system.
Prometheus uses the Pull model, in which Prometheus Server pulls monitoring data from targets through HTTP. It uses a local configuration to describe the endpoints to be collected and the interval required for the collection. Each endpoint has a client that collects the data and updates the representation on each request (or the client is configured). This data is collected and stored in an efficient storage engine on local disk. Storage systems use only additional files for each metric.
Prometheus includes a high-level expression language for selecting and displaying data called PromQL. This data can be displayed graphically or in tables through the REST API or used by external systems. The expression language allows users to create regressions to analyze real-time data or trend history data. Tags are also great tools for filtering and querying data. Labels can be associated with each metric name.
Prometheus ships with AlertManager to handle alerts. AlertManager allows for alert aggregation and more complex traffic to limit the time to send alerts. If something goes wrong on 10 nodes at the same time as the switch goes off, you probably don’t need to send an alarm about those 10 nodes because everyone who receives the alarm may not be able to do anything until the switch is fixed. With AlertManager, you can send alerts only to the network team about the switch and include information about other systems that might be affected; It is also possible to send an email (rather than a page) to the system team so that they know that these nodes are down and that they do not need to respond unless the system does not recover after the switch is fixed. If this happens, the AlertManager will reactivate those alerts that were suppressed by the switch alerts.
Graphite
Graphite is an open source enterprise-level monitoring and graphing tool written in Python that runs on inexpensive computer hardware. Graphite can collect, store and display time series type data in real time. It consists of three software components:
-
Carbon-twisted based process that listens and receives data;
-
Whisper – a small database dedicated to storing temporal data, similar in design to RRD;
-
Graphite WebApp – Django-based web application that can capture and present time series data from whisper databases.
Graphite architecture diagram
Graphite is a push based system that receives data from applications by having them push data into Graphite’s Carbon components. Carbon stores this data in the Whisper database, and Graphite Web components read Carbon’s and database, allowing users to plot data in a browser or extract data through an API. One really cool feature is the ability to export these graphs as images or data files so that they can be easily embedded in other applications.
Another interesting feature of Graphite is the ability to store arbitrary events related to timing metrics. Applications or infrastructure deployments can be added and tracked in Graphite, which allows operations personnel or developers to troubleshoot problems and gain more background information on the abnormal behavior environments being investigated.
Graphite monitoring started guide: http://www.infoq.com/cn/articles/graphite-intro
InfluxDB
InfluxDB is a relatively new temporal database written in Go language and without external dependencies. It is easy to install and configure and is suitable for building monitoring systems for large distributed systems.
Its design goal is to achieve distributed and horizontal scaling.
Key features of InfluxDB:
-
No structure (no schema) : Can be any number of columns
-
You can set the metric to be saved for a certain time
-
Support time-related functions (such as min, Max, sum, count, mean, median, etc.) to facilitate statistics
-
Storage policies: Can be used to delete or modify data. (influxDB does not provide methods for deleting or modifying data)
-
Continuous query: a group of statements automatically and periodically started in the database. When combined with storage policies, the system usage of the InfluxDB can be reduced.
-
Native HTTP support, with built-in HTTP APIS
-
Supports sqL-like syntax
-
You can set the number of copies of data in a cluster
-
Supports periodic sampling of data and writing to another measurement to facilitate granular storage of data
-
Provides a web management interface for easy use (login: http://< influxdb-ip >:8083).
OpenTSDB
OpenTSDB is a distributed, scalable, hbase-based timing database. Specifically, it is an Hbase application. OpenTSDB is primarily used as a monitoring system; For example, the monitoring data of large-scale clusters (including network devices, operating systems, applications and environment status) are collected, stored and queried.
OpenTSDB can dynamically add metrics and flexibly support collectors in any language, greatly facilitating operations personnel and reducing development and maintenance costs.
Data stored in OpenTSDB is measured in metrics. Metrics are monitoring items, such as CPU usage and memory usage, for a server.
OpenTSDB uses HBase as the storage. Due to its good design, OpenTSDB supports the storage of metric data to the second level.
OpenTSDB supports permanent data storage, that is, saved data is not deleted. And the original data will always be saved (some monitoring systems will aggregate data from a long time ago and save it)
Log aggregation tool
Some logging rules:
-
Including time stamps
-
Format for JSON
-
Do not record meaningless events
-
Log all application errors
-
There can be log warnings
-
Enabling Logging
-
Write messages in readable form
-
Do not record information during production
-
Do not record anything that cannot be read or reacted to
ELK
ELK is short for Elasticsearch, Logstash, and Kibana. ELK is the most popular open source log aggregation tool on the market for real-time data retrieval and analysis. It is used by Netflix, Facebook, Microsoft, LinkedIn and Cisco. All three components are developed and maintained by Elastic. Elasticsearch is essentially a NoSQL, implemented with the Lucene search engine. Logstash is a log pipeline system that extracts, transforms, and loads data into stores like Elasticsearch. Kibana is the visualization layer on top of Elasticsearch.
A few years ago, Beats, a data collector, was introduced to simplify the transfer of data to Logstash. You can install Beat, export NGINX logs or Envoy proxy logs, and use them effectively in Elasticsearch without knowing the correct syntax for each type of log.
When installing the production-level ELK stack, other parts such as Kafka, Redis, and NGINX may be included. In addition, Logstash can often be replaced with Fluentd. The system was complicated to operate and there were a lot of early problems that led to a lot of complaints. These issues have largely been fixed, but it’s still a complex system, so for smaller operations you may not want to try it.
The ELK stack also provides excellent visualization tools through Kibana, but it lacks alerting capabilities. Elastic offers alerts in a paid X-pack plug-in, but nothing is built into the open source system. Yelp offers a solution to this problem called ElastAlert, and there may be others like it. This additional software is very powerful, but it adds further complexity to the ELK stack.
The ELK Stack has risen rapidly in the last two years. Compared with traditional log processing solutions, the ELK Stack has the following advantages:
-
Flexible processing. Elasticsearch is a real-time full text index and does not need to be pre-programmed like storm to use it;
-
Easy to use configuration. Elasticsearch uses JSON interface and Logstash is a Ruby DSL design, which is the most common configuration syntax in the industry.
-
Efficient retrieval performance. Although every query is real-time calculation, excellent design and implementation can basically achieve second-level response of all-day data query;
-
The cluster expands linearly. Both Elasticsearch and Logstash clusters are linearly scalable;
-
Front end operation dazzling. On the Kibana interface, you can search, aggregate and create dazzling dashboards with the click of a mouse.
Graylog
Graylog is a powerful log management, analysis tool based on Elasticsearch, Java, and MongoDB, which makes it as complex to run as the ELK stack, if not more so. However, the Graylog open source version comes with built-in alerts, as well as other noteworthy features such as streaming, message rewriting, and geolocation.
The streaming capability allows data to be routed to a particular Stream in real time while being processed. With this feature, users can view all database errors in one Stream and Web server errors in another. Alarms can even be based on these streams when new items are added or thresholds are exceeded. Latency can be one of the biggest problems with log aggregation systems, and this problem is eliminated in Streams in Graylog. Once logs are in, they can be routed to other systems via Stream without processing.
The message rewriting feature uses the open source rules engine Drools to allow all incoming messages to be evaluated against user-defined rule files. This file can discard messages (called blacklists), add or remove fields, and modify information.
Perhaps Graylog’s coolest feature is location, which allows IP addresses to be plotted on a map. This functionality is fairly common and is available in Kibana, but Graylog adds a lot of value, especially if you want to use it as a SIEM system. Geolocation is available in the open source version of Graylog.
Image source: https://testerhome.com/topics/3026
Graylog’s attractions:
-
All-in-one solution, easy to install, unlike ELK, which has integration issues between 3 independent systems.
-
Collect the original log and can add fields later, such as http_status_code, response_time, etc.
-
Develop the scripts for collecting logs and send them to the Graylog Server using CURL/NC. The sending format is custom GELF, and flELF and Logstash have corresponding GELF output add-on. Developing your own gives you a lot of freedom. You can use inotifyWait to monitor the modify event of the log and send the new line of the log to the Graylog Server using curl/netcat.
-
Search results are highlighted, just like Google.
-
The search syntax is simple, for example, source:mongo AND reponse_time_ms:>5000. Avoid entering elasticSearch to search for JSON syntax.
-
The search criteria can be exported as the ELASticSearch JSON text, facilitating the development of search scripts that invoke the ElasticSearch REST API.
Graylog’s official website is https://www.graylog.org/
Fluentd
Fluentd is a completely open source and free log information collection software, supporting log information collection of more than 125 systems, written in C and Ruby, accepted by CNCF as an incubation project, and recommended by AWS and Google Cloud. Fluentd has become a common alternative to Logstash in many installations, acting as a local aggregator for collecting all node logs and sending them to a central storage system. But Fluentd is not a log integration system.
Image source: http://www.muzixing.com/pages/2017/02/05/fluentdru-men-jiao-cheng.html
Fluentd uses a powerful plug-in system with over 500 plug-ins available to quickly and easily integrate different data sources and data outputs, covering most of your use cases.
Fluentd has low memory requirements (only a few tens of megabytes) and high throughput, making it a common choice in Kubernetes environments. In an environment like Kubernetes, where each Pod has a Fluentd Side-car, memory consumption increases linearly with the creation of each new Pod. Using Fluentd will greatly reduce system utilization.
Alarm and visualization tools
The purpose of the alarms and visualization tools is known by name, and the alarms and visualization system focuses on understanding the output of other systems. That’s why they’re grouped together. Visualization and alerting tools can represent system output structurally.
Common types of alarms and visualization
First, we need to be clear about what is not an alarm. If responders cannot do anything about the problem, the alarm should not be sent. This scenario includes alarms that are sent to multiple people, but only a few people can respond to them.
For example, if an operator receives hundreds of emails a day from an alert system, he will ignore all emails from the alert system and only respond to real events when there is a problem, a customer sends an email or a boss calls. In this case, the alarm has lost its meaning and usefulness.
An alert should not be a stream of messages or status updates. They are just data points that many systems call alerts, representing events that should have been known but were not responded to. This information is usually part of the visual system of the alert tool, rather than the event that triggers the actual notification.
Alarms are classified into two types: internal interrupts and external interrupts. In this model, system performance degradation is treated as an outage because it is often not known how severe the impact of each user is.
Internal interrupts have a lower priority than external interrupts, but still require a fast response. Internal outages typically involve internal systems used by company employees or application components visible only to company employees.
External outages include any system outages that immediately affect customers. These do not include system outages that affect the release of system updates, but include customer-facing application failures, database outages, and network partitions, as well as tool failures that may not directly affect users.
visualization
Common scenarios for visualizing and understanding systems are:
Line charts: Line charts are probably the most common and ubiquitous visualization. Over time, line charts provide a good understanding of the system. You can also stack line charts to show relationships. For example, you might want to view requests on each server individually, but you can also view them collectively.
Heat Maps: Another common visualization is a heat map, which is useful when viewed with a histogram. This type of visualization is similar to a bar chart, but you can show gradients in a bar chart that represent different percentiles of the overall metric.
For example, you might be looking at request latency and want to quickly understand the overall trend and distribution of all requests. Heat maps are useful for this, as you can quickly see the number of sections by color.
Gauge: Gauge is a single-value measurement visualization. The gauge aspect can be a semicircle or full circle table. You can customize the thickness of the inner and outer lines to achieve the desired design aesthetics. The meter and text colors are fully customizable according to a set of rules.
Flame graph: Flame graph is an SVG image based on perF results that shows the CALL stack of the CPU.
The Y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent functions below.
The X-axis represents the number of samples. The wider a function occupies along the X-axis, the more times it is drawn, or the longer it takes to execute. Note that the X-axis does not represent time, but rather all call stacks are grouped in alphabetical order.
The alarm tool
Bosun
Bosun is a new monitoring and alarm system, built by Stack Exchange team and written using Golang. It supports the definition of complex alarm rules and supports OpenTSDB, Graphite, Logstuck-ElasticSearch and other data sources. Bosun will be a strong competitor to Zabbix and Nagios.
Image source: http://blog.tangyingkang.com/post/2016/12/06/bosun-alert-guide/
One neat feature of Bosun is that it lets you test alerts against historical data. Bosun also has some common features, such as displaying simple graphics and creating alerts. Bosun uses an expression language to query monitoring metrics and can be viewed as a simple set of programming languages, which makes it flexible and powerful, but also slightly complex.
Cabot
Cabot is a free, open source, lightweight monitoring and alerting service that combines some of the best features of PagerDuty, Server Density, Pingdom, and Nagios, but is neither as complex nor as expensive as these tools.
Cabot’s architecture is similar to Bosun’s in that it does not collect data. Native support for Graphite and Jenkins is rare.
Cabot provides a Web interface that allows you to monitor services (e.g. “Stage Redis server”, “Production ElasticSearch cluster”) and send phone, TEXT or email alerts to the duty team if the service fails without you having to write a line of code.
Catbot’s alarm can be based on:
-
Monitoring data collected by Graphite;
-
The response content and status code of the URL;
-
Jenkins compile task status;
Without the need to implement and maintain an entirely new data collector system.
Visualization tool
Grafana
Grafana is an open source program for visualization of large measurement data. Developed using Go language, Grafana is fully functional, with beautiful dashboards and charts, which can be used to analyze logs and display graphs (such as API request logs). ElasticSearch, InfluxDB, OpenTSDB, etc., most commonly used for network infrastructure and application analysis, with hot-swappable control panels and extensible data sources,
Using Grafana, you can set up alerts intuitively. This means you can look at the chart and even see where an alert should be triggered due to system performance degradation, click on the chart to trigger the alert, and then tell Grafana where to send the alert. This is a very powerful add-on that won’t necessarily replace the alert platform, but will certainly enhance it.
Essentially, Grafana is a feature-rich alternative to Graphe-Web that helps users create and edit dashboards more easily. It includes a unique Graphite target parser that makes it easy to edit metrics and functions. Grafana’s fast client rendering uses Flot by default, even with long time ranges, allowing users to create complex charts with smart axis formats such as lines and dots.
Vizceral
Vizceral is an open source project released by Netflix for monitoring network traffic between applications and clusters in near real time.
Vizceral is a group of dynamic display circuit diagram components implemented by WebG standard, which can realize data viewing and interaction. It is divided into three dimensions: global, partial area and horizontal, which makes data more intuitive display.
The Vizceral component can take multiple traffic graphs and will generate a “global” graph showing all incoming traffic to each “zone”, enabling cross-zone communication.
Interregional traffic chart of Netflix
According to https://opensource.com/article/18/8/now-available-open-source-guide-devops-monitoring-tools finishing
Scan the following public account on wechat
Use applets
“Small programs
Open the