preface

With the rapid development of cloud computing and the Internet, a large number of applications need to span different network terminals and have extensive access to third-party services (such as payment, login, navigation, etc.), making IT system architectures increasingly complex. Rapid iteration of product requirements and good user experience require IT operation and maintenance managers to ensure the stable availability of core business at all times, and pain points and difficulties in enterprise operation and maintenance also need to be solved urgently.

  • 1. Business-oriented operation and maintenance not only cares about the running state of a single point of IT resources, but also cares about the health state of the entire business system

  • 2. If the enterprise uses a large number of apis and modular applications, pay attention to performance changes and metrics for each interface

  • 3. For operation and maintenance supervisors and enterprise management, the large monitoring screen on the wall is particularly needed

  • 4. O&m needs to view trend analysis reports weekly and monthly, but it is difficult to export data from traditional O&M tools

  • 5, need the first time to transfer and quickly find the fault node, reduce the loss caused by business interruption

With the gradual implementation and vigorous development of Devops, cloud computing, microservices, containers and other concepts, there are more and more machines, more and more applications, more and more micro services, and more and more diversified application operating environment, including containers, virtual machines, and physical machines. In the face of hundreds or thousands of virtual machines, containers and dozens of objects to monitor, can the existing monitoring system support it? How can I use the same solution to collect and analyze indicator data from containers, VMS, physical machines, network devices, and middleware in a fast and complete manner? What kind of architecture and technical solution is more suitable for such a huge and complicated monitoring requirements?

I. Analysis of unified monitoring platform architecture

To review, the unified monitoring platform consists of seven roles: monitoring source, data collection, data storage, data analysis, data presentation, early warning center, and CMDB(enterprise hardware and software asset management).

  • Monitor the source

In terms of hierarchy, it can be roughly divided into three layers: business application layer, middleware layer and infrastructure layer. The business application layer mainly includes application software and enterprise message bus, the middleware layer includes database, cache, configuration center and other system software, and the infrastructure layer mainly includes physical machines, virtual machines, containers, network devices, storage devices and so on.

  • Data collection:

With so many sources of data, the task of data collection is not easy. Data collection indicators can be divided into business indicators, application indicators, system software monitoring indicators and system indicators. Application monitoring indicators are as follows: Availability, abnormal, throughput, response time and frequency of the current waiting, request quantity, resource utilization, log size, performance, queue depth, the number of threads, the service of calls, visits, service availability, etc., business monitoring indicators such as large amount of water, the water area, water subsidiary, request number and frequency of response time, response, etc., system monitoring indicators such as: CPU load, memory load, disk load, network I/O, disk I/O, TCP connection number, and process number.

In terms of collection methods, it can be divided into interface collection, client agent collection, and active capture through network protocols (HTTP, SNMP, etc.).

  • Data storage:

Collected data is generally stored in file systems (such as HDFS), index systems (such as Elasticsearch), indicator libraries (such as influxDB), message queues (such as Kafka for temporary message storage or buffer), and databases (such as mysql).

  • Data analysis:

For the collected data, data processing. There are two types of processing: real-time and batch. The technologies include Map/Reduce computing, full log retrieval, streaming computing, and indicator computing. Different computing methods are selected according to different scenarios.

  • Data presentation:

In the multi-screen era, cross-device support is essential.

  • Warning:

If a problem is found during data processing, you need to analyze exceptions, estimate risks, and trigger or alarm events.

  • CMDB(Enterprise Hardware and software Asset Management):

CMDB in unified monitoring platform is an important part of, although monitoring source variety, but most of them have relationship, such as application running in a running environment, the normal operation of the application and rely on the network and storage device, an application will also depend on other applications (depend on), once any one step out of the question, can lead to the application is not available. In addition to storing hardware and software assets, the CMDB also stores an association between assets. If an asset fails, the relationship can be used to quickly determine which other assets will be affected, and then solve the problems one by one.

OK, that’s all for the review. Let’s move on to system monitoring.

Second, the technology stack of system monitoring

Part of the technology stack of system monitoring is shown in the figure below. There are so many monitoring technologies that it is naturally impossible to list all of them here. Some classic and popular open source technologies are selected.

System monitoring is different from log monitoring. There are many open source software that complete the tasks of database collection, data storage, data presentation and event alarm. Therefore, these open source software are temporarily excluded from the technology stack of system monitoring and will be explained in the following chapters. Here we focus on how to build a unified system monitoring platform.

  • Data collection:

System monitoring data can be collected in two ways: active data collection and client data collection. Active collection is generally remote collection by means of SNMP, SSH, Telnet, IPMI, and JMX. Client collection requires that a client be deployed on each host to be monitored to collect data and send the data to the remote server for receiving.

  • Data buffering:

Similar to log monitoring, in the face of massive monitoring, considering the pressure of the network and the bottleneck of data processing, you can go through a data buffer before data storage, place the collected data in the message queue, and then read the data from the distributed queue and store it. If the data volume is small, this layer can be ignored.

  • Data storage:

For system monitoring data, time series database is usually used to store time series database. Time series database is mainly used to process the data with time label (change in time order, that is, time serialization). The data with time label is also called time series data. Influxdb and Opentsdb are among the best.

OpenTSDB is a distributed and scalable time series database built by storing all time series (without sampling) in hbase. It can obtain metrics from large-scale clusters (including network devices, operating systems, and applications) and store, index, and service metrics. Thus making these data easier to understand, such as web, graphical and so on. Implemented in JAVA language, it is a good news for students in JAVA department. However, some students may be discouraged from relying on hbase because they need to maintain hbase first.

Influxdb is a new temporal database written in go with no external dependencies and growing fast. The latest version is up to 1.2. Sql-like query syntax, easy to install, single point of use, although there is clustering capability, but the feature is not open source (but the single point of performance is basically enough to meet enterprise needs). Provides AN Http API for easy invocation and encapsulation. It is very friendly for students who want to do their own data processing and presentation based on influxDB.

  • Data presentation:

When it comes to graphically presenting temporal data, Grafana is a powerful tool. Grafana is an open source query and presentation software for temporal data, providing flexible and rich graphical options. Can mix styles, has a full-featured measurement dashboard and graphics editor. Support for data query and chart display by connecting with many data stores such as Graphite, Elasticsearch, CloudWatch, Prometheus and InfluxdbDB. Some open source monitoring software such as Zabbix, Graphite, Prometheus also have their own data graphical presentation capabilities, but are generally recommended

Grafana to replace their pages. You can imagine Grafana’s excellence.

Of course, Grafana’s data sources come from sequential databases. In a real world scenario, some of the data you might want to view in a report comes from a business system. This is something Grafana or other monitoring software can’t do. Another way is to realize chart presentation based on your own needs, using open source chart front-end frameworks such as Echarts for presentation through calculation and analysis of time series data and combining business data. This is where Influxdb’s advantages come into play, as external HTTP apis are ideal for self-wrapping graphical pages.

  • Warning:

In log monitoring sharing, alarms are not described. Open source monitoring software such as Zabbix, Nagios, Open-Falcon, Prometheus, etc., all have their own alarm capabilities. If you use them as a monitoring platform, the alerting capability is already there. If a self-built unified monitoring platform is used, the alarm center can also be implemented. In our own way, during data processing, corresponding events are generated and thrown into Kafka according to the configured event triggering rules. The event processing engine listens to the event data in Kafka, parses it, and carries out alarm notification according to the event processing policy.

Open source system monitoring software

Zabbix VS Nagios VS Open-Falcon

The technical stack of operation and maintenance monitoring is briefly introduced above, but in fact there are some open source monitoring software with comprehensive functions, providing support from data collection to data presentation. If a small team does not want to build its own monitoring platform, these open source software is actually a good choice.

Zabbix

Zabbix (zæbix) is an enterprise-level open source solution for distributed system monitoring and network monitoring based on the WEB interface.

Zabbix can monitor various network parameters to ensure the safe operation of the server system. It also provides a flexible notification mechanism for system administrators to quickly locate and resolve problems.

Zabbix consists of two parts, Zabbix Server and optional Zabbix Agent.

Zabbix Server can use SNMP, Zabbix Agent, ping, port monitoring and other methods to provide remote server/network status monitoring, data collection and other functions, it can run on Linux, Solaris, HP-UX, AIX, Free BSD, Open BSD, On platforms like OS X.

Zabbix, as an enterprise-level open source distributed monitoring solution, supports the implementation of millions of indicator data collection from tens of thousands of servers, virtual machines, network devices, etc. Common commercial monitoring software provides functions such as host performance monitoring, network device performance monitoring, database performance monitoring, common protocols such as FTP monitoring, multiple alarm modes, and detailed report drawing. Automatic discovery of network devices and servers is supported. Support distributed, centralized display and management of distributed monitoring points; Strong scalability, the server provides a common interface, you can develop and improve all kinds of monitoring.

Zabbix Important components:

  • Zabbix Server: the core component responsible for receiving the report information sent by agent, and all configuration, statistics and operation data are organized by it;

  • Database Storage: dedicated to storing all configuration information and data collected by Zabbix;

  • Web Interface: Zabbix GUI interface;

  • Proxy: An optional component used to monitor distributed environments with many nodes. The proxy server collects some data and forwards it to the server to reduce the pressure on the server.

  • Agent: deployed on a monitored host, the agent collects local data, such as CPU, memory, and database data, and sends the data to the server or proxy.

Advantages:

  • All in One: Easy to deploy

  • Server has low requirements on host performance.

  • Automatically discover servers and network devices

  • Distributed monitoring and WEB centralized management

  • Both Agent and non-Agent collection modes are supported. Data is collected by the Agent or IPMI on hosts, and data is collected by SNMP clients on network devices and storage devices. The Agent supports common UNIX and Windows operating systems

  • Comprehensive functions: data collection, data storage, data display, event alarm.

  • Open interface, strong expansibility, easy to write plug-ins

Inadequate:

  • Database bottleneck, using mysql as the underlying storage, when reading and writing big data, the pressure on the database is very large

  • The Agent must be installed on the host

  • Poor container monitoring support and need to extend yourself.

Supplementary comparison:

Zabbix is an enterprise-level open source operation and maintenance platform based on WEB interface that provides distributed system monitoring and network monitoring functions. It is also the most widely used monitoring software among Internet users in China at present. More than 85% of users met by Cloud Intelligence are using Zabbix for monitoring solutions.

Easy to get started, easy to get started, powerful and open source for free is the most intuitive evaluation of Cloud wisdom on Zabbix. Zabbix is easy to manage and configure, and can generate beautiful data graphs. Its automatic discovery function greatly reduces the workload of daily management. Rich data collection methods and API interfaces allow users to flexibly collect data. In theory, the plug-in architecture provided by Zabbix can meet any need of an enterprise.

User base: more than 85% of pan-Internet enterprises.

Advantages:

1. Support multi-platform enterprise-level distributed open source monitoring software

2. Simple installation and deployment, flexible integration of various data acquisition plug-ins

3. Powerful function, can realize complex multi-condition alarm,

4. Built-in drawing function, the data can be drawn into graphics

5. Provides a variety of API interfaces to support script invocation

6. Automatically execute commands remotely when problems occur (you need to set execution permission on the Agent)

Disadvantages:

1. It is inconvenient to modify projects in batches

2. Although the community is mature, there are relatively few Chinese materials and limited service support;

3. It is easy to get started and can realize basic monitoring, but it is difficult to be familiar with Zabbix and carry out a lot of secondary customized development for deep needs;

4. There are many alarm Settings at the system level. There will be many alarm emails if there is no filtering. And the self-defined project alarm needs to be set by oneself, the process is tedious;

5. Lack of data summary function. If the average value of a group of servers cannot be viewed, secondary development is required;

6. Data reports need special secondary development definition;

Nagios 

Nagios’s full name is (Nagios Ain’t Goona Insist on Saintood), and the original project name was NetSaint.

Nagios is a free, open source network monitoring tool that can effectively monitor the status of Windows, Linux and Unix hosts, switches, routers and other network devices, printers, and more. When the system or service status is abnormal, send an email or SMS alarm to inform the website operation and maintenance personnel in the first time, and send a normal email or SMS notification after the status recovers.

Nagios is a monitoring system that monitors system health and network information. Nagios monitors specified local or remote hosts and services, provides exception notification, and so on.

Nagios runs on Linux/Unix platforms and provides an optional browser-based WEB interface for system administrators to view network status, system issues, logs, and more.

IT is a free and open source IT infrastructure monitoring system, which is powerful and flexible, and can effectively monitor the status of Windows, Linux, VMware and Unix hosts, switches, routers and other network Settings. The core function of Nagios is to monitor the alarm. The alarm ability is very good, but the graphics display effect is very poor. At the same time, Nagios is more flexible, and many functions need to be implemented through plug-ins. For students with less technical ability, it will be difficult to get started. Of course, for operation and maintenance veterans, it will be quick to learn.

Nagios has the following features:

  • Monitor network services (SMTP, POP3, HTTP, NNTP, PING, etc.);

  • Monitor host resources (processor load, disk utilization, etc.);

  • Simple plug-in design makes it easy for users to extend the detection methods of their own services.

  • Parallel service inspection mechanism;

  • Ability to define the hierarchical structure of the network, using the “parent” host definition to express the relationship between network hosts, which can be used to detect and clarify host downtime or unreachable status;

  • When a service or host fault occurs or is rectified, send an alarm to a contact by EMail, SMS, or user-defined mode.

  • Handlers can be defined to prevent service or host failures;

  • Automatic log scrolling function;

  • Can support and realize the host redundant monitoring;

  • The optional WEB UI is used to view the current network status, notification and fault history, and log files

Supplementary comparison:

Nagios is an open source enterprise monitoring system, which can realize the basic system monitoring of system CPU, disk, network and other parameters, as well as SMTP, POP3, HTTP, NNTP and other basic service types. In addition, by installing plug-ins and writing monitoring scripts, users can realize application monitoring and deploy a hierarchical monitoring architecture for a large number of monitoring hosts and multiple objects.

The biggest feature of Nagios is its powerful management center. Although its function is to monitor services and hosts, Nagios itself does not include this part of the function code. All the monitoring and alarm functions are completed by related plug-ins.

User base: enterprises suitable for complex IT environment

Advantages:

1. If an error occurs, the server, application, or device automatically restarts and logs are automatically rolled

2. Flexible configuration. You can customize shell scripts in distributed monitoring mode

3. Host monitoring in redundant mode is supported, and alarm Settings are diverse

4. Run the command to reload the configuration file without interrupting Nagios

Disadvantages:

1. The function of the event console is weak and the plug-in is not easy to use

2. Poor performance and flow

3. You can only see alarm events without historical data, so it is difficult to trace the cause of the fault

4. The configuration is complex, and beginners invest a lot of time, energy and cost

Open-Falcon

Open-falcon is an Internet enterprise-level monitoring system developed by The operation and maintenance department of Xiaomi. Currently, Xiaomi, Jinshanyun, Meituan, JD Finance and Ganji.com are all using Open-Falcon. Open-falcon consists of two parts: drawing component and alarm component. “Drawing module” is responsible for data collection, collection, storage, archiving, sampling, query, display (Dashboard/Screen) and other functions. It can work independently as a storage and display scheme of time-series data. Alarm Component Is responsible for alarm policy configuration (Portal), alarm determination (judge), alarm processing (Alarm/Sender), and user group management (UIC). It can work independently. The structure is as follows:

Key features are:

  • Data collection free configuration: Agent self-discovery, Plugin, and active push modes are supported

  • Capacity level expansion: The production environment collects, alarms, stores, and draws data 500,000 times per second.

  • Alarm policy self-discovery: Supports policy templates, template inheritance and overwriting, multiple alarm modes, and rollback.

  • Humanized alarm Settings: Supports maximum alarm times, alarm severities, alarm recovery notification, alarm suspension, different thresholds in different time periods, maintenance periods, and alarm merging.

  • Efficient query of historical data: Returns the historical data of hundreds of indicators in one year in seconds.

  • Humanization of Dashboard: multi-dimensional data display, customized Dashboard and other functions.

  • Architecture design high availability: the whole system has no core single point, easy operation and maintenance, easy deployment.

Disadvantages:

  • Common application servers, such as Tomcat, Apache, and Jetty, are not supported.

  • With no dedicated operations support, fewer code updates, and no larger community to maintain, you basically have to expand what new capabilities you want later on.

The overall comparison of Zabbix, Nagios and Open-Falcon is as follows:

Iv. System monitoring practices based on K8S Container Cloud:

cAdvisor+Heapster+Influxdb

All of the above are traditional system monitoring architectures. With the advent of the container era, support for containers is not satisfactory. Next, we will introduce our system monitoring scheme based on K8S container cloud. First, we will introduce our DevOps platform architecture, which runs in the container cloud constructed by Kubernetes + Docker. Kubernetes, Docker and other services run on IaaS platform (our production environment is Aliyun).

Our unified monitoring platform adopts the scheme of cAdvisor+Heapster+Influxdb in system monitoring. The structure is as follows:

Why do you use this scheme? Let’s take a look at these three tools.

CAdvisor is a tool used by Google to analyze resource usage and performance characteristics of running Docker containers. CAdvisor is deployed as a running daemon that collects, aggregates, processes and exports information about running Docker containers. This information can include container-level resource isolation parameters, historical resource usage, and a complete history of resource usage and network statistics. The ability to monitor Docker is very powerful. It also provides its own Web page, through which users can directly view the monitoring data of all containers on the host. CAdvior functionality has been integrated into the Kubelet component, meaning that once Kubernetes is installed, cAdvisor is installed on every compute node. CAdvisor pages are now accessible on each compute node through IP+ port (default: 4194).

Heapster is also provided by Google for monitoring THE K8S cluster. Heapster can be started by the container, passing in the kubernetes master address, Heapster will call kubernetes API to get all Kubernetes compute nodes, Then kubelet’s HTTP API is called through kubelet’s external call port number (default: 10250). Kubelet will call the cAdvisor interface to get the container data on the current compute node and the performance data of the current host, and return it to Heapter. Thus heapster collects all the container data and host data of the Kubernetes cluster. Heapster supports data transfer to Influxdb for storage. Data presentation We call The InfluxDB API to obtain data by ourselves, combine our business-related data for calculation, and use Echarts to perform front-end chart presentation.

Some students may ask, this only monitors all the compute node container data and host performance data, so some non-compute node host monitoring how to do? Indeed, since Heapster only monitors the Kubernetes cluster, non-Kubelet nodes do not get data, and we do not want to use another way to monitor the host individually, so that the data format is different. So we took a compromise and installed Kubelet on each non-K8S cluster node, and joined the Kubernetes cluster, but configured not to participate in the cluster scheduling, that is, containers will not be deployed on these machines. In this way, Heapster can collect performance data from these hosts.

Monitoring tool of the Container Age: Prometheus

Besides the cAdvisor+Heapster+Influxdb solution we used to monitor container and host performance data at the same time, a better solution was Prometheus. Prometheus is an open source monitoring/alarm/time series database combination developed by social music platform SoundCloud in 2012. As it grew, more and more companies and organizations adopted Prometheus and the community became active enough to make it an open source project independently of any company. Prometheus was originally developed with reference to Google’s internal monitoring system BorgMon, and is now commonly monitored with Prometheus in the most common Kubernetes container management system.

In 2016 Prometheus became an incubator project of the Cloud Native Computing Foundation, a group of IT industry leaders created and directed the development of the Kubernetes Container Management system with the support of Google. Under CNCF’s leadership, Prometheus became the second formal component of the open platform stack. Features are as follows:

  • High dimensional data model

  • Efficient temporal data storage capability

  • Flexible query language

  • Graphical presentation of specific time series data

  • Easy operations

  • Provides a rich client development library

  • The alarm center has comprehensive functions

The architecture diagram for Prometheus is as follows:

  • Prometheus Server: Prometheus master Server, used to collect and store time series data

  • Client libraries: client libraries

  • Push gateway: mediation gateway for short jobs

  • Gui-based Dashboard Builder: Rails/ SQL-based GUI Dashboard

  • Exporters: Data collection probe supports various kinds of probes including database, host, message queue, storage, application server, Github software and other monitoring systems.

  • Alertmanager: indicates the alarm center

Prometheus is a Monitoring solution championed by Google, and the community is very active and growing rapidly, with features being added and improved at a rapid rate. A monitoring spectrum of containers, hosts, storage, databases, various middleware, as well as specific and well-developed time-series data storage and alarm center capabilities, is growing rapidly, and Prometheus will continue to gain popularity.

Six, summarized

There are many solutions for system monitoring, even good open source compatible software, but if the demand is not high, zabbix may be a good solution, and if container monitoring is wanted, Prometheus may be a good solution. In a word, there is no best, only the most suitable.