Most of the systems that need to be deployed privately only provide the service functions of the system itself, such as user management, financial management, and customer management. However, the system still needs to collect logs and application indicators, such as the request rate, host disks, and memory usage. At the same time, convenient distributed system log view, indicator monitoring and alarm is an important guarantee of stable system operation.

In order to make privately-deployed systems more robust and without additional deployment operation and maintenance workload, this paper proposes an out-of-the-box log and index collection scheme based on ELK.

background

In the current project, we have used Elasticsearch as a data store for the business, as well as a combination of ansible, Docker, and Jenkins for rapid deployment. After configuring the SSH connection information to deploy the host, we can use Jenkins to deploy Elasticsearch and Kibana in one click.

The system follows the following design principles:

Self-contained Deployment: We package all the Deployment scripts, configuration files, and Jenkins tasks into a standardized Jenkins Docker package. Once installed on the target environment, we bring all the tools needed for Deployment into the package at once.

Single Source of Truth: A configuration file manager in YAML format is embedded in Jenkins for unified management of variables that all deployment needs to rely on. For example, the port number exposed by the back end of XX system is configured only once in Jenkins, and all scripts will automatically read this variable.

Configuration as Code, Infrastructure as Code: After all the Configuration is determined, the subsequent process can theoretically be fully automated, so all installation is done by script.

Demand analysis

The collection and use of logs in a privatized deployment environment has several characteristics:

Need to be able to deploy quickly. Due to the large number of customers, we need to be able to quickly deploy the monitoring system, and the operation and maintenance pressure of the monitoring system itself needs to be small.

Deploying components should be simple and robust. Because the deployment environment is complex, each component is expected to be robust and the interaction between components should be as simple as possible, avoiding complex network topologies.

Functionality is better than stability. Since logs and indicator information are copied on the host and application, even if the monitoring system data is lost, the impact is not significant. But it would be helpful if the system provided more powerful features.

Performance requirements are not high. Since the capacity and complexity of the interconnection system in the privatized environment can be controlled, standalone deployment can be used and slow query is ok.

Several requirements need to be met simultaneously:

You need to be able to collect distributed logs and view them centrally.

You must be able to collect and monitor basic machine information, such as CPUS and disks.

It is best to collect and monitor application data, such as the number of items that are imported.

It is better to realize the alarm function of abnormal indicators.

Project analysis

There are three alternatives:

Using ELK (Elasticsearch, Logstash, and Kibana) as the basic monitoring component, and using Elastic’s new Beat collection tool as the collection tool.

Zabbix, Open-Falcon and other operation and maintenance monitoring tools were used to monitor the basic components of the system. In addition, customized indicators are used to monitor data and generate alarms.

Use TICK(Telegraph, InfluxDB, Chronograf, Kapacitor) as the basic component of overall monitoring.

Currently, only open source ELK and commercial Splunk can better meet the requirements of logging. If Splunk’s license fee is affordable, schemes 2 and 3 can be used in combination with Splunk. But Splunk’s hefty licensing fees aren’t affordable to most companies so far. Options 2 and 3 do not meet the requirements of log collection and viewing, so they are excluded.

Scheme 1(ELK) is further refined according to our requirements:

Need to be able to deploy quickly: With our Jenkins, one-click deployment is possible.

Deploying components is simple: We only deploy Elasticsearch and Kibana components, and Elasticsearch itself is self-contained as a basic component, not dependent on any external components. Instead of clustering, we deployed Elasticsearch on a single machine, making it simple and stable.

Functionality over stability: Elasticsearch is currently in version 5.5.3, but the log collection and analysis version of Elasticsearch will be upgraded to version 7.6.0. Delete the data and deploy again.

Low performance requirements: Elasticsearch and Kibana are deployed on the same machine in single-node deployment mode.

Log specific Elasticsearch, Kibana, Beat

To avoid resource or configuration conflicts between ES used for logs and ES used for services, ES used for logs is deployed separately and uses about 3 GB memory.

Log collection:

We deploy FileBeat on all relevant hosts using Ansible for log collection. In order to simplify the system, we also do not use logstash for log preprocessing, and simply configure the FileBeat configuration file. And added to our Jenkins One-click deployment suite.

Log view:

Since the logs were directly collected in ES through FileBeat, we could directly check them using Kibana.

Collection of system indicators:

We deploy MetricBeat on all related hosts using Ansible to collect indicators. Through configuration file configuration, we can collect docker resource usage, system CPU, memory, disk, and network usage, and open indicator collection ports in StatSD format.

On-site status detection:

We deployed Heartbeat on the gateway machine using Ansible for proactive resource availability detection, monitoring the status of system-specific databases, HTTP services, and so on, and sending them to the default ES storage index.

Es-based alarms

Elasticsearch’s native alarms are paid for. To build a more general alarm system, there is an open source project called ElastAlert. Elastalert is a Python and Elasticsearch based alarm system developed by Yelp. There are many ways to connect to alerts, but most of them are foreign tools such as Slack, HipChat and PagerDuty. So we are currently using only the most basic email alert function.

Elastalert can be configured with multiple alarm types, such as:

The condition fires N times consecutively (frequency type).

The frequency of an indicator increases or decreases (spike type).

N Minutes No indicator (flatline type) is detected.

The configuration core of each alarm is a query statement of ElasticSearch. You can determine the number of items returned by the query statement.

Currently, we only use the basic alarm type of Frequency. Since this alarm is specific to several privatized deployed systems, we pre-configure the configuration files for several alarms. In the deployment script, if there are no special requirements, they are all copied into elastAlert’s system without any manual configuration.

Monitor the market

Using Kibana’s visualization capabilities, we can create a monitor platter for each business system to visually see the status of all system components, as well as the health of the host:

Kibana configuration automation

All persistent configuration in Kibana is a Saved Object, including: quick search, monitor, visual panel, index configuration.

After configuring a monitoring Kibana in the internal test environment, we exported the configuration file regularly through the CI system and stored it in the Git repository. When updating the basic components next time, the update script would automatically import the corresponding Kibana configuration into the private deployment environment, without any manual configuration during deployment. Implement Infrastructure as Code.

Expand monitoring scope

This set of deployment components also has a standard process for scaling.

Monitor more application components

When we need to monitor new application components.

For service status, you can simply add the application component’s access address to HearBeat’s configuration and see the status of the corresponding component in the monitoring panel.

For application logs, we can add the log file path to fileBeat configuration and find it in Kibana.

Monitor indicators related to applications

When you need to monitor metrics for your application, you can publish metrics to MetricBeat via the Statsd interface and collect them in Elasticsearch. The underlying rules of STATSD are relatively simple, so there are corresponding SDKS in each programming language that can be used directly without complex dependencies:

Github.com/statsd/stat…

However, the statSD information collected by MetricBeat does not support tag, so it can only do some simple indicator collection and cannot aggregate analysis of different dimensions of the same indicator.

Add Service Tracing

Elasticsearch also comes with APM, which is not yet available, but is a great tool for performance monitoring and analysis.

conclusion

In a private deployment environment, log collection and monitoring do not require strong performance and scalability as Internet products do, and it is important to have powerful functions out of the box. Version 7.6.0 of Elasticsearch and Kibana do a great job of this, standardizing the deployment process and preparing the configuration files ahead of time. You can set up a monitoring system in half an hour.