takeaway

The series of cloud public cloud products have been running steadily for several years. As enterprises pay more attention to information security and data isolation, the privatization of cloud information products is bound to become an important branch in the development of cloud information. Relying on the increasing maturity of virtualization and container technology, yunxin business system has been fully privatized, and has been running steadily in many enterprises, becoming an important part of yunxin product line.

In public cloud services, cloud letter log information acquisition and monitoring alarm system is dependent on netease technology settle the research platform for many years, however, since this research platform for cloud letter privatisation products appear too multifarious, at the same time relative to the public cloud, private products due to the particularity of structures within the enterprise often has the isolation on the network, It is difficult for technical support personnel to directly access the privatized environment, and enterprise operation and maintenance personnel have a relatively limited understanding of the cloud information system. Therefore, privatized products need to build a more lightweight and highly available logging and monitoring platform to show real-time service status and operation data and give early warning to internal o&m personnel.

This article will detail the cloud private log monitoring server architecture, if you need to build log collection and monitoring platform for your products, this article can be used as a reference.

Cloud letter privatized main technology

At the beginning of the introduction of log monitoring platform, we first have a preliminary understanding of the overall plan of cloud letter privatization and the technology stack used. At present, the public cloud products of Yunxin involve numerous modules, with different technical architectures and programming languages, as well as different environments and networks. Therefore, we use Docker to achieve the encapsulation of each module and runtime resource isolation in the privatization of Yunxin. For container deployment and management, we chose ansible, a more lightweight o&M deployment tool, rather than Kubernetes, which is often found in large container cloud platforms:

Is a private enterprise scale level of the vast majority under 10 w, it has extended service for enterprise and enterprise is very large, so in the process of privatization, companies can provide resources are often limited, perhaps only a few or even a server, and the deployment of a set of Kubernetes will bring additional resources for the enterprise cost;

Second, the real-time dynamic adjustment of service cluster is often not needed for the enterprise’s own operation and maintenance personnel after the delivery of privatization. What enterprises need is the efficient operation of cloud information under the agreed scale. The resource scheduling and node control provided by Kubernetes is too complex and heavy for privatization. Therefore, we choose to develop a visual deployment management platform through Ansible collaborative multi-machine synchronization control processing framework and python extension to achieve unified rapid deployment and management of cloud information privatization.

Logging and monitoring platform main requirements

1. High performance: meet the operation data and log collection in the cloud communication privatization scheme;

2. High availability guarantee: The log platform has certain high availability capability. If a single node or network fails, the service system can continue to collect and monitor logs.

3. Real-time monitoring and alarm of the platform: visual monitoring of system and module information, timely alarm when abnormal network resources and service status of the system occur;

4. Log system security: ensure the security of platform logs and operation data;

Basic components of ELK

Here is a basic ELK architecture diagram to illustrate the main components and architecture of the ELK Stack

Elasticsearch is an open source distributed search engine, featuring distributed, zero configuration, automatic discovery, automatic index sharding, index copy, restful interface, multiple data sources, and automatic search load. As a core component of ELK, responsible for log storage, retrieval, and analysis

Logstash is a data collection engine with real-time channel capability, which can preprocess data well in filter.

Kibana is a Web platform for Elasticsearch analytics and visualization. It can look up Elasticsearch’s indexes, interact with data, and generate table diagrams for various dimensions.

Metricbeat&Filebeat

Beats is responsible for the data collection part of the ELK protocol stack, collecting system information and logs and sending data to Logstash or Elasticsearch. Filebeat collects log files and can initially search and filter log information. Metricbeat collects indicators from systems and services.

Privatize the logging platform architecture

Privatized All modules are container deployed. Because it is deployed in an enterprise, technical support and maintenance are difficult. Therefore, each module must be highly available. ElasticSearch is deployed in a cluster. The ES cluster deployment solution is available for details. On nGINx deployment keeplived implements high availability via virtual IP addresses (VIPs), keepalived standby service detects in the event of host downtime and automatically takes over the VIP resources to continue service.

In large enterprises or enterprises with significant business peaks in usage scenarios, high-performance messaging middleware Kafka can be added as message queue buffer before LogStash during deployment.

Manages ELK processes using Supervisor

In addition to supporting component high availability solutions in the deployment solution, privatization uses the Supervisor as the process management and is encapsulated in the image along with the ELK component. When a managed process is suspended unexpectedly, the Supervisor automatically pulls it back up, facilitating automatic process recovery.

Supervisor configuration file instance

Use Nginx IP whitelist for access control

The three Elk components themselves do not have too many security related Settings, although the official x-pack component is open source, but the free version has many restrictions on the use of functions, so we need to introduce Nginx control access IP whitelist to strengthen the security of the log platform. This can also be achieved through the host firewall configuration, but in the private service, the enterprise firewall management of the host is very different, some enterprises firewall management is basically blank, so in the image using Nginx as a reverse proxy in the service layer whitelist management, It is a good choice for elK components and Nginx to provide services as a whole. Elk components are bound to local IP addresses, and Nginx acts as an external access exit, which is equivalent to providing whitelist management functions for components.

Supervior is used to manage processes by adding them to the nginx whitelist.

Elasticsearch has a few other things to do to harden security beyond managing whitelists

  • Change the default elasticSearch cluster name.
  • Expired indexes are cleaned automatically to ensure disk space is used.
  • ES Bind Intranet IP addresses to prevent data from leaving the external network card when multiple network adapters are configured.
  • Disable the use of interfaces in ES to delete indexes in batches using regular expressions. The configuration items are as follows:

Develop customized system information collection tools

Extending Metricbeat functionality is a good way to gather information about modules in privatization. Metricbeat, as a tool for ELK Stack to collect and statistics system and application information, has integrated modules including system information, common middleware information, process-level resource usage data, and integrated exquisite Kibana charts. Metricbeat uses GO language, clear interface, suitable for adding personalized business information collection module through secondary development.

We make good use of the public in the process of secondary development of cloud services service monitoring interface and information collection scripts and instructions from system in the module only encapsulates the call interface, several different types of instruction through a configuration file to configure the module specific call the script or instruction and parses the data as a result, the secondary development to minimize the amount of code.

In the containerization process, metricBeat and FileBeat acquisition modules are integrated into the basic image, and other images are constructed based on the basic image. During the deployment and creation of containers, different collection indicators are controlled based on the configurations generated by services

Use ElastAlert as a platform alarm

ElastAlert is an open source ElasticSearch alarm tool developed in Python. You need to pay attention to the elasticSearch version when deploying elasticSearch. Elastalert’s extension API module is recommended for use with Kibana visual plugin to edit alarm rules directly in Kibana.

The privatized alarm content covers four aspects. The first part is server alarm, which mainly includes host load, CPU, network card flow, disk, memory and other indicators. The second part is the middleware database basic service layer alarm, such as mysql master/slave synchronization inconsistency, rabbitMQ queue overflow and so on. The third layer is the process and JVM layer, such as unexpected process termination or restart, JVM GC frequency is too high; The fourth part involves specific services, and the data collected by the custom module is abnormal.

conclusion

Due to operation and maintenance factors, enterprise private cloud has higher requirements on the high availability of the overall service, so system monitoring and alarm are more important in privatization. The log collection and alarm platform built based on the mature ELK Stack undertakes the collection and preservation of log and operation data in the privatization of cloud communication, monitors the situation of cloud communication service and system in real time, sends abnormal alarms to the person in charge of operation and maintenance, and provides an important guarantee for the operation of cloud communication system.

At the same time, as an efficient search engine, Elasticsearch can do more than that. Based on the massive data search statistics capability of ES, we analyzed and sorted out the daily operation data of the platform, and quickly developed a set of daily operation data through kibana’s excellent chart display. It provides data basis for decision makers in the direction of enterprise product research and development.