Nightingale is an enterprise-level monitoring solution developed by Didi Basic platform and Didi Cloud and open source. It aims to meet the needs of enterprise monitoring in the cloud native era. Nightingale meets enterprise-level requirements in terms of product completion, system availability, and user experience. It can meet the needs of users of different sizes, from a few services to hundreds of thousands. Both cloud native and bare metal, support application monitoring and system monitoring, flexible plug-in mechanism, plug-ins rich and perfect, with a high degree of flexibility and scalability.

Based on Open-Falcon and in combination with didi’s internal best practices, Nightingale has made numerous improvements in performance, maintainability and ease of use. As a unified monitoring solution for the group, It supports billions of monitoring indicators within Didi, covering monitoring needs at all levels from system, container and application. Thousands of weekly active users. Five years of grinding a sword, open source, open source feedback.

Nightingale navigates with tree-like nodes, which we call object trees. Object tree is a group management mechanism for monitored objects, facilitating the search and view of monitored objects and setting monitoring policies for monitored objects. A typical tree can be described from top to bottom as organizational structure relationship, product service module relationship, machine room and machine mount relationship. The navigation tree can be customized flexibly according to user requirements.

After a monitoring policy is applied to a node, the monitoring policy is applied to all the machines attached to all the child nodes of the node. When any machine triggers the threshold, an alarm is generated.

Monitor the customization of the market to do a significant ease of use improvement, support the chart threshold, support the chart classification, the new chart and sort management are visible as you get, the inspection of the market customization is no longer difficult.

Nightingale is developed on the basis of Open-Falcon. As one of the most widely used monitoring solutions in China, Open-Falcon provides a lot of reference for Nightingale’s design and development.

Differences from Open-Falcon

  • Alarm engine reconstruction: Open-Falcon alarm policies trigger policy judgment when monitoring data is pushed. This “push” mode has the advantage of high timeliness of policy judgment, but is not conducive to the support and expansion of more advanced alarm policies. For example, multi-condition combination alarms are difficult to support. Nightingale has moved to a push and pull model, which ensures the efficiency of most policy decisions, and a pull model that supports and conditional and nodata alerts.
  • Introduced the navigation object tree: Flattened HostGroup adopted by Open-Falcon was converted into Nightingale’s navigation object tree. Essentially, object tree is a group management mechanism for monitored objects, which is convenient for searching and viewing monitored objects and setting monitoring policies for monitored objects. At the same time, Nightingale has removed the concept of alarm template, and alarm policy is directly bound to tree nodes, simplifying design and greatly improving flexibility and ease of use.
  • Index module update: Open-Falcon uses MySQL to store index data of metrics, which is a bottleneck in scalability and flexibility. According to the monitoring requirements, Nightingale designed and developed a new in-memory index module, Index, which has diversified query methods and higher query efficiency, avoiding the maintenance and optimization work faced by MySQL when index data reaches 100 million level.
  • Timing database optimization: Based on Graph, the open-Falcon storage module, Facebook’s Gorilla compression scheme is introduced to store recent hours’ data in memory, greatly improving data query efficiency, while long-term data is still stored on hard disk using RRDTool data format. At the same time, the performance and stability of sequential database are further improved.
  • Improved high availability of the alarm engine: The Judge module of the alarm engine automatically removes faults through the heartbeat mechanism, and does not need to worry about the failure of some policies caused by the breakdown of a single JUDGE. The Index module also adopts a similar method to ensure availability.
  • Built-in log monitoring natively: Nightingale client natively has log matching and metrics extraction capabilities, supports configuration of log matching rules on web console pages, and also supports reading configuration files in specific directories on target machines, making business metrics monitoring easier.
  • Enhanced operation and maintenance: Integrate Portal (FALcon-Plus API), UIC, Dashboard, HBS, and Alarm into a single module: MONAPI, which simplifies the overall deployment difficulty of the system and changes the original inter-module calls into in-process method calls, providing higher performance.
  • Configuration file centralization: The configuration file has been modified for ease of use, such as extracting database general configuration to mysql.yml, extracting port instance address and other associated configuration to address.yml. A large number of configurations are given default values in the code, making the configuration file clearer and easier to maintain.

Similar to open-Falcon

  • There is no change in the data model, which is still the organization mode of metric, endpoint and tags. Agent can be reused basically. Agent in Nightingale is called Collector. The original Open-Falcon Agent and Falcon-log-Agent logic are combined, and various monitoring plug-ins can be reused.
  • The data flow and overall processing logic are similar, still using a flexible push model, divided into two links: data storage and alarm judgment.

Nightingale architecture

  • Collector, namely agent, can collect common indicators of the machine, support log monitoring, plug-in mechanism, and support services to directly report data through interfaces.
  • Transfer Provides THE RPC interface to receive the data reported by the collector, and then forward the data to multiple TSDB and judges through the consistency hash.
  • The GRAPH component of Open-Falcon is used to store historical data and can be configured in dual-write mode to improve the disaster recovery capability of the system. The TSDB forwards a copy of monitoring data to index for index construction.
  • Index is a memory index module, replacing the original mysql scheme, to build an index in memory, convenient for subsequent data retrieval, in the flexibility of retrieval and retrieval performance greatly improved;
  • Judge is an alarm engine. It synchronizes monitoring policies from MonAPI (Portal) and determines alarms based on the received data. If the threshold is met, alarms are generated and pushed to the Redis queue.
  • Monapi (Alarm) reads the event generated by Judge from the Redis queue, performs secondary processing, supplements some meta information, generates alarm messages, and pushes them back to the Redis queue.
  • The sending components, such as mail-sender and SMS-Sender, read alarm messages from Redis, send alarms, and abstract all types of sender for subsequent customization.
  • Monapi integrates the functions of multiple original modules and provides interfaces for JS calls. The API prefix is/API /portal. Data query is transferred, and the original Query component in Open-Falcon is removed with the API prefix of/API /transfer. Index query API prefix/API /index, so, in the front unified build Nginx, can be through different location to forward requests to different back-end;
  • The database still uses MySQL, which mainly stores user information, team information, tree node information, alarm policy, monitor, masking policy, collection policy, and heartbeat information of some components.

A work in progress

  1. Provide monitoring indicator aggregation components. The current architecture can solve the machine-level and module-level monitoring, but the monitoring indicators of cluster dimension need to aggregate indicators of all modules and machines in the whole cluster, and do some operations such as adding and averaging. We are in the intensive open source process of relevant aggregation components.
  2. Seamless integration with K8S is also in progress;
  3. We will improve more monitoring plugins. Before, many plugins in Open-Falcon community can be used directly. We will try our best to supplement the plugins that are not available in the community, and reorganize and maintain the plugins that are already in the community, so as to make Nightingale’s neighborhood more perfect.

Contact us

  • Our website is n9e.didiyun.com, and the documentation will be published here.
  • You can follow Nightingale on Github, where you are welcome to try it and join the community.
  • You can install and experience the nightingale mirror via didi Cloud with one click.
  • You can also add the wechat account of didiyun little Assistant, and the little assistant will pull you into the mutual aid communication group.

Acknowledgments and explanations

  • Open-falcon is an Open source enterprise-level monitoring solution of Xiaomi operation and maintenance team, which is widely used in China.
  • Nightingale uses the Apache-2.0 open source license. Copyright © Didi 2020.