background

EMonitor: It is a one-stop monitoring system serving all technical departments of Ele. me, covering data storage and query of system monitoring, container monitoring, network monitoring, middleware monitoring, business monitoring, access layer monitoring and front-end monitoring. The total amount of data processed daily is nearly PB, the amount of index data written daily is hundreds of tons, the amount of index query daily is tens of millions, the number of configuration charts is tens of thousands, and the number of kanban is thousands.

CAT: it is a real-time application monitoring platform developed based on Java. It provides comprehensive real-time alarm monitoring services for Meituan-Dianping

In this paper, through the comparative analysis of the next two things as an opportunity to discuss the monitoring system may have the appearance, as well as the various stages of the development of the monitoring system

Things CAT does (Open Source)

First of all, it should be noted that we only have the latest version of CAT 3.0.0 on Github, so the comparison is based on that

So what does CAT do?

1 Abstract the monitoring model

Transaction, Event, Heartbeat, and Metric monitoring models are abstracted.

  • Transaction: Used to record the time and number of times a piece of code is executed
  • Event: Records the number of times an Event occurs
  • Heartbeat: Indicates statistics that are generated periodically within the program, such as CPU utilization
  • Metric: Used to record service indicators, including times and totals

Two dimensions, Type and Name, are fixed for Transaction and Event, and minute-level aggregation of Type and Name is performed to form reports and display curves.


2 Sampling link

There are minute-level sampling links for the type and name of the above Transaction and Event respectively


3 User-defined Metric points

Currently, Counter and Timer types are supported, and tag is supported. The maximum number of tag combinations for a single Metric in a single machine is 1000. And there is a simple monitoring kanban, as shown below:


4 Integrate with other components

For example, if you integrate with Mybatis, start the related SQL statistics on the client side and divide the statistics into the type=SQL column in the Transaction statistics kanban


5 the alarm

Simple threshold alarms can be generated for Transaction and Event

EMonitor vs. CAT

EMonitor borrows from CAT and improves onit.

1 Introduce Transaction and Event

Two dimensions, type and name, are fixed for both Transaction and Event. The difference lies in the aggregation of data sent by users

The architecture diagram of CAT is shown below:


CAT’s consumer machine needs to do the following two things:

  • The current hour is aggregated according to type and name for message models such as Transaction and Event, and the aggregated data of the historical hour is written into mysql
  • Write link data to a local file or remote HDFS

The architecture diagram for EMonitor is shown below:


EMonitor is divided into two channels for data isolation processing:

  • Real-time Streaming Compute: The monitoring models such as Transaction and Event in the link sent by the user are pre-aggregated into indicator data and the Metric data sent by the user is pre-aggregated for 10s. Finally, write the pre-aggregated data of 10s to LinDB and Kafka, and let the alarm module watchdog consume Kafka to do real-time alarm
  • Real-time Data Writer: Builds link indexes for link Data sent by users, writes indexes and link Data to HDFS and HBase, and builds dependencies between applications and writes the dependencies to Neo4j

Therefore, a big difference between EMonitor and CAT lies in the processing of indicators. EMonitor is entrusted to a professional timing database for processing, while CAT’s function is very limited when it does aggregation by itself, as shown below:

  • CAT can view type and name data only within an hour. That is, it cannot view report data between any two time periods. EMonitor does not have this restriction
  • CAT cannot view the response time and QPS after all types are summarized. EMonitor can freely combine type and name for aggregation
  • The CAT type and NAME reports are at the minute level, and the EMonitor is at the 10s level
  • The type and name of CAT cannot be directly compared with historical report curves. EMonitor can compare historical report curves to find problems more easily
  • The first page of CAT’s type and name lists shows a lot of numbers, and some intuitive information cannot be obtained immediately, such as whether the response time TP99 100ms is good or bad. EMonitor has current curve and history curve, so it can be directly judged whether it is OK or not
  • The TP99 and TP999 of CAT are accurate based on the report of a certain hour in a single machine. In addition, the aggregate TP99 and TP999 of multiple machines or hours are calculated by weighted average, so the accuracy needs to be improved

But CAT has its own advantages:

  • CAT contains TP999 and TP9999 lines (although accuracy is questionable) and EMonitor is only as thin as TP99
  • CAT’s type and name can be filtered by machine dimension, whereas EMonitor is not so granular

2 Sampling link

Currently both CAT and EMonitor can filter sampling links by type and name, but the difference is that

  • The CAT sampling link is at the minute level, and the EMonitor is at the 10s level
  • For a certain type and name, CAT cannot easily find the desired link at present, while EMonitor can easily find the desired link at a certain time or within a certain period of response time (patent has been applied at present).

EMonitor links are as follows:


  • This graph shows the sampling link at a certain 10s time and under certain filtering conditions of type and name
  • The first row shows the sampled links within 10s, sorted by response time
  • You can click a response time to view the link details

3 User-defined Metric points

EMonitor supports Counter, Timer, Histogram, Payload, and Gauge methods, as well as tag

  • Counter: type of count accumulation
  • Timer: Records the elapsed time of a code, including execution times, maximum, minimum, and average elapsed time
  • Histogram: Contains everything from the Timer, as well as support for calculating TP99 lines, and any other TP lines (from 0 to 100)
  • Payload: Records the size of a packet, including the number of packets, the maximum, minimum, and average packets
  • Gauge: The Gauge used to measure queue size, connection number, CPU, memory, and so on

Any Metric points can be processed through EMonitor and sent to the LinDB timing database. At this point, EMonitor can unify any monitoring indicators, such as machine monitoring can be saved through EMonitor, which lays the foundation for a one-stop monitoring system

Custom Metric Kanban

EMonitor has developed a Metric kanban for Metric that is comparable to Grafana, with advantages over Grafana:

  • There is a very simple way to configure metrics, similar to SQL
  • Integration with the company personnel organization structure, more elegant authority control, different departments can build their own kanban
  • Index and kanban collection, when the source index or kanban changes, no collectors need to change
  • One-click synchronization of metrics and Kanban between alpha, Beta and PROD environments without multiple configurations
  • View indicators and kanban simultaneously on PC and mobile

The sqL-like configuration query indicator is as follows:


  • You can configure the presentation of charts
  • You can configure rich expressions such as the fields to query and the addition, subtraction, multiplication and division between fields
  • You can configure multiple filter criteria for any tag
  • You can configure group by and Order BY

Kanban as a whole looks like this:


The mobile terminal displays the following information:


4 Integrate with other components



At present, EMonitor has been able to monitor all links and indicators of IaaS layer, PaaS layer and application layer, and no longer needs to switch between multiple monitoring systems, as shown below



  • 1 Monitoring specifications for IaaS physical machines and network switches in the equipment room
  • 2 PaaS layer middleware server monitoring indicators
  • 3 Application layer indicators related to clients such as SOA, Exception, JVM, and MQ
  • 4 User-defined monitoring indicators at the application layer

Take the example of opening the branch database and table middleware DAL of Ele. me:



  • You can filter by equipment room, execution status, table, and operation type (such as Insert, Update, and Select)
  • The left-hand list shows the average time taken to execute each SQL entry
  • The two charts on the right show the time consuming of this SQL at DAL middleware level, DB level and QPS invocation
  • It can give the distribution of this SQL in the backend DAL and DB, which can be used to check whether there are some hot spots
  • There are also some DATA packet size curves of SQL query results, SQL flow limiting by DAL and so on
  • You can view the link information of the SQL invocation at any point in time

Again, take opening up ele. me SOA service as an example:



  • You can filter based on equipment room and status information
  • The left column lists the SOA service interfaces provided by the application, along with the average response time and how it compares to yesterday
  • The two charts on the right respectively show the service response time and QPS of the corresponding service interface, as well as the comparison with yesterday. At the same time, you can switch the average response time to TP99 or other TP values. At the same time, there is a jump link that can quickly add alarms to the relevant curve
  • You can switch to the single-machine dimension to see the response time and QPS of the SOA interface on each machine to locate problems on a particular machine
  • The distribution of this SOA interface invocation across different clusters can be given
  • You can give all the callers of the SOA interface and their QPS
  • You can view the invocation link information of the SOA interface at any point in time

5 the alarm

You can configure the following alarm modes for all monitoring indicators:

  • Threshold: a simple threshold alarm applies to the CPU and memory
  • Comparison period: indicates the alarm compared with the same period in the past
  • Trend: Suitable for smooth and continuous smart alarms without a threshold
  • Other Alarm Forms

Discuss the development trend of monitoring system

1 Log monitoring phase

The implementation method of this stage: program log, use ELK to store and query the program’s run log, ELK can also simply display the index curve

Troubleshooting process: Once there is a problem, search ELK for possible exception logs to analyze troubleshooting

2 Link monitoring phase

Problems in the previous phase: ELK only performs aggregation or search analysis on a line-by-line basis, with no context between logs. It is difficult to know at what stage a request takes longer

The implementation method of this stage: CAT was born out of the blue, which abstracted Transaction, Metric and other monitoring models through modeling, and brought link analysis and simple reports into the vision of everyone

Alarm mode: You can perform threshold monitoring and troubleshooting for reports. Once an alarm is generated, you can click the report to locate the type or name of the fault and find the corresponding link to view detailed information

3 Indicator Monitoring Phase

Problems in the previous stage: CAT has weak support for custom indicators, and cannot realize or display more diversified query aggregation requirements

The implementation method of this stage: support rich Metric indicators, divide some report data on the link into indicators, hand over to professional timing database for storage and query of indicators, and connect or develop rich indicator kanban such as Grafana

Alarm mode: Provides more diversified alarm policies for indicators. Once an alarm occurs, you may need to view the indicator kanban on each system to roughly locate the root cause and analyze the total number of links

4 Platform through the integration stage

Problems existing in the previous stage: System monitoring, middleware and service monitoring, partial service monitoring, link monitoring and indicator monitoring all have a set of data collection, pre-processing, storage, query, display and alarm processes, and the data processing formats and usage modes of each system are not uniform

The implementation method of this stage is to open up the possible link and indicator monitoring from the system level, container level, middleware level, business level, etc., unify the data processing process, and integrate the release, change, alarm and monitoring curve to become a one-stop monitoring platform

Alarm mode: You can uniformly clear alarms based on the monitoring data at all levels: You can view all monitoring curves and link information in only one monitoring system

At present, EMonitor has completed this stage, integrating the company’s three independent monitoring systems that have existed for a long time into the current monitoring system

5. In-depth analysis stage

Problems in the previous stage:

  • Although users can see the monitoring data of all levels in a system, they still have to spend a lot of time to check whether there is a problem at all levels when removing obstacles. Once they miss one, they may miss the root cause of the problem
  • There is no global monitoring view of the whole business, but only the perspective of each application

In a word: the previous stage is to build a monitoring platform, users query what indicators to display the corresponding data, the monitoring platform does not care about the content of data stored by users. Now we need to change our thinking, and the monitoring platform needs to take the initiative to help users analyze the data stored inside

The implementation of this stage: what we need to do is to help the user analysis of the process abstract out, for the user to build application market and business market, as well as for the market to do related root cause analysis.

  • Application market: is for the current application to build upstream and downstream application dependent monitoring, monitoring of the current application associated with the machine, REDis, MQ, database and other monitoring, you can always do a physical examination for the application, to actively expose problems, rather than wait for users to check indicators and then find problems
  • Service market: it is to sort out or use links to automatically produce the market according to the business. The market can quickly tell users which business links have problems

Root cause analysis: A market have a lot of links, each link binding there are many indicators, a warning every time come out may need detailed analysis under each link indicators, such as consumption delay increases, kafka could cause there are a variety of reasons, every time the alarm screen will need to process again all man-made analysis under the screen, very tired, Therefore, the process of locating the root cause needs to be solved uniformly through modeling abstraction

Trend report analysis: Take the initiative to help users discover some deteriorating problems. For example, after user release, interface time increases. Users may not discover the problems

In order to do active analysis, we also deeply rely on the index drilling analysis, that is, when the amount of a certain index drops, we can actively analyze which combination of tag dimensions leads to the decline, which is the basis of many intelligent analysis mentioned above, and this part is not easy

Alarm mode: The alarm troubleshooting process can be unified for monitoring data at all levels: NOC can quickly know which businesses or applications have problems according to business indicators or business market. The owner of the application can get relevant information of changes through physical examination of the application market, such as redis fluctuation, database fluctuation, and a method fluctuation of upstream and downstream applications, so as to quickly locate problems. Or by performing root cause analysis on the market to locate the root cause

Again, Logging, Tracing and Metrics

A diagram of the relationship between three is common


All three are indeed indispensable and mutually reinforcing, but I would like to say the following:

  • However, the proportion of the three factors in monitoring and clearing is quite different: Metrics occupies the first place, Tracing the second, and Logging the last
  • Tracing contains the dependency information of important applications, while Metrics has more space for in-depth analysis and mining. Therefore, in the future, great efforts should be made on Metrics, and further in-depth global analysis should be conducted by combining with application dependencies of Tracing. Metrics and Tracing combine to explore more possibilities

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.