In both operations and development, monitoring is essential for troubleshooting and performance tuning. When we talk about monitoring, we actually include metric monitoring and log monitoring. This article focuses on how to build an HBase metric monitoring system using open source. Logging may not be important in the early stages, or it can be implemented using ELK.

After investigation, there are probably the following schemes:

  1. Python plugin to implement Gmetad, write Opentsdb after collecting data from GMOND, alarm can use Bosun

  2. Direct use of Prometheus system, the use of JMX_EXPORTER, using javaAgent to start collection, alarm using alterManager

  3. The Sink is sent to pushgateway or opentsdb, and the alarm is similar to 1,2

  4. Use Hadoop metric to send your own GraphiteSink to InfluxDB-Graphite

  5. Collect it directly using TCollecotr and send it to Opentsdb

Comparison of schemes

The first and second solutions are pull or pull + push.

In the first scenario, metric resolution (for example, region metric) needs to be implemented in the plug-in, while in the second scenario JMX resolution can be configured with regular resolution. In terms of deployment: The first solution requires independent deployment and maintenance of GMOND, and the second solution uses The Agent technology of Java. Similar interceptors can be registered when HBase processes are started. In terms of alarm: the first way is relatively flexible due to its own analysis. For example, the data to be monitored can be written to the company’s internal monitoring system in a certain format or open source BOsun (a monitoring and alarm system based on Opentsdb and Redis) can be used. The second way can use its own monitoring system.

The third and fourth scheme uses the metric system of Hadoop itself to spit out monitoring information by rewriting Sink or adding a new Sink system. Pushgateway is an implementation of a relay like station provided by Prometheus that receives pushes of data. The fourth kind is because influxDB is compatible with Graphite interface, so it can be spit to Graphite. But after testing, it was found that the analytical JMX provided by InfluxDB’s Graphite could not resolve special characters. For example, “Context=regionserver.Hostname=gdc-dn05-devbase.i.nease.net” parsing is disgusting, so if you have a good solution, also welcome to leave a message.

The fifth scenario, using a push approach, collects JMX information and sends it to Opentsdb by deploying the TCollector program on each service, so TCollector needs to be deployed and managed. In the previous company we used to overwrite TCollector sending celery and then write to opentsdb.

On storage, Opentsdb is based on HBase, so it can be expanded horizontally. However, opentsdb reading has also been criticized, although it is said that a future version of opentsdb will be added to the redis cache of data within one hour, but it has not been released yet. Prometheus’s current stand-alone storage, by contrast, depends on disk size. However, currently Opentsdb and InfluxDB do not support Float64. For Remote storage, it is better to store data locally for 3 months and then store all data in Opentsdb solution. The official news Thecurrent record is 800K samples/s, which is enough to monitor over 10K node exporters at a 10s interval.

Performance on queries has not yet been tested on data volumes, but in terms of small data volumes, it’s pretty much the same. In terms of query language, Promethues is a bit richer, for example, adding and subtracting between two indicators, but there is no need to use such a complex query for the basic information we need at present. In terms of client compatibility, Promethues’ JMX_EXPORTER can meet most of the monitoring needs, and TCollector can collect most of the Hadoop system’S JMX information.

Choice of scheme

Finally, I read the data from GMOND for two reasons. One is that writing the plug-in by myself can control the writing anywhere and modify the logic, which makes it more free. Another is that components of the Hadoop ecosystem support sending to Gmond. According to the test using python plug-ins, the performance is not able to meet our current scale, the main reason is python multithreading is to run a CPU. So I implemented similar functionality in Java, as long as the implementation reads gmond and puts it on a memory queue, and multiple threads consume it and write it to Opentsdb and elsewhere. Since the code is not written well, it was not put on Githup. If you need reference, please contact me. Currently, there are no bottlenecks in fetching and writing. The OPS of HBase is about 200,000 times. There is, of course, a single point of problem.