Nowadays, the Internet business is booming, and the system architecture is increasingly complex. With the transformation of software products and engineering teams, many open source monitoring tools have emerged, some of which are quite famous, such as Zabbix, Nagios and StatsD. Some of the questions are constantly being debated, for example, which is better, Zabbix or Nagios, an open source tool for monitoring? Is StatsD likely to replace Zabbix or Nagios as the new standard for system monitoring?
The birth of StatsD
As a large online marketplace for artisanal goods, Etsy has been compared by the New York Times to eBay, Amazon and grandma’s Basement Collection. Back in 2009, Etsy was struggling to reach out. But the reliability of the site is not good enough. The reason has a lot to do with architecture, which stems from a pre-Devops culture in which developers, DBAs and system administrators were all locked into their silos and developers had no access to the product. At the time, this was the most common way to develop and operate Web sites.
During Kellan Elliott-McCrea’s five years as VICE president of engineering and CTO at Etsy, both the software product and engineering teams have undergone dramatic changes. The most obvious aspect of engineering team change is presentation. This revolution has led to a number of open source tools, some of which are quite well known, such as StatsD, an aggregator that generates metrics from log files and grabs data.
StatsD has been arguably the most popular and useful DevOps tool of the past few years.
StatsD profile
StatsD is a simple network daemon, based on node. js platform, through UDP or TCP to listen to various statistics, including counters and timers, and send aggregate information to back-end services, such as Graphite.
StatsD was originally written by Erik Kastner of Etsy as a front-end agent for Graphite/Carbon metrics to aggregate and analyze application metrics. It is based on two main functions: counting and timing. Initially using Node, other languages have been implemented as well. With Statsd, you can detect application metrics through language-specific clients. Based on your personalized needs, you can collect any data you want through Statsd. Statsd calls each Statsd server by sending UDP packets. Let’s see why YOU chose UDP over TCP.
Why UDP?
StatsD uses UDP to transmit data. Why UDP instead of TCP? First of all, it’s fast. No one wants to slow down an app to track its performance. In addition, UDP packets follow the fire-and-forget mechanism. So either StatsD received the packet or it didn’t. The application doesn’t care if StatsD is running, down, or on fire, it simply trusts that everything is working fine. That is to say, we do not need to care about whether the background StatsD server crashes, even if the crash will not affect the foreground application. (Of course, we can track UDP packet receiving failures with graphs.)
Some concepts of StatsD
To learn more about StatsD, you should start with Buckets, values, and Flush interval.
Buckets
When a Whisper file is created, it has a fixed size that does not change. There may be buckets in this file that correspond to data points of different resolutions, and each bucket has a retention attribute that specifies how long the data points should be retained in the bucket. Whisper performs some simple math to figure out how many data points will actually be stored in each bucket.
Values
Each stat has a value, which depends on how the modifier is interpreted. In general, values should be integers.
Flush Interval
After flush Interval (flush interval, typically 10 seconds) times out, STATS is aggregated and sent upstream to the back-end service.
Tracking all events is key to productivity. With StatsD, engineers can easily track the transactions they need to focus on without having to make time-consuming configuration changes, etc.
The extension of StatsD
Collecting and visualizing data is an important way to make informed decisions about servers and applications, and StatsD has the following benefits:
- Simple – very easy to access application, StatsD protocol is text-based, can be written and read directly.
- Low coupling – Applications run on background, using a fire-and-forget protocol like UDP, with no dependency between metrics collection and the application itself.
- Small footprint – The StatsD client is very lightweight, with no state and no thread required.
- Universal and multi-language support – there are clients based on Ruby, Python, Java, Erlang, Node, Scala, Go, Haskell, and almost all other languages.
Etsy uses the Statsd monitoring system
Etsy has blogged about how she uses Statsd and why: Measure Anything, Measure Everything. The article introduces that Etsy tracks the changes of its server, application and network in the form of charts. Among the three, the data of application is the most complex. Chart out the timeline so that all metrics can be visualized and measured.
Statsd uses counters to collect numbers. One of the great things about timers is that you can get averages, totals, counts, and upper and lower limits. Etsy finds that the events tracked are very frequent while Statsd does not have any buffered data, so the call between the two is kept simple. If there is a large amount of data operation, the sample data can be added when the data is sent to Statsd, that is, only a certain proportion of data can be sent. The Statsd daemon listens for UDP traffic from all application libraries, collects data through the time stream, and updates the data in the background at the required interval. For example, an aggregate function call timer can collect data every 10 seconds and analyze the maximum, minimum, average, median, 90, and 95 values of the data.
Review images
Etsy has also made StatsD open source, showing a simple way to use the metric format expected to be sent based on the basic Line protocol:
: |
Copy the code
If you are running StatsD locally with the default UDP server, the easiest way to send metrics on the command line is:
Echo "foo: 1 c | |" nc - u - w0 127.0.0.1 8125Copy the code
collectd
Collectd is also a daemon that collects system performance data and provides a mechanism for storing different values in various ways. For details, see the Collectd website.
Collectd not only collects performance data, but also periodically collects information about the system based on that data. It uses those statistics to check current server performance and predict the future of the system. But it can’t generate graphs by itself — although it can write RRD files, But you can’t generate graphics from these files — so you generally need to incorporate a data graphing tool called Graphite. For example, VPSee uses CollectD to collect various performance parameters of the machine.
Collectd has some advantages over other projects that collect system performance metrics, such as embedded systems, C development (efficient), no system cron support (independent), ease of use, and more. It also includes more than 70 plug-ins and documentation support.
StatsD and Graphite
We believe in charts for the presentation of data, as Ian Malpass put it in his Code as Craft article: We keep track of changing events. We tend to focus on network, device, and application data, and application performance data is often the most difficult to measure, but the most important of the three, and they are relevant to your business. Is it possible to chart in the easiest way possible what an engineer might measure or time?
StatsD is known to be on the radar of engineers alongside Graphite, which collects and aggregates measurements and then passes them on to Graphite, which stores them and charts them based on time series. That means StatsD is responsible for the preliminary data processing, Graphite is responsible for data presentation, complement each other.
There are many reasons why we like Graphite: it is easy to use, and it has strong ability of drawing and data manipulation. We can combine data from StatsD and other metrics to collect the system. Most importantly, StatsD creates new measurements as soon as they are pushed to Graphite. This means that engineers do not need to worry about administrative costs when tracking new indicators, he just tells StatsD “I want to track grue.” This indicator will automatically appear in Graphite. In addition, data is pushed to Graphite at a frequency of 10 seconds, so StatsD’s measurements are presented in near real time.
The image simply depicts the ELapsed_time value of the HTTP request over time.
Review images
Therefore, with StatsD, capture rate and other measurements become very simple, coupled with Graphite data processing, view, analysis of these data has become very simple.
StatsD ecological environment
A series of tools based on StatsD have been produced in foreign countries, or on the basis of mature projects, start to be compatible with StatsD. According to the directions, it can be divided into several directions as shown in the figure.
Review images
Integrations
Since StatsD itself is not responsible for defining the meaning of indicators, the work collected from the database or operating system requires script development. One of the great contributors to this is Datadog.
Dd-agent has 150 contributors on GitHub and is compatible with more than 60 operating systems, middleware and databases.
Review images
In addition, Librato and App First have joined StatsD’s ranks. Infrastructure management solutions: Puppet and Chef are now compatible with bulk installation of StatsD into infrastructure.
The Visualization and Data Hosting
Graphite as a visual control, not only contains visualization but also its own storage. But when it comes to visualization alone, Grafana is the one that does it best, with a variety of forms and a variety of configurable projects. Signal FX came from behind to get into the fray.
Review images
On the basis of data visualization, there are also services that start hosting visual data. For example, Host Graphite.
Review images
Time series database and event processing engine
In fact, the emergence of StatsD and time series database complement each other. On the basis of OpenTSDB and InfluxDB, the application of StatsD is gradually developed.
Event processing engines such as Riemann are starting to interface with time series databases or stASTd-based all-in-one solutions to make up for the lack of alerting in addition to presentation.
Integrated solution
So, is there an integrated solution?
In addition to these segmentation directions, there are also foreign manufacturers to provide integrated solutions, through lightweight StatsD to achieve higher computing power, to deal with increasingly complex infrastructure architecture. Examples: Datadog, Librato, etc.
The Domestic Cloud Insight also provides system monitoring services based on the same idea. Cloud Insight uses the StatsD collection technology and interconnects with MySQL, Redis, MongoDB, CentOS, and RedHat operating systems. On HBase storage, Cloud Insight uses OpenTSDB to aggregate, group, and filter performance indicators. Through the study of StatsD’s ecological environment, different tools are integrated to provide users with integrated solutions.
View pictures View pictures
Datadog raised $31 million in Series C funding earlier this year. Its client list, from Facebook to Airbnb, is impressive. OneAPM, the founder of Cloud Insight, also received a C round of financing of RMB 165 million not long ago, which is widely expected in the industry.
conclusion
Admittedly, StatsD, as the core of the new generation of system monitoring, is still in the process of technology accumulation. We believe that in the future, more and more open source projects will join its embrace, and more and more companies will add r&d resources on top of it, or invest in other areas related to it.
Datadog, a company based on the technology, has done well with its investment in the technology and its proven computing power. The domestic product line Cloud Insight also joins the StatsD camp based on the same idea. Whether StatsD will eventually replace Zabbix or Nagios as the new standard for system monitoring remains to be seen.