We currently use Sensu to monitor our environment, and I’ve taken to using standalone checks to collect various metrics. Standalone metrics don’t rely on the server to issue a check request which provides more reliable interval between checks. One of the metrics we collect is the number of TCP sockets in each of the possible states on each server. We started off using the metrics-netstat-tcp.rb check from the excellent set of Sensu community plugins.
This plugin was doing the job quite nicely, until I noticed that some of our machines had widely varying intervals for publishing this data, especially when under load. This started to be noticeable once the server passed roughly 10k connections, and got worse as the number of connections increased. Given that it’s not uncommon for some of our servers to handle in the region of 100k connections during busy times, I decided to have a closer look at what was going on. Closer inspection on one of the servers revealed that the script was pegging a CPU core at 100% and still taking around 10s to complete when a server had ~60k TCP connections in various states – not a good use of valuable resources.
Taking a look at the code of this plugin for the first time, everything looked pretty reasonable and nicely readable, but the large regular expression running for every line in /proc/net/tcp looks glaringly suspicious. As my skills with awk are greater than my skills with ruby, I decided it would be quicker for me to simply rewrite the check using a tool that was built for running efficiently over large text files. The result a few minutes later was metrics-netstat-tcp.awk. Although the parameters are not the same, the output and functionality matches making it an almost but not quite drop in replacement.
The more important feature for me though is that collecting the metrics on a machine with ~60k connections now completes in under 60ms instead of around 10s. Hopefully the lesson for everyone else is that the older tools are still around for a reason, and you need to know when and how to pick the right tool for the right job.
It’s been way too long since I’ve posted anything here, so I’ve decided to resume again with what I’ve been busy with lately.
We’ve been using Logstash for quite a while now, and one of the annoyances I’ve had is that with the default configuration we threw together we only got second resolution on our timestamps. 99% of the time this has been good enough for the simple things we’ve been trying to keep an eye on, but now and again it’s resulted trouble stitching together the exact sequence of events when trying to diagnose a problem.
Rsyslog is the primary tool we use to get data into Logstash in the first place. As we don’t have the luxury of using anything cloud oriented we need to care about our hardware too, and using syslog makes this very easy. Add to this how simple it is to get most applications to log via syslog too and it makes using it a no brainer.
Getting the high resolution timestamps into your rsyslog messages is actually really simple. Just make sure you have the following configuration option set in your rsyslog configuration: