We currently use Sensu to monitor our environment, and I’ve taken to using standalone checks to collect various metrics. Standalone metrics don’t rely on the server to issue a check request which provides more reliable interval between checks. One of the metrics we collect is the number of TCP sockets in each of the possible states on each server. We started off using the metrics-netstat-tcp.rb check from the excellent set of Sensu community plugins.
This plugin was doing the job quite nicely, until I noticed that some of our machines had widely varying intervals for publishing this data, especially when under load. This started to be noticeable once the server passed roughly 10k connections, and got worse as the number of connections increased. Given that it’s not uncommon for some of our servers to handle in the region of 100k connections during busy times, I decided to have a closer look at what was going on. Closer inspection on one of the servers revealed that the script was pegging a CPU core at 100% and still taking around 10s to complete when a server had ~60k TCP connections in various states – not a good use of valuable resources.
Taking a look at the code of this plugin for the first time, everything looked pretty reasonable and nicely readable, but the large regular expression running for every line in /proc/net/tcp looks glaringly suspicious. As my skills with awk are greater than my skills with ruby, I decided it would be quicker for me to simply rewrite the check using a tool that was built for running efficiently over large text files. The result a few minutes later was metrics-netstat-tcp.awk. Although the parameters are not the same, the output and functionality matches making it an almost but not quite drop in replacement.
The more important feature for me though is that collecting the metrics on a machine with ~60k connections now completes in under 60ms instead of around 10s. Hopefully the lesson for everyone else is that the older tools are still around for a reason, and you need to know when and how to pick the right tool for the right job.