We currently use Sensu to monitor our environment, and I’ve taken to using standalone checks to collect various metrics. Standalone metrics don’t rely on the server to issue a check request which provides more reliable interval between checks. One of the metrics we collect is the number of TCP sockets in each of the possible states on each server. We started off using the metrics-netstat-tcp.rb check from the excellent set of Sensu community plugins.
This plugin was doing the job quite nicely, until I noticed that some of our machines had widely varying intervals for publishing this data, especially when under load. This started to be noticeable once the server passed roughly 10k connections, and got worse as the number of connections increased. Given that it’s not uncommon for some of our servers to handle in the region of 100k connections during busy times, I decided to have a closer look at what was going on. Closer inspection on one of the servers revealed that the script was pegging a CPU core at 100% and still taking around 10s to complete when a server had ~60k TCP connections in various states – not a good use of valuable resources.
Taking a look at the code of this plugin for the first time, everything looked pretty reasonable and nicely readable, but the large regular expression running for every line in /proc/net/tcp looks glaringly suspicious. As my skills with awk are greater than my skills with ruby, I decided it would be quicker for me to simply rewrite the check using a tool that was built for running efficiently over large text files. The result a few minutes later was metrics-netstat-tcp.awk. Although the parameters are not the same, the output and functionality matches making it an almost but not quite drop in replacement.
The more important feature for me though is that collecting the metrics on a machine with ~60k connections now completes in under 60ms instead of around 10s. Hopefully the lesson for everyone else is that the older tools are still around for a reason, and you need to know when and how to pick the right tool for the right job.
I’ve just release Ecks into the wild, a Python library for accessing SNMP data from a server without having to deal with the pain of knowing about what a MIB or OID is. SNMP stands for Simple Network Management Protocol, but for most people it is anything but simple. It’s pretty straight forward once you understand what’s going on, but most people are daunted by the learning curve.
What results from this resistance is that when your average developer decides he wants to monitor CPU usage or disk space on his machine he or she ends up doing it in the most obtrusive way possible – SSH. While I’m a big fan of small shell scripts, this is one place they do not belong. Let me give you an example:
I set up a new server here in London for one of our Chicago teams. Being a conscientious team, the first thing they did was wire in some monitoring that wrote for their servers. It checks things like disk space, memory usage, CPU load and the state of various processes that they care about. They need pretty fine grained checking intervals, so they check these every minute. The easiest way they know how to do this though is to SSH in to their machines and run df, free, netstat,etc and scrape the output. Every minute. Which on this nice shiny server consumed almost 20% of the CPU right off the bat. Educating them on the use of SSH ControlMaster helped, but it’s still doing a lot of work on the machine.
This was the last straw that lead to the creation of Ecks. People will always follow the path of least resistance, so if you want people to do the right thing, you need to make it the easiest thing to do. SNMP has all this information available, modern snmpd implementations are stable, have a tiny footprint and are more secure than providing SSH access to your machine.
The hardest part of all though is what to name this little library. When discussing the problem with Julian Simpson (the @builddoctor), he pointed out that MIB always reminded him of the Men in Black. Reading the Wikipedia article on the original comic book series had some interesting snippets:
The Men in Black are a secret organization that monitors and suppresses paranormal activity on Earth…
Replace “Earth” with “a computer” and you’re starting to get somewhere. Then I noticed this gem:
An agent named Ecks went rogue after learning the truth behind the MiB: they seek to manipulate and reshape the world in their own image by keeping the supernatural hidden.
Many people think that the complexity of the MIB keeps SNMP data hidden. And so the name was chosen…
I found a link to Above the Clouds, a paper on Cloud Computing recently published by a quartet of UC Berkeley RAD Lab professors. I’ve been quite disappointed with publications on the subject of the latest buzzword taking the world by storm right now, so I was not expecting much when I first clicked on the link. The thing is, as I started reading through the Executive Summary it all sounded very familiar. The outline the give in the summary follows the same outline as a talk I gave in November last year at the ThoughtWorks London office for the London Java Community.
The only criticism I have is that they don’t put enough emphasis on one of my key reasons for why it’s suddenly taken off. Cloud computing is not a new idea – it’s an extension of the Utility Computing that John McCarthy talked about in 1961. Although they only make a passing remark in section 3, I think one of the most important reasons it’s taken off is that the services Amazon provide were the first that were not a “solution looking for a problem”. Earlier offerings by the likes of Sun, HP and Intel all created a solution that they tried to sell to clients. The problem was that there were remarkably few problems that their solutions solved. Amazon simply exposed services that they were using internally already. That’s not to say the other reasons they give are not valid, I totally agree with them. I think they just missed a good point.
One of the topics I only glanced over is covered cover quite well in section 6 – Cloud Computing Economics. They provide some interesting example cost calculations. Although the numbers are obviously US centric, they do provide a nice way for a company to approach making the old “build vs buy” comparison.
In summary, I highly recommend this paper for anyone who wants to get the head around what this Cloud stuff is all about and what they need to do to prepare for it.