It’s been way too long since I’ve posted anything here, so I’ve decided to resume again with what I’ve been busy with lately.
We’ve been using Logstash for quite a while now, and one of the annoyances I’ve had is that with the default configuration we threw together we only got second resolution on our timestamps. 99% of the time this has been good enough for the simple things we’ve been trying to keep an eye on, but now and again it’s resulted trouble stitching together the exact sequence of events when trying to diagnose a problem.
Rsyslog is the primary tool we use to get data into Logstash in the first place. As we don’t have the luxury of using anything cloud oriented we need to care about our hardware too, and using syslog makes this very easy. Add to this how simple it is to get most applications to log via syslog too and it makes using it a no brainer.
Getting the high resolution timestamps into your rsyslog messages is actually really simple. Just make sure you have the following configuration option set in your rsyslog configuration:
I’ve just release Ecks into the wild, a Python library for accessing SNMP data from a server without having to deal with the pain of knowing about what a MIB or OID is. SNMP stands for Simple Network Management Protocol, but for most people it is anything but simple. It’s pretty straight forward once you understand what’s going on, but most people are daunted by the learning curve.
What results from this resistance is that when your average developer decides he wants to monitor CPU usage or disk space on his machine he or she ends up doing it in the most obtrusive way possible – SSH. While I’m a big fan of small shell scripts, this is one place they do not belong. Let me give you an example:
I set up a new server here in London for one of our Chicago teams. Being a conscientious team, the first thing they did was wire in some monitoring that wrote for their servers. It checks things like disk space, memory usage, CPU load and the state of various processes that they care about. They need pretty fine grained checking intervals, so they check these every minute. The easiest way they know how to do this though is to SSH in to their machines and run df, free, netstat,etc and scrape the output. Every minute. Which on this nice shiny server consumed almost 20% of the CPU right off the bat. Educating them on the use of SSH ControlMaster helped, but it’s still doing a lot of work on the machine.
This was the last straw that lead to the creation of Ecks. People will always follow the path of least resistance, so if you want people to do the right thing, you need to make it the easiest thing to do. SNMP has all this information available, modern snmpd implementations are stable, have a tiny footprint and are more secure than providing SSH access to your machine.
The hardest part of all though is what to name this little library. When discussing the problem with Julian Simpson (the @builddoctor), he pointed out that MIB always reminded him of the Men in Black. Reading the Wikipedia article on the original comic book series had some interesting snippets:
The Men in Black are a secret organization that monitors and suppresses paranormal activity on Earth…
Replace “Earth” with “a computer” and you’re starting to get somewhere. Then I noticed this gem:
An agent named Ecks went rogue after learning the truth behind the MiB: they seek to manipulate and reshape the world in their own image by keeping the supernatural hidden.
Many people think that the complexity of the MIB keeps SNMP data hidden. And so the name was chosen…
With the price of storage dropping all the time, there is a constant perception from people who don’t deal with it every day that “disk space is cheap”, especially when it comes to developers. The problem is that so called “Enterprise” storage costs are still astronomical compared to what people are used to paying for home storage – even when using SATA disks.
A lot of this extra cost comes from a perceived requirement for the highest available capacity, availability and performance. Achieving all three characteristics is expensive, but if you’re willing to sacrifice one of them then costs start to fall considerably. Lowering requirements on two of the three drops it even more.
One of the teams I work with has a requirement primarily on capacity. Performance and availability are nice, but capacity is the key. We generate gigabytes worth of log files every day, but didn’t have one place to store it all for easy analysis. Just before I joined the team they’d purchased the cheapest “Enterprise” storage system the IT team at the time would allow – it ended up costing in the region of £12k for 12TB of raw storage. That’s £1000 per TB!
In addition to the price, the other problems were accessibility and management of the data and managing growth. This inspired a hunt for something that would provide a cheaper and more flexible solution.
Our requirements were:
- *nix based system. The current storage solution was based on Windows Storage Server, but all our systems and tools for this team are Linux based. Yes, Windows does technically provide things like an NFS server, but fighting with the file system permissions and overall performance are two things that impacted us.
- Cheap to expand. We need to have a clear path to grow the storage in the server easily by simply adding more disks.
- Large filesystems. There’s nothing more wasteful from a storage point of view than having lots of small filesystems. Besides the management overhead, there’s also many wasted blocks lying around un-used.
- Cheap to build. This inevitably means commodity hardware.
- Reasonable availability. We don’t need 99.999% uptime, but would be happy with somewhere in the region of 90%+
- Reasonable performance. Primary access to the data on this machine is via gigabit Ethernet. As long as it can keep up with the network card we’re happy…
When I got back from DevOpsDays in Hamburg this year I felt the need to explain my journey to the “DevOps” world and my view on where it’s headed. I started writing a State of the Nation paper to lay it all out. A couple of weeks in I got a message from Matthias Marschall asking if I’d like to do a guest post as part of their DevOps series. I agreed, and after a lot of effort (and help from a couple of great editors) you can now read it here.
It’s by far the longest article I’ve ever written, and I was amazed at how ideas that had been floating around in my head for a while crystalized through the processes of writing them down. I found I got so passionate talking to people about what was in it that I’ve decided to make a talk out of it, the first iteration of which will be in Chicago on Tuesday (see previous post).
I’ll be in Chicago next week and I’ll be speaking at the local DevOps group meetup on the 7th of December. Places are limited, so please sign up here if you’d like to attend.
Julian Simpson, aka The Build Doctor, has been working away at a nice web based Build Status Monitor called XFD for a while now. One of my complaints for years has been that there’s no nice build status tool that’s easy to use, but I think he’s on to something.
It’s entered in the Ultimate Wallboard Challenge, and you can vote for it here.
A few years ago Martin Fowler introduced me and a few other ThoughtWorkers who were involved in the Continuous Integration and Deployment space to an editor he knew and told us we needed to write a book about what we were doing. The key thing we were focusing on was making sure that quality software could be released to production in a reliable and repeatable way. Software has absolutely no value until it’s running in production. If you can’t get it into production quickly and easily then you’re just wasting time and money.
There were a few false starts, a few changes of crew, but eventually Jez Humble and Dave Farley stuck it out all the way to the end and Continuous Delivery was published this year. No matter what you do in your organisation, if you’re filled with dread at the thought of a new software release then you need to buy it and read it and do what it says. Now. This article will still be here after you’ve ordered it. Although I didn’t have enough time to be a big contributor, Jez would drag me in front of a white board when ever our paths crossed to discuss things and kept on sending me drafts of key chapters that I have specific interest in and so I am proud to have helped in at least some small way.
One of the things he did do was include a mention to a project that Tom Sulston and I started to make configuration management easier. It’s called ESCape, and I’ve written and talked about it a few times in a few different places. As more and more people are reading the book though I’m getting more questions about what the status of the project is and what our plans are going forward.
At the moment the project is in hibernation. We’ve not made any changes for over a year now, and I don’t think I’ll be working on it in its current incarnation any time soon. That does not mean it’s not based on a good idea though! It’s more a problem of implementation.
At its heart, ESCape is supposed to be a simple way to manage a hierarchal key/value store. I like the way the UI works, Dan North even has a wonderful acronym for it that I can’t for the life of me remember right now. The real problem with the current design is how we’re storing the data. Trying to wedge that kind of data into a relation database always felt dirty and I decided to stop before any real damage was done.
What I’d like to do though is to take the existing UI and functionality and use something like Neo4J or CouchDB to store the data. Conversations I’ve had with Jim Webber and Ian Robinson about it were one of the reasons I didn’t start immediately on a replacement as at the time Jim was making plans to write the REST interface into Neo4J. As an early release of it is now available I guess I’ve run out of excuses…
I’m busy wiring together a new server configuration environment using Windows Deployment Services (don’t ask), Cobbler and Chef. So far things seem to be going quite well, until I bumped in to the following error trying to get a new client to register with the Chef server:
HTTP Request Returned 401 Unauthorized: Failed to authenticate!
A quick sift through Google results didn’t get anything usable. A quick sniff of the packets going over the wire though showed that it was authenticating using a signed certificate. Normally when you sign HTTP requests like that you add some kind of timed expiry. Could the problem be clock related?
Sure enough, a quick check on the new client and the server showed that there was just over an hour time difference. Getting the time on the client and the server in sync got the client registered!
After more than 5 years at ThoughtWorks I’ve decided that it’s time for a change of scenery. As much as I enjoyed the challenge of consulting, meeting new people and seeing new places – I prefer spending time with my family more. It was a tough decision to make, but I think it’s the right one for me now. I will miss many people at TW, but on the plus side I’m again working with some great people who I missed when they left TW…
As I’m going to be doing DevOps type stuff all day every day at DRW now, I’m hoping I’ll have more time to document and share the things I discover.