Reboot...Reboot...Reboot...

And now, ladies and gentlemen, a rant on rebooting.

Paul Ramsey’s blog Clever Elephant had a great post recently titled No One Ever Got Fired for Buying Linux. This particular bit stuck in my head:

Did you buy an expensive web mapping server and then have to put it on a nightly re-boot cycle to avoid service degradation? Don't worry, everyone is doing it, it doesn't reflect badly on you.
I run Ubuntu Linux at home, with my hand-built Dark Tower being my main desktop. Aside from being my desktop, it is also my personal server. I run databases for development and testing (MySQL, PostgreSQL/PostGIS), ssh for remote access, a web server, a UPnP for sharing media throughout the house, rsync server for backups, all kinds of stuff. The only thing it doesn't run is a virus scanner, as it just isn't necessary. It runs 24x7*, and I don't reboot it. Ever.

Generally speaking, with modern Linux distributions reboots are only required when the Linux kernel itself is updated. Everything else, from your database engine to your web browser, will update without incident. This is fundamentally different from the Windows world, where an update to Notepad may require a reboot and a kernel update comes every ~4-5 years, costs you ~$200, and requires a good slagging of your hard drive.

But now I’m running Ksplice, which updates the Linux kernel without requiring a reboot, so even my rare kernel update reboot is gone. Barring a hardware failure or a power outage that lasts longer than my UPS, my Dark Tower never has to go down. Ever.

rebootsThis is an image of our organization’s new Windows server reboot schedule. Every night of the week Windows servers are winking off and on like Christmas tree lights. Every Windows server in the county goes down at least once a week. The column on the far right is for servers that have to get rebooted every day.

I went through 4 of the 5 stages of grief in about a minute the first time I saw this. Denial lasted 10 seconds (the amount of time it took me to scan the halls for Ashton Kutcher). Anger took 40 seconds, just because I like the warm feeling anger gives off, and bargaining took 5 seconds while I realized the spread sheet wasn’t going to negotiate with me. That left 5 seconds for depression, and it’s there that I’ve stayed.

I’m stuck in depression for this simple reason: it’s perfect. Our IT group is doing exactly the right thing. When you run Windows servers, this is absolutely the most responsible course of action to take. In order to both avoid security exploits and to keep the darn things running, you have to do this. I can’t even fault them over the sheer number of servers, which largely derives from a paranoia of running more than one thing on a Windows server, regardless of load. They are right to be paranoid. It’s very easy for two different pieces of software on a Windows server to cause problems for one another, and it’s often exceedingly difficult to debug. It is generally cheaper to have a separate server for each task, even if we have enough idle CPU time to wrap up all of SETI’s data in a week.

Of course, this sort of strategy brings its own set of problems. Every time a Windows server reboots, there are a lot of things that can go wrong (I’m looking at you SQL Server services). Rebooting every week means those problems will occur that much more frequently. A developer recently talked to us about some intermittent errors in an application used by 3rd shift workers, and I basically had to forward him the spreadsheet and say good luck. But in balance, I can’t fault find fault with what they’re doing.

Still, I have a nagging voice in the back of my head saying THIS IS WRONG. It’s the same voice I hear when I’m on my laptop at work and the fan ratchets to full blast, the hard drive light stays on, and I walk away from my now useless PC while Norton pulverizes my hard drive scanning for Windows malware. Or when horrifically expensive map server software asks me about my “service refresh strategy” (read: we don’t trust our software not to break if left running for too long).

So I’ll conclude this rant with a question: as professionals, should we be embarrassed by things like this? Maybe that’s something everybody has to answer for themselves. My personal answer is keeping me from reaching the 5th stage of the grieving cycle - acceptance. But maybe that’s just me.

*80+ certified power supply and automatic CPU scaling, you hippies.