Using Python to Manage Windows Services

I recently ran into a little problem with a Windows service.

Apache Tomcat, about once every two weeks or so, simply croaks on one of our servers. When I say croaks, I mean croaks: a hit to :8080 gives a fast no-server-listening-whatsoever message.

Tomcat logs lead me to belive it’s a java memory issue, but as that service is moving to a new virtual server soon, I don’t want to get too deep in it yet in case it just goes away after the move (lazy > smart). That doesn’t mean I want to spend a lot of time periodically checking on it, however.

There are any number of great enterprise monitoring tools for this sort of thing, such as Nagios and Zenoss, which are for the most part free and open source. That would have been like squirrel hunting with a bazooka in this case, however, and both of the aforementioned products wouldn’t jive well with our Linux-phobic IT department. But a little Python can easily do the trick.

First you need to install the Python for Windows extension, which adds a lot of neat functions, including what we’re looking for - win32serviceutil. Now we build a simple function to stop, start, restart, or query the status of a Windows service.

def service_manager(action, machine, service):
    if action == 'stop':
        win32serviceutil.StopService(service, machine)
    elif action == 'start':
        win32serviceutil.StartService(service, machine)
    elif action == 'restart':
        win32serviceutil.RestartService(service, machine)
    elif action == 'status':
        if win32serviceutil.QueryServiceStatus(service, machine)[1] == 4:
            print "%s is happy" % service
        else:
            print "%s is being a PITA" % service

Now we’re off and running. Since the problem here is Tomcat just up and dies, a simple urlopen to Apache will do it.

import urllib
import socket
import win32serviceutil

def service_info(action, machine, service):
......

socket.setdefaulttimeout(30)
try:
    f = urllib.urlopen("http://servername:8080/")
    print "Tomcat is smokin'."
except:
    print "Tomcat is dead. Restarting the service."
    service_manager("restart", "servername", "Apache Tomcat")

This little bit of code sets the default connection timeout to 30 seconds, so if the service is up and running, it will spend up to 30 seconds loading the URL to f. If that runs without error, we’re golden. If a socket error occurs (i.e. Tomcat is not talking), the exept block is executed, restarting the Tomcat service. Set the tiny script to run every 10 minutes or so, Bob’s your uncle, a self-healing server.

You can do quite a few service management tricks with this simple code. Say you’ve got an ArcIMS server that’s croaking on you (not an uncommon thing). Wouldn’t it be nice if the services would simply restart themselves?

Let’s say you have a map service named map_service on a server named servername (we’re also implying you’re incredibly uncreative here). ArcIMS is making images that look like so:
map_service_SERVERNAME24642500375.xxx

If ArcIMS has taken a walk, a hit to a web page with the map on it won’t have a link to your map image. So we hit the page, look for the “map_service_SERVERNAME” text string, and if it doesn’t find it, we restart the services:

f = urllib.urlopen("http://URL-to-page-with-map/")
s = f.read()

if s.find("map_service_SERVERNAME") == -1 :
    print "Restarting ArcIMS."
    service_manager("stop", "servername", "ArcIMS Tasker 9.x")
    service_manager("stop", "servername", "ArcIMS Monitor 9.x")
    service_manager("stop", "servername", "ArcIMS Application Server 9.x")
    service_manager("start", "servername", "ArcIMS Application Server 9.x")
    service_manager("start", "servername", "ArcIMS Monitor 9.x")
    service_manager("start", "servername", "ArcIMS Tasker 9.x")
else:
    print "ArcIMS OK! Send ESRI a check!"

Viola - if ArcIMS is down, our script will spank it back to life while we’re spinning about in our chairs. The maximum down time will be service_death + time_to_next_script_run + arcims_comes_back_to_life_time, the latter of which can take a while. But the customer-calls-help-desk-calls-you-login-manually-restart process takes even longer, and there’s no telling how long it’s down before it gets to your desk.

Of course you can do quite a bit more with this stuff - send yourself an email or SMS message, log things to a file or the event log, etc. - but this should give you some ideas. Python is a great langauge for a lot of different server management tasks.