Open Source ETL Tools

Moving spatial data into Postgres isn’t difficult to script. Suppose you have the shapefile ponies.shp in WGS84, and you want to dump it into PostGIS as ponies. A plain old batch script like this will do it for you:

shp2pgsql ponies.shp ponies > ponies.sql -d -s 4326 -i -I -D
psql -h server -d database -U username -f ponies.sql

That’s really it. You should probably vacuum the table after load, so there’s another line. You might want to set some particular permissions, which is another line, or get the data from SDE, which is another line (sde2shp is the easiest - just copy the SDE BIN folder to the running machine and put it in your path). But worst case scenario - 5 lines of code. On Windows I have a bat file that takes 4 arguments so I can cram a shapefile into Postgres without having to do anything rash like thinking.

@echo off

IF "%1"=="" GOTO DisplayUsage
IF "%2"=="" GOTO DisplayUsage
IF "%3"=="" GOTO DisplayUsage
IF "%4"=="" GOTO DisplayUsage

shp2pgsql %1.shp %2 > %2.sql -d -s %4 -i -I -D
psql -h %3 -d database -U username -f %2.sql
psql -h %3 -d database -U username -c "vacuum analyze %2"
psql -h %3 -d database -U username -c "grant select on %2 to read_only_login"

:DisplayUsage
echo Command usage: pgload.bat shape_file_name postgis_layer_name server_name srid

That’s for ah-hoc stuff. For scheduled stuff, I have a Python script that hits a Postgres table with a record for each layer showing where to get it and what to call it. I schedule the Python script to run periodically at night, Bob’s your uncle.

But I also have to load some data that isn’t already in a spatial format. Sometimes I get it as a text file, sometimes I have to scrape it off a SQL Server, and sometimes I have to drag it kicking and screaming out of an Access database (I use the term “database” for this last item very loosely). Scripting these types of things by hand can be tedious, error prone, and apt to run in to unexpected consequences. When you start getting into these types of jobs, it’s time to look for an ETL tool.

ETL stands for extract, transform, and load. As in extract data from a data source, transform it into what you need, and load it somewhere else. There are lots of ETL tools out there, and some proprietary databases ship with their own. I’ve looked at a number of open source ETL solutions, and I always end up coming back to the same two: Talend Open Studio and Pentaho Kettle. I’m not going to recommend one over the other; although I use Kettle, I can’t find a major fault with either one.

Both of these ETL solutions have a lot of the same types of features:

  • Both are Java based and run on Linux and Windows (Talend is Eclipse-based)
  • Visual designers for creating the transformations
  • Connectivity for a myriad of databases, including all the big DBs, text formats, etc.
  • Supports distributing jobs across multiple servers if you are doing serious lifting
  • Excellent error handling and error notification systems
  • Active and helpful forums (Kettle is older and seems to have a larger community however)
  • Free and open source
  • They are complex enough to handle a lot of tasks, but not so much as to kill you
  • There are versions of both (GeoKettle and Spatial Data Integrator) that are tailored for GIS
The biggest difference between the two is in the way they approach ETL. Talend Open Studio is, generally speaking, a code writer - it's going to take your job and script it out for you, either in Perl of Java. Kettle is metadata driven - you are essentially building an XML description of a transformation, and you run a Kettle bat or sh script, pointing it at your file, to run the transformation (along with arguments for log files and whatnot). There are pros and cons with either approach, but both will get the job done.

I use Kettle, and after spending several hours figuring everything out, I’m very happy with it. The first job I tossed at it included connecting to a SQL Server database and joining a gob of tables using a heinous SQL call, transforming a few fields, tossing it into Postgres, and building geometry on the X and Y columns, along with some error trapping. Runs like a top. And it’s easy to run on a server, as you just unzip the PDI folder there and hit the pan.bat or pan.sh script. I haven’t tried either of the GIS-centric verions of these ETL tools, just because that part of my life isn’t complex enough to prevent a little hand-coding, but they seem to have some fairly advanced functionality when dealing with open GIS formats, like shapefiles and PostGIS layers.

There are a lot of proprietary and open source ETL tools out there, so if you’re in the ETL market, it pays to do some Googling. If you’re looking for some great open source ETL packages that can handle a wide variety of tasks, I highly recommend checking out either Talend Open Studio or Pentaho Kettle.