Using Fuzzy Search in PostgreSQL

Or…I’m an idiot for not thinking of this sooner.

One of the things I’ve spent the most time thinking about in web application design is search. In my experience, that’s the spot your users will most often face-plant. As a developer, when a user face-plants, you face-plant.

What I do to avoid face-plants is single box text entry with autosuggest. Your autosuggest has to be blindingly fast to be useful, so I do some input preprocessing to detect addresses. If the search string is a number, followed by a space, followed by a character of some kind, it’s an address, and I have the house number. If they got in “100 ma” in the search box, the SQL ends up being something like so:

1
2
3
4
5
SELECT * from addresses
WHERE
house_number = 100
AND
full_address LIKE '100 ma%'

The PostgreSQL query optimizer will get the tiny subset of our .5 million record address table with the house number, and then smack only that tiny subset with the ever-inefficient LIKE operator. Super-fast autosuggest. For ~95% of my users, this has eliminated search face-plants.

The other 5% are fast typists and outrun the autosuggest. The problem with this type of query is the match has to be exact. If they put in an extra space or key one thing differently from what’s in the database, no matches are found. It doesn’t matter if you wrote directions on what they’re supposed to be doing on every square inch of your site. News flash: nobody reads directions. They’re stuck. And I’m not sure if there’s a grant for a psychology study into this or not, but 100% of that 5% are going to find time in their busy day to email you various iterations of you suck. And, in fact, you (I) do. What we need to do is turn this exact match query into a fuzzy query.

My backend database for serious work is PostgreSQL. Fortunately, there are fuzzy search options available in PostgreSQL, they just aren’t installed by default. Head over to your install’s share/contrib/ folder and run fuzzystrmatch.sql. This will add some fuzzy matching functions for strings: soundex, Levenshtein, metaphone, and double metaphone. You can read about them all on the PostgreSQL docs site, but we only need soundex.

The soundex function converts a string into a numeric code based on what it “sounds” like. It’s perfect for this type of search.

With soundex, our SQL looks something like this:

1
2
3
4
5
SELECT * from addresses
WHERE
house_number = 100
AND
soundex(full_address) = soundex('100 ma')

Viola! Our fast typists are cured! Now we’ve screwed our slow typists!

How? We’re getting the soundex of the entire address in the database, which can look like “100 main st, charlotte nc 28202”. Our slow typist only managed to get in “100 ma” before the suggest fired in the middle of their hunt and peck. Those two don’t sound alike.

What we need to do is get the soundex of the full address for the length of the query string. Suppose our search string has 6 characters, like “100 ma”. Our query should really look like this:

1
2
3
4
5
SELECT * from addresses
WHERE
house_number = 100
AND
soundex(substring(full_address from 1 for 6)) = soundex('100 ma')

And…done. Search/autosuggest results that can withstand the many ingenious ways humans can describe where their mailboxes sit. I can’t believe I didn’t do this sooner. Maybe those you suck emails were on to something.*

*Nah.