“Did you mean FOOBAR?” in PostgreSQL text indexing

This is a reminder-to-self kind of post.  Chris Rossi and I were talking about some of the “improved search” features we have planned for KARL thanks to pgtextindex and the use of PostgreSQL’s under-appreciated text indexing/retrieval story.

One thing we think we could add is “Did you mean?” support.  That is, if someone did a typo on a search term, offer some alternatives.  It’s a valuable feature (at least to me.)

The most natural thing to think about is brute-force spellchecking.  That has a number of flaws.  First, lots of things (names, for example) won’t be in any dictionary, much less some default language dictionary.  Second, you can’t show a bunch of corrections, which ones should you show?  Finally, what if you show a correction for which the word doesn’t occur in the corpus?

In Fuzzy String Matching With Trigram and Trigraphs you see how PostgreSQL helper functions can, well, help.  You can compare a term to all the reduced forms of all the words indexed in all the documents.  Then issue a query that does a soundex-style comparison to the query terms.  All using optimized indices, weighting, and scoring.  Perhaps you could also use corpus statistics to narrow the list down to words that occur in many documents.

Chris and I previously came across this when he had a neat idea on speeding up prefix searches.  Rather than do a prefix, instead find all the expansions of words/lexemes for that prefix which occur in the corpus, select the 100 most likely candidates, and do full-word searches for those 100 words.  You get better performance by omitting prefixes.  One might assert you get better quality results by letting all the full-word scoring machinery, synonyms, stemming, etc. kick in.

It’s interesting to look at some of the machinery PostgreSQL has which doesn’t get talked about much.  Like the ltree for hierarchies and tree structures.

About these ads

2 Responses to ““Did you mean FOOBAR?” in PostgreSQL text indexing”

  1. Marius Gedminas Says:

    I’d love to see more posts on this topic. Especially with small, easily digestible, code samples ;)

    • Paul Everitt Says:

      Hi Marius. Do you want code samples on the implementation, or on using it? If the latter, it’s just repoze.catalog with (at the moment) the same interface as zope.index. Switching back and forth in a project between Zope-based textindexing and PostgreSQL is invisible.

      Perhaps what you’d prefer is a buildout, to get all the bootstrapping setup?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 411 other followers

%d bloggers like this: