This is a reminder-to-self kind of post. Chris Rossi and I were talking about some of the “improved search” features we have planned for KARL thanks to pgtextindex and the use of PostgreSQL’s under-appreciated text indexing/retrieval story.
One thing we think we could add is “Did you mean?” support. That is, if someone did a typo on a search term, offer some alternatives. It’s a valuable feature (at least to me.)
The most natural thing to think about is brute-force spellchecking. That has a number of flaws. First, lots of things (names, for example) won’t be in any dictionary, much less some default language dictionary. Second, you can’t show a bunch of corrections, which ones should you show? Finally, what if you show a correction for which the word doesn’t occur in the corpus?
In Fuzzy String Matching With Trigram and Trigraphs you see how PostgreSQL helper functions can, well, help. You can compare a term to all the reduced forms of all the words indexed in all the documents. Then issue a query that does a soundex-style comparison to the query terms. All using optimized indices, weighting, and scoring. Perhaps you could also use corpus statistics to narrow the list down to words that occur in many documents.
Chris and I previously came across this when he had a neat idea on speeding up prefix searches. Rather than do a prefix, instead find all the expansions of words/lexemes for that prefix which occur in the corpus, select the 100 most likely candidates, and do full-word searches for those 100 words. You get better performance by omitting prefixes. One might assert you get better quality results by letting all the full-word scoring machinery, synonyms, stemming, etc. kick in.
It’s interesting to look at some of the machinery PostgreSQL has which doesn’t get talked about much. Like the ltree for hierarchies and tree structures.