State-of-the-art mobile search part 7: spelling correction

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 6: Search Execution

Rod Smith

Rod Smith

Search query terms that are absent from the corpus may be treated as potentially misspelled words. A search engine can improve the search experience by identifying potentially misspelled query words through its inverted index and by proposing likely corrections through a metric called the edit distance and an n-gram language model.

Edit distance

Edit distance quantifies the noise in an expressed misspelling of an intended term based on an error model.  The Levenshtein distance is such a model, where the edit distance is the number single-character insertions, deletions, and replacements required to convert an intended term into an expressed misspelling. For example, given the misspelling *‘nathing’, its Levenshtein edit distance from the candidate term ‘matching’ is 2 (one replacement of ‘n’ for ‘m’ and one deletion of ‘c’), but its Levenshtein edit distance from ‘nothing’ is 1 (just one replacement of ‘a’ for ‘o’).

function editDistance(s1, s2) {
    if (s1.length == 0) {
        return s2.length;
    } else if (s2.length == 0) {
        return s1.length;
    } else if (s1[0] == s2[0]) {
        return editDistance(s1.substring(1), s2.substring(1));
    } else {
        return MIN(
            1 + editDistance(s1, s2.substring(1)),
            1 + editDistance(s1.substring(1), s2),
            1 + editDistance(s1.substring(1), s2.substring(1));
        }
    }
}

In other edit distance models, transposition, repetition, and specific substitutions may contribute unique weights to the total edit distance. For instance, an edit distance model for typed words may assign a larger distance to typing the character ‘a’ where ‘o’ was intended than to typing ‘n’ where ‘m’ was intended because ‘n’ is closer to ‘m’ on the keyboard than ‘a’ is to ‘o.’ Edit models based on the expression channel (e.g., the proximity of keys in the Qwerty keyboard) and on linguistic features of the expression language (e.g., phonetic and semantic similarities in English such as the suffixes “-er” and “-or” or the suffixes “-ery” and “-ary”) are very effective in practice.

Given a misspelling, a search engine can generate all strings of a given edit distance from the misspelling by adding, deleting, and substituting characters.

function make1Edit(spelling) {
    var respellings = [];
    for (i=0; i<spelling.length; i++) {
        // edit at position i
        var head = spelling.substring(0, i);
        var mid = spelling[i];
        var tail;
        if (i>=spelling.length-1) {
            tail = “”;
        } else {
            tail = spelling.substring(i+1);
            if (i < spelling.length-2) { // transpose a i and i+1
                respellings.add(head + tail[1] + tail[0] + tail.substring(2));
            }
        }
        respellings.add(head + tail); // delete at i
        for(c = ‘a’; c <= ‘z’; c++) {
            respellings.add(head + c + mid + tail); // insert
            if (c != mid) {  // replace
                respellings.add(head + c + tail);
            }
        }
    }
    for(c = ‘a’; c <= ‘z’; c++) {
        respellings.add(spelling + c); // append
    }
    return respellings;
}

function getCandidateCorrections(misspelling) {
    var candidates = [];
    var stringsWithin1Edit = make1Edit(misspelling);
    foreach (s in stringsWithin1Edit) {
        if (invertedIndex[s] != null) {
            candidates.add(s);
        }
    }
    return candidates;
}

An approach like that is reasonable at an edit distance of 1, which purportedly includes at least 80% of spelling errors. To go beyond that, a greater edit distance is required, but naively extending the approach above to greater edit distances (e.g., foreach(s1 in make1Edit(misspelling)){foreach(s2 in make1Edit(s1)){...}}) would be inefficient due to the number of non-word strings it generates. To optimize this approach for such edit distances, a Bloom Filter initialized with all of the words in the corpus vocabulary can be used to reject edit operations based on hashes and thus skip generating candidate strings that are not valid words. Details of Bloom Filter optimization are beyond the scope of this blog series.

Ranking spelling correction candidates

Since multiple candidate respellings may be found at a given edit distance, it can be helpful to rank the candidates. An n-gram language model can be very effective for that purpose. In the simplest—the unigram model—the frequency of each word in the corpus corresponds to the context-independent likelihood of the user intending to type it but making a single spelling error. That is, given two words with the same edit distance from a misspelling, the more common word is more likely the intended one.

var rankedCandidates = new SortableMap();
foreach(candidate in getCandidateCorrections(misspelling)) {
    var indexEntry = invertedIndex[candidate];
    var unigramInverseProb = indexEntry.term.idfWeight;
    rankedCandidates.add(key:candidate, value:unigramInverseProb);
}
return rankedCandidates.keysSortedByValue;

If the spelling suggestion feature explores edit distances greater than 1 or language models more sophisticated than unigram, ranking candidates is more complex.

  • w: a misspelled word supplied by the user, e.g., “teh”
  • c: a candidate correction, e.g., “the” or “ten”
  • P(w): the probability that the author typed the misspelling w.
  • P(c): the probability that proposed correction c might appear in text, based on the language model. For example, in a unigram language model P(“the”) would have a high value (a probability of about 2.6%) in typical English document collections because it is a very common word, while P(“ten”) has a lower probability (a probability of around 0.015%).
  • P(w|c): the probability that the author typed w by mistake, assuming the author intended to type c, based on the error model.
  • P(c|w): the probability that the intended correct word is c, given that the author typed the misspelling w.
  • argmaxc: from all feasible values of c, the one that gives the maximum score.

To rank spelling suggestions with various edit distances and complex language models, spelling corrections are ranked by their probability of being the intended correct word c given the misspelling w. That is, the best spelling correction c is argmaxc P(c|w). By Bayes’ Theorem:

argmaxc P(c|w) = argmaxc P(w|c) P(c) / P(w)

Given two candidate corrections c1 and c2 of misspelling w, the one with greater P(c|w) is more likely the intended word, so spelling correction candidates are best presented in reverse order of their P(c|w) scores. Generally, corrections that are common words and those very similar to the misspelling are listed before rare words and those that are less similar to the misspelling.

Given that the author typed w, the probability P(w) is fixed, i.e., it’s the same for every possible correction c, so we can ignore P(w) and evaluate this simpler expression to find the best spelling suggestion:

argmaxc P(w|c) P(c)

That is, spelling results can be ranked in reverse order of the product of (a) the probability of making the errors required to change c into w and (b) the probability of the candidate word c according to the language model.

A bigram language model may rank candidate replacements better by considering the word immediately before and after the misspelling. For example, a bigram language model reveals that the second word in “the teh best movies” is likely “ten” because *“…the the…” is a very rare sequence. Unigram language models are inexpensive to store since the frequency data needs only a byte or so per term in the inverted index. A full bigram language model takes considerably more space and only marginally increases accuracy, so it is not generally worth the data storage expense in an offline-capable mobile app.

Sophisticated language models may account for other language features, such as the likelihood of a word appearing in a “novel continuation.” For example, “Francisco” has a low likelihood of appearing in a novel continuation in most English corpora because it almost always follows “San.” Such language models can measurably improve spelling suggestion, but at the expense of prohibitively expanded storage requirements for offline mobile apps.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

One Response to State-of-the-art mobile search part 7: spelling correction

  1. Pingback: State-of-the-art mobile search part 8: evaluation | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: