State-of-the-Art Mobile Search Part 5: Term Canonicalization

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 4: Fields and Phrases

Rod Smith

Rod Smith

The inverted index built in earlier parts of this series use an undefined function canonicalize(word) to convert strings of characters into a standard form. Doing so accounts for the fact that there are multiple forms of most words in English and similar languages. Consider a query like the following:

3d printing donuts”

Crude search engines match literal words of the search query against literal words from the document collection with case insensitive substring matching. Literal substring matching is obviously deficient given its failure to match the query above against documents that contain the following:

  • “3D Printers Make Donuts Healthy”
  • “… a 3D-printed donut….”
  • “Dunkin Donuts has made a 3D printer.”

To match the search query above with those documents, search engines can employ various types of term canonicalization that ignore non-semantic details like grammatical class, so printing matches print, printed, printers, etc.  The most common approach for English-language search is known as stemming.

Stemming

Besides normalizing the case of words (e.g. converting to lower-case), term canonicalization in English and most other Indo-European languages is typically implemented through stemming, which basically strips certain suffixes from each term to derive its lemma or citation form.  In English, for example, plural nouns and singular ones stem to a single form, as do past tense verbs, present participles, etc. Simply stripping -(e)s, -(e)d, -(e)r, -ing, and -ly, for example, maps the words prints, printing, and printer to the inverted index entry for the stem print. Stemming thus solves the problem illustrated above, matching the search query with the target documents through the stems of each term.

To implement stemming, both inverted index creation and query processing transform each word through a canonicalization function that recognizes certain special cases (e.g. “began” -> “begin”) and strips certain suffixes.  Following is a simplified illustration:

function canonicalize(word) {
    switch(word) {
        case “arose”:
        case “arisen”:
            return “arise”;
        case “began”:
        case “begun”:
            return “begin”;
        // …
        case “wrote”:
        case “written”:
            return “writ”;
        default:
            return word.replace(/^(.+)((at)?e?[dnr]|ing|ion|ize|ly|ness)?s?$/,"\1");
    }
}

For a more sophisticated stemming algorithm that can easily be incorporated directly into a mobile app’s search feature, see the many implementations of the Porter Stemmer.

For obvious reasons, the stemming algorithm used must be based on the natural language used in the corpus and the search queries.

Related term indexing

Although stemming allows the search engine to match independent of grammatical form, it does not enable matching between alternative spellings, synonyms, or related terms in general. For example, stemming does not make a search query with the word donut match a document with the word doughnut or one with words for other foods. A more advanced approach to term canonicalization augments the search index with synonyms and other related terms.

Yet more advanced than related term indexing is semantic search, wherein a natural language model determines the sense of each word in the corpus and in the search query.  For example, semantic search determines whether “donut” in a given sentence or query refers to deep-fried dough, a non-edible torus, a circular peel-out maneuver in a car, or a small spare tire. The search engine then matches search queries to documents according to word sense rather than by literal words. Doing so is at the leading edge of information retrieval technology, building on machine learning and sophisticated language models.  It is not yet practical for offline mobile apps to implement true semantic search.  Instead, search indexers on servers may roughly approximate semantic search by augmenting the inverted index with alternative expressions for all senses of each indexed term, i.e. mapping to each indexed term its related terms from a thesaurus (synonyms, hyponyms, hypernyms, meronyms, etc.).

Related terms data model

With such an augmented index, the engine processing the search query finds each stemmed query term in the inverted index under its main entry and as an alternative expression for any related terms.  For example, it may find donut as a standard index key and as a synonym of doughnut, and as a hyponym of food. It could then add documents containing various forms of the word food. The score for each match of a related expression is typically diluted so that documents matching on related terms appear later in the search results than documents matching on the verbatim search term.

function searchRelatedTerms(term, searchDepth, priorSubstitutionFrequency) {
    foreach (relatedTerm in term.relatedTerms) {
        var compositeSubstitutionFrequency =
            priorSubstitutionFrequency * relatedTerm.substitutionFrequency;
        foreach (posting in relatedTerm.relatedIndexEntry.postings) {
            documentScores[posting.document] +=
                compositeSubstitutionFrequency *
                    posting.tfWeight * relatedTerm.idfWeight;
        }
        if (searchDepth) {
            searchRelatedTerms(relatedTerm, searchDepth - 1,
                compositeSubstitutionFrequency);
        }
    }
}

function search(searchQuery) {
    var documentScores = [];
    foreach (word in tokenize(searchQuery)) {
        var indexEntry = word.canonicalTerm.indexEntry;
        foreach (posting in indexEntry.postings) {
            documentScores[posting.document] +=
                posting.tfWeight * indexEntry.term.idfWeight;
        }
        searchRelatedTerms(indexEntry.term, MAX_SEMANTIC_SEARCH_DEPTH, 1);
    }
    return documentScores;
}

Related term indexing is feasible in offline apps because the alternative expressions map requires a less storage than the position attribute of most inverted index postings. To adopt it, see online databases like WordNet, from which related terms may be imported. Due to the added complexity of the search function and the extra storage space required for the alternative expressions, though, an offline mobile search solution may forgo related term augmentation if storage or speed is at a premium.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

One Response to State-of-the-Art Mobile Search Part 5: Term Canonicalization

  1. Pingback: State-of-the-Art Mobile Search Part 6: Search Execution | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: