State-of-the-art mobile search part 6: search execution

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 5: Term Canonicalization

Rod Smith

Rod Smith

As explained in prior installments in this series, advanced search relies on term frequency and inverse document frequency. Together, those two factors reflect the importance of a term in a given document with respect to the rest of the corpus, which tells the search engine how relevant a document is given a single search term.

To optimize search execution, term frequency and independent document frequency are calculated for each term and document in the corpus before executing search queries. Then, when a search query is issued, the search engine quickly scores each document for the given search terms.

In the standard form of scoring shown in parts 2 and 3, each term-document pair is scored as a normalized product of the term frequency and the inverse document frequency:

function normalizedTfIdfWeight(posting, term) {
    return posting.document.normalizationFactor * posting.tfWeight * term.idfWeight;
}

function search(searchQuery) {
    var documentScores = [];
    foreach (word in tokenize(searchQuery)) {
        var term = word.canonicalTerm;
        var indexEntry = term.indexEntry;
        foreach (posting in indexEntry.postings) {
            documentScores[posting.document] +=
                normalizedTfIdfWeight(posting, term);
        }
    }
    return documentScores;
}

A high score for a given document d is achieved when a term t from the search query appears many times in d but few times across the rest of the document collection. Often, the tf and idf weights have a logarithmic scale, so addition replaces multiplication above, but the same concepts apply. Given a search query, a search engine aggregates the normalized tf-idf weight of all the search terms for each candidate document. That aggregation may be as simple as a summation.

When all document scores are known for a given query, the search engine ranks the documents by their scores and returns the top several results to the user and allows the user to page or scroll through additional search results.

Faceted search

Most natural language documents have metadata that can be used for faceted search.  For example, articles may each have a publication date, status, category, topic, author, or other classifications. Expert users typically know the documents’ metadata fields, and exposing faceted search can help such users find the right documents. The technical details of faceted search are fairly obvious (filter the documents by the specified search facets before or during document scoring). A good UX design, however, can make the difference between a confusing array of search filters and a simple but powerful tool. Effective UX design for faceted search is a domain-specific detail beyond the scope of this blog series.

Context snippets

If search results are merely links to matching documents, the user may need to investigate each link in order to determine whether it answers the information need.  Search results that include the context of the document near the matched terms are much easier to evaluate for relevance. So, good search results show document text before and after the matched search terms, typically with the search terms themselves somehow highlighted or emphasized.

Stems are not always proper substrings of the unstemmed variations (e.g. in popular stemmers, “countri” is the stemmed form of “country” and “countries”, and “run” is the stem of “ran”), so without a position index, it may be difficult to find the parts of the document that actually match the search query. If position information is included in the inverted index, though, the context may be quickly extracted based on the position of the matched terms.

function search(searchQuery) {
    var documentScores = [];
    var matchedPostings = [];
    foreach (word in tokenize(searchQuery)) {
        var term = word.canonicalTerm;
        var indexEntry = term.indexEntry;
        foreach (posting in indexEntry.postings) {
            matchedPostings[posting.position] = posting;
            documentScores[posting.document] +=
                normalizedTfIdfWeight(posting, term);
        }
    }
    var snippets = [];
    foreach (document in documentScores.keys) {
        var previousPosition = null;
        foreach (position in matchedPostings.sortedKeys) {
            if (previousPosition == null) {
                if (position > 0) {
                    snippets[document] += “…”;
                }
            } else if (position – previousPosition < CONTEXT_MAX_SPAN - 1) {
                snippets[document] += document.rangeOfWords(
                    previousPosition + 1, word.position);
            } else {
                snippets[document] += “…” + document.rangeOfWords(
                    word.position–CONTEXT_HEAD, word.position);
            }
            snippets[document] += MATCH_BEGIN + document.rangeOfWords(
                word.position, word.position+1) + MATCH_END;
            previousPosition = position;
        }
        if (previousPosition != null) {
            snippets[document] += document.rangeOfWords(
                previousPosition, previousPosition+CONTEXT_TAIL) + “…”;
        }
    }
    return {scores: documentScores, contextSnippets: snippets};
}

The snippet code above assumes that the positions are in word offsets. Some adjustment would be necessary if positions are recorded as character offsets.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

One Response to State-of-the-art mobile search part 6: search execution

  1. Pingback: State-of-the-art mobile search part 7: spelling correction | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: