State-of-the-Art Mobile Search Part 2: Ranked Retrieval

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 1: Offline Search | State-of-the-Art Mobile Search Part 3: TF-IDF Models

Rod Smith

Rod Smith

The inverted index explained in part 1 of this series allows an offline mobile search engine to quickly retrieve documents that match the search terms. A naïve search engine might use such an index for a Boolean search, which may simply return the documents that contain either all or any of the terms in the search query. More advanced Boolean search queries may be written in a query language of operators and expressions that defines precisely which documents to retrieve and which to ignore. That works well for experts who are very familiar with the corpus, but not for casual users. When Boolean searches are too specific, relevant documents may be omitted for lacking one search term, resulting in too few results. On the other hand, Boolean searches are often not specific enough, returning far too many results. It is difficult for most people to write an effective Boolean query that returns enough results to be useful, yet not so many as to be overwhelming.

State-of-the-art search engines resolve this problem by implementing ranked retrieval, which normally accepts a free-text query: a simple sequence of words that might appear in the sought documents. Documents that match the search query very well (for example, those containing each search term several times) appear near the top of the result set. Documents that match more loosely (for example, those containing fewer of the search terms) appear lower. Ideally, the most relevant documents appear early in the results, but since relevance determination is an inexact science, additional results follow for users to page through as needed. Long but well ordered results don’t overwhelm the user because the first page of results typically contains the sought information.

Since documents that match all of the words in the search query tend to appear early, while the search results for documents that match fewer query words tend to appear later in the ranked results. Ranked retrieval produces a kind of hybrid between Boolean intersections (traditionally invoked with “AND” operators) and unions (traditionally invoked with “OR” operators).

To rank documents against a search query, the search engine must know not only which documents contains each search term, but also how relevant each document is for each term and the relative information content of each term.

Term Count per Document

For a given document d and a given term t, the fundamental unit of relevance is the raw count of t in d, represented here as count(t,d). The inverted index below records that number as the “count”:

Document 1
Foo
Document 2
Foo, bar.
Document 3
Bar, bar.
{ invertedIndex: [
    { term:"foo", postings: [{docID:1, count:1}, {docID:2, count:1}] },
    { term:"bar", postings: [{docID:2, count:1}, {docID:3, count:1}] } ]
}

An inverted index capturing only terms’ counts in documents

A closely related measure is the term frequency tf(t,d), which is a function of count(t,d). The specific tf(t,d) function varies by the term frequency model implemented, but for simplicity, assume that the “tfWeight” attribute in the Core Data schema below stores the raw count of the given term in a particular document. Alternative (and better) formulae for that weight are described in part 3 of this series “TF-IDF Models.”

Information Content per Term

If the search engine treated each search term equally, words with little information content (“the”, “of”, etc.) would unduly influence the ranking toward documents that happen to contain those unimportant words many times.  To avoid that, good search models also assign a weight to each term t according to its inverse document frequency idf(t), which represents the information content of t. For simplicity, assume that the “idfWeight” attribute in the Core Data schema below stores the multiplicative inverse of the raw count of the given term across all documents. Alternative (and better) formulae for that weight are described in part 3 of this series “TF-IDF Models.”

Following is the Core Data schema of an inverted index that includes weights for term frequency (tfWeight) and inverse document frequency (idfWeight):

Document Scoring

With term frequency and inverse document frequency weights, the search engine can effectively rank search results by scoring each document d as the sum over all the query terms t of the product of the term frequency tf(t,d) and inverse document frequency idf(t), as demonstrated in the pseudo code below:

function search(searchQuery) {
    var documentScores = [];
    foreach (word in tokenize(searchQuery)) {
        var term = word.canonicalTerm;
        var indexEntry = invertedIndex.findEntryByTerm(term);
        if (indexEntry) {
            foreach (posting in indexEntry.postings) {
                documentScores[posting.document] += posting.tfWeight * term.idfWeight;
            }
        }
    }
    return documentScores;
}

Naturally, a document has no score if it contains none of the query terms. Part 3 of this series will explore formulae for the tfWeight of postings and the idfWeight of terms, some of which subtly change the document-scoring algorithm above.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

2 Responses to State-of-the-Art Mobile Search Part 2: Ranked Retrieval

  1. Pingback: State-of-the-Art Mobile Search Part 3: TF-IDF Models | The Slalom Blog

  2. Pingback: State-of-the-Art Mobile Search Part 1: Offline Search | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: