State-of-the-Art Mobile Search Part 1: Offline Search

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the Art Mobile Search: Introduction | State-of-the-Art Mobile Search Part 2: Ranked Retrieval

Rod Smith

Rod Smith

As mobile device memory capacity expands, so do opportunities to feature extensive natural-language content in offline-capable apps. Network connections are ever faster, but even under ideal connectivity conditions, apps serve local content noticeably faster than server-based equivalents. Moving outside of ideal network connectivity conditions further highlights the advantages of network independence.

So for the best user experience, apps with much offline natural-language content should support offline search. The key to efficient offline search is the data structure known as the “inverted index.”

Inverted index

All modern search engines are based on an inverted index, a key-value collection with each entry key representing a term—a form of a word from the corpus (the searchable collection)—and each value representing the term’s postings (the term’s distribution in the corpus).

More specifically, an inverted index is an associative array, which may be a dictionary, a map, a hash, or a table in various programming environments. Each key of the inverted index is a canonical term—that is, a standardized form of a word from the corpus. The corresponding value is a list of postings that identify the documents that contain the term, optionally accompanied by the frequency, positions, and/or raw forms of the term in that document.

Following is a simple set of three documents and a minimal example inverted index for them:

Document 1
Foo
Document 2
Foo, bar.
Document 3
Bar, bar.
{ invertedIndex: [
    { term: "foo", postings: [{docID: 1}, {docID: 2}] },
    { term: "bar", postings: [{docID: 2}, {docID: 3}] } ]
}

A minimal inverted index, capturing only terms’ presence in documents

The index is “inverted” by contrast with a document-centric index, whose keys are document IDs and whose values capture details of the associated document. The inverted index inverts the location of the document IDs within the structure in order to surface the terms.

When documents are added, removed, or updated, the inverted index must be rebuilt to include the updates in search results. It is fairly straightforward to do so, but it can be processor intensive depending on the details captured. So, the inverted index is built on a server, either through a packaged product like Google Search Appliance and Apache Lucene (e.g. through the IndexReader API) or from scratch.

The pseudo code below illustrates building a minimal inverted index:

function createInvertedIndex() {
    var invertedIndex = [];
    foreach (document in corpus) {
        foreach (word in document.words) {
            var term = canonicalize(word);
            var postings = invertedIndex[term];
            if (postings == null) {
                postings = [];
                invertedIndex[term] = postings;
            }
            postings.add({docId: document.Id});
        }
    }
    return invertedIndex;
}

To support offline search, a mobile app downloads the inverted index from the server along with each update of new and changed documents.  Typically, the mobile app stores the downloaded index in a client-side database.  For example, an iOS app may store the inverted index in Core Data, and a mobile web app may store it in a Web SQL database.

Search Results

A search engine uses the document IDs stored in the postings list to determine the relevant documents for each term in a search query. In this simple form, the search engine has very little information with which to assess the relevance of each document. The pseudo code search function below shows all of the documents that match all of the search terms:

function search(query) {
    var relevantDocuments = null;
    var searchWords = tokenize(query);
    foreach (word in searchWords) {
        var documentsWithThisWord = [];
        foreach (posting in invertedIndex[word].postings) {
            documentsWithThisWord.push(posting.docId);
        }
        if (relevantDocuments == null) {
            // Documents with the first search word may be relevant.
            relevantDocuments = documentsWithThisWord;
        } else {
            foreach (docId in relevantDocuments) {
                // Documents lacking this search word are not relevant.
                if (!documentsWithThisWord.contains(docId)) {
                    relevantDocuments.remove(docId);
                }
            }
        }
    }
    return relevantDocuments;
}

The documents deemed relevant above are those that match all of the search terms. It could easily be adjusted to include those that match any of the search terms instead, or as shown below, a simple ranking could be achieved by considering the number of search terms each document contains:

function search(query) {
    var documentRelevanceScores = [];
    var searchWords = tokenize(query);
    foreach (word in searchWords) {
        foreach (posting in invertedIndex[word].postings) {
            documentRelevanceScores[posting.docId]++;
        }
    }
    return documentRelevanceScores;
}

The number of search terms in a document is a crude way to score the relevance of a search result.  The next posting in this series (“State-of-the-Art Mobile Search Part 2: Ranked Retrieval”) will show a better scoring model that ranks the results by how well they match the search terms (e.g., by storing the number of occurrences of the term in each document and other statistics in the inverted index).

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

3 Responses to State-of-the-Art Mobile Search Part 1: Offline Search

  1. Pingback: State-of-the-Art Mobile Search Part 2: Ranked Retrieval | The Slalom Blog

  2. Pingback: State-of-the-Art Mobile Search Part 3: TF-IDF Models | The Slalom Blog

  3. Pingback: State-of-the-Art Mobile Search: Introduction | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: