State-of-the-Art Mobile Search Part 4: Fields and Phrases

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 3: TF-IDF Models

Rod Smith

Rod Smith

The inverted index and the ranked retrieval model from earlier in this series did not distinguish between different fields of the indexed documents, nor did they make any special accommodation for multi-word phrases.

Phrase queries

A user may need to find documents that contain multi-word phrases like “all your base.” Users understand phrase queries well enough that explicit phrase queries are one of the few effective types of advanced queries. Typically, users identify phrase queries by enclosing each phrase in quotes, but implicit phrase queries are also possible, where a search engine identifies phrases without quotes or any other indication beyond the mere proximity of the words.

To answer a phrase query, a search engine must determine the distance between terms in a candidate document corresponding to sequential search query terms. Recall the inverted index:

invertedIndex = [
    {term: “all”, idfWeight: 1.234, postings: [
        {docId: 12, tfWeight: 3.4},
    …]},
    {term: “your”, idfWeight: 0.1234, postings: [
        {docId: 12, tfWeight: 5.6},
    …]},
    {term: “base”, idfWeight: 0.0123, postings: [
        {docId: 12, tfWeight: 0.78},
    …]},
…];

The postings in such an index identify each term’s frequency in each document but its positions. To find search phrases efficiently, the index may also store the terms’ positions:

invertedIndex = [
    {term: "all", idfWeight: 1.234, postings: [
        {docID: 12, tfWeight: 3.4, positions: [56, 78, 90, 123, 456…]},
    …]},
    {term: "your", idfWeight: 0.1234, postings: [
        {docID: 12, tfWeight: 5.6, positions: [127…]},
    …]},
    {term: "base", idfWeight: 0.0123, postings: [
        {docID: 12, tfWeight: 0.78, positions: [132…]},
    …]},
…] }

A considerable drawback of storing the positions of each term is a dramatic inflation of the inverted index. In the raw form above, the inverted index with term positions is comparable to the size of the corpus itself. Since the index is stored on the client, its size matters, so inflating the data storage size may not be acceptable for a mobile app. It’s possible to compress such postings so that frequent terms’ positions occupy about one byte each and infrequent terms’ positions occupy a few bytes each, making it around a third of the size of the raw corpus, but implementation details for such compression is outside of the scope of this blog series.

If a mobile app needs to support literal phrase queries and it can accommodate the inflated size of the inverted index, it can award score bonuses to implicit phrase matches. Such a bonus causes documents with phrase matches to appear earlier in the results than documents with separated search terms.

A generalization of phrase matching considers the distance between document terms that match sequential terms from the search query. This approach allows for short intervening words in the matched phrase, e.g. allowing the search phrase “all your base” to match document phrases like “all of your bases” with only a slight penalty for the intervening word. (Note: looping through the prior term positions introduces a performance penalty. Part 8 of this series lays out a framework for evaluating such penalties.)

function search(searchQuery) {
    var documentScores = [];
    var priorTerm = null;
    foreach (word in tokenize(searchQuery)) {
        var indexEntry = word.canonicalTerm.indexEntry;
        foreach (posting in indexEntry.postings) {
            documentScores[posting.document] +=
             posting.tfWeight * indexEntry.term.idfWeight;
            if (priorTerm) {
                var priorTermPosting = priorTerm.postings[posting.document];
                if (priorTermPosting) {
                    foreach (priorTermPosition in priorTermPosting.positions) {
                        var thisTermPosition = posting.positions.first(position =>
                            position.start > priorTermPosition.start);
                        if (thisTermPosition) {
                            documentScores[posting.document] += ProxityBonusRise /
                                (ProximityBonusRun + thisTermPosition.start –
                                priorTermPosition.start – priorTerm.length);
                        }
                    }
                }
            }
        }
        priorTerm = term;
    }
    return documentScores;
}

The proximity bonus calculation above differs in obvious ways if posting positions are stored as character ranges (with start position and length in order to handle multiple forms of the indexed term) or as word indices.

Field weights

A search engine may augment the term frequency function tf(t,d) according to which field of the document contains the term. For example, a document d1 that contains a search term t in its “title” field typically scores higher than a document d2 that only contains t in its content “body” field.

For example, a search engine with a corpus of television shows may match words “episodes” in extended metadata for every series. Doing so naïvely could yield numerous irrelevant results for searches targeting a TV series named “Episodes.” To improve the quality of matches, the search engine may award matches in the show title field a greater score than matches in other fields, causing the “Episodes” TV series to rank higher in the search results for queries including “episodes” than the irrelevant shows with “episodes” in extended metadata fields.

function fieldWeight(field) {
    switch(field) {
      case FIELD.Title:
        return 10.0;
      default:
        return 1.0;
    }
}

function createInvertedIndex() {
    var invertedIndex = [];
    // Determine the postings for each word.
    foreach (document in corpus) {
        foreach (word in document.words) {
            var term = canonicalize(word);
            var indexEntry = invertedIndex[term];
            if (indexEntry == null) {
                indexEntry = invertedIndex[term] = {
                    term: term,
                    postings: []
                };
            }
            var posting = indexEntry.postings[document];
            if (posting == null) {
                posting = indexEntry.postings[document] = {
                    document: document,
                    tfWeight: tf(term, document),
                    positions: []
                };
            }
            posting.positions += word.position;
        }
    }
    // Calculate inverse document frequency weights.
    foreach (indexEntry in invertedIndex) {
        indexEntry.term.idfWeight = idf(term);
    }
    // Determine document normalization factors.
    var squareOfCosSimNormFactors = [];
    foreach (indexEntry in invertedIndex) {
        foreach (posting in indexEntry.postings) {
            squareOfCosSimNormFactors[posting.document] +=
                (posting.tfWeight*indexEntry.term.idfWeight)^2;
        }
    }
    foreach (document in documents) {
        document.normalizationFactor =
            1/sqrt(squareOfCosSimNormFactors[document]);
    }
    // Normalize tfWeights, combine with idfWeights, and augment with field weights
    foreach (indexEntry in invertedIndex) {
        foreach (posting in indexEntry.postings) {
            posting.fieldWeightedNormalizedTfIdfScore =
                fieldWeight(posting.field) * posting.document.normalizationFactor *
                posting.tfWeight * indexEntry.term.idfWeight;
        }
    }
    return invertedIndex;
}

To optimize document scoring at query time, the inverted index above pre-calculates fieldWeightedNormalizedTfIdfScore by combining TF and IDF weights for each posting, normalizing the result by the document normalization factors, and augmenting the score by field weight. If weights (TF, IDF, and Field weights) have logarithmic scales, addition replaces multiplication above.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

One Response to State-of-the-Art Mobile Search Part 4: Fields and Phrases

  1. Pingback: State-of-the-Art Mobile Search Part 5: Term Canonicalization | The Slalom Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: