State-of-the-Art Mobile Search Part 3: TF-IDF Models

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 2: Ranked Retrieval

Rod Smith

Rod Smith

The ranked retrieval model explained in part 2 of this series established a framework for assessing the relevance of documents for a search query in terms of the information content of each query term and the focus of each document.

Term Frequency Models

If a search term t appears several times in one document d1 but fewer times in another document d2 of similar length, d1 is arguably more relevant to t than d2 is. However, if d1 has ten occurrences of t, while d2 has t just once, that does not make d1 ten times more relevant than d2.

To reflect term frequency relevance accurately in search result ranking, most search models define a term frequency weight tf(t,d) that is positively correlated with but not directly proportional to the raw count of occurrences of the term t in document d. tf(t,d) is stored in the postings of the inverted search index (described in part 1 of this series). Several alternative models may be used for tf(t,d). In the models below, count(term,document) measures the number of occurrences of term in document; argmax selects the maximum value of the expression over all valid arguments; and average selects the average value of the expression over all valid arguments:

// Natural term frequency
function tf(term, document) {
    return count(term,document);
}
// Logarithmic term frequency
function tf(term, document) {
    if (count(term,document)==0) return 0;
    else return 1+log(count(term,document));
}
// Augmented term frequency (documents normalized by their max term frequency)
function tf(term, document) {
    var maxTermCount = argmax(t => count(t,document));
    return 0.5 + 0.5*count(term,document)/maxTermCount;
}
// Boolean term frequency
function tf(term, document) {
    if (count(term,document)>0) return 1;
    else return 0;
}
// Log average term frequency
function tf(term, document) {
    var avgTermCount = average(t => count(t,document));
    return (1+log(count(term,document)))/(1+log(avgTermCount));
}

Using the logarithmic function dampens the affect of large raw counts. The base of the log function used turns out to be fairly irrelevant after normalization. For example, if tf(t,d) is the “Logarithmic term frequency” function above using base 10, the following term frequency weights would result:

raw count(t,d) logarithmic tf(t,d)
0 0.0
1 1.0
2 1.3
10 2.0
1000 4.0

Thus, a document with ten occurrences of a search query term would get twice the weight of one with only a single instance of that term.

To create an inverted index with term frequency weights, the server calculates the weight for each posting. The following pseudo code does so using the logarithmic term frequency weight model:

function tf(term, document) {
    if (count(term,document)==0) return 0;
    else return 1+log(count(term,document));
}

function createInvertedIndex() {
    var invertedIndex = [];
    foreach (document in corpus) {
        foreach (word in tokenize(document)) {
            var term = canonicalize(word);
            var indexEntry = invertedIndex[term];
            if (indexEntry == null) {
                invertedIndex[term] = indexEntry = {
                    postings: []
                };
            }
            indexEntry.postings.add({
                doc: document,
                tfWeight: tf(term, document)
            });
        }
    }
    return invertedIndex;
}

Inverse Document Frequency Models (Information Content Weighting)

Rare terms tend to be more informative than frequent terms. Consider a query term that is rare in the collection, e.g., “galvanized.” A document containing this term is very likely to be relevant to the information need behind a query like “tools made of galvanized iron.” Frequently occurring terms tend to be less informative than rare ones, but not entirely useless. A document containing a more common term like “tools,” for example, is more likely to be relevant than a document without that term, but it’s not as big an indicator of relevance as “galvanized.”

If one search term appears in most documents (e.g., “of”), and another search term appears in fewer (e.g., “tools”), the one that appears more frequently is considered to have less information content than the less frequent term. For a given term t, a search index weights its information content by calculating the inverse document frequency idf(t), which measures the term’s rarity across all documents. It thus can assign a greater weight to rare terms, a lesser weight to more common terms, and little to no weight to very frequent terms.

A cruder approach is to define a set of “stop words” (such as “a,” “is,” “the,” “of,” and “to”) to ignore. The more general approach with inverse document frequency automatically ignores stop words because they are abundant in the corpus, so their inverse document frequency is naturally low, and typical inverse document frequency models calculate stop words’ idf(t) weight near zero.

There are multiple inverse document frequency models. Below, N represents the total number of documents (corpus.documents.count) and dc(term) represents the raw count of documents containing term.

// Constant (no consideration for document frequency)
function idf(term) {
    return 1;
}
// Raw inverse document frequency
function idf(term) {
    return N/dc(term);
}
// Logarithmic inverse document frequency
function idf(term) {
    return log(N/dc(term));
}
// Probabilistic inverse document frequency
function idf(term) {
    return max(0, log((N-dc(term))/dc(term)));
}

In the logarithmic inverse-document-frequency model, idf(t) is obtained by dividing the total number of documents N by dc(t), the raw count of documents containing t, and then taking the logarithm of that quotient. The base of the log function does not affect the ranking since it constitutes a constant multiplicative factor toward the overall result for all documents.

To create an inverted index with inverse document frequency weights, the server calculates the weight after processing all documents. The following pseudo code does so using the logarithmic inverse document frequency weight model:

function idf(term) {
    var postings = invertedIndex[term].postings;
    return log(N/postings.count);
}

function createInvertedIndex() {
    var invertedIndex = [];
    foreach (document in corpus) {
        foreach (word in document.words) {
            var term = canonicalize(word);
            var indexEntry = invertedIndex[term];
            if (indexEntry == null) {
                invertedIndex[term] = indexEntry = {
                    idf: null,
                    postings: []
                };
            }
            indexEntry.postings.add({
                doc: document.Id,
                tf: tf(term, document)
            });
        }
    }
    // Calculate inverse document frequency weights.
    foreach (indexEntry in invertedIndex) {
        indexEntry.idf = idf(term);
    }
    return invertedIndex;
}

Cosine Similarity (Document Length/Focus) Normalization

If the corpus includes both long and short documents, the longer documents are more likely to contain the search terms than the shorter ones. To avoid bias for longer documents, tf-idf scores are normalized based on document length. For example, if document d1 has only 100 distinct terms in it and document d2 has 1000 distinct terms in it, d1 has a narrower focus than d2. If both documents contain search term t an equal number of times, d1 is considered more relevant to t than the more broadly focused d2.

There are several document focus normalization models. The cosine normalization model, for example, works well in practice and has the intuitive geometric interpretation where vectors represent the query and the documents. The multidimensional space for those vectors has a dimension for each term in the vocabulary. The document and search query vectors’ component in any particular dimension identifies the tf-idf weight for the dimension’s term. Rather than calculating the difference between a document vector and a query vector, cosine similarity normalization calculates the difference between the cosines of the vectors’ angles, which essentially normalizes the vector lengths.

Algorithmically, documents can be ranked in decreasing order of the angle between their vector and the query vector by dividing each document score by the square root of the sum of all the term weights for that document. For efficiency, normalization factors are calculated when the inverted index is built so that it need not be calculated during query execution:

    var squareOfCosSimNormFactors = [];
    foreach (indexEntry in invertedIndex) {
        var idf = indexEntry.term.idfWeight;
        foreach (posting in indexEntry.postings) {
            squareOfCosSimNormFactors[posting.docId] += (posting.tfWeight*idf)^2;
        }
    }

    foreach (docId in squareOfCosSimNormFactors.keys) {
        documents[docId].normalizationFactor =
            1/sqrt(squareOfCosSimNormFactors[docId]);
    }

    function normalizedTfIdfWeight(posting, term) {
        return posting.document.normalizationFactor * posting.tfWeight * term.idfWeight;
    }

If tfWeight and idfWeight are logarithmic, multiplication above is replaced by addition, but the approach is otherwise practically identical.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

3 Responses to State-of-the-Art Mobile Search Part 3: TF-IDF Models

  1. Pingback: State-of-the-Art Mobile Search Part 2: Ranked Retrieval | The Slalom Blog

  2. Pingback: State-of-the-Art Mobile Search Part 4: Fields and Phrases | The Slalom Blog

  3. TheOldHag says:

    I’m not sure I follow the very last sentence on this page. From my understanding, normalizing a vector with respect to the Euclidean norm is always via dividing each term weight by the square root of the sum of squared weights over the whole vector. This doesn’t change if the tf/idf weights use logarithms use logarithms.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: