# State-of-the-Art Mobile Search Part 3: TF-IDF Models

April 5, 2013 3 Comments

State-of-the-Art Mobile Searchis a series exploring how to implement advanced mobile search.

The ranked retrieval model explained in part 2 of this series established a framework for assessing the relevance of documents for a search query in terms of the information content of each query term and the focus of each document.

## Term Frequency Models

If a search term *t* appears several times in one document *d _{1}* but fewer times in another document

*d*of similar length,

_{2}*d*is arguably more relevant to

_{1}*t*than

*d*is. However, if

_{2}*d*has ten occurrences of

_{1}*t*, while

*d*has

_{2}*t*just once, that does

**not**make

*d*ten times more relevant than

_{1}*d*.

_{2}To reflect term frequency relevance accurately in search result ranking, most search models define a term frequency weight *tf(t,d)* that is positively correlated with but not directly proportional to the raw count of occurrences of the term *t* in document *d*. *tf(t,d)* is stored in the postings of the inverted search index (described in part 1 of this series). Several alternative models may be used for *tf(t,d)*. In the models below, *count(term,document)* measures the number of occurrences of *term* in *document;* *argmax* selects the maximum value of the expression over all valid arguments; and *average* selects the average value of the expression over all valid arguments:

// Natural term frequency function tf(term, document) { return count(term,document); } // Logarithmic term frequency function tf(term, document) { if (count(term,document)==0) return 0; else return 1+log(count(term,document)); } // Augmented term frequency (documents normalized by their max term frequency) function tf(term, document) { var maxTermCount = argmax(t => count(t,document)); return 0.5 + 0.5*count(term,document)/maxTermCount; } // Boolean term frequency function tf(term, document) { if (count(term,document)>0) return 1; else return 0; } // Log average term frequency function tf(term, document) { var avgTermCount = average(t => count(t,document)); return (1+log(count(term,document)))/(1+log(avgTermCount)); }

Using the logarithmic function dampens the affect of large raw counts. The base of the *log* function used turns out to be fairly irrelevant after normalization. For example, if *tf(t,d)* is the “Logarithmic term frequency” function above using base 10, the following term frequency weights would result:

raw count(t,d) |
logarithmic tf(t,d) |

0 | 0.0 |

1 | 1.0 |

2 | 1.3 |

10 | 2.0 |

1000 | 4.0 |

Thus, a document with ten occurrences of a search query term would get twice the weight of one with only a single instance of that term.

To create an inverted index with term frequency weights, the server calculates the weight for each posting. The following pseudo code does so using the logarithmic term frequency weight model:

function tf(term, document) { if (count(term,document)==0) return 0; else return 1+log(count(term,document)); } function createInvertedIndex() { var invertedIndex = []; foreach (document in corpus) { foreach (word in tokenize(document)) { var term = canonicalize(word); var indexEntry = invertedIndex[term]; if (indexEntry == null) { invertedIndex[term] = indexEntry = { postings: [] }; } indexEntry.postings.add({ doc: document, tfWeight: tf(term, document) }); } } return invertedIndex; }

## Inverse Document Frequency Models (Information Content Weighting)

Rare terms tend to be more informative than frequent terms. Consider a query term that is rare in the collection, e.g., “galvanized.” A document containing this term is very likely to be relevant to the information need behind a query like “tools made of galvanized iron.” Frequently occurring terms tend to be less informative than rare ones, but not entirely useless. A document containing a more common term like “tools,” for example, is more likely to be relevant than a document without that term, but it’s not as big an indicator of relevance as “galvanized.”

If one search term appears in most documents (e.g., “of”), and another search term appears in fewer (e.g., “tools”), the one that appears more frequently is considered to have less information content than the less frequent term. For a given term *t*, a search index weights its information content by calculating the inverse document frequency *idf(t)*, which measures the term’s rarity across all documents. It thus can assign a greater weight to rare terms, a lesser weight to more common terms, and little to no weight to very frequent terms.

A cruder approach is to define a set of “stop words” (such as “a,” “is,” “the,” “of,” and “to”) to ignore. The more general approach with inverse document frequency automatically ignores stop words because they are abundant in the corpus, so their inverse document frequency is naturally low, and typical inverse document frequency models calculate stop words’ *idf(t)* weight near zero.

There are multiple inverse document frequency models. Below, *N* represents the total number of documents (*corpus.documents.count*) and *dc(term)* represents the raw count of documents containing *term*.

// Constant (no consideration for document frequency) function idf(term) { return 1; } // Raw inverse document frequency function idf(term) { return N/dc(term); } // Logarithmic inverse document frequency function idf(term) { return log(N/dc(term)); } // Probabilistic inverse document frequency function idf(term) { return max(0, log((N-dc(term))/dc(term))); }

In the logarithmic inverse-document-frequency model, *idf(t)* is obtained by dividing the total number of documents *N* by *dc(t)*, the raw count of documents containing *t*, and then taking the logarithm of that quotient. The base of the log function does not affect the ranking since it constitutes a constant multiplicative factor toward the overall result for all documents.

To create an inverted index with inverse document frequency weights, the server calculates the weight after processing all documents. The following pseudo code does so using the logarithmic inverse document frequency weight model:

function idf(term) { var postings = invertedIndex[term].postings; return log(N/postings.count); } function createInvertedIndex() { var invertedIndex = []; foreach (document in corpus) { foreach (word in document.words) { var term = canonicalize(word); var indexEntry = invertedIndex[term]; if (indexEntry == null) { invertedIndex[term] = indexEntry = { idf: null, postings: [] }; } indexEntry.postings.add({ doc: document.Id, tf: tf(term, document) }); } } // Calculate inverse document frequency weights. foreach (indexEntry in invertedIndex) { indexEntry.idf = idf(term); } return invertedIndex; }

## Cosine Similarity (Document Length/Focus) Normalization

If the corpus includes both long and short documents, the longer documents are more likely to contain the search terms than the shorter ones. To avoid bias for longer documents, *tf-idf* scores are normalized based on document length. For example, if document *d _{1}* has only 100 distinct terms in it and document

*d*has 1000 distinct terms in it,

_{2}*d*has a narrower focus than

_{1}*d*. If both documents contain search term

_{2}*t*an equal number of times,

*d*is considered more relevant to

_{1}*t*than the more broadly focused

*d*.

_{2}There are several document focus normalization models. The cosine normalization model, for example, works well in practice and has the intuitive geometric interpretation where vectors represent the query and the documents. The multidimensional space for those vectors has a dimension for each term in the vocabulary. The document and search query vectors’ component in any particular dimension identifies the *tf-idf* weight for the dimension’s term. Rather than calculating the difference between a document vector and a query vector, cosine similarity normalization calculates the difference between the cosines of the vectors’ angles, which essentially normalizes the vector lengths.

Algorithmically, documents can be ranked in decreasing order of the angle between their vector and the query vector by dividing each document score by the square root of the sum of all the term weights for that document. For efficiency, normalization factors are calculated when the inverted index is built so that it need not be calculated during query execution:

var squareOfCosSimNormFactors = []; foreach (indexEntry in invertedIndex) { var idf = indexEntry.term.idfWeight; foreach (posting in indexEntry.postings) { squareOfCosSimNormFactors[posting.docId] += (posting.tfWeight*idf)^2; } } foreach (docId in squareOfCosSimNormFactors.keys) { documents[docId].normalizationFactor = 1/sqrt(squareOfCosSimNormFactors[docId]); } function normalizedTfIdfWeight(posting, term) { return posting.document.normalizationFactor * posting.tfWeight * term.idfWeight; }

If *tfWeight* and *idfWeight* are logarithmic, multiplication above is replaced by addition, but the approach is otherwise practically identical.

Pingback: State-of-the-Art Mobile Search Part 2: Ranked Retrieval | The Slalom Blog

Pingback: State-of-the-Art Mobile Search Part 4: Fields and Phrases | The Slalom Blog

I’m not sure I follow the very last sentence on this page. From my understanding, normalizing a vector with respect to the Euclidean norm is always via dividing each term weight by the square root of the sum of squared weights over the whole vector. This doesn’t change if the tf/idf weights use logarithms use logarithms.