State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.
State-of-the-Art Mobile Search Part 4: Fields and Phrases
The inverted index built in earlier parts of this series use an undefined function canonicalize(word) to convert strings of characters into a standard form. Doing so accounts for the fact that there are multiple forms of most words in English and similar languages. Consider a query like the following:
“3d printing donuts”
Crude search engines match literal words of the search query against literal words from the document collection with case insensitive substring matching. Literal substring matching is obviously deficient given its failure to match the query above against documents that contain the following:
- “3D Printers Make Donuts Healthy”
- “… a 3D-printed donut….”
- “Dunkin Donuts has made a 3D printer.”
To match the search query above with those documents, search engines can employ various types of term canonicalization that ignore non-semantic details like grammatical class, so printing matches print, printed, printers, etc. The most common approach for English-language search is known as stemming.
Read more of this post