State-of-the-art mobile search part 8: evaluation

State-of-the-Art Mobile Search is a series exploring how to implement advanced mobile search.

State-of-the-Art Mobile Search Part 7: Spelling Correction

Rod Smith

Rod Smith

This blog series presented search features for offline-capable mobile apps. Most aspects of the solutions were explained with multiple implementation options, e.g., the TF-IDF models, unigram vs. bigram language model, noisy channel edit model, indexing term positions and/or related terms, etc. To evaluate the fitness of any particular combination of the search implementation options for a particular mobile app, it helps to quantify the quality of the search results.

Evaluating the results

Ultimately, the most important measure is user satisfaction. Unfortunately, that’s a fairly subjective and abstract attribute. It may seem obvious, but ordinarily the chief measure of user satisfaction is the relevance of results to user’s information need. To measure the relevance of ranked search results, consider the metrics of precision, recall, and F-measure:

  • Precision: Proportion of all the search results that are relevant to the user’s information need, as opposed to irrelevant search results.
  • Recall: Proportion of all relevant documents that appear in the search results, as opposed to relevant documents missing from the results.
  • F-measure: the harmonic mean of precision and recall, e.g.,
    F = 2 * (precision * recall) / (precision + recall)

If a search engine scores poorly in either precision or recall, it compromises the ability to answer the user’s information need. Poor recall may exclude the best documents, while poor precision forces the user to wade through too many irrelevant results. (The extreme case of poor precision is a search engine that simply returns all the documents of the collection.) The F-measure combines these two metrics.

Search engines are also commonly evaluated from the perspective of the speed of search execution, i.e., search latency. Typically, search latency is in the sub-second range and is a function of index size. Latency metrics may be relevant, but search engineers may put too much emphasis on speed and not enough on precision or recall. Given search engine E1 that retrieves 10 relevant results and 10 irrelevant results in 0.01 seconds vs. search engine E2 that retrieves 20 relevant results and 5 irrelevant results in 2.00 seconds, most users would prefer the slower but more relevant search engine E2.

Another common measurement is the indexing speed, which may be given in terms of documents per hour at a particular average document size. Indexing as presented here occurs on the server, where speed can be improved by adding more hardware (especially easy in the cloud). There are other ways to evaluate search engines, e.g., by the expressiveness of query language or the ability to express complex queries, but such types of evaluation are not considered here.

To evaluate the overall quality of ranked results, a common approach is to average the F-measure across multiple “levels of recall”:

  1. Translate an information need into a query, e.g.:
    • Information need: “I’m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.”
    • Query: “red or white wine more effective against heart attacks”
  2. Select a “level of recall” appropriate for answering the information need from the document collection. For example, if only three documents in the corpus meet the information need, the relevant level of recall to evaluate may be two if only two or three, depending on the number of relevant documents required to answer the information need thoroughly. If 100 documents meet the information need, a level of recall of 5-10 may be more appropriate for the evaluation. For this evaluation, we choose level of recall 3, meaning we are exhaustively evaluating the relevance of the search results.
  3. Execute the search and review the results.
  4. Assess the relevance of each result relative to the information need rather than the query. That is, evaluate whether each document returned addresses the information need, not whether it contains the search query words.
  5. Take successive numbers of the top returned documents and calculate the F-measure at each level of recall relevant to the information need.
  6. Average the F-measure across the relevant levels of recall.

Doing so with a query that three documents answer may yield results like the following:

  • Result 1 is relevant (i.e., it helps answer the information need).
  • Result 2 is irrelevant (i.e., it does not help answer the information need).
  • Result 3 is relevant.
  • Result 4 is irrelevant.
  • Result 5 is irrelevant.
  • Result 6 is relevant.
  • Subsequent results are irrelevant.

Level of recall 1 is met with the first returned search result, while levels 2 and 3 are met at the 3rd and 6th results. So, the search engine has the following precision and F-measures at the first three levels of recall:

  • Level of recall 1:
    • Precision 1.00 (1 relevant/1 result)
    • F-measure 0.50 (2*(1.00*1/3)/(1.00+1/3))
  • Level of recall 2:
    • Precision 0.67 (2 relevant/3 results)
    • F-measure 0.67 (2*(0.67*2/3)/(0.67+2/3))
  • Level of recall 3:
    • Precision 0.50 (3 relevant/6 results)
    • F-measure 0.67 (2*(0.50*3/3)/(0.50+3/3))

The average F-measure over all levels of recall for the search results above is 61% (the average of F-measures 50%, 67%, and 67%). To ensure that the measurement captures the score for typical usage, repeat the above process for multiple typical information needs and average the resulting F-measures.

The F-measure over multiple levels of recall gives evidence of fitness of a search engine. To select an app’s search features and implementation details based on such evidence, prototype the alternatives (e.g., with a bigram language model vs. a unigram language model) and compare the prototypes’ F-measures along with the other pertinent differences, such as size of the search indices. The F-measures of the prototypes thus enable a quantitative approach to user satisfaction with regard to search in any particular app.

Analytics

When a search solution is deployed, each search query executed is typically logged and batch uploaded for analytics when a connection is available. Search analytics can identify usage patterns that can inform product or program decisions. Note: Analytics should comply with privacy policies, ordinarily by anonymizing the data (excluding personally identifiable information) and by informing users of the anonymous data collection.

    id<GAITracker> tracker = [[GAI sharedInstance] defaultTracker];
    [tracker trackEventWithCategory:@"search" withAction:@"searchQuery"
        withLabel:searchQuery
        withValue:fetchedProductCount];

Search analytics can help improve search results. For example, if several users search for “metal garden shed” in a product catalog app, a review of search analytics could prompt a content editor to add the word “metal” to the searchable keywords for a product otherwise known as a “steel garden shed,” or a related terms index could be edited to include “metal” as a hypernym for “steel.”

Conclusion

Thanks for reading this mobile search blog series. I hope you use these suggestions to give your app users a great search experience. Feel free to reach out to me at rods@slalom.com with any questions or comments you might have about mobile app search, natural language processing, mobile development, etc.

About rodasmith
Rod is a Slalom Consulting Solution Architect and software developer specializing in mobile applications for HTML5, Android, and iOS, with a passion for natural language processing and machine learning.

One Response to State-of-the-art mobile search part 8: evaluation

  1. blake says:

    that is something irrrve never ever study.very detailed analysis.
    blake http://www.bizcommunity.com/View.aspx?ct=5&cst=0&i=83983&eh=Y1top&msg=y&us=1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: