Retrieval and Evaluation Techniques for Personal Information Jin Young Kim 7/26 Ph.D Dissertation Seminar Personal Information Retrieval (PIR) The practice and the study of supporting users to retrieve personal information effectively 2 Personal Information Retrieval in the Wild Everyone has unique information & practices Different information and information needs Different preference and behavior Many existing software solutions Platform-level: desktop search, folder structure Application-level: email, calendar, office suites 3 Previous Work in PIR (Desktop Search) Focus User interface issues [Dumais03,06] Desktop-specific features [Solus06] [Cohen08] Limitations Each based on different environment and user group None of them performed comparative evaluation Research findings do not accumulate over the years 4 Our Approach Develop general techniques for PIR Make contributions to related areas Start from essential characteristics of PIR Applicable regardless of users and information types Structured document retrieval Simulated evaluation for known-item finding Build a platform for sustainable progress Develop repeatable evaluation techniques Share the research findings and the data 5 Essential Characteristics of PIR Many document types Unique metadata for each type People combine search and browsing [Teevan04] Long-term interactions with a single user People mostly find known-items [Elsweiler07] Privacy concern for the data set Field-based Search Models Associative Browsing Model Simulated Evaluation Methods 6 Search and Browsing Retrieval Models Challenge Users may remember different things about the document How can we present effective results for both cases? User’s Memory Search Browsing Lexical Memory Associative Memory Query James Registration Retrieval Results 1. 2. 3. 4. 5. 7 Information Seeking Scenario in PIR User Input Search System Output James A user initiate a session with a keyword query Registration Browsing 2011 Search The user switches to browsing by clicking on a email document James Registration 2011 The user switches to back to search with a different query Simulated Evaluation Techniques Challenge User’s query originates from what she remembers. How can we simulate user’s querying behavior realistically? User’s Memory Search Browsing Lexical Memory Associative Memory Query James Registration Retrieval Results 1. 2. 3. 4. 5. 9 Research Questions Field-based Search Models Associative Browsing Model How can we improve the retrieval effectiveness in PIR? How can we improve the type prediction quality? How can we enable the browsing support for PIR? How can we improve the suggestions for browsing? Simulated Evaluation Methods How can we evaluate a complex PIR system by simulation? How can we establish the validity of simulated evaluation? 10 Field-based Search Models Searching for Personal Information An example of desktop search 12 Field-based Search Framework for PIR Type-specific Ranking Type Prediction Rank documents in each document collection (type) Predict the document type relevant to user’s query Final Results Generation Merge into a single ranked list 13 Type-specific Ranking for PIR Individual collection has type-specific features Most of these documents have rich metadata Thread-based features for emails Path-based features for documents Email: <sender, receiver, date, subject, body> Document: <title, author, abstract, content> Calendar: <title, date, place, participants> We focus on developing general retrieval techniques for structured documents Structured Document Retrieval Field Operator / Advanced Search Interface User’s search terms are found in multiple fields Understanding Re-finding Behavior in Naturalistic Email Interaction Logs. Elsweiler, D, Harvey, M, Hacker., M [SIGIR'11] 15 Structured Document Retrieval: Models Document-based Retrieval Model f1 Score each document as a whole f2 ... Field-based Retrieval Model fn Combine evidences from each field q1 q2 ... qm q1 q2 ... qm f1 f2 ... fn Document-based Scoring w1 w2 wn f1 f2 ... fn w1 w2 wn Field-based Scoring 16 Field Relevance Model for Structured IR Field Relevance Different fields are important for different query terms 2 1 ‘registration’ is relevant when it occurs in <subject> 2 1 2 1 ‘james’ is relevant when it occurs in <to> 17 Estimating the Field Relevance: Overview If User Provides Feedback Relevant document provides sufficient information If No Feedback is Available Combine field-level term statistics from multiple sources from/to title content Collection + from/to title content Top-k Docs ≅ from/to title content Relevant Docs 18 Estimating Field Relevance using Feedback Assume a user who marked DR as relevant Estimate field relevance from the field-level term dist. of DR We can personalize the results accordingly Rank higher docs with similar field-level term distribution This weight is provably optimal under LM retrieval framework Field Relevance: - To is relevant for ‘james’ - Content is relevant for ‘registration’ DR 19 Estimating Field Relevance without Feedback Linear Combination of Multiple Sources Weights estimated using training queries Features Field-level term distribution of the collection Field-level term distribution of top-k docs Unigram and Bigram LM Unigram and Bigram LM A priori importance of each field (wj) Estimated using held-out training queries Unigram is the same to PRM-S Pseudo-relevance Feedback Similar to MFLM and BM25F 20 Retrieval Using the Field Relevance Comparison with Previous Work q1 q2 ... qm f1 sum f2 ... fn w1 w2 wn f1 f2 ... fn q1 q2 ... qm w1 w2 wn f1 f2 ... fn P(F1|q1) P(F2|q1) P(Fn|q1) f1 f2 ... fn P(F1|qm) P(F2|qm) P(Fn|qm) multiply Ranking in the Field Relevance Model Per-term Field Score Per-term Field Weight 21 Evaluating the Field Relevance Model Retrieval Effectiveness (Metric: Mean Reciprocal Rank) DQL BM25F MFLM FRM-C FRM-T FRM-R TREC 54.2% 59.7% 60.1% 62.4% 66.8% 79.4% IMDB 40.8% 52.4% 61.2% 63.7% 65.7% 70.4% Monster 42.9% 27.9% 46.0% 54.2% 55.8% 71.6% Per-term Field Weights Fixed Field Weights 80.0% 70.0% 60.0% TREC 50.0% IMDB Monster 40.0% 30.0% 20.0% DQL BM25F MFLM FRM-C FRM-T FRM-R 22 Type Prediction Methods Field-based collection Query-Likelihood (FQL) Calculate QL score for each field of a collection Combine field-level scores into a collection score Feature-based Method Combine existing type-prediction methods Grid Search / SVM for finding combination weights 23 Type Prediction Performance Pseudo-desktop Collections CS Collection (% of queries with correct prediction) FQL improves performance over CQL Combining features improves the performance further 24 Summary So Far… Field relevance model for structured document retrieval Type prediction methods for PIR Enables relevance feedback through field weighting Improves performance using linear feature-based estimation Field-based type prediction method (FQL) Combination of features improve the performance further We move onto associative browsing model What happens when users can’t recall good search terms? Associative Browsing Model Recap: Retrieval Framework for PIR Keyword Search Associative Browsing Registration James James 27 User Interaction for Associative Browsing Users enter a concept or document page by search The system provides a list of suggestions for browsing Data Model User Interface How can we build associations? How would it match user’s Automatically? Manually? preference? Participants wouldn’t create associations beyon d simple tagging operations - Sauermann et al. 2005 29 Building the Associative Browsing Model 1. Document Collection 2. Concept Extraction 3. Link Extraction 4. Link Refinement Click-based Training Term Similarity Temporal Similarity Co-occurrence 30 Link Extraction and Refinement Link Scoring Link Presentation Combination of link type scores S(c1,c2) = Σi [ wi × Linki(c1,c2) ] Ranked list of suggested items Users click on them for browsing Concepts Documents Term Vector Similarity Temporal Similarity Tag Similarity String Similarity Path / Type Similarity Co-occurrence Concept Similarity Concept: Search Engine Link Refinement (training wi) Maximize click-based relevance Grid Search : Maximize retrieval effectiveness (MRR) RankSVM : Minimize error in pairwise preference Browsing Suggestions 31 Evaluating Associative Browsing Model Data set: CS Collection Value of browsing for known-item finding Collect public documents in UMass CS department CS dept. people competed in known-item finding tasks % of sessions browsing was used % of sessions browsing was used & led to success Quality of browsing suggestions Mean Reciprocal Rank using clicks as judgments 10-fold cross validation over the click data collected 32 Value of Browsing for Known-item Finding Evaluation Type Simulation Total (#sessions) Browsing used Successful outcome 63,260 9,410 (14.8%) 3,957 (42.0%) User Study (1) 290 42 (14.5%) 15 (35.7%) Document Only User Study (2) 142 43 (30.2%) 32 (74.4%) Document + Concept Comparison with Simulation Results Roughly matches in terms of overall usage and success ratio The Value of Associative Browsing Browsing was used in 30% of all sessions Browsing saved 75% of sessions when used Quality of Browsing Suggestions Concept Browsing (MRR) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 CS/Top1 CS/Top5 title content tag time string cooc occur Uniform Grid SVM Document Browsing (MRR) 0.18 0.16 0.14 0.12 0.1 CS/Top1 0.08 CS/Top5 0.06 0.04 0.02 0 title content tag time topic path type concept Uniform Grid SVM 34 Simulated Evaluation Methods Challenges in PIR Evaluation Hard to create a ‘test-collection’ Each user has different documents and habits People will not donate their documents and queries for research Limitations of user study Experimenting with a working system is costly Experimental control is hard with real users and tasks Data is not reusable by third parties 36 Our Approach: Simulated Evaluation Simulate components of evaluation Collection: user’s documents with metadata Task: search topics and relevance judgments Interaction: query and click data 37 Simulated Evaluation Overview Simulated document collections Pseudo-desktop Collections CS Collection Subsets of W3C mailing list + Other document types UMass CS mailing list / Calendar items / Crawl of homepage Evaluation Methods Controlled User Study Simulated Interaction Field-based Search DocTrack Search Game Query Generation Methods Associative Browsing DocTrack Search + Browsing Game Probabilistic User Modeling Controlled User Study: DocTrack Game Procedure DocTrack search game Collect public documents in UMass CS dept. (CS Collection) Build a web interface where participants can find documents People in CS department participated 20 participants / 66 games played 984 queries collected for 882 target documents DocTrack search+browsing game 30 participants / 53 games played 290 +142 search sessions collected 39 DocTrack Game Target Item Find It! *Users can use search and browsing for DocTrack search+browsing game 40 Query Generation for Evaluating PIR Known-item finding for PIR Query Generation for PIR A target document represents an information need Users would take terms from the target document Randomly select a target document Algorithmically take terms from the document Parameters of Query Generation Choice of extent : Document [Azzopardi07] vs. Field Choice of term : Uniform vs. TF vs. IDF vs. TF-IDF [Azzopardi07] 41 Validating of Generated Queries Basic Idea Validation by Comparing Query-terms Use the set of human-generated queries for validation Compare at the level of query terms and retrieval scores The generation probability of manual query q from Pterm Validation by Compare Retrieval Scores [Azzopardi07] Two-sided Kolmogorov-Smirnov test 42 Validation Results for Generated Queries Validation based on query terms Validation based on retrieval score distribution Probabilistic User Model for PIR Query generation model State transition model Term selection from a target document Use browsing when result looks marginally relevant Link selection model Click on browsing suggestions based on perceived relevance 44 A User Model for Link Selection User’s level of knowledge Random : randomly click on a ranked list Informed : more likely to click on more relevant item Oracle : always click on the most relevant item Relevance estimated using the position of the target item 1 … 2 … 3 … 1 … 2 … 4 … 5 … 3 … 4 … 1 … 5 … 2 … 3 … 4 … 5 … 45 Success Ratio of Browsing Varying the level of knowledge and fan-out for simulation Exploration is valuable for users with low knowledge level 0.48 0.46 0.44 0.42 0.4 random informed 0.38 oracle 0.36 0.34 0.32 0.3 FO1 FO2 FO3 More Exploration 46 Community Efforts using the Data Sets 47 Conclusions & Future Work Major Contributions Field-based Search Models Associative Browsing Model Field relevance model for structured document retrieval Field-based and combination-based type prediction method An adaptive technique for generating browsing suggestions Evaluation of associative browsing in known-item finding Simulated Evaluation Methods for Known-item Finding DocTrack game for controlled user study Probabilistic user model for generating simulated interaction 49 Field Relevance for Complex Structures Current work assumes documents with flat structure Field Relevance for Complex Structures? XML documents with hierarchical structure Joined Database Relations with graph structure Cognitive Model of Query Generation Current query generation methods assume: Relaxing these assumptions Queries are generated from the complete document Query-terms are chosen independently from one another Model the user’s degradation in memory Model the dependency in query term selection Ongoing work Graph-based representation of documents Query terms can be chosen by random walk Thank you for your attention! Special thanks to my advisor, coauthors, and all of you here! Are we closer to the superhuman now? One More Slide: What I Learned… Start from what’s happening from user’s mind Balance user input and algorithmic support Field relevance / query generation, … Generating suggestions for associative browsing Learn from your peers & make contributions Query generation method / DocTrack game Simulated test collections & workshop