Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University and Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford) and Ramnath Balasubramanyan ML for email [Cohen, AAAI Spring Symposium on ML and IR 1996] Starting point: Ishmail, an emacs RMAIL extension written by Charles Isbell in summer ’95 (largely for Ron Brachman) Could manually write mailbox definitions and filtering rules in Lisp Foldering tasks Rule-learning method [Cohen, ICML95] [Rocchio, 71] Machine Learning in Email • Why study learning for email ? 1. Email has more visible impact than anything else you do with computers. 2. Email is hard to manage: • People get overwhelmed. • People lose important information in email archives. • People make horrible mistakes. Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? search don’t sort! – Foldering – Spam filtering important and well-studied – Search: beyond keyword search – Recognizing errors “Oops, did I just hit reply-to-all?” – Help for tracking tasks “Dropping the ball” Learning to Search Email [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007] CALO Sent To Term In Subject William graph proposal CMU 6/17/07 6/18/07 einat@cs.cmu.edu Q: “what are Jason’s email aliases?” einat@cs .cmu.edu Sent To Has term inv. “Jason” Basic idea: learning to search email is learning to query a graph for information einat Msg 2 Msg 5 Sent from Email Sent-to Msg 18 Sent to Email jernst@ cs.cmu.edu jernst@ andrew.cmu.edu Similar to Jason Ernst EmailAddressOf How do you pose queries to a graph? • An extended similarity measure via graph walks: How do you pose queries to a graph? • An extended similarity measure via graph walks: Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. How do you pose queries to a graph? • An extended similarity measure via graph walks: Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) How do you pose queries to a graph? • An extended similarity measure via graph walks: Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication How do you pose queries to a graph? • An extended similarity measure via graph walks: Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication The result is a list of nodes, sorted by “similarity” to an input node distribution (final node probabilities). • • • • Email, contacts etc: a graph Graph nodes are typed, edges are directed and typed Multiple edges may connect two given nodes. Every edge type is assigned a fixed weight—which determines probability of being followed in a walk: e.g., uniform • Random walk with restart, graph kernels, heat diffusion kernels, diffusion processes, Laplacian regularization, graph databases (BANKS, DbExplorer, …), graph mincut, associative Markov networks, … A query language: Q: { , } Returns a list of nodes (of type ) ranked by the graph walk probs. = query “terms” Tasks that are like similarity queries Person name disambiguation [ term “andy” file msgId ] “person” Threading What are the adjacent messages in this thread? A proxy for finding “more messages like this one” Alias finding What are the email-addresses of Jason ?... [ file msgId ] “file” [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address” Learning to search better Task T (query class) Query a + Rel. answers a Query b + Rel. answers b … Query q + Rel. answers q Standard set of features used for x on each problem: • Edge n-grams in all paths from Vq to x GRAPH WALK node rank 1 node rank 1 node rank 1 node rank 2 node rank 2 node rank 2 node rank 3 node rank 3 node rank 3 node rank 4 node rank 4 node rank 4 … … … node rank 10 node rank 10 node rank 10 node rank 11 node rank 11 node rank 11 node rank 12 node rank 12 node rank 12 … … … node rank 50 node rank 50 node rank 50 • Number of reachable source nodes • Features of topranking paths (e.g. edge bigrams) Learning Node re-ordering: train task Graph walk Feature generation Learn re-ranker Re-ranking function Learning Approach Node re-ordering: train task Graph walk Feature generation Learn re-ranker test task Graph walk Feature generation Score by re-ranking function [Collins & Koo, CL 2005; Collins, ACL 2002] Re-ranking function Boosting Voted Perceptron; RankSVM; PerceptronCommittees; … [Joacchim KDD 2002, Elsas et al WSDM 2008] Tasks that are like similarity queries Person name disambiguation [ term “andy” file msgId ] “person” Threading What are the adjacent messages in this thread? A proxy for finding “more messages like this one” Alias finding What are the email-addresses of Jason ?... [ file msgId ] “file” [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address” PERSON NAME DISAMBIGUATION Corpora and datasets Corpora Person names • Nicknames: Dave for David, Kai for Keiko, Jenny for Qing • Common names are ambiguous CSpace Email: • collected at CMU • 15,000+ emails from semester-line management course • students formed groups that acted as “companies” and worked together • dozens of groups with some known social connections (e.g., “president”) Results 100% 80% Recall PERSON NAME DISAMBIGUATION Mgmt. game 60% 40% 20% 0% 1 2 3 4 5 6 Rank 7 8 9 10 Results 100% 80% Recall PERSON NAME DISAMBIGUATION Mgmt. game 60% 40% 20% 0% 1 2 3 4 5 6 Rank 7 8 9 10 Results 100% 80% Recall PERSON NAME DISAMBIGUATION Mgmt. game 60% 40% 20% 0% 1 2 3 4 5 6 Rank 7 8 9 10 Results 100% 80% Recall PERSON NAME DISAMBIGUATION Mgmt. game 60% 40% 20% 0% 1 2 3 4 5 6 Rank 7 8 9 10 Results On All Three Problems PERSON NAME DISAMBIGUATION Mgmt. Game MAP Δ Acc. Δ 49.0 72.6 66.3 85.6 89.0 48% 35% 75% 82% 41.3 61.3 48.8 72.5 83.8 48% 18% 76% 103% Enron: Sager-E MAP Δ Acc. Δ Baseline Graph - T Graph - T + F Reranked - T Reranked - T + F 67.5 82.8 61.7 83.2 88.9 23% -9% 23% 32% 39.2 66.7 41.2 68.6 80.4 70% 5% 75% 105% MAP Δ Acc. Δ 60.8 84.1 56.5 87.9 85.5 38% -7% 45% 41% 38.8 63.3 38.8 65.3 77.6 65% 1% 71% 103% Baseline Graph - T Graph - T + F Reranked - T Reranked - T + F Enron: Shapiro-R Baseline Graph - T Graph - T + F Reranked - T Reranked - T + F Tasks Person name disambiguation [ term “andy” file msgId ] “person” Threading What are the adjacent messages in this thread? A proxy for finding “more messages like this one” Alias finding What are the email-addresses of Jason ?... [ file msgId ] “file” [ term Jason ] “email-address” Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] “email-address” Threading: Results Mgmt. Game 73.8 80% 70% 60.3 58.4 60% MAP 71.5 50.2 50% 36.2 40% 30% 20% all & Body Header Subject Reply lines 80% Enron: Farmer no reply&lines Header Body Subject 79.8 no reply lines no subj Header & Body - 65.7 70% 65.1 MAP 60% 50% 36.1 40% 30% 20% all & Body Header Subject Reply lines no reply&lines Header Body Subject - no reply lines no subj Header & Body - Learning approaches Edge weight tuning: Graph walk Weight update Theta* Learning approaches Edge weight tuning: Graph walk task Weight update Theta* [Diligenti et al, IJCAI 2005; Toutanova & Ng, ICML 2005; … ] Graph walk Question: which is better? Node re-ordering: Graph walk Feature generation Learn re-ranker Graph walk Feature generation Score by re-ranking function Re-ranking function Boosting; Voted Perceptron Results (MAP) 0.85 Name disambiguation 0.8 0.75 0.7 0.65 * * * 0.6 0.55 0.5 0.45 ++ 0.4 M.game Threading sager * * 0.85 0.8 * 0.75 0.7 0.65 0.6 Shapiro * * ** * * 0.55 0.5 0.45 0.4 + + M.game Alias finding 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 Meetings + Farmer Germany • Reranking and edgeweight tuning are complementary. • Best result is usually to tune weights, and then rerank • Reranking overfits on small datasets (meetings) Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? – Foldering – Spam filtering – Search beyond keyword search – Recognizing errors “Oops, did I just hit reply-to-all?” – Help for tracking tasks “Dropping the ball” http://www.sophos.com/ … Preventing errors in Email [SDM 2007] Email Leak: email accidentally sent to wrong person • Idea Email Leak – Goal: to detect emails accidentally sent to the wrong person – Generate artificial leaks: Email leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc. – Method: Look for outliers. Preventing Email Leaks • Method – Create simulated/artificial email recipients – Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list). • Features: textual(subject, body), network features (frequencies, co-occurrences, etc). – Rank potential outliers - Detect outlier and warn user based on confidence. P(rect) Rec6 Most likely outlier Rec2 … RecK Rec5 Least likely outlier P(rect) =Probability recipient t is an outlier given “message text and other recipients in the message”. Enron Data Preprocessing 1 • Realistic scenario – For each user, 10% (most recent) sent messages will be used as test • Construct Address Books for all users – List of all recipients in the sent messages. Simulating Leaks • Several options: – Frequent typos, same/similar last names, identical/similar first names, aggressive auto-completion of addresses, etc. • In this paper, we adopted the 3g-address criteria: – On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: 1-α Marina.wang @enron.com 1 α Random nonaddress-book address 2 3 Else: Randomly select an address book entry Experiments: using Textual Features only • Three Baseline Methods – Random • Rank recipient addresses randomly – Rocchio/TfIdf Centroid [Rocchio 71] • Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. – Knn-30 [Yang and Chute, SIGIR 94] • Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set. Experiments: using Textual Features only Email Leak Prediction Results: Prec@1 in 10 trials. On each trial, a different set of outliers is generated Network Features • How frequent a recipient was addressed • How these recipients cooccurred in the training set Using Network Features 1. Frequency features – – – Number of received messages (from this user) Number of sent messages (to this user) Number of sent+received messages 2. Co-Occurrence Features – 3. Max3g features – • Number of times a user co-occurred with all other recipients. For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature. Combine with text-only scores using votedperceptron reranking, trained on simulated leaks α =0 Precision at rank 1 Finding Real Leaks in Enron • How can we find it? – Grep for “mistake”, “sorry” or “accident”. – Note: must be from one of the Enron users • Found 2 good cases: 1. Message germanyc/sent/930, message has 20 recipients, leak is alex.perkins@ 2. kitchen-l/sent items/497, it has 44 recipients, leak is rita.wynne@ Results on real leaks •Kitchen-l has 4 unseen addresses out of the 44 recipients, •Germany-c has only one. The other kind of recipient error [ECIR 2008] • How accurately can you fill in missing recipients, using the message text as evidence? Mean average precision over 36 users, after using thread information Current prototype (Thunderbird plug-in) Leak warnings: hit x to remove recipient Suggestions: hit + to add Pause or cancel send of message Timer: msg is sent after 10sec by default Classifier/rankers written in JavaScript Machine Learning in Email • Why study learning for email ? • For which tasks can learning help ? – Foldering – Spam filtering – Search beyond keyword search – Recognizing errors – Help for tracking tasks “Dropping the ball” Dropping the Ball Speech Acts for Email [EMNLP 2004, SIGIR 2005, ACL Acts WS 2006] Classifying Speech Acts • [Carvalho & Cohen, SIGIR 2005]: Relational model including adjacent messages in thread; pseudo-likelihood/RDN model with annealing phase. • [Carvalho & Cohen, ACL workshop 2006: IE preprocessing, n-grams, feature extraction, YFLA] ** Ciranda package [Dabbish et al, CHI 05; Drezde et al, IUI 06; Khoussainov & Kushmeric, CEAS 2006; Goldstein et al, CEAS 2006; Goldstein & Sabin, HICCS 06] Add Task: follow up on: “request for screen shots” by ___ 2 days before -? “next Wed” (12/5/07) “end of the week” (11/30/07) “Sunday” (12/2/07) - other - Request Time/date Add Task: “METAL – fairly urgent feedback sought” by “tomorrow noon” (11/29/07) - other - Warning! You are making a commitment! Hit cancel to abort! Commitment Time/date Conclusions/Summary • Email is visible and important perfect ML application • There are lots of interesting problems associated with email processing – Learning to query heterogeneous data graphs – Modeling patterns of interactions • User User textual communication • User User commun. frequency, recency, … – … to predict likely recipients/nonrecipients, correct possible errors, and/or aid user in tracking requests and commitments Conclusions/Summary • Email is visible and important • There are lots of interesting problems associated with email processing – Learning to query heterogeneous data graphs – Modeling patterns of interactions • User User textual communication • User User commun. frequency, recency, … – … to predict likely recipients/nonrecipients, correct possible errors, and/or aid user in tracking requests and commitments Bibliography: Our Group • • • • • • • • • • • • Einat Minkov and William Cohen (2007): Learning to Rank Typed Graph Walks: Local and Global Approaches in WebKDD-2007. Vitor Carvalho, Wen Wu and William Cohen (2007): Discovering Leadership Roles in Email Workgroups in CEAS-2007. Vitor Carvalho and William Cohen (2007): Ranking Users for Intelligent Message Addressing, to appear in ECIR-2008. Vitor Carvalho and William W. Cohen (2007): Preventing Information Leaks in Email in SDM-2007. Einat Minkov and William W. Cohen (2006): An Email and Meeting Assistant using Graph Walks in CEAS-2006. Einat Minkov, Andrew Ng and William W. Cohen (2006): Contextual Search and Name Disambiguation in Email using Graphs in SIGIR-2006. Vitor Carvalho and William W. Cohen (2006): Improving Email Speech Act Analysis via N-gram Selection in HLT/NAACL ACTS Workshop 2006. William W. Cohen, Einat Minkov & Anthony Tomasic (2005): Learning to Understand Web Site Update Requests in IJCAI-2005. Einat Minkov, Richard C. Wang, and William W. Cohen (2005): Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text in EMNLP/HLT2005. Vitor Carvalho & William W. Cohen (2005): On the Collective Classification of Email Speech Acts in SIGIR 2005. William W. Cohen, Vitor R. Carvalho & Tom Mitchell (2004): Learning to Classify Email into "Speech Acts" in EMNLP 2004. Vitor R. Carvalho & William W. Cohen (2004): Learning to Extract Signature and Reply Lines from Email in CEAS 2004. Bibliography: Other Cited Papers • • • • • • • M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In ACL, 2002. [M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–69, 2005. M. Diligenti, M. Gori, and M. Maggini. Learning web page scores by error backpropagation. In IJCAI, 2005. T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. Jonathan Elsas, Vitor R. Carvalho and Jaime Carbonell. Fast Learning of Document Ranking Functions with the Committee Perceptron. WSDM-2008 (ACM International Conference on Web Search and Data Mining). Y. Yang and C. G. Chute, “An example-based mapping method for text classification and retrieval”, ACM Trans Info Systems, 12(3), 1994. L. A. Dabbish, R. E. Kraut, S. Fussell, and S. Kiesler, “Understanding email use: predicting action on a message,” in CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, 2005, pp. 691–700. Bibliography: Other Cited Papers • • • • • • • M. Dredze, T. Lau, and N. Kushmerick, “Automatically classifying emails into activities,” in IUI ’06: Proceedings of the 11th international conference on Intelligent user interfaces, 2006, pp. 70–77. D. Feng, E. Shaw, J. Kim, and E. Hovy, “Learning to detect conversation focus of threaded discussions,” in Proceedings of the HLT/NAACL 2006 (Human Language Technology Conference - North American chapter of the Association for Computational Linguistics), New York City, NY, 2006. J. Goldstein, A. Kwasinksi, P. Kingsbury, R. E. Sabin, and A. McDowell, “Annotating subsets of the enron email corpus,” in Conference on Email and Anti-Spam (CEAS’2006), 2006. J. Goldstein and R. E. Sabin, “Using speech acts to categorize email and identify email genres,” Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06), vol. 3, p. 50b, 2006. [11] R. Khoussainov and N. Kushmerick, “Email task management: An iterative relational learning approach,” in Conference on Email and AntiSpam (CEAS’2005), 2005. David Allen, “Getting things done: the art of stress-free productivity”, Penguin Books, 2001.