Machine Learning for Personal Information Management William W. Cohen Vitor Carvalho, Einat Minkov,

advertisement
Machine Learning for
Personal Information Management
William W. Cohen
Machine Learning Department and Language Technologies Institute
School of Computer Science
Carnegie Mellon University
and
Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford)
and Ramnath Balasubramanyan
ML for email
[Cohen, AAAI Spring Symposium on ML and IR 1996]
Starting point: Ishmail,
an emacs RMAIL
extension written by
Charles Isbell in
summer ’95 (largely for
Ron Brachman)
Could manually write
mailbox definitions and
filtering rules in Lisp
Foldering tasks
Rule-learning method [Cohen, ICML95]
[Rocchio, 71]
Machine Learning in Email
• Why study learning for email ?
1. Email has more visible impact than anything else
you do with computers.
2. Email is hard to manage:
•
People get overwhelmed.
•
People lose important information in email archives.
•
People make horrible mistakes.
Machine Learning in Email
• Why study learning for email ?
• For which tasks can learning help ?
search don’t sort!
– Foldering
– Spam filtering important and well-studied
– Search: beyond keyword search
– Recognizing errors “Oops, did I just hit reply-to-all?”
– Help for tracking tasks “Dropping the ball”
Learning to Search Email
[SIGIR 2006, CEAS 2006,
WebKDD/SNA 2007]
CALO
Sent
To
Term In
Subject
William
graph
proposal
CMU
6/17/07
6/18/07
[email protected]
Q: “what are Jason’s email aliases?”
[email protected]
.cmu.edu
Sent To
Has term
inv.
“Jason”
Basic idea: learning to
search email is learning to
query a graph for
information
einat
Msg
2
Msg
5
Sent from
Email
Sent-to
Msg
18
Sent to
Email
[email protected]
cs.cmu.edu
[email protected]
andrew.cmu.edu
Similar to
Jason
Ernst
EmailAddressOf
How do you pose queries to a graph?
• An extended similarity measure via graph walks:
How do you pose queries to a graph?
• An extended similarity measure via graph walks:
Propagate “similarity” from start nodes through edges in the
graph – accumulating evidence of similarity over multiple
connecting paths.

How do you pose queries to a graph?
• An extended similarity measure via graph walks:
Propagate “similarity” from start nodes through edges in the
graph – accumulating evidence of similarity over multiple
connecting paths.

Fixed probability of halting the walk at every step –
i.e., shorter connecting paths have greater importance (exponential
decay)

How do you pose queries to a graph?
• An extended similarity measure via graph walks:
Propagate “similarity” from start nodes through edges in the
graph – accumulating evidence of similarity over multiple
connecting paths.

Fixed probability of halting the walk at every step –
i.e., shorter connecting paths have greater importance (exponential
decay)

In practice we can approximate with a short finite graph walk,
implemented with sparse matrix multiplication

How do you pose queries to a graph?
• An extended similarity measure via graph walks:
Propagate “similarity” from start nodes through edges in the
graph – accumulating evidence of similarity over multiple
connecting paths.

Fixed probability of halting the walk at every step –
i.e., shorter connecting paths have greater importance (exponential
decay)

In practice we can approximate with a short finite graph walk,
implemented with sparse matrix multiplication

The result is a list of nodes, sorted by “similarity” to an
input node distribution (final node probabilities).

•
•
•
•
Email, contacts etc: a graph
Graph nodes are typed, edges are
directed and typed
Multiple edges may connect two
given nodes.
Every edge type is assigned a
fixed weight—which determines
probability of being followed in a
walk: e.g., uniform
• Random walk with restart,
graph kernels, heat diffusion
kernels, diffusion processes,
Laplacian regularization,
graph databases (BANKS,
DbExplorer, …), graph
mincut, associative Markov
networks, …
A query language:
Q: {
,
}
Returns a list of nodes
(of type
) ranked by
the graph walk probs.
= query “terms”
Tasks that are like similarity queries
Person name
disambiguation
[ term “andy”
file
msgId ]
“person”
Threading
 What are the adjacent messages in
this thread?
 A proxy for finding “more
messages like this one”
Alias finding
What are the email-addresses of
Jason ?...
[ file msgId ]
“file”
[ term Jason ]
“email-address”
Meeting
attendees finder
Which email-addresses (persons)
should I notify about this meeting?
[ meeting mtgId ]
“email-address”
Learning to search better
Task T
(query class)
Query a
+ Rel. answers a
Query b
+ Rel. answers b
…
Query q
+ Rel. answers q
Standard set of
features used for x
on each problem:
• Edge n-grams in all
paths from Vq to x
GRAPH WALK
 node rank 1
 node rank 1
 node rank 1
 node rank 2
 node rank 2
 node rank 2
 node rank 3
 node rank 3
 node rank 3
 node rank 4
 node rank 4
 node rank 4
…
…
…
 node rank 10
 node rank 10
 node rank 10
 node rank 11
 node rank 11
 node rank 11
 node rank 12
 node rank 12
 node rank 12
…
…
…
 node rank 50
 node rank 50
 node rank 50
• Number of
reachable source
nodes
• Features of topranking paths (e.g.
edge bigrams)
Learning
Node re-ordering:
train task
Graph
walk
Feature
generation
Learn
re-ranker
Re-ranking
function
Learning Approach
Node re-ordering:
train task
Graph
walk
Feature
generation
Learn
re-ranker
test task
Graph
walk
Feature
generation
Score by
re-ranking
function
[Collins & Koo, CL 2005; Collins, ACL 2002]
Re-ranking
function
Boosting
Voted Perceptron;
RankSVM;
PerceptronCommittees;
…
[Joacchim KDD 2002,
Elsas et al WSDM 2008]
Tasks that are like similarity queries
Person name
disambiguation
[ term “andy”
file
msgId ]
“person”
Threading
 What are the adjacent messages in
this thread?
 A proxy for finding “more
messages like this one”
Alias finding
What are the email-addresses of
Jason ?...
[ file msgId ]
“file”
[ term Jason ]
“email-address”
Meeting
attendees finder
Which email-addresses (persons)
should I notify about this meeting?
[ meeting mtgId ]
“email-address”
PERSON NAME DISAMBIGUATION
Corpora and datasets
Corpora
Person names
• Nicknames:
Dave for David,
Kai for Keiko,
Jenny for Qing
• Common names are ambiguous
CSpace Email:
• collected at CMU
• 15,000+ emails from
semester-line
management course
• students formed
groups that acted as
“companies” and
worked together
• dozens of groups with
some known social
connections (e.g.,
“president”)
Results
100%
80%
Recall
PERSON NAME DISAMBIGUATION
Mgmt. game
60%
40%
20%
0%
1
2
3
4
5
6
Rank
7
8
9
10
Results
100%
80%
Recall
PERSON NAME DISAMBIGUATION
Mgmt. game
60%
40%
20%
0%
1
2
3
4
5
6
Rank
7
8
9
10
Results
100%
80%
Recall
PERSON NAME DISAMBIGUATION
Mgmt. game
60%
40%
20%
0%
1
2
3
4
5
6
Rank
7
8
9
10
Results
100%
80%
Recall
PERSON NAME DISAMBIGUATION
Mgmt. game
60%
40%
20%
0%
1
2
3
4
5
6
Rank
7
8
9
10
Results On All Three Problems
PERSON NAME DISAMBIGUATION
Mgmt. Game
MAP
Δ
Acc.
Δ
49.0
72.6
66.3
85.6
89.0
48%
35%
75%
82%
41.3
61.3
48.8
72.5
83.8
48%
18%
76%
103%
Enron: Sager-E
MAP
Δ
Acc.
Δ
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
67.5
82.8
61.7
83.2
88.9
23%
-9%
23%
32%
39.2
66.7
41.2
68.6
80.4
70%
5%
75%
105%
MAP
Δ
Acc.
Δ
60.8
84.1
56.5
87.9
85.5
38%
-7%
45%
41%
38.8
63.3
38.8
65.3
77.6
65%
1%
71%
103%
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
Enron: Shapiro-R
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
Tasks
Person name
disambiguation
[ term “andy”
file
msgId ]
“person”
Threading
 What are the adjacent
messages in this thread?
 A proxy for finding “more
messages like this one”
Alias finding
What are the email-addresses of
Jason ?...
[ file msgId ]
“file”
[ term Jason ]
“email-address”
Meeting
attendees finder
Which email-addresses (persons)
should I notify about this meeting?
[ meeting mtgId ]
“email-address”
Threading: Results
Mgmt.
Game
73.8
80%
70%
60.3
58.4
60%
MAP
71.5
50.2
50%
36.2
40%
30%
20%
all & Body
Header
Subject
Reply lines
80%
Enron:
Farmer
no reply&lines
Header
Body
Subject
79.8
no reply lines
no subj
Header
& Body
-
65.7
70%
65.1
MAP
60%
50%
36.1
40%
30%
20%
all & Body
Header
Subject
Reply lines
no reply&lines
Header
Body
Subject
-
no
reply lines
no subj
Header
& Body
-
Learning approaches
Edge weight tuning:
Graph
walk
Weight
update
Theta*
Learning approaches
Edge weight tuning:
Graph
walk
task
Weight
update
Theta*
[Diligenti et al, IJCAI
2005; Toutanova &
Ng, ICML 2005; … ]
Graph
walk
Question: which is better?
Node re-ordering:
Graph
walk
Feature
generation
Learn
re-ranker
Graph
walk
Feature
generation
Score by
re-ranking
function
Re-ranking
function
Boosting;
Voted Perceptron
Results (MAP)
0.85
Name
disambiguation
0.8
0.75
0.7
0.65
*
*
*
0.6
0.55
0.5
0.45
++
0.4
M.game
Threading
sager
* *
0.85
0.8
*
0.75
0.7
0.65
0.6
Shapiro
*
*
**
*
*
0.55
0.5
0.45
0.4
+ +
M.game
Alias finding
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
Meetings
+
Farmer
Germany
• Reranking and edgeweight tuning are
complementary.
• Best result is usually
to tune weights, and
then rerank
• Reranking overfits on
small datasets
(meetings)
Machine Learning in Email
• Why study learning for email ?
• For which tasks can learning help ?
– Foldering
– Spam filtering
– Search beyond keyword search
– Recognizing errors “Oops, did I just hit reply-to-all?”
– Help for tracking tasks “Dropping the ball”
http://www.sophos.com/
…
Preventing errors in Email
[SDM 2007]
Email Leak: email accidentally
sent to wrong person
• Idea
Email
Leak
– Goal: to detect emails
accidentally sent to the wrong
person
– Generate artificial leaks: Email
leaks may be simulated by
various criteria: a typo, similar
last names, identical first names,
aggressive auto-completion of
addresses, etc.
– Method: Look for outliers.
Preventing Email Leaks
•
Method
– Create simulated/artificial email
recipients
– Build model for (msg.recipients):
train classifier on real data to detect
synthetically created outliers (added
to the true recipient list).
• Features: textual(subject, body),
network features (frequencies,
co-occurrences, etc).
– Rank potential outliers - Detect
outlier and warn user based on
confidence.
P(rect)
Rec6
Most likely
outlier
Rec2
…
RecK
Rec5
Least likely
outlier
P(rect) =Probability
recipient t is an outlier
given “message text
and other recipients in
the message”.
Enron Data Preprocessing 1
• Realistic scenario
– For each user, 10% (most recent) sent
messages will be used as test
• Construct Address Books for all users
– List of all recipients in the sent messages.
Simulating Leaks
• Several options:
– Frequent typos, same/similar last names,
identical/similar first names, aggressive auto-completion
of addresses, etc.
• In this paper, we adopted the 3g-address criteria:
– On each trial, one of the msg recipients is randomly chosen and an
outlier is generated according to:
1-α
Marina.wang @enron.com
1
α
Random nonaddress-book
address
2
3
Else:
Randomly select an address book entry
Experiments:
using Textual Features only
• Three Baseline Methods
– Random
• Rank recipient addresses randomly
– Rocchio/TfIdf Centroid [Rocchio 71]
• Create a “TfIdf centroid” for each user in Address
Book. A user1-centroid is the sum of all training
messages (in TfIdf vector format) that were addressed
to user user1. For testing, rank according to cosine
similarity between test message and each centroid.
– Knn-30 [Yang and Chute, SIGIR 94]
• Given a test msg, get 30 most similar msgs in training
set. Rank according to “sum of similarities” of a given
user on the 30-msg set.
Experiments:
using Textual
Features only
Email Leak Prediction
Results: [email protected] in
10 trials.
On each trial, a
different set of
outliers is generated
Network Features
• How frequent a
recipient was
addressed
• How these
recipients cooccurred in the
training set
Using Network Features
1. Frequency features
–
–
–
Number of received messages (from this user)
Number of sent messages (to this user)
Number of sent+received messages
2. Co-Occurrence Features
–
3.
Max3g features
–
•
Number of times a user co-occurred with all other
recipients.
For each recipient R, find Rm (=address with max
score from 3g-address list of R), then use
score(R)-score(Rm) as feature.
Combine with text-only scores using votedperceptron reranking, trained on simulated leaks
α =0
Precision at rank 1
Finding Real Leaks in Enron
• How can we find it?
– Grep for “mistake”, “sorry” or “accident”.
– Note: must be from one of the Enron users
• Found 2 good cases:
1. Message germanyc/sent/930, message has 20 recipients,
leak is [email protected]
2. kitchen-l/sent items/497, it has 44 recipients, leak is
[email protected]
Results on real leaks
•Kitchen-l has 4 unseen addresses out of the
44 recipients,
•Germany-c has only one.
The other kind of recipient error
[ECIR 2008]
• How accurately can you fill in missing recipients,
using the message text as evidence?
Mean average precision over 36 users,
after using thread information
Current prototype (Thunderbird plug-in)
Leak warnings:
hit x to remove
recipient
Suggestions:
hit + to add
Pause or cancel send
of message
Timer: msg is sent after
10sec by default
Classifier/rankers written in JavaScript
Machine Learning in Email
• Why study learning for email ?
• For which tasks can learning help ?
– Foldering
– Spam filtering
– Search beyond keyword search
– Recognizing errors
– Help for tracking tasks “Dropping the ball”
Dropping the Ball
Speech Acts for Email
[EMNLP 2004, SIGIR 2005, ACL Acts WS 2006]
Classifying Speech Acts
• [Carvalho & Cohen, SIGIR
2005]: Relational model
including adjacent messages in
thread; pseudo-likelihood/RDN
model with annealing phase.
• [Carvalho & Cohen, ACL
workshop 2006: IE
preprocessing, n-grams, feature
extraction, YFLA] **
Ciranda package
[Dabbish et al, CHI 05; Drezde et al, IUI 06;
Khoussainov & Kushmeric, CEAS 2006;
Goldstein et al, CEAS 2006; Goldstein & Sabin,
HICCS 06]
Add Task: follow up on:
“request for screen shots”
by ___
2
days before -?
“next Wed” (12/5/07)
“end of the week” (11/30/07)
“Sunday” (12/2/07)
- other -
Request
Time/date
Add Task: “METAL – fairly
urgent feedback sought”
by
“tomorrow noon” (11/29/07)
- other -
Warning! You are
making a commitment!
Hit cancel to abort!
Commitment
Time/date
Conclusions/Summary
• Email is visible and important  perfect ML application
• There are lots of interesting problems
associated with email processing
– Learning to query heterogeneous data graphs
– Modeling patterns of interactions
• User  User textual communication
• User  User commun. frequency, recency, …
– … to predict likely recipients/nonrecipients,
correct possible errors, and/or aid user in
tracking requests and commitments
Conclusions/Summary
• Email is visible and important
• There are lots of interesting problems
associated with email processing
– Learning to query heterogeneous data graphs
– Modeling patterns of interactions
• User  User textual communication
• User  User commun. frequency, recency, …
– … to predict likely recipients/nonrecipients, correct
possible errors, and/or aid user in tracking requests
and commitments
Bibliography: Our Group
•
•
•
•
•
•
•
•
•
•
•
•
Einat Minkov and William Cohen (2007): Learning to Rank Typed Graph Walks: Local
and Global Approaches in WebKDD-2007.
Vitor Carvalho, Wen Wu and William Cohen (2007): Discovering Leadership Roles in
Email Workgroups in CEAS-2007.
Vitor Carvalho and William Cohen (2007): Ranking Users for Intelligent Message
Addressing, to appear in ECIR-2008.
Vitor Carvalho and William W. Cohen (2007): Preventing Information Leaks in Email in
SDM-2007.
Einat Minkov and William W. Cohen (2006): An Email and Meeting Assistant using Graph
Walks in CEAS-2006.
Einat Minkov, Andrew Ng and William W. Cohen (2006): Contextual Search and Name
Disambiguation in Email using Graphs in SIGIR-2006.
Vitor Carvalho and William W. Cohen (2006): Improving Email Speech Act Analysis via
N-gram Selection in HLT/NAACL ACTS Workshop 2006.
William W. Cohen, Einat Minkov & Anthony Tomasic (2005): Learning to Understand
Web Site Update Requests in IJCAI-2005.
Einat Minkov, Richard C. Wang, and William W. Cohen (2005): Extracting Personal
Names from Email: Applying Named Entity Recognition to Informal Text in EMNLP/HLT2005.
Vitor Carvalho & William W. Cohen (2005): On the Collective Classification of Email
Speech Acts in SIGIR 2005.
William W. Cohen, Vitor R. Carvalho & Tom Mitchell (2004): Learning to Classify Email
into "Speech Acts" in EMNLP 2004.
Vitor R. Carvalho & William W. Cohen (2004): Learning to Extract Signature and Reply
Lines from Email in CEAS 2004.
Bibliography: Other Cited Papers
•
•
•
•
•
•
•
M. Collins. Ranking algorithms for named-entity extraction: Boosting and the
voted perceptron. In ACL, 2002.
[M. Collins and T. Koo. Discriminative reranking for natural language parsing.
Computational Linguistics, 31(1):25–69, 2005.
M. Diligenti, M. Gori, and M. Maggini. Learning web page scores by error backpropagation. In IJCAI, 2005.
T. Joachims, Optimizing Search Engines Using Clickthrough Data,
Proceedings of the ACM Conference on Knowledge Discovery and Data
Mining (KDD), ACM, 2002.
Jonathan Elsas, Vitor R. Carvalho and Jaime Carbonell. Fast Learning of
Document Ranking Functions with the Committee Perceptron. WSDM-2008
(ACM International Conference on Web Search and Data Mining).
Y. Yang and C. G. Chute, “An example-based mapping method for text
classification and retrieval”, ACM Trans Info Systems, 12(3), 1994.
L. A. Dabbish, R. E. Kraut, S. Fussell, and S. Kiesler, “Understanding email
use: predicting action on a message,” in CHI ’05: Proceedings of the SIGCHI
conference on Human factors in computing systems, 2005, pp. 691–700.
Bibliography: Other Cited Papers
•
•
•
•
•
•
•
M. Dredze, T. Lau, and N. Kushmerick, “Automatically classifying emails into
activities,” in IUI ’06: Proceedings of the 11th international conference on
Intelligent user interfaces, 2006, pp. 70–77.
D. Feng, E. Shaw, J. Kim, and E. Hovy, “Learning to detect conversation
focus of threaded discussions,” in Proceedings of the HLT/NAACL 2006
(Human Language
Technology Conference - North American chapter of the Association for
Computational Linguistics), New York City, NY, 2006.
J. Goldstein, A. Kwasinksi, P. Kingsbury, R. E. Sabin, and A. McDowell,
“Annotating subsets of the enron email corpus,” in Conference on Email and
Anti-Spam (CEAS’2006), 2006.
J. Goldstein and R. E. Sabin, “Using speech acts to categorize email and
identify email genres,” Proceedings of the 39th Annual Hawaii International
Conference on System Sciences (HICSS’06), vol. 3, p. 50b, 2006.
[11] R. Khoussainov and N. Kushmerick, “Email task management: An
iterative relational learning approach,” in Conference on Email and AntiSpam (CEAS’2005), 2005.
David Allen, “Getting things done: the art of stress-free productivity”,
Penguin Books, 2001.
Download