Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text

advertisement
Extracting Personal Names from Email:
Applying Named Entity Recognition
to Informal Text
Einat Minkov & Richard C. Wang
William W. Cohen
Language Technologies Institute
Center for Automated
Learning and Discovery
School of Computer Science
Carnegie Mellon University
What is an informal text?
• A text that is…
– Written for a narrow audience
• Group/task-specific abbreviations often used
• Not self-contained (context shared by a related group of
people)
– Not carefully prepared
• Contains grammatical and spelling errors
• Does not follow capitalization conventions
• Some examples are…
– Instant messages
– Newsgroup postings
– Email messages
October 7, 2005
CMU School of Computer Science
2
Objective / Outline
• Investigate named entity recognition (NER) for
informal text
– Conduct experiments on recognizing personal names in
email
•
•
•
•
•
Examine indicative features in email and newswire
Suggest specialized features for email
Evaluate performance of a state-of-the-art extractor (CRF)
Analyze repetition of names in email and newswire
Suggest and evaluate a recall-enhancing method that is effective
for email
October 7, 2005
CMU School of Computer Science
3
Corpora
• Mgmt corpora – Emails from a management course at CMU in which
students form teams to run simulated companies
– Teams: Each set (train/tune/test) formed by different simulation teams
– Game: Each set formed by different days during the simulation period
• Enron corpora – Emails from Enron Corporation
– Meetings: Each set formed by randomly selected meeting-related emails
– Random: Each set formed by repeatedly sampling a user then sampling an
email from that user, both at random
Note: The number of words and names refer to the whole annotated corpora
October 7, 2005
CMU School of Computer Science
4
Extraction Method
• Train Conditional Random Fields (CRF) to label and
extract personal names
– A machine-learning based probabilistic approach to labeling
sequences of examples
• Learning reduces NER to the task of tagging, or
classifying, each word using a set of five tags:
–
–
–
–
–
Unique: A one-token entity
Begin: The first token of a multi-token entity
End: The last token of a multi-token entity
Inside: Any other token of a multi-token entity
Outside: A token that is not part of an entity
Example:
Einat and Richard Wang met William W. Cohen today
Unique
October 7, 2005
Outside
Begin
End
Outside
Begin
Inside
CMU School of Computer Science
End
Outside
5
Top Learned Features
Features most indicative of a token being part of a name in a
Conditional Random Fields (CRF) extractor
Email (Mgmt-Game)
Newswire (MUC-6)
In Quoted
Excerpt
Name
Titles
2
In Email
Signature
Job Titles
Results show that…
Email and newswire text have very different characteristics
Note: A feature is denoted by its direction (left/right) comparing to the focus word, offset, and lexical value
October 7, 2005
CMU School of Computer Science
6
Our Proposed Features
Note: All features are instantiated for the focus word t, and 3 tokens to the left and right of t
October 7, 2005
CMU School of Computer Science
7
Feature Evaluation
• Entity-level F1 of learned extractor (CRF) using:
–
–
–
–
Basic features (B)
Basic and Email features (B+E)
Basic and Dictionary features (B+D)
All features (B+D+E)
B+D+E
Precision
Recall
93.8
81.3
95.3
87.8
83.6
70.2
83.0
69.4
Results show that…
1) Dictionary and Email features are useful (best when combined)
2) Generally high precision but low recall
October 7, 2005
CMU School of Computer Science
8
What’s Next?
• Previous experiments show high precision but
low recall
– Next goal: Improve recall
• One recall-enhancing method
– Look for multiple occurrences of names in a corpus
• We conduct experimental studies
– Examine repetition patterns of names in email and
newswire text
– Examine occurrences of names within a single
document and across multiple documents
October 7, 2005
CMU School of Computer Science
9
Doc. Frequency of Names
Percentage of person-name tokens that appear in at most K distinct
documents as a function of K
Only 1.3% of names in MUC-6
appear in 10+ documents
Percentage
About 20% of names in MgmtGame appear in 10+ documents
Nearly 80% of names in MUC-6
appear only in one document
30% of names in Mgmt-Game
appear only in one document
Results show that…
Repetition of names across
multiple documents is more
common in email corpora
 # unique (w : df ( w)  i) 
F (K )   

i 1  # unique (w : df ( w)  0) 
k
unique(A): duplicates removed from set A
df(w): # of documents containing token w
1
Document Frequency
October 7, 2005
CMU School of Computer Science
10
Single vs. Multiple Documents
We define the following extractors:
1.
CRF – baseline trained with all features
2.
SDR (Single Document Repetition)
Rules that extract person-name tokens that appear more than once
within a single document; hence an upper bound on recall using only
names repetition within a single document
3.
MDR (Multiple Document Repetition)
Rules that extract person-name tokens that appear in more than one
document; hence an upper bound on recall using only names repetition
across multiple documents
4.
SDR+CRF
Union of extractions by SDR and CRF; hence an upper bound on recall
using CRF and names repetition within a single document
5.
MDR+CRF
Union of extractions by MDR and CRF; hence an upper bound on recall
using CRF and names repetition across multiple documents
October 7, 2005
CMU School of Computer Science
11
Single vs. Multiple Documents
Token-level upper bounds on recall and potential recall-gains associated
with methods that look for name tokens that re-occur within a single
document or across multiple documents
MUC-6 has highest
recall using SDR
MUC-6 has highest
recall-gain using SDR
MUC-6 has lowest
recall using MDR
MUC-6 has lowest
recall-gain using MDR
Results show that…
Higher recall and potential recall-gains can be obtained for email corpora using MDR method
October 7, 2005
CMU School of Computer Science
12
What’s Next?
•
•
Our studies show the potential of exploiting
repetition of names over multiple documents for
improving recall in email corpora
We suggest a recall-enhancing method:
1. Auto-construct a dictionary of predicted names and
their variants from test set
2. Statistically filter out noisy names from the dictionary
3. Match names globally from the inferred dictionary onto
test set, exploiting repetition of names
Note: A “dictionary” is simply a list of one or more tokens
October 7, 2005
CMU School of Computer Science
13
Name Dictionary Construction
Every name in the test set predicted by the learned extractor
(CRF), trained with all features, is transformed into a set of
name variants and inserted into a dictionary
Original name is
included by default
Transformation Example
Name variants of “Benjamin Brown Smith”
.
October 7, 2005
CMU School of Computer Science
14
Name Dictionary Filtering
• Previously constructed dictionary contains noisy names
– i.e. “brown” can also refer to a color
– Next goal: Filter out noisy names
• We suggest a filtering scheme to remove every singletoken name w from the dictionary when PF.IDF(w) < Θ
Predicted Frequency × Inverse Document Frequency
Words that get low PF.IDF
scores are either highly
ambiguous names or very
common words in corpus
cpf(w): # of times w is predicted as a
name-token in corpus
ctf(w): # of occurrences of w in corpus
df(w): document frequency of w in corpus
N: # of documents in corpus
Θ = 0.16 optimizes entitylevel F1 in tune sets; thus,
we apply the same threshold
onto our test sets
Note: “Corpus” mentioned here refers to the test set in our experiments
October 7, 2005
CMU School of Computer Science
15
Name Matching
Filtered Dictionary
…
benjamin brown smith
benjamin-brown smith
benjamin brown-smith
benjamin-brown-smith
benjamin brown s.
benjamin-b. smith
benjamin b. smith
benjamin brown-s.
benjamin-brown s.
benjamin-brown-s
benjamin-b. s.
benjamin-smith
benjamin smith
b. brown smith
benjamin b. s.
b. brown-smith
benjamin-s.
benjamin s.
b. brown s.
b. b. smith
b. brown-s.
benjamin
b. smith
b. b. s.
smith
b. s.
…
October 7, 2005
• A window slides through every token in the test set
• A match occurs when tokens in a window starts with
the longest possible name variant in the dictionary
• All matched names are marked for evaluation
Names Matching Example E-Mail
I called Benjamin Brown Smith and left a message to send us
an e-mail if he could come. I have not received his e-mail yet.
He might not be able to come. We may want to postpone
until tomorrow morning. Do you still have our class schedule?
Please contact benjamin and confirm the meeting. I do not
have classes tomorrow morning.
Predicte
d by CRF
Missed
by CRF
CMU School of Computer Science
16
Experimental Results
Entity-level relative improvements (and final scores) after
applying our recall-enhancing method on test sets
– Baseline: learned extractor (CRF) trained with all features
Results show that…
1) Recall improved significantly with small sacrifice in precision
2) F1 scores improved in all cases
October 7, 2005
CMU School of Computer Science
17
Conclusion
• Email and newswire text have different characteristics
• We suggested a set of specialized features for names
extraction on email exploiting structural regularities in email
• Exploiting names repetition over multiple documents is
important for improving recall in email corpora
• We presented the PF.IDF recall-enhancing method that
improves recall significantly with small sacrifice in precision
October 7, 2005
CMU School of Computer Science
18
Thank You!
October 7, 2005
CMU School of Computer Science
19
References
October 7, 2005
CMU School of Computer Science
20
Download