Slides

advertisement
Email Data Cleaning
(KDD’05)
Jie Tang1, Hang Li2, Yunbo Cao2, Zhaohui Tang3
1 Tsinghua
University
2 Microsoft Research Asia
3 Microsoft Corporation
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Motivation


Email is one of the most common modes of
communication
Text mining applications on emails




Email classification
Email summarization
Term extraction from email
…
Term Extraction
From: SY <sandeep....@gmail.com> - Find messages by this author
Date: Mon, 4 Apr 2005 11:29:28 +0530
Subject: Re: ..How to do addition??
Hi Ranger,
Your design of Matrix
class is not
good.
what are you doing with two
matrices in a single class?make class Matrix as follows
import java.io.*;
class Matrix {
public static int AnumberOfRows;
public static int AnumberOfColumns;
int matrixA[][];
Hiprivate
Ranger,
Extra line break
.
Missing space
Extra space
Missing period
Case errors
Your
of Matrix
is not good. What are
publicdesign
void inputArray()
throws class
IOException
{ doing with two matrices in a single class?
you
InputStreamReader input = new InputStreamReader(System.in);
Make
class Matrix
as follows:
BufferedReader
keyboardInput
= new BufferedReader(input)
}
-- Sandeep Yadav
Tel: 011-243600808
Homepage: http://www.it.com/~Sandeep/
On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote:
> Hi... I want to perform the addtion in my Matrix class. I got the program to
> enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and
> TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Related Work -- Data Mining

Email Cleaning



Several products have the feature of email cleaning by using rules
E.g. eClean (2000), WinPure ListCleaner Pro (2004)
Information Extraction from Email


Extracting contact information, etc
E.g. Kristjansson and Culotta (2004), Culotta, Bekkerman, and McCallum
(2004), Viola (2005)

Web Page Cleaning



Removing banner ads, decoration pictures
E.g. Yi and Liu (2003), Lin and Ho (2002)
Tabular Data Cleaning


Detecting and removing duplicate information
E.g. Hernández and Stolfo (1998), Rahm and Do (2000), SQL Server 2005
Related Work -- Language Processing

Sentence Boundary Detection


Case Restoration



Palmer and Hearst (1997)
Lita and Ittycheriah (2003)
Mikheev (2002)
Spelling Error Correction

Golding and Roth (I996)
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Our Approach -- Cascaded Approach
Cleaning = non-text block filtering + text normalization
•Non-text block filtering
- Quotation detection
- Header detection
- Signature detection
- Program code detection
•Text normalization
- Paragraph normalization
* Extra line break detection
- Sentence normalization
* Missing period detection
- Word normalization
* Case restoration
Cascaded Approach
Noisy Email
Message
Non-text Block Filtering
Quotation Detection
Quotation
Detection
Header Detection
Detection
Header
Signature Detection
Signature
Detection
Program Code
Code Detection
Program
Detection
Paragraph Normalization
Extraline
Linebreak
Break Detection
Extra
Detection
Sentence Normalization
Missing Periods
Missing
Periodand
and
Missing
spaces
Detection
Missing Space Detection
Extra Space
Spaces Detection
Extra
Detection
Word Normalization
From: SY <sandeep....@gmail.com> - Find messages by this author
Date: Mon, 4 Apr 2005 11:29:28 +0530
Subject: Re: ..How to do addition??
Hi Ranger,
good.
Your design of Matrix class is not good.
class are
is
good.with
What
arenot
you doing
doing
with two
two matrices
matrices in
in aa single
single class?
class?make
class? make
Make
what
you
what are
youas
doing
with two
follows
class
Matrix
follows.
matrices in a single class?make class Matrix as follows
import java.io.*;
class Matrix {
public static int AnumberOfRows;
public static int AnumberOfColumns;
private int matrixA[][];
public void inputArray() throws IOException
{
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader keyboardInput = new BufferedReader(input)
}
-- Sandeep Yadav
Tel: 011-243600808
Homepage: http://www.it.com/~Sandeep/
In a particular text mining
application, we can retain
some of the blocks
Case Restoration
Case
Restoration
Cleaned Email
Message
On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote:
> Hi... I want to perform the addtion in my Matrix class. I got the program to
> enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and
> TestMatrix class. I'm glad If anyone can let me know how to do the addition..
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Technical Issues

Non-text filtering





Quotation detection
Header detection
Signature detection
Program code detection
Text normalization



Extra line break detection
Sentence normalization
Case restoration
Non-text Filtering Using SVMs
Start line feature set
Header detection
Signature detection
Program code detection
End line feature set
Position feature
Position feature
Positive word feature
Positive word feature
...
...
Feature
extraction
Feature
extraction
SVMs
SVMs
Training data
Two SVM models
Feature
extraction
Test data
Start line model
End line model
Identified
blocks
Features Used in Header
Detection
Position
Feature
Position
Feature
Is
Isthe
thefirst
firstline?
line?
PositiveWord
Word
Features
Positive
Features
Begins
“From:”,
“Re:”,
“In article”,
etc. etc.
Beginswith:
with:
“From:”,
“Re:”,
“In article”,
Contains:“original
“original
message”,
“Fwd:”,
Contains:
message”,
“Fwd:”,
etc. etc.
Endswith:
with:“wrote:”,
“wrote:”,
“said:”,
Ends
“said:”,
etc. etc.
Negative Word Features
Contains: “Hi”, “dear”, “thank you”, “best regards”, etc.
Number of Words Feature
Number of words in the current line
Person Name Feature
Contains a person name?
EndingCharacter
Character
Features
Ending
Features
Endswith:
with:colon,
colon,
semicolon,
quotation
mark,
Ends
semicolon,
quotation
mark,
question
question
mark, exclamation
mark,
exclamation
mark, etc. mark, etc.
Special Pattern Features
Special Pattern Features
Containsone
onetype
type
special
patterns:
email,
Contains
of of
special
patterns:
email,
date,date,
number,URL,
URL,percentage,
percentage,
etc.
number,
etc.
Number of Line Breaks Feature
Number of line breaks exist before the current line
Header Detection
From: SY <sandeep....@gmail.com> - Find messages by this author
Date: Mon, 4 Apr 2005 11:29:28 +0530
Subject: Re: ..How to do addition??
Hi Ranger,
Your design of Matrix
class is not
good.
what are you doing with two
matrices in a single class?make class Matrix as follows
import java.io.*;
class Matrix {
public static int AnumberOfRows;
public static int AnumberOfColumns;
private int matrixA[][];
public void inputArray() throws IOException
{
InputStreamReader input = new InputStreamReader(System.in);
BufferedReader keyboardInput = new BufferedReader(input)
}
Position Feature
Feature
Position
Positive Word
Word Features
Features (“Subject:”)
(“From:”)
Positive
Negative Word
Word Features
Features
Negative
Two SVM models are
Number of
of Words
Feature
Number
Feature
employed
toWords
respectively
Person
Name
Feature
Person
Name
Feature
identify
the
start
line
and
end line.
Ending
CharacterFeatures
Features (“??”)
Ending
Character
Special Pattern
Pattern Features
Features (“email”)
Special
Number of
of Line
Line Breaks
Breaks Feature
Feature
Number
-- Sandeep Yadav
Tel: 011-243600808
Homepage: http://www.it.com/~Sandeep/
On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote:
> Hi... I want to perform the addtion in my Matrix class. I got the program to
> enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and
> TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx
Automatic Feature Generation
- Input: An annotated email dataset.
Generated Features
- Output: Discovered features.
From: <email>
- Algorithm:
Subject: (.*?) Re:
Step 1: Preprocessing. This step first processes emails by using hard rules. it
<<email>> wrote in message
replaces several special patterns
by a tag. For example, an email address
“joke@hotmail.com” is to be replaced
by a <date>
tag <email>.
Date: <week>
- Feature definition is tedious.
Subject:
Step 2:
patterns.
This
stepthe
take the
header lines
as positive samples
- Learning
Can we
automate
feature
generation?
and the other lines as negative
samples.
It employs
<week>
<date>
<time> the pattern learning tool to
discovering the patterns. An example of the discovered patterns is: “<begin>
Date: <week> <date> <time> Date:
<end>”.
-----Original Message-----
Step 3: Generating features. This step generates features according to the
To: <email>
learned patterns by using heuristic
rules. For the above example, the
corresponding feature can be: “^\s*Date: <week> <date> <time>\s*$”. The
….
feature represents whether or not the current line contains the pattern.
Example Features Used in
Signature Detection
Position Feature
Is the first line or the last line?
Positive Word Features
Contains: “Best Regards”, “Thanks”, “Sincerely”,
“Good luck”, etc.
Number of Words Feature
Number of words in the current line
Person Name Feature
Contains a person name?
Ending Character Features
Ends with: colon, semicolon, quotation mark,
question mark, exclamation mark…
Special Symbol Pattern
Features
Contains consecutive special symbols such as: “-------”, “======”, “******”.
Case Features
Whether the tokens are all in upper-case, all in
lower-case, all capitalized or only the first token is
capitalized
Number of Line Breaks Feature
Number of line breaks exist before the current line
Example Features Used in
Program Code Detection
Position Feature
Position of the current line
Declaration Keyword Features
Starts with: “string”, “char”, “double”, “dim”,
“typedef struct”, “#include”, “import”, “#define”,
“#undef”, etc.
Statement Keyword Features
There are four kind of statement keyword features:
- “i++”;
- “if”, “else if”, “switch”, and “case”;
- “while”, “do{”, “for”, and “foreach”;
- “goto”, “continue;”, “next;”, “break;”
Equation Pattern Features
There are four kind of equation pattern features:
- “=”, “<=” and “<<=”
- “a=b+/*-c;”
- “a=B(bb,cc);”
- “a=b;”
Function Pattern Feature
Contains function pattern? E.g., pattern covering
“fread(pbBuffer,1, LOCK_SIZE, hSrcFile);”
Extra Line Break Detection
Using SVMs
Feature set
Position feature
Bullet feature
...
Feature
extraction
SVMs
Training data
One SVM model
Feature
extraction
Test data
Extra line break model
Identified extra
line breaks
Features Used in Extra Line
Break Detection
Position Feature
Is the first line or the last line?
Greeting Word Features
Contains: “Hi” and “Dear”, etc.
Ending Character Features
Ends with: colon, semicolon, quotation mark,
question mark, exclamation mark, etc.
Whether
Whether the
thecurrent
current line
lineends
endswith
withaaword
wordininlower
lower
case
oror
not
the
next
line
starts
caseletters
lettersand
andwhether
whether
not
the
next
line
starts
with
with aa word
word in
in lower
lower case
caseletters
letters
Case
CaseFeatures
Features
Bullet Features
Is the next line one kind of bullet of a list item like
“1.” and “a)”?
Number of Line Breaks Feature
Number of line breaks exist after the current line
Extra Line Break Detection
Hi Ranger,
One
SVM
model is employed
Position
Feature
to identify whether a line
Your design of Matrix
Greeting
Features
break is Word
an extra
one or not.
class is not
good.
Ending Character Features
what are you doing with two
matrices in a single class?make class Matrix as follows
Case Features
Bullet Features
Number of Line Breaks Feature
Case restoration

tri-gram + sentence level decoding
P( wi | wi  2 wi 1 ) 
Backoff scheme:
Jack
C ( wi  2 wi 1wi )
C ( wi  2 wi 1 )
 C ( wi  2 wi 1wi )  D(C ( wi 2 wi 1wi ))

C ( wi  2 wi 1 )
P( wi | wi  2 wi 1 )  
 ( w w ) P( w | w )
i  2 i 1
i
i 1

if C ( wi  2 wi 1wi )  0
otherwise
utilize
outlook
express
to
retrieve
emails.
Jack
Utilize
Outlook
Express
To
Receive
Emails
jack
utilize
outlook
express
to
receive
emails
JACK
UTILIZE
OUTLOOK EXPRESS
TO
RECEIVE
EMAILS
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Datasets in Experiments
Data Set # of Email
DC
Ontology
NLP
ML
Jena
Weka
Protégé
OWL
Mobility
WinServer
Windows
PSS
BR
J2EE
100
100
60
40
700
200
500
500
400
400
1000
1000
310
255
5565




Containing Containing Containing
Header
Signature Prog. Code
1.00
0.87
0.15
1.00
1.00
1.00
0.996
0.995
0.28
0.384
0.44
0.449
0.476
0.492
0.495
1.00
0.77
0.883
0.975
0.97
0.975
0.822
0.932
0.745
0.672
0.653
0.668
0.643
0.561
0.02
0.0
0.05
0.38
0.17
0.032
0.042
0.0
0.0125
0.007
0.01
0.0
0.094
Text
Only
0.0
0.0
0.0
0.0
0.0
0.0005
0.168
0.048
0.183
0.221
0.218
0.208
0.244
0
3256(0.585)
3256(0.585) 4229(0.760) 401(0.072) 768(0.138)
73.2% contain extra line breaks
85.4% need sentence normalization
47.1% contain case errors
Only 1.6% are absolutely clean
Cleaning Results
-- 5-fold Cross Validation
Cleaning Task
Precision
Recall
F1Measure
Our Method
0.9695
0.9742
0.9719
Baseline
0.9981
0.6055
0.7537
Our Method
0.9133
0.8838
0.8983
Baseline
0.8854
0.2368
0.3736
Quotation
0.9818
0.9201
0.9500
Program Code
0.9297
0.7217
0.8126
Our Method
0.8553
0.9765
0.9119
Baseline
0.6355
0.9813
0.7715
0.9493
0.9391
0.9442
Header
Signature
Extra Line
Break
Sentence
For casemethods
restoration:
Baseline
method
can reach
98.15% in terms of accuracy
•-Our
Header
detection
(eClean2000)
accuracy
of Trucasing
about 97.7%
•-The
Signature
detection
(rule is
based)
• Extra line break detection baseline (eClean2000)
Automatic Features vs. Manual
Features
Detection Task
Precision
Recall
F1-Measure
Manual
0.9695
0.9742
0.9719
Automatic
0.9932
0.9626
0.9777
Manual
0.9133
0.8838
0.8983
Automatic
0.7616
0.6671
0.7112
Header
Signature
100
80
90
70
Percentage(%)
Percentage(%)
Term Extraction Using Email
Cleaning
80
70
60
50
60
50
40
30
40
Precision
Original Data
Recall
Baseline
BR
F1-Measure
Our Method
Precision
Original Data
Recall
Baseline
J2EE
F1-Measure
Our Method
How Cleaning Processing
Helps Term Extraction
100
Percentage(%)
90
+6.4%
+74.2%
+41%
80
70
60
50
40
Precision
Original Data
+Header
Recall
+Signature
BR
+Quotation
F1-Measure
+Program
+Paragraph
How Cleaning Processing
Helps Term Extraction (cont.)
80
+2.3%
Percentage(%)
70
+24.7%
60
+42.4%
50
40
30
Precision
Original Data
+Header
Recall
+Quotation
J2EE
+Signature
F1-Measure
+Program
+Paragraph
Outline






Motivation and Problem Description
Related Work
Our Approach
Implementation
Experimental Results
Summary
Summary





Formalized email data cleaning as non-text filtering and
text normalization
Conducted email cleaning in ‘cascaded’ approach
Used SVM models for header, signature, program code,
and extra line break detection
Our approach significantly outperforms baseline methods
When applied to term extraction, significant improvement
on extraction accuracy can be obtained
Thanks!
Download