Email Data Cleaning (KDD’05) Jie Tang1, Hang Li2, Yunbo Cao2, Zhaohui Tang3 1 Tsinghua University 2 Microsoft Research Asia 3 Microsoft Corporation Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Motivation Email is one of the most common modes of communication Text mining applications on emails Email classification Email summarization Term extraction from email … Term Extraction From: SY <sandeep....@gmail.com> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530 Subject: Re: ..How to do addition?? Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; int matrixA[][]; Hiprivate Ranger, Extra line break . Missing space Extra space Missing period Case errors Your of Matrix is not good. What are publicdesign void inputArray() throws class IOException { doing with two matrices in a single class? you InputStreamReader input = new InputStreamReader(System.in); Make class Matrix as follows: BufferedReader keyboardInput = new BufferedReader(input) } -- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/ On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Related Work -- Data Mining Email Cleaning Several products have the feature of email cleaning by using rules E.g. eClean (2000), WinPure ListCleaner Pro (2004) Information Extraction from Email Extracting contact information, etc E.g. Kristjansson and Culotta (2004), Culotta, Bekkerman, and McCallum (2004), Viola (2005) Web Page Cleaning Removing banner ads, decoration pictures E.g. Yi and Liu (2003), Lin and Ho (2002) Tabular Data Cleaning Detecting and removing duplicate information E.g. Hernández and Stolfo (1998), Rahm and Do (2000), SQL Server 2005 Related Work -- Language Processing Sentence Boundary Detection Case Restoration Palmer and Hearst (1997) Lita and Ittycheriah (2003) Mikheev (2002) Spelling Error Correction Golding and Roth (I996) Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Our Approach -- Cascaded Approach Cleaning = non-text block filtering + text normalization •Non-text block filtering - Quotation detection - Header detection - Signature detection - Program code detection •Text normalization - Paragraph normalization * Extra line break detection - Sentence normalization * Missing period detection - Word normalization * Case restoration Cascaded Approach Noisy Email Message Non-text Block Filtering Quotation Detection Quotation Detection Header Detection Detection Header Signature Detection Signature Detection Program Code Code Detection Program Detection Paragraph Normalization Extraline Linebreak Break Detection Extra Detection Sentence Normalization Missing Periods Missing Periodand and Missing spaces Detection Missing Space Detection Extra Space Spaces Detection Extra Detection Word Normalization From: SY <sandeep....@gmail.com> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530 Subject: Re: ..How to do addition?? Hi Ranger, good. Your design of Matrix class is not good. class are is good.with What arenot you doing doing with two two matrices matrices in in aa single single class? class?make class? make Make what you what are youas doing with two follows class Matrix follows. matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][]; public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) } -- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/ In a particular text mining application, we can retain some of the blocks Case Restoration Case Restoration Cleaned Email Message On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.. Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Technical Issues Non-text filtering Quotation detection Header detection Signature detection Program code detection Text normalization Extra line break detection Sentence normalization Case restoration Non-text Filtering Using SVMs Start line feature set Header detection Signature detection Program code detection End line feature set Position feature Position feature Positive word feature Positive word feature ... ... Feature extraction Feature extraction SVMs SVMs Training data Two SVM models Feature extraction Test data Start line model End line model Identified blocks Features Used in Header Detection Position Feature Position Feature Is Isthe thefirst firstline? line? PositiveWord Word Features Positive Features Begins “From:”, “Re:”, “In article”, etc. etc. Beginswith: with: “From:”, “Re:”, “In article”, Contains:“original “original message”, “Fwd:”, Contains: message”, “Fwd:”, etc. etc. Endswith: with:“wrote:”, “wrote:”, “said:”, Ends “said:”, etc. etc. Negative Word Features Contains: “Hi”, “dear”, “thank you”, “best regards”, etc. Number of Words Feature Number of words in the current line Person Name Feature Contains a person name? EndingCharacter Character Features Ending Features Endswith: with:colon, colon, semicolon, quotation mark, Ends semicolon, quotation mark, question question mark, exclamation mark, exclamation mark, etc. mark, etc. Special Pattern Features Special Pattern Features Containsone onetype type special patterns: email, Contains of of special patterns: email, date,date, number,URL, URL,percentage, percentage, etc. number, etc. Number of Line Breaks Feature Number of line breaks exist before the current line Header Detection From: SY <sandeep....@gmail.com> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530 Subject: Re: ..How to do addition?? Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][]; public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) } Position Feature Feature Position Positive Word Word Features Features (“Subject:”) (“From:”) Positive Negative Word Word Features Features Negative Two SVM models are Number of of Words Feature Number Feature employed toWords respectively Person Name Feature Person Name Feature identify the start line and end line. Ending CharacterFeatures Features (“??”) Ending Character Special Pattern Pattern Features Features (“email”) Special Number of of Line Line Breaks Breaks Feature Feature Number -- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/ On Apr 3, 2005 5:33 PM, ranger <asiri....@gmail.com> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx Automatic Feature Generation - Input: An annotated email dataset. Generated Features - Output: Discovered features. From: <email> - Algorithm: Subject: (.*?) Re: Step 1: Preprocessing. This step first processes emails by using hard rules. it <<email>> wrote in message replaces several special patterns by a tag. For example, an email address “joke@hotmail.com” is to be replaced by a <date> tag <email>. Date: <week> - Feature definition is tedious. Subject: Step 2: patterns. This stepthe take the header lines as positive samples - Learning Can we automate feature generation? and the other lines as negative samples. It employs <week> <date> <time> the pattern learning tool to discovering the patterns. An example of the discovered patterns is: “<begin> Date: <week> <date> <time> Date: <end>”. -----Original Message----- Step 3: Generating features. This step generates features according to the To: <email> learned patterns by using heuristic rules. For the above example, the corresponding feature can be: “^\s*Date: <week> <date> <time>\s*$”. The …. feature represents whether or not the current line contains the pattern. Example Features Used in Signature Detection Position Feature Is the first line or the last line? Positive Word Features Contains: “Best Regards”, “Thanks”, “Sincerely”, “Good luck”, etc. Number of Words Feature Number of words in the current line Person Name Feature Contains a person name? Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark… Special Symbol Pattern Features Contains consecutive special symbols such as: “-------”, “======”, “******”. Case Features Whether the tokens are all in upper-case, all in lower-case, all capitalized or only the first token is capitalized Number of Line Breaks Feature Number of line breaks exist before the current line Example Features Used in Program Code Detection Position Feature Position of the current line Declaration Keyword Features Starts with: “string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, etc. Statement Keyword Features There are four kind of statement keyword features: - “i++”; - “if”, “else if”, “switch”, and “case”; - “while”, “do{”, “for”, and “foreach”; - “goto”, “continue;”, “next;”, “break;” Equation Pattern Features There are four kind of equation pattern features: - “=”, “<=” and “<<=” - “a=b+/*-c;” - “a=B(bb,cc);” - “a=b;” Function Pattern Feature Contains function pattern? E.g., pattern covering “fread(pbBuffer,1, LOCK_SIZE, hSrcFile);” Extra Line Break Detection Using SVMs Feature set Position feature Bullet feature ... Feature extraction SVMs Training data One SVM model Feature extraction Test data Extra line break model Identified extra line breaks Features Used in Extra Line Break Detection Position Feature Is the first line or the last line? Greeting Word Features Contains: “Hi” and “Dear”, etc. Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc. Whether Whether the thecurrent current line lineends endswith withaaword wordininlower lower case oror not the next line starts caseletters lettersand andwhether whether not the next line starts with with aa word word in in lower lower case caseletters letters Case CaseFeatures Features Bullet Features Is the next line one kind of bullet of a list item like “1.” and “a)”? Number of Line Breaks Feature Number of line breaks exist after the current line Extra Line Break Detection Hi Ranger, One SVM model is employed Position Feature to identify whether a line Your design of Matrix Greeting Features break is Word an extra one or not. class is not good. Ending Character Features what are you doing with two matrices in a single class?make class Matrix as follows Case Features Bullet Features Number of Line Breaks Feature Case restoration tri-gram + sentence level decoding P( wi | wi 2 wi 1 ) Backoff scheme: Jack C ( wi 2 wi 1wi ) C ( wi 2 wi 1 ) C ( wi 2 wi 1wi ) D(C ( wi 2 wi 1wi )) C ( wi 2 wi 1 ) P( wi | wi 2 wi 1 ) ( w w ) P( w | w ) i 2 i 1 i i 1 if C ( wi 2 wi 1wi ) 0 otherwise utilize outlook express to retrieve emails. Jack Utilize Outlook Express To Receive Emails jack utilize outlook express to receive emails JACK UTILIZE OUTLOOK EXPRESS TO RECEIVE EMAILS Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Datasets in Experiments Data Set # of Email DC Ontology NLP ML Jena Weka Protégé OWL Mobility WinServer Windows PSS BR J2EE 100 100 60 40 700 200 500 500 400 400 1000 1000 310 255 5565 Containing Containing Containing Header Signature Prog. Code 1.00 0.87 0.15 1.00 1.00 1.00 0.996 0.995 0.28 0.384 0.44 0.449 0.476 0.492 0.495 1.00 0.77 0.883 0.975 0.97 0.975 0.822 0.932 0.745 0.672 0.653 0.668 0.643 0.561 0.02 0.0 0.05 0.38 0.17 0.032 0.042 0.0 0.0125 0.007 0.01 0.0 0.094 Text Only 0.0 0.0 0.0 0.0 0.0 0.0005 0.168 0.048 0.183 0.221 0.218 0.208 0.244 0 3256(0.585) 3256(0.585) 4229(0.760) 401(0.072) 768(0.138) 73.2% contain extra line breaks 85.4% need sentence normalization 47.1% contain case errors Only 1.6% are absolutely clean Cleaning Results -- 5-fold Cross Validation Cleaning Task Precision Recall F1Measure Our Method 0.9695 0.9742 0.9719 Baseline 0.9981 0.6055 0.7537 Our Method 0.9133 0.8838 0.8983 Baseline 0.8854 0.2368 0.3736 Quotation 0.9818 0.9201 0.9500 Program Code 0.9297 0.7217 0.8126 Our Method 0.8553 0.9765 0.9119 Baseline 0.6355 0.9813 0.7715 0.9493 0.9391 0.9442 Header Signature Extra Line Break Sentence For casemethods restoration: Baseline method can reach 98.15% in terms of accuracy •-Our Header detection (eClean2000) accuracy of Trucasing about 97.7% •-The Signature detection (rule is based) • Extra line break detection baseline (eClean2000) Automatic Features vs. Manual Features Detection Task Precision Recall F1-Measure Manual 0.9695 0.9742 0.9719 Automatic 0.9932 0.9626 0.9777 Manual 0.9133 0.8838 0.8983 Automatic 0.7616 0.6671 0.7112 Header Signature 100 80 90 70 Percentage(%) Percentage(%) Term Extraction Using Email Cleaning 80 70 60 50 60 50 40 30 40 Precision Original Data Recall Baseline BR F1-Measure Our Method Precision Original Data Recall Baseline J2EE F1-Measure Our Method How Cleaning Processing Helps Term Extraction 100 Percentage(%) 90 +6.4% +74.2% +41% 80 70 60 50 40 Precision Original Data +Header Recall +Signature BR +Quotation F1-Measure +Program +Paragraph How Cleaning Processing Helps Term Extraction (cont.) 80 +2.3% Percentage(%) 70 +24.7% 60 +42.4% 50 40 30 Precision Original Data +Header Recall +Quotation J2EE +Signature F1-Measure +Program +Paragraph Outline Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary Summary Formalized email data cleaning as non-text filtering and text normalization Conducted email cleaning in ‘cascaded’ approach Used SVM models for header, signature, program code, and extra line break detection Our approach significantly outperforms baseline methods When applied to term extraction, significant improvement on extraction accuracy can be obtained Thanks!