Chad Mills Program Manager Windows Live Safety Platform Microsoft Run Time Training Time training features million dollars transfer guardian (feature,weight) pairs (dollars, 0.2) 0.37 (million, 0.1) (transfer, 0.1) (community, -0.01) (social, -0.01) Assumption: March community -0.01) Spam words continue to appear in(fellow, spam messages -0.11 social 0.03) Good words continue to appear in(guardian, good messages fellow (March, -0.08) features feature weights score From: "Chelsea Clark" <easyMoneySurveys@pointracer.com> Subject: Get PaidFor yourOpinion <style> … <br Bij board bar atteindre jYST GCS re sonrisa fuse Kiviuq padded /> <br Star Honolulu /> <br Ons apporter /> opens NRSU syringe /> <br Jerusalem comfort HTTPS 2604 confidence Miles /> <br 27 mails Qty backwards Meditations bans sedative ect salve <br insightful /> Korean relations header greeting Airllines Phantom CVS Rae 504 1009 perf<br graphiques /> undertaking paced Liquidation reduction /> … Overall Good Good Good Good Good Group of words newsletter peers month select these late click commissioner media smoothly off close support before okay sponsor rock go by ads none cases text membership Good Message + + Free Nigeria Viagra Spammy Words = Borderline Spam Message late click commissioner Borderline Spam Borderline Spam + Unknown Words late click commissioner Inbox newsletter select month Unknown Words Junk Folder = = Good Words newsletter select month Non-Good Words Chaff Spam [spam content] newsletter peers month select these late click commissioner media smoothly off close support before okay sponsor rock go by ads none cases text membership Legitimate Mail March is all about the Zune community. This month, you can help create a new feature for The Social, get tips from a fellow Zune user and find out the winners of the Your Zune Your Choice Awards. Sum of weights (content filter score) Average weight Standard Deviation Percent of words that are good Percent of words that are spam Number of features Maximum feature weight Number of strong spam words Etc. Run Time Training Time training features Metafeatures 1.9 Sum: 0.37 σ: 0.09 Max: 0.2 Sum: -0.11 σ: 0.04 -1.7 Max: -0.1 Features million dollars transfer guardian March community social fellow (feature,weight) pairs (dollars, 0.2) metafeature extraction (million, 0.1) (transfer, 0.1) (community, -0.01) Metafeatures (social, -0.01) (fellow, -0.01) (guardian, 0.03) training (March, -0.08) (feature, weight) (Metafeature,weight) Pairs features feature weights metafeature extraction Metafeatures Metafeature weights (Metafeature, weight) (Sum: 0.37, 1.0) (Sum: -0.11, -0.8) (σ: 0.09, 0.8) (σ: 0.04, -0.6) (Max: 0.2, 0.1) (Max: -0.1, -0.3) score Hotmail Feedback Loop ◦ Messages classified by recipients Training Set: 1,800,000 messages ◦ Ending on 5/20/07 Evaluation Set: 50,000 messages ◦ Data from 5/21/07 45% improvement in TP at low FP levels At a reasonable False Positive rate: ◦ 98% of unique catches are chaff spam ◦ Caught 99.5% of chaff spam missed by regular content filter ◦ Similar types of False Positives as regular filter Challenges Remaining ◦ Primarily just helped on spam with chaff ◦ Relies on base content filter to detect spam with obfuscated content (e.g. v1agra) or naïve spam without any chaff Spam messages with good word chaff have unnatural weight distributions Metafeatures is able to identify and catch these messages This resulted in a 45% improvement in TP Gains were limited to spam with good word chaff