Spam Detection Jingrui He 10/08/2007 Spam Types Email Spam Blog Spam Unsolicited commercial email Unwanted comments in blogs Splogs Fake blogs to boost PageRank From Learning Point of View Spam Detection Feature Extraction Classification problem (ham vs. spam) A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung Fast Classifier Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman A Learning Approach to Spam Detection based on Social Networks H.Y. Lam and D.Y. Yeung CEAS 2007 Problem Statement n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t. Goal Assign the remaining account with in System Flow Chart Social Network from Logs Directed Graph Directed Edge Email sent from Edge Weight to = is the number of emails sent from to System Flow Chart Features from Email Social Networks In-count / Out-count The sum of in-coming / out-going edge weights In-degree / Out-degree The number of email accounts that a node receives emails from / sends emails to Features from Email Social Networks Communication Reciprocity (CR) The percentage of interactive neighbors that a node has The set of accounts that sent emails to The set of accounts that received emails from Features from Email Social Networks Communication Interaction Average (CIA) The level of interaction between a sender and each of the corresponding recipients Features from Email Social Networks Clustering Coefficient (CC) Friends-of-friends relationship between email accounts Number of connections between neighbors of Number of neighbors of System Flow Chart Preprocessing Sender Feature Vector Weighted Features Problematic? System Flow Chart Assigning Spam Score Similarity Weighted k-NN method Gaussian similarity Similarity weighted mean k-NN scores yi Score scaling j:x j wij y j j:x j wij The set of k nearest neighbors Experiments Enron Dataset: 9150 Senders To Get Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class Results Averaged over 100 Times Number of Nearest Neighbors Feature Weights (CC) Feature Weights (CIA) Feature Weights (CR) Feature Weights In/Out-Count & In/Out-Degree The smaller the better Final Weights In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15 Conclusion Legitimacy Score Can Be Combined with Content-Based Filters More Sophisticated Classifiers No content needed SVM, boosting, etc Classifiers Using Combined Feature Relaxed Online SVMs for Spam Filtering D. Sculley and G.M. Washman SIGIR 2007 Anti-Spam Controversy Support Vector Machines (SVMs) Academic Researchers Practitioners Statistically robust State-of-the-art performance Quadratic in the number of training examples Impractical! Solution: Relaxed Online SVMs Background: SVMs Data Set = Class Label : 1 for spam; -1 for ham Classifier: Tradeoff parameter To Find and Slack variable Minimize: margin the loss function Constraints: Maximizing theMinimizing Online SVMs Tuning the Tradeoff Parameter C Spamassassin data set: 6034 examples Large C preferred Email Spam and SVMs TREC05P-1: 92189 Messages TREC06P: 37822 messages Blog Comment Spam and SVMs Leave One Out Cross Validation 50 Blog Posts; 1024 Comments Splogs and SVMs Leave One Out Cross Validation 1380 Examples Computational Cost Online SVMs: Quadratic Training Time Relaxed Online SVMs (ROSVM) Objective Function of SVMs: Large C Preferred Minimizing training error more important than maximizing the margin ROSVM Full margin maximization not necessary Relax this requirement Three Ways to Relax SVMs (1) Only Optimize Over the Recent p Examples Dual form of SVMs Constraints The last value found for when Three Ways to Relax SVMs (2) Only Update on Actual Errors Original online SVMs Update when ROSVM Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost Three Ways to Relax SVMs (3) Reduce the Number of Iterations in Interative SVMs SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance Testing Reduced Size Testing Reduced Iterations Testing Reduced Updates Online SVMs and ROSVM ROSVM: Email Spam Blog Comment Spam Splog Data Set