Spam Detection Jingrui He 10/08/2007

advertisement
Spam Detection
Jingrui He
10/08/2007
Spam Types

Email Spam


Blog Spam


Unsolicited commercial email
Unwanted comments in blogs
Splogs

Fake blogs to boost PageRank
From Learning Point of View

Spam Detection


Feature Extraction


Classification problem (ham vs. spam)
A Learning Approach to Spam Detection based
on Social Networks. H.Y. Lam and D.Y. Yeung
Fast Classifier

Relaxed Online SVMs for Spam Filtering. D.
Sculley, G.M. Wachman
A Learning Approach to Spam
Detection based on Social Networks
H.Y. Lam and D.Y. Yeung
CEAS 2007
Problem Statement

n Email Accounts
Sender Set:
; Receiver Set
Labeled Sender Set:
s.t.

Goal



Assign the remaining account with
in
System Flow Chart
Social Network from Logs


Directed Graph
Directed Edge


Email sent from
Edge Weight
to
=
is the number of emails

sent from
to
System Flow Chart
Features from Email Social Networks

In-count / Out-count


The sum of in-coming / out-going edge weights
In-degree / Out-degree

The number of email accounts that a node
receives emails from / sends emails to
Features from Email Social Networks

Communication Reciprocity (CR)

The percentage of interactive neighbors that a
node has
The set of accounts that sent emails to
The set of accounts that received emails from
Features from Email Social Networks

Communication Interaction Average (CIA)

The level of interaction between a sender and
each of the corresponding recipients
Features from Email Social Networks

Clustering Coefficient (CC)

Friends-of-friends relationship between email
accounts
Number of connections between neighbors of
Number of neighbors of
System Flow Chart
Preprocessing

Sender Feature Vector



Weighted Features

Problematic?
System Flow Chart
Assigning Spam Score

Similarity Weighted k-NN method

Gaussian similarity

Similarity weighted mean k-NN scores




yi
Score scaling
j:x j 
wij y j
j:x j 
wij
The set of k
nearest
neighbors
Experiments


Enron Dataset: 9150 Senders
To Get




Legitimate Enron senders: email transactions
within the Enron email domain
5000 generated spam accounts
120 senders from each class
Results Averaged over 100 Times
Number of Nearest Neighbors
Feature Weights (CC)
Feature Weights (CIA)
Feature Weights (CR)
Feature Weights

In/Out-Count & In/Out-Degree


The smaller the better
Final Weights




In/Out-count & In/Out-degree: 1
CR: 1
CIA: 10
CC: 15
Conclusion

Legitimacy Score



Can Be Combined with Content-Based Filters
More Sophisticated Classifiers


No content needed
SVM, boosting, etc
Classifiers Using Combined Feature
Relaxed Online SVMs for
Spam Filtering
D. Sculley and G.M. Washman
SIGIR 2007
Anti-Spam Controversy


Support Vector Machines (SVMs)
Academic Researchers



Practitioners



Statistically robust
State-of-the-art performance
Quadratic in the number of training examples
Impractical!
Solution: Relaxed Online SVMs
Background: SVMs




Data Set =
Class Label : 1 for spam; -1 for ham
Classifier:
Tradeoff parameter
To Find and
Slack variable

Minimize:

margin
the loss function
Constraints: Maximizing theMinimizing
Online SVMs
Tuning the Tradeoff Parameter C

Spamassassin data set: 6034 examples
Large C preferred
Email Spam and SVMs


TREC05P-1: 92189 Messages
TREC06P: 37822 messages
Blog Comment Spam and SVMs


Leave One Out Cross Validation
50 Blog Posts; 1024 Comments
Splogs and SVMs


Leave One Out Cross Validation
1380 Examples
Computational Cost

Online SVMs: Quadratic Training Time
Relaxed Online SVMs (ROSVM)

Objective Function of SVMs:

Large C Preferred


Minimizing training error more important than
maximizing the margin
ROSVM


Full margin maximization not necessary
Relax this requirement
Three Ways to Relax SVMs (1)

Only Optimize Over the Recent p Examples

Dual form of SVMs

Constraints
The last value found for
when
Three Ways to Relax SVMs (2)

Only Update on Actual Errors

Original online SVMs


Update when
ROSVM




Update when
m=0: mistake driven online SVMs
NO significant degrade in performance
Significantly reduce cost
Three Ways to Relax SVMs (3)

Reduce the Number of Iterations in Interative
SVMs



SMO: repeated pass over the training set to
minimize the objective function
Parameter T: the maximum number of iterations
T=1: little impact on performance
Testing Reduced Size
Testing Reduced Iterations
Testing Reduced Updates
Online SVMs and ROSVM

ROSVM:
Email Spam
Blog Comment Spam
Splog Data Set
Download