slides - dimva 2013

advertisement
Early Detection of Outgoing Spammers in
Large-Scale Service Provider Networks
Yehonatan Cohen
Daniel Gordon
Danny Hendler
Ben-Gurion University
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Talk outline




Preliminaries
ErDOS: An Early Detection Scheme for Outgoing Spam
Evaluation
Conclusions and Future Work
Danny Hendler and Philipp Woelfel, PODC 2009
Preliminaries
 Spam
Unsolicited mail, typically
sent in large quantities
 Hazards
• Malware distribution
• Phishing
• Resource consumption
• Poor user experience
 Detection may be attempted when
• Mail is sent (outgoing spam detection)
• Mail is received (incoming spam detection)
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Outgoing spam detection
 Spam can be blocked before leaving
the Email Service Provider (ESP)
 Advantages
• Reduces load on ESP infrastructure
• Prevents damage to ESP reputation
• Detection may be based on hosted accounts' activity
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Outgoing spam filtering techniques
 Contents-based filtering: Learn & identify
messages' textual patterns typical of spam
messages
• May be tricked by manipulating spam content
o Image-based
o Random string insertion (hash busters)
Non-negligible false negative rate
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Outgoing spam filtering techniques (cont'd)
 Inter-account communication patterns analysis:
• Models accounts' behaviour
• Based on inter-account social interactions
• Typically utilizes machine-learning techniques
• May leverage ESP account identification
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Our goals
 Devise an effective detector
of outgoing spammers for large
ESPs (the ErDOS detector)
 Emphasis on early detection
• Detects spammers before the contents-based filter
 Short training periods
• Highly adaptive to changing spamming patterns
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Most relevant related work
 Lam & Yeung, CEAS 2007
• Introduce “social-network”-based outgoing spam detection
• Use the k-NN classifier
• Relatively small dataset (ENRON)
• Labeling based on simulated spammer accounts
 Tseng & Chen, CSE 2009
• Uses same set of features
• Uses SVM classifier
• Larger, non-ESP dataset (University email server)
• Incremental model update
• Labeling based on pure accounts
• Account identification based on “from” header field
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Comparison with data-sets of previous work


Collected by a very large ESP
Consists of incoming and outgoing log files
o

4 days of bi-directional data + 22 days of outgoing traffic only
Both incoming and outgoing messages are labeled as spam/ham by
a content-based detector
Our data set
NTU
Enron
#mails
9.86E7
2.13E8
2.86E6
5.17E5
#accounts
5.63E7
5.81E7
6.37E5
3.67E4
#edges
7.40E7
12.90E7
-
3.68E5
time period
4 days
(in/out)
26 days
(outgoing)
10 days
3.5 years
spam &
ham
ham
contents
spam & ham
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Comparison with data-sets of previous work


Collected by a very large ESP
Consists of incoming and outgoing log files
o

4 days of bi-directional data + 22 days of outgoing traffic only
Both incoming and outgoing messages are labeled as spam/ham by
a content-based detector
Our data set
NTU
Enron
#mails
9.86E7
2.13E8
2.86E6
5.17E5
#accounts
5.63E7
5.81E7
6.37E5
3.67E4
#edges
7.40E7
12.90E7
-
3.68E5
time period
4 days
(in/out)
26 days
(outgoing)
10 days
3.5 years
spam &
ham
ham
contents
spam & ham
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Talk outline
 Preliminaries
 ErDOS: An Early Detection Scheme for Outgoing Spam
• Computation Flow
• Features
 Evaluation
 Conclusions and Future Work
Danny Hendler and Philipp Woelfel, PODC 2009
The ErDOS detector: computation flow
Pre-processing
Compute account
feature values
based on a single
day of email logs
Construct
suspect
accounts list of
configurable
size
Feature
values
computed
Scored
accounts
Determine
accounts'
classification
Assign account
scores using
classification
model
Classified
data set
Undersampling:
extract all spammers
and equal number of
legitimate accounts
as training set
Training
set
Classification
model
Build
rotation
forest model
Remainder of accounts
not in training set
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Talk outline
 Preliminaries
 ErDOS: An Early Detection Scheme for Outgoing Spam
• Computation Flow
• Features
 Evaluation
 Conclusions and Future Work
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
ErDOS features: IOR
An account’s IOR = #incoming/#outgoing mails
Legitimate users
 Maintain social
interactions
 Often belong to
mailing lists
Spammers
 Sent messages
seldom replied
Low IOR characteristic of spammers
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
ErDOS features: IOR (cont'd)
Danny Hendler and Philipp Woelfel, PODC 2009
ErDOS features: IOR versus CR
 Communication Reciprocity (CR)
• Fraction of recipients who responded to an account's emails
• Defined by Gomes et al.
• IOR is superior for short training periods
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
ErDOS features: IEBC
 IEBC (Internal/External Behaviour Consistency)
• An account can send/receive emails to/from
•
 Internal addresses (accounts hosted by ESP)
 External addresses
Legitimate accounts show correlation between internal and
external IOR, spammers less so
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
ErDOS features: #outgoing messages
 Number of outgoing messages
• Spamming accounts send more emails than legitimate
• Insufficient for detecting low-volume spammers
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
ErDOS: Sender Accounts' Characteristics
 A large fraction of spammers' incoming mail is spam!
• Legitimate accounts seldom send emails to spamming
•
accounts
Dictionary attacks may cause spammers to spam each other
 Analyse senders' characteristics
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Talk outline




Preliminaries
ErDOS: An Early Detection Scheme for Outgoing Spam
Evaluation
Conclusions and Future Work
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Accuracy for Single-Day training
 Evaluate Accuracy attained for single day logs
• Email accounts are classified based on the tags of the
contents-base detector
• True Positive (TP) and False Positive (FP) values are
averaged over available 4 days of bidirectional data
ErDOS
LY-knn ⃰
MailNET ⃰ ⃰
TP
FP
TP
FP
TP
FP
71
8.9
76.3
47.8
22.6
44.2
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Early detection evaluation
 Spamming accounts detected before the
contents-based detector
• Suspected by detector, send messages tagged as spam
•
only on later days
Evaluation uses all 26 days of data
 Early detection quality criteria:
• e-Precision: fraction of early detected accounts out of
•
suspects list.
Enrichment Factor (EF): ratio between detector's
e-Precision and that of a random accounts list.
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Early detection
 Early detection results, averaged over 4 days:
ErDOS’s suspects
Entire population
#accounts
100
100
Early detections
9
0.53
e-Precision
0.09
0.0053
 Prior art’s early detections results compared to
ErDOS:
ErDOS
LY-knn
MailNET
e-Precision
0.09
0.012
0.025
EF
16.9
2.3
4.7
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Early detection (cont’d)
 e-Precision for varying suspects list lengths:
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Talk outline




Preliminaries
ErDOS: An Early Detection Scheme for Outgoing Spam
Evaluation
Conclusions and Future Work
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Conclusions and Future Work
 Conclusions
• The case of outgoing spam detection for ESPs has its
•
•
unique nature
Contents-based filtering is not enough
Early detection of spamming accounts can be achieve by a
combination of contents-based filter and network levelbased detector
 Future Work
• Enhancement of ErDOS’s early detection performance by
•
additional features
A low-volume spammers expert detector, based on
ErDOS’s computation flow and features
Yehonatan Cohen, Daniel Gordon and Danny Hendler, DIMVA 2013
Download