scavenger

advertisement
Detecting and Characterizing
Social Spam Campaigns
Yan Chen
Lab for Internet and Security Technology (LIST)
Northwestern Univ.
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
2
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
3
Motivation
• Online social networks (OSNs) are
exceptionally useful collaboration and
communication tools for millions of Internet
users.
– 400M active users for Facebook alone
– Facebook surpassed Google as the most
visited website
4
Motivation
• Unfortunately, the trusted communities in
OSN could become highly effective
mechanisms for spreading miscreant
activities.
– Popular OSNs have recently become the
target of phishing attacks
– account credentials are already being sold
online in underground forums
5
Goal
• In this study, our goal is to:
– Design a systematic approach that can
effectively detect the miscreant activities in
the wild in popular OSNs.
– Quantitatively analyze and characterize the
verified detection result to provide further
understanding on these attacks.
6
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
7
Detection System Design
The system design, starting from raw data collection
and ending with accurate classification of malicious
wall posts and corresponding users.
8
Data Collection
• Based on “wall” messages crawled from
Facebook (crawling period: Apr. 09 ~ Jun.
09 and Sept. 09).
• Leveraging unauthenticated regional
networks, we recorded the crawled users’
profile, friend list, and interaction records
going back to January 1, 2008.
• 187M wall posts with 3.5M recipients are
used in this study.
9
Filter posts without URLs
• Assumption: All spam posts should
contain some form of URL, since the
attacker wants the recipient to go to some
destination on the web.
• Example (without URL):
Kevin! Lol u look so good tonight!!!
Filter out
10
Filter posts without URLs
• Assumption: All spam posts should
contain some form of URL, since the
attacker wants the recipient to go to some
destination on the web.
• Example (with URL):
Um maybe also this:
http://community.livejournal.com/lemonadepoem/54654.html
Guess who your secret admirer is??
Go here nevasubevd\t. blogs pot\t.\tco\tm (take out spaces)
Further process
11
Build Post Similarity Graph
• After filtering wall posts without URLs, we
build the post similarity graph on the
remaining ones.
– A node: a remaining wall post
– An edge: if the two wall posts are “similar” and
are thus likely to be generated from the same
spam campaign
12
Wall Post Similarity Metric
• Two wall posts are “similar” if:
– They share similar descriptions, or
– They share the same URL.
• Example (similar descriptions):
Guess who your secret admirer is??
Go here nevasubevd\t. blogs pot\t.\tco\tm (take out spaces)
Guess who your secret admirer is??
Visit: \tyes-crush\t.\tcom\t (remove\tspaces)
Establish an edge!
13
Wall Post Similarity Metric
• Two wall posts are “similar” if:
– They share similar descriptions, or
– They share the same URL.
• Example (same URL):
secret admirer revealed.
goto yourlovecalc\t.\tcom (remove the spaces)
hey see your love compatibility !
go here yourlovecalc\t.\tcom (remove\tspaces)
Establish an edge!
14
Extract Wall Post Clusters
• Intuition:
– If A and B are generated from the same spam
campaign while B and C are generated from
the same spam campaign, then A, B and C
are all generated from the same spam
campaign.
• We reduce the problem of extract wall post
clusters to identifying connected
subgraphs inside the post similarity graph.
15
Extract Wall Post Clusters
A sample wall post similarity graph and the
corresponding clustering process (for illustrative
purpose only)
16
Identify Malicious Clusters
• The following heuristics are used to
distinguish malicious clusters (spam
campaigns) from benign ones:
– Distributed property: the cluster is posted by
at least n distinct users.
– Bursty property: the median interval of two
consecutive wall posts is less than t.
17
Identify Malicious Clusters
Maliciousfrom_user >= n
&& interval <= t?
Cluster!!
Yes!!
from_user >= n
&& interval <= t?
NO!!
from_userBenign
>= n
&& interval <= t?
Cluster!!
NO!!
Benign
Cluster!!
from_user >= n
&& >=
interval
<= t?
from_user
n
from_user >= n
&& interval <=NO!!
t?
&& intervalBenign
<= t?
NO!!
NO!!
Cluster!!
Benign
Benign
Cluster!!
Cluster!!
from_user >= n
&& interval <= t?
Yes!!
Malicious
Cluster!!
from_user >= n
&& interval <= t?
Yes!!
Malicious
Cluster!!
A sample process of distinguishing malicious
clusters from benign ones (for illustrative
purpose only)
18
Identify Malicious Clusters
• (6, 3hr) is found to be a good (n, t) value
by testing TF:FP rates on the border line.
• Slightly modifying the value only have
minor impact on the detection result.
• Sensitivity test: we vary the threshold
– (6, 3 hr) to (4, 6hr)
– Only result in 4% increase in the classified
malicious cluster.
19
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
20
Experimental Validation
• The validation is focused on detected
URLs.
• A rigid set of approaches are adopted to
confirm the malice of the detection result.
• The URL that cannot be confirmed by any
approach will be assumed as “benign”
(false positive).
21
Experimental Validation
• Step 1: Obfuscated URL
– URLs embedded with obfuscation are
malicious, since there is no incentive for
benign users to do so.
– Detecting obfuscated URLs, e.g.,
• Replacing ‘.’ with “dot”, e.g., 1lovecrush dot com
• Inserting white spaces, e.g., abbykywyty\t. blogs
pot\t.\tco\tm, etc.
• Have a complete such list from anti-spam research
22
Experimental Validation
• Step 2: Third-party tools
– Multiple tools are used, including:
• McAfee SiteAdvisor
• Google’s Safe Browsing API
• URL blacklist (SURBL, URIBL, Spamhaus,
SquidGuard)
• Wepawet, drive-by-download checking
– The URL that is classified as “malicious” by at
least one of these tools will be confirmed as
malicious
23
Experimental Validation
• Step 3: Redirection analysis
– Any URL that redirects to a confirmed malicious URL
is considered as “malicious”, too.
• Step 4: Wall post keyword search
– If the wall post contains typical spam keyword, like
“viagra”, “enlarger pill”, “legal bud”, etc, the contained
URL is considered as “malicious”.
– Human assistance is involved to acquire such
keywords
24
Experimental Validation
• Step 5: URL grouping
– Groups of URLs exhibit highly uniform features. Some
have been confirmed as “malicious” previously. The
rest are also considered as “malicious”.
– Human assistance is involved in identifying such
groups.
• Step 6: Manual analysis
– We leverage Google search engine to confirm the
malice of URLs that appear many times in our trace.
25
Experimental Validation
The validation result. Each row gives the number of
confirmed URL and wall posts in a given step.
The total # of wall posts after filtering is ~2M out of 187M.
26
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
27
Usage summary of 3 URL Formats
• 3 different URL formats (with e.g.):
– Link: <a href=“...”>http://2url.org/?67592</a>
– Plain text: mynewcrsh.com
– Obfuscated: nevasubevu\t. blogs pot\t.\tco\tm
28
Usage summary of 4 Domain Types
• 4 different domain types (with e.g.):
–
–
–
–
Content sharing service: imageshack.us
URL shortening service: tinyurl.org
Blog service: blogspot.com
Other: yes-crush.com
29
Spam Campaign Identification
30
Spam Campaign Temporal Correlation
31
Attack Categorization
• The attacks categorized by purpose.
• Narcotics, pharma and luxury stands for the
corresponding product selling.
32
User Interaction Degree
• Malicious accounts exhibit higher interaction
degree than benign ones.
33
User Active Time
• Active time is measured as the time between the first and last
observed wall post made by the user.
• Malicious accounts exhibit much shorter active time comparing to
benign ones.
34
Wall Post Hourly Distribution
• The hourly distribution of benign posts is
consistent with the diurnal pattern of human,
while that of malicious posts is not.
35
Detecting and Characterizing Social
Spam Campaigns: Roadmap
•
•
•
•
•
Motivation & Goal
Detection System Design
Experimental Validation
Malicious Activity Analysis
Conclusions
36
Conclusions
• We design our automated techniques to detect
coordinated spam campaigns on Facebook.
• Based on the detection result, we conduct indepth analysis on the malicious activities and
make interesting discoveries, including:
– Over 70% of attacks are phishing attacks.
– malicious posts do not exhibit human diurnal patterns.
– etc.
37
Thank you!
38
Extract Wall Post Clusters
The algorithm for wall post clustering. The detail of
breadth-first search (BFS) is omitted.
39
Download