Don`t_Follow_me - Internet Database Lab.

advertisement
Don’t Follow me : Spam Detection in Twitter
Alex Hai Wang
The Pensylvania State University
International Conference on Security and Cryptography, 2010
January 12, 2011
In-seok An
SNU Internet Database Lab.
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
2 / 37
Introduction
 Social Network Service ( SNS )
– An online service, platform, or site that focuses on building and reflecting
of social networks or social relations among people
– The most popular applications of Web 2.0
 Twitter
– Founded in 2006
– One of the fastest growing SNSs
 Surging more than 2,800% in 2009
– Social networking site and microblogging service
3 / 37
Introduction
 Twitter
You can post your latest updates
Messages(Tweets) from twitter that
you are following( describing )
4 / 37
Introduction
 Spammer in Twitter
– The goal of Twitter
 Allow friends to communicate and stay connected through the exchange of
short message
– Spammer also use Twitter as a tool to post malicious links
– More than 3% messages are spam on Twitter ( Analytics, 2009 )
– The offensive trending topic Attack on February 20 ( CNET, 2009 )
5 / 37
Introduction
 Method to report spam
– By clicking on the “report as spam”
– To post a tweet in the “@spam @username”
 This report service is also abused by both hoaxes and spam
 Legitimate user can be mistakenly suspended by Twitter’s anti spam action
6 / 37
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
7 / 37
Social Graph model
 Twitter can be modeled as a directed graph
– G = (V ,A )
– V : a set of nodes ( vertices )
– A : a set of arcs ( Edges )
 Four types of relationships on Twitter can be defined
– Follower
 Node v j is a follower of node vi if the arc a = ( j , i ) is contained in A
– Friend
 Node v j is a friend of node vi if the arc a = ( i, j ) is contained in A
– Mutual Friend
 Node vi and node v j are mutual friends if both arcs a = ( i , j ) and
a = ( j , i ) are contained in A
– Stranger
 Node vi and node vi are strangers if neither arcs a = ( I , j ) nor
a = ( j , I ) is contained in A
8 / 37
Social Graph model
 A simple Twitter graph
A follows B
B follows C,
C follows B
A is follower of B
B is friend of A
B and C are
Mutual friend
A doesn’t follow C,
C doesn’t follow A
A and C are stranger
9 / 37
Social Graph model
 Twitter Social Graph
10 / 37
Outline
 Introduction
 Social Graph model
 Features
– Graph-based features
– Content-based features





Data Set
Spam Detection
Experiments
Evaluation
Conclusion
11 / 37
Features
Graph-based features
 Twitter’s spam and abuse policy
– “if you have a small number of followers compared to the amount of
people you are following, it may be considered as a spam account”
 Three features
– The number of friends
 The indegree d I (vi ) of a node vi
– The number of followers
 The outdegree d (v ) of a node vi
O
i
– The reputation of a user
R(vi ) 
d I (vi )
d I (vi )  d O (vi )
12 / 37
Features
Content-based features
 Duplicate Tweets
– An account may be considered as a spam if you post duplicate content on
one account
– Detected by measuring the Levenshtein distance ( edit distance )
 Minimum cost of transforming one string into another through a sequence of
edit operations ( deletion , insertion and substitution of individual symbols )
 Clean the data by stopping the words containing “@”, “#”, “http://” and “www.”
– The number of duplicate tweets can be measurement
 In the user’s 20 most recent tweets
 Two tweets are considered as duplicate only when the are exactly the same
13 / 37
Features
Content-based features
 Need for cleaning
14 / 37
Features
Content-based features
 HTTP Links
– It is considered as spam if your updates consist mainly of links and not
personal updates
– Twitter filters out the URLs linked to known malicious sites
 URL shorten services like bit.ly provides opportunity for attacker to spam
– The number of tweets containing HTTP links can be measurement
??
http://bit.ly/ab3cd
http://porno.com
http://bit.ly/ab3cd
↓
http:// porno.com
Tweet
Tweet
with
with
HTTP
HTTP
link
link
URL shorten service
15 / 37
Malicious
MaliciousSite
Site
Features
Content-based features
 Replies and Mentions
– You can send a reply message to another user
 @username + message
– You can also mention another @username anywhere in the tweet
 Message + @username + message
– Twitter automatically collects all tweets containing your username
– You can reply anyone no matter
they are your friends/followers
or not
– Spammer abuses this feature
– The number of Tweets containing mention or reply can be
measurement
16 / 37
Features
Content-based features
 Spam tweets using mention or reply
17 / 37
Features
Content-based features
 Trending topic
– The most-mentioned terms on Twitter at that moment, week, month
– User can use the hashtag to a tweet
 #tagname
– If there are many tweets containing the same term,
 It may become a trending topic
– Twitter considers an account as spam
 If you post multiple unrelated updates to a topic using the # symbols
18 / 37
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
19 / 37
Data Set
 Data Set
–
–
–
–
3 weeks from January 3 to January 24, 2010
25,847 users
500k tweets
49M follower/friend relationships
20 / 37
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
21 / 37
Spam Detection
 Several classification algorithms
–
–
–
–
–
Decision tree
Neural network
Support vector machines
K – nearest neighbers
Naïve Bayesian
 Naïve Bayesian outperform all other method
– Bayesian classifier is noise robust
 It uses posterior probability
– A spam probability is calculated for each individual user based its behaviors,
instead of giving a general rule
22 / 37
Spam Detection
 Naïve Bayesian classifier
P( X | Y ) P(Y )
P(Y | X ) 
P( X )
– X : each Twitter account is considered as a vector X with feature values
– Y : one of two classes, spam and non-spam
– The features are conditionally independent
23 / 37
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
24 / 37
Experiments
 To evaluate the detection method
– 500 Twitter user accounts are labeled manually to two classes( spam or not )
 By reading the 20 most recent tweets
 Checking the friends and followers of the user
– Result show that there are around 1% spam account in the data set
 Additional spam data are added to the data set
 To simulate the reality and avoid the bias in the crawling and label methods
– The study in Analytics, 2009, shows there is 3% spam on Twitter
 Search @spam on Twitter and collect additional spam data
– Only small number of result report real spam
– The data set is mixed to contain around 3% spam data
25 / 37
Experiments
 Graph-based features
– The number of friends for each Twitter account
– Only 30% of spam accounts follow a large amount of user
 Spammer doesn’t need to follow other user
26 / 37
Experiments
 Graph-based features
– The number of followers for each Twitter account
– Usually the spam accounts do not have a large amount of followers
 Some spam accounts having a relatively large amount of followers
27 / 37
Experiments
 Graph-based features
– The reputation for each Twitter account
– The reputation of most legitimate users is between 30% to 90%
 Some spam accounts have a 100% reputation
28 / 37
Experiments
 Content-based Features
– The number of pairwise duplication
– Not all spam accounts post multiple duplicate tweets
 We can not only depend on this feature
29 / 37
Experiments
 Content-based Features
– The number of mentions and replies
– Most spam accounts have the maximum 20 “@” symbol
 This will lure legitimate users to read their spam messages or click their link
30 / 37
Experiments
 Content-based Features
– The number of links
– Some legitimate users also include links in all tweets, some companies join
Twitter to promote their own web sites
31 / 37
Experiments
 Content-based Features
– The number of Hash tag signs
32 / 37
Outline








Introduction
Social Graph model
Features
Data Set
Spam Detection
Experiments
Evaluation
Conclusion
33 / 37
Evaluation
 The evaluation of the overall process
– Confusion matrix
– Precision : P = a / ( a + c )
– Recall : R = a / ( a + b )
– F-measure : F = 2PR / ( P + R )
 Each classifier is trained 10 times
– Each time using the 9 out of the 10 partitions as training data
– Computing the confusion matrix using the tenth partition as test data
34 / 37
Evaluation
 The evaluation results
– Naïve Bayesian classifier has the best overall performance
 Finally, the Bayesian classifier learned from the labeled data is
applied to the entire data set
– Information about totally 25,817 users
– Precision of the spam detection system
 392 users are classified as spam
 348 users are real spam account and 44 users are false alarms
 89% precision
35 / 37
Conclusion
 The spam behavior in a popular online SNS, Twitter
– To formalize the problem, social graph model is proposed
 Novel content-based and graph-based features are proposed
– Graph-based features
 The number of friends
 The number of followers
 The reputation of the user
– Content-based features




The number of pairwise duplications
The number of Mention and Replies
The number of Links
The number of Hashtags
 Analyze the data set and evaluate the performance of the
detection system
36 / 37
Conclusion
 Among the graph-based features
– The proposed reputation features has the best performance
– No many spam follow large amount of users
– Some spammers have many followers
 For the content-based features
– Most spam accounts have multiple duplicate tweets
– But not all spam account post multiple duplicate tweets
 We can not rely on this feature
 Several popular classification algorithms are studied and
evaluated
 The naïve classifier achieve a 89% precision
37 / 37
Download