Don`t_Follow_me - Internet Database Lab.

Don’t Follow me : Spam Detection in Twitter Alex Hai Wang The Pensylvania State University International Conference on Security and Cryptography, 2010 January 12, 2011 In-seok An SNU Internet Database Lab. Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 2 / 37 Introduction  Social Network Service ( SNS ) – An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people – The most popular applications of Web 2.0  Twitter – Founded in 2006 – One of the fastest growing SNSs  Surging more than 2,800% in 2009 – Social networking site and microblogging service 3 / 37 Introduction  Twitter You can post your latest updates Messages(Tweets) from twitter that you are following( describing ) 4 / 37 Introduction  Spammer in Twitter – The goal of Twitter  Allow friends to communicate and stay connected through the exchange of short message – Spammer also use Twitter as a tool to post malicious links – More than 3% messages are spam on Twitter ( Analytics, 2009 ) – The offensive trending topic Attack on February 20 ( CNET, 2009 ) 5 / 37 Introduction  Method to report spam – By clicking on the “report as spam” – To post a tweet in the “@spam @username”  This report service is also abused by both hoaxes and spam  Legitimate user can be mistakenly suspended by Twitter’s anti spam action 6 / 37 Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 7 / 37 Social Graph model  Twitter can be modeled as a directed graph – G = (V ,A ) – V : a set of nodes ( vertices ) – A : a set of arcs ( Edges )  Four types of relationships on Twitter can be defined – Follower  Node v j is a follower of node vi if the arc a = ( j , i ) is contained in A – Friend  Node v j is a friend of node vi if the arc a = ( i, j ) is contained in A – Mutual Friend  Node vi and node v j are mutual friends if both arcs a = ( i , j ) and a = ( j , i ) are contained in A – Stranger  Node vi and node vi are strangers if neither arcs a = ( I , j ) nor a = ( j , I ) is contained in A 8 / 37 Social Graph model  A simple Twitter graph A follows B B follows C, C follows B A is follower of B B is friend of A B and C are Mutual friend A doesn’t follow C, C doesn’t follow A A and C are stranger 9 / 37 Social Graph model  Twitter Social Graph 10 / 37 Outline  Introduction  Social Graph model  Features – Graph-based features – Content-based features      Data Set Spam Detection Experiments Evaluation Conclusion 11 / 37 Features Graph-based features  Twitter’s spam and abuse policy – “if you have a small number of followers compared to the amount of people you are following, it may be considered as a spam account”  Three features – The number of friends  The indegree d I (vi ) of a node vi – The number of followers  The outdegree d (v ) of a node vi O i – The reputation of a user R(vi )  d I (vi ) d I (vi )  d O (vi ) 12 / 37 Features Content-based features  Duplicate Tweets – An account may be considered as a spam if you post duplicate content on one account – Detected by measuring the Levenshtein distance ( edit distance )  Minimum cost of transforming one string into another through a sequence of edit operations ( deletion , insertion and substitution of individual symbols )  Clean the data by stopping the words containing “@”, “#”, “http://” and “www.” – The number of duplicate tweets can be measurement  In the user’s 20 most recent tweets  Two tweets are considered as duplicate only when the are exactly the same 13 / 37 Features Content-based features  Need for cleaning 14 / 37 Features Content-based features  HTTP Links – It is considered as spam if your updates consist mainly of links and not personal updates – Twitter filters out the URLs linked to known malicious sites  URL shorten services like bit.ly provides opportunity for attacker to spam – The number of tweets containing HTTP links can be measurement ?? http://bit.ly/ab3cd http://porno.com http://bit.ly/ab3cd ↓ http:// porno.com Tweet Tweet with with HTTP HTTP link link URL shorten service 15 / 37 Malicious MaliciousSite Site Features Content-based features  Replies and Mentions – You can send a reply message to another user  @username + message – You can also mention another @username anywhere in the tweet  Message + @username + message – Twitter automatically collects all tweets containing your username – You can reply anyone no matter they are your friends/followers or not – Spammer abuses this feature – The number of Tweets containing mention or reply can be measurement 16 / 37 Features Content-based features  Spam tweets using mention or reply 17 / 37 Features Content-based features  Trending topic – The most-mentioned terms on Twitter at that moment, week, month – User can use the hashtag to a tweet  #tagname – If there are many tweets containing the same term,  It may become a trending topic – Twitter considers an account as spam  If you post multiple unrelated updates to a topic using the # symbols 18 / 37 Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 19 / 37 Data Set  Data Set – – – – 3 weeks from January 3 to January 24, 2010 25,847 users 500k tweets 49M follower/friend relationships 20 / 37 Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 21 / 37 Spam Detection  Several classification algorithms – – – – – Decision tree Neural network Support vector machines K – nearest neighbers Naïve Bayesian  Naïve Bayesian outperform all other method – Bayesian classifier is noise robust  It uses posterior probability – A spam probability is calculated for each individual user based its behaviors, instead of giving a general rule 22 / 37 Spam Detection  Naïve Bayesian classifier P( X | Y ) P(Y ) P(Y | X )  P( X ) – X : each Twitter account is considered as a vector X with feature values – Y : one of two classes, spam and non-spam – The features are conditionally independent 23 / 37 Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 24 / 37 Experiments  To evaluate the detection method – 500 Twitter user accounts are labeled manually to two classes( spam or not )  By reading the 20 most recent tweets  Checking the friends and followers of the user – Result show that there are around 1% spam account in the data set  Additional spam data are added to the data set  To simulate the reality and avoid the bias in the crawling and label methods – The study in Analytics, 2009, shows there is 3% spam on Twitter  Search @spam on Twitter and collect additional spam data – Only small number of result report real spam – The data set is mixed to contain around 3% spam data 25 / 37 Experiments  Graph-based features – The number of friends for each Twitter account – Only 30% of spam accounts follow a large amount of user  Spammer doesn’t need to follow other user 26 / 37 Experiments  Graph-based features – The number of followers for each Twitter account – Usually the spam accounts do not have a large amount of followers  Some spam accounts having a relatively large amount of followers 27 / 37 Experiments  Graph-based features – The reputation for each Twitter account – The reputation of most legitimate users is between 30% to 90%  Some spam accounts have a 100% reputation 28 / 37 Experiments  Content-based Features – The number of pairwise duplication – Not all spam accounts post multiple duplicate tweets  We can not only depend on this feature 29 / 37 Experiments  Content-based Features – The number of mentions and replies – Most spam accounts have the maximum 20 “@” symbol  This will lure legitimate users to read their spam messages or click their link 30 / 37 Experiments  Content-based Features – The number of links – Some legitimate users also include links in all tweets, some companies join Twitter to promote their own web sites 31 / 37 Experiments  Content-based Features – The number of Hash tag signs 32 / 37 Outline         Introduction Social Graph model Features Data Set Spam Detection Experiments Evaluation Conclusion 33 / 37 Evaluation  The evaluation of the overall process – Confusion matrix – Precision : P = a / ( a + c ) – Recall : R = a / ( a + b ) – F-measure : F = 2PR / ( P + R )  Each classifier is trained 10 times – Each time using the 9 out of the 10 partitions as training data – Computing the confusion matrix using the tenth partition as test data 34 / 37 Evaluation  The evaluation results – Naïve Bayesian classifier has the best overall performance  Finally, the Bayesian classifier learned from the labeled data is applied to the entire data set – Information about totally 25,817 users – Precision of the spam detection system  392 users are classified as spam  348 users are real spam account and 44 users are false alarms  89% precision 35 / 37 Conclusion  The spam behavior in a popular online SNS, Twitter – To formalize the problem, social graph model is proposed  Novel content-based and graph-based features are proposed – Graph-based features  The number of friends  The number of followers  The reputation of the user – Content-based features     The number of pairwise duplications The number of Mention and Replies The number of Links The number of Hashtags  Analyze the data set and evaluate the performance of the detection system 36 / 37 Conclusion  Among the graph-based features – The proposed reputation features has the best performance – No many spam follow large amount of users – Some spammers have many followers  For the content-based features – Most spam accounts have multiple duplicate tweets – But not all spam account post multiple duplicate tweets  We can not rely on this feature  Several popular classification algorithms are studied and evaluated  The naïve classifier achieve a 89% precision 37 / 37

Don`t_Follow_me - Internet Database Lab.

Related documents

Products

Support

Don`t_Follow_me - Internet Database Lab.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib