Social Spam Kristina Lerman University of Southern California CS 599: Social Media Analysis University of Southern California 1 Manipulation of social media • Spam use of electronic messaging systems to send unsolicited bulk messages indiscriminately, for financial gain – Malware • if the page hosts malicious software or attempts to exploit a user’s browser. – Phishing • pages include any website attempting to solicit a user’s account credentials – Scam • any website advertising pharmaceuticals, software, adult content, and a multitude of other solicitations • Deception Motivations for spam • Abusers drive traffic to a web site – Malicious sites • phishing, malware, sell products • Compromised accounts then sold to other spammers – “Click fraud” • Gain financially from showing ads to visitors What is the cost of spam? • Are users harmed by click fraud? – advertiser gains, because real users click on ads – intermediary gains fees from the advertiser. – Spammer gains its cut from the clicked ads. – User gains, since she learns about products from ads • No harm is done? What is the cost of spam • Costs to consumers – Information pollution: good content is hard to find – Search engines and folksonomies direct traffic in the wrong directions – Users end up with less relevant resources • Costs to content producers – Less revenue for producers of relevant content • Costs to search engines – Develop algorithms to combat spam Everybody pays for the cost of information pollution Combatting spam • Social media spam is successful – 8% of URLs posted on Twitter are spam [2010] – Much higher click-through rates than email Strategies designed to make spam more costly to spammers • Search engine spam – Algorithms to combat rank manipulations, e.g., link farms – Blacklists of suspected malware and phishing (e.g., Google’s SafeBrowsing API) • Email spam – Filters on servers and clients – Blacklists: IP, domain and URL • Social spam? Social Spam Detection Benjamin Markines, Ciro Cattuto, Filippo Menczer Presented by Yue Cai Introduction • Web 2.0: social annotation user-driven simplicity and open-ended nature • Folksonomy: set of triples (u, r, t) user annotates resource r with tag t • Problem: social spam malicious user exploit collaborative tagging Focus of paper • Six features of social spam in collaborative tagging system limited to social bookmarking system (delicious.com) • Prove each feature has predictive power • Evaluate various supervised learning algorithms using these features Background • Why? financial gains • How? create content (generate by NLP or plagiarizing) place ads misleading tagging in social sites to attract traffic -- “Gossip Search Engine” • Outcome? Pollution of web environment Levels of spam • Content of tagged resources subjectivity • Posts: associate resources with tags create artificial links between resources and unrelated tags for questionable content, how user annotates it reveals intent • User account flag users as spammers – BibSonomy broad brush: exceedingly strict TagSpam • Spammers may use tags and tag combinations that are statistically unlikely in legitimate posts • Pr(t) : possibility of a given tag t is associated with spam users with tag t: U {u : r : (u, r , t ) F} t Pr(t ) St St U t Ut • For a post: T (u , r ) {t : (u , r , t ) F } fTagSpam (u ,r ) 1 T (u, r ) Pr(t ) tT ( u , r ) • Time complexity: constant time for any post • Cold start problem: needs a body of labeled annotations to bootstrap tag possibilities TagBlur • Spam posts associate spam resources with popular tags that are often semantically unrelated to each other • Semantic similarity of tags: base on prior work (t1 , t1 ) [0,1] • For a post: T (u, r ) {t : (u, r , t ) F } fTagBlur (u ,r ) 1 1 1 Z t1 t2T (u ,r ) (t1, t2 ) 1 Z: number of tag pairs in T(u,r) ε: attuning constant • Time complexity: quadratic in number of tags per post considers constant time • Needs precomputed similarity for any two tags DomFp • Spam webpages tend to have similar document structure • Estimate likelihood of r being spam by structure similarity with spam pages • Fingerprint: string containing all HTML 4.0 elements with order preserved K fingerprints of spam pages, each with its frequency Pr(k) Shingles method: f DomFp ( r ) (k1 , k2 ) [0,1] (k (r ), k ) * Pr(k ) k K (k (r ), k ) k K • Time complexity: grows linearly with size of labeled spam collection • Needs to crawl each resource and precompute spam fingerprint possibility Plagiarism • Spammers often copy original content from all over the Web • Estimate likelihood of content of r not being genuine • Random sequence of 10 words from page submit to Yahoo API get numbers of results f Plagiarism ( r ) y (r ) / ymax , ymax 10 • Most expensive feature: page download, query limit NumAds • Spammers create pages for serving ads • g(r): number of googlesyndication.com appeared in page r f NumAds g (r ) / g max , gmax 113 • Needs complete download of a web page ValidLinks • Many spam resources may be taken offline when detected • High portion of links by a spam user are invalid after some time fValidLinks(u ) Vu / Ru , Ru {r : (t : (u, r, t ) F )},Vu Ru Evaluation • Public Dataset by BibSonomy.org annotations of 27,000 users, 25,000 of which are spammers • Training dataset: 500 users, half spammers, half legitimate users • Another training dataset of same size for precompution features like TagSpam, TagBlur and DomFp • Aggregation of features on user level: TagSpam, TagBlur: post level DomFp, Plagiarism, NumAds: resource level Simple average works most effective across all features 1 f (u ) P(u ) ( u , r )P ( u ) f (u, r ), f (u ) [0,1] Each feature has predictive power • Each feature: contingency matrix n(l, f) TagSpam works the best Classification Effect of feature selection (SVM): • a modest improvement in accuracy and decrease in false positive rate by using both TagSpam and TagBlur • Performance is hindered by the addition of the ValidLinks feature (not for linear separation) All classifiers perform very well, with accuracy over 96% and false positive rate below 5%. Conclusion • Features are strong single use : 96% accuracy, 5% false positive combining: : 98% accuracy, 2% false positive • TagBlur feature looking promising its reliance on tag-tag similarity could be updated others rely on resource content or search engine so not reliable • Bootstrap still an open issue features like TagSpam and DomFp needs spam labels • Whether unsupervised features still needed like ValidLinks and Plagiarism Questions? @spam: The Underground on 140 Characters or Less Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang Presented by Renjie Zhao Focus of the Paper • Categorization and measure of Twitter spam – Spammers’ strategies, accounts and tools – How good are they? (Much better than junk emails) • Identification of spam campaigns – URL clustering – Extraction of distinct spam behaviors and targets • Performance of URL blacklists against Twitter spam – Temporal effectiveness (lead/lag) – Spammers’ counter-measures Preparation • Data Collection – Tapping into Twitter’s Streaming API • 7 million tweets per day • Over the course of one month (January 2010 – Feburary 2010) – Total: 200 million tweets gathered • Spam Identification – Focus on tweets with URL (25 million URLs) – Check URLs with 3 blacklists: Google Safebrowsing API, URIBL, Joewein – Result: 2 million URLs are flagged as spam • Challenged by manual inspection! Spam Breakdown Win an iTouch AND a $150 Apple gift card @victim! http://spam.com Call outs RT @scammer: check out the iPads there having a giveaway http://spam.com Retweets http://spam.com RT @barackobama A great battle is ahead of us Tweet hijacking Buy more followers! http://spam.com #fwlr Trend setting Help donate to #haiti relief: http://spam.com Trend hijacking Clickthrough Analysis • According to Clickthrough data of 245,000 URLs: – Only 2.3% have traffic – They had over 1.6 million visitors • Clickthrough rate – For a certain spam URL, CR = <# of clicks> / <# of URL’s exposure> – Result: 0.13% of spams tweets generate a visit (Compared to junk emails’ CR of 0.0003%-0.0006%) Spam Accounts • 2 tests to identify career spamming accounts – χ2 test on timestamp – consistency with uniform distribution – Tweet entropy – whether content is repeated throughout tweets • Result In a sample of 43,000 spam accounts: – 16% are identified as career spammers – What about the rest 84%? Spam Accounts • Compromised (non-career) spamming accounts – Phishing sites • 86% of 20,000 victims passed career spammer tests – Malware botnet: Koobface Spam Campaigns • Multiple spamming accounts may co-operate to advertise a spam website • URL clustering – Define a spam campaign as a binary feature vector c={0, 1}n – For two accounts i and j, if ci∩cj ≠ Ø, then i and j are clustered Spam Campaigns • Phishing for followers – A pyramid scheme – Most spammers are compromised users advertising the service • Personalized mentions – twitprize.com/<user name> – Unique, victim-specific landing pages shortened with tinyurl – Most relevant tweets are just RT or mentions Spam Campaigns • Buying retweets – retweet.it – Usually employed by spammers to spread malware and scams – Most accounts are career spammers (by χ2 test) • Distributing malware – ‘Free’ software, drive-by download – Use multiple hops of redirect to mask landing pages URL Blacklists • Currently (2010), Twitter relies Google Safebrowsing API to block malicious URLs. – Blacklists usually lags behind spam tweets – No retroactive blocking! Evading URL Blacklists • URL shortening service – bit.ly goo.gl ow.ly spam.com • What about domain-wise blacklists? Conclusion • • • • 8% of URLs on Twitter are spams 16% of spam accounts are automated bots Spam Clickthrough rate = 0.13% Spammers may coordinate thousands of accounts in a campaign • URL blacklists don’t work very well – because of delayed response – unable to reveal shortened URLs • Advice – Dig deeper into redirect chains – Retroactive blacklisting to increase spammers’ cost Follow-ups • More researches on spammers’ behaviors • Twitter added feature for user to report spam • ‘BotMaker’ launched in August Entropy-based Classification of ‘Retweeting’ Activity [Ghosh et al.] • Question – Given the time series of ‘retweeting’ activity on some user-generated content or tweet, how do we meaningfully categorize it as organic or spam? • Contributions – Use information theory-based features to categorize tweeting activity • Time interval entropy • User entropy Dynamics of Retweeting Activity (i) Popular news website (nytimes) (ii) Popular celebrity (billgates) (iii) Politician (silva_marina) vs (iv) An aspiring artist (youngdizzy) (v) Post by a fan site (AnnieBeiber) (vi) Advertisement using social media(onstrategy) Measuring time interval and user diversity • Measure time interval between consecutive retweets • Count distinct tweeting users Dti Time Interval Diversity Many different time intervals (ii) (i) Frequency of time Intervals of duration Dti pDT (Dti ) = Few time intervals observed nDti nT ån Dtk k =1 Time Interval Entropy nT H DT (DT ) = -å pDT (Dti )log(pDT (Dti )) i=1 User Diversity Many different users retweet a few times each Frequency of retweets by distinct user fi pF ( fi ) = n fi k=1 User Entropy nF H F (F) = -å pF ( fi )log(pF ( fi )) nF ån Few users retweet many times each fk i=1 Bloggers and News Website Dynamics of Retweeting Activity (i) Popular news website (nytimes) (ii) Popular celebrity (billgates) Campaigners Dynamics of Retweeting Activity (iii) Politician (silva_marina) (vi) Animal Right Activist(nokillanimalist) Performers and their fans Dynamics of Retweeting Activity (iv) An aspiring artist (youngdizzy) (v) Post by a fan site (AnnieBeiber) Advertisers and spammers (ix) Advertisement by a Japanese user (nikotono) (vii) Advertisement using social media(onstrategy) (viii) Account eventually suspended by Twitter(EasyCash435) Validation AnnieBieber News and blogs bot activity nytimes billgates silva_marina EasyCash advertisements & spams animalist onstrategy campaigns DonnaCCasteel Manually annotated URLs shown in the entropy plane Conclusion – Novel information theoretic approach to activity recognition • Content independent • Scalable and efficient • Robust to sampling – Results • sophisticated tools for marketing and spamming • Twitter is exploited for promotional and spam-like activities • Able to identify distinct classes of dynamic activities in Twitter and associated content • Separation of popular with unpopular content – Applications-spam detection, trend identification, trust management, user-modeling, social search, content classification