Social Spam Kristina Lerman University of Southern California CS 599: Social Media Analysis

Manipulation of social media
• Spam
use of electronic messaging systems to send unsolicited bulk
messages indiscriminately, for financial gain
– Malware
• if the page hosts malicious software or attempts to exploit a user’s
– Phishing
• pages include any website attempting to solicit a user’s account
– Scam
• any website advertising pharmaceuticals, software, adult content,
and a multitude of other solicitations
• Deception
Motivations for spam
• Abusers drive traffic to a web site
– Malicious sites
• phishing, malware, sell products
• Compromised accounts then sold to other spammers
– “Click fraud”
• Gain financially from showing ads to visitors
What is the cost of spam?
• Are users harmed by click fraud?
– advertiser gains, because real users click on ads
– intermediary gains fees from the advertiser.
– Spammer gains its cut from the clicked ads.
– User gains, since she learns about products from ads
• No harm is done?
What is the cost of spam
• Costs to consumers
– Information pollution: good content is hard to find
– Search engines and folksonomies direct traffic in the
wrong directions
– Users end up with less relevant resources
• Costs to content producers
– Less revenue for producers of relevant content
• Costs to search engines
– Develop algorithms to combat spam
 Everybody pays for the cost of information pollution
Combatting spam
• Social media spam is successful
– 8% of URLs posted on Twitter are spam [2010]
– Much higher click-through rates than email
Strategies designed to make spam more costly to spammers
• Search engine spam
– Algorithms to combat rank manipulations, e.g., link farms
– Blacklists of suspected malware and phishing (e.g.,
Google’s SafeBrowsing API)
• Email spam
– Filters on servers and clients
– Blacklists: IP, domain and URL
• Social spam?
Social Spam Detection
Benjamin Markines, Ciro Cattuto, Filippo Menczer
Presented by Yue Cai
• Web 2.0: social annotation
simplicity and open-ended nature
• Folksonomy: set of triples (u, r, t)
user annotates resource r with tag t
• Problem: social spam
malicious user exploit collaborative tagging
Focus of paper
• Six features of social spam in collaborative tagging
limited to social bookmarking system (
• Prove each feature has predictive power
• Evaluate various supervised learning algorithms
using these features
• Why?
financial gains
• How?
create content (generate by NLP or plagiarizing)
place ads
misleading tagging in social sites to attract traffic
-- “Gossip Search Engine”
• Outcome?
Pollution of web environment
Levels of spam
• Content of tagged resources
• Posts: associate resources with tags
create artificial links between resources and unrelated tags
for questionable content, how user annotates it reveals intent
• User account
flag users as spammers – BibSonomy
broad brush: exceedingly strict
• Spammers may use tags and tag combinations that are
statistically unlikely in legitimate posts
• Pr(t) : possibility of a given tag t is associated with spam
users with tag t: U  {u : r : (u, r , t )  F}
 Pr(t )  St
St  U t
• For a post: T (u , r )  {t : (u , r , t )  F }
fTagSpam (u ,r )
T (u, r )
Pr(t )
tT ( u , r )
• Time complexity: constant time for any post
• Cold start problem: needs a body of labeled annotations to
bootstrap tag possibilities
• Spam posts associate spam resources with popular tags that
are often semantically unrelated to each other
• Semantic similarity of tags:
base on prior work
 (t1 , t1 ) [0,1]
• For a post: T (u, r )  {t : (u, r , t )  F }
fTagBlur (u ,r )
Z t1 t2T (u ,r )  (t1, t2 )   1  
Z: number of tag pairs in T(u,r)
ε: attuning constant
• Time complexity: quadratic in number of tags per post
considers constant time
• Needs precomputed similarity for any two tags
• Spam webpages tend to have similar document structure
• Estimate likelihood of r being spam by structure similarity with
spam pages
• Fingerprint:
string containing all HTML 4.0 elements with order preserved
K fingerprints of spam pages, each with its frequency  Pr(k)
Shingles method:
f DomFp ( r ) 
 (k1 , k2 )  [0,1]
  (k (r ), k ) * Pr(k )
k K
  (k (r ), k )
k K
• Time complexity: grows linearly with size of labeled spam collection
• Needs to crawl each resource and precompute spam fingerprint
• Spammers often copy original content from all over the Web
• Estimate likelihood of content of r not being genuine
• Random sequence of 10 words from page
 submit to Yahoo API  get numbers of results
f Plagiarism ( r )  y (r ) / ymax , ymax  10
• Most expensive feature: page download, query limit
• Spammers create pages for serving ads
• g(r): number of appeared in page r
f NumAds  g (r ) / g max , gmax  113
• Needs complete download of a web page
• Many spam resources may be taken offline when
• High portion of links by a spam user are invalid after some time
fValidLinks(u )  Vu / Ru , Ru {r : (t : (u, r, t )  F )},Vu  Ru
• Public Dataset by
annotations of 27,000 users, 25,000 of which are spammers
• Training dataset: 500 users, half spammers, half legitimate users
• Another training dataset of same size for precompution
features like TagSpam, TagBlur and DomFp
• Aggregation of features on user level:
TagSpam, TagBlur: post level
DomFp, Plagiarism, NumAds: resource level
Simple average works most effective across all features
f (u ) 
P(u )
( u , r )P ( u )
f (u, r ), f (u )  [0,1]
Each feature has predictive power
• Each feature:
contingency matrix n(l, f)
TagSpam works the
Effect of feature selection (SVM):
• a modest improvement in accuracy and
decrease in false positive rate by using both
TagSpam and TagBlur
• Performance is hindered by the addition of the
ValidLinks feature (not for linear separation)
All classifiers perform very well,
with accuracy over 96% and false
positive rate below 5%.
• Features are strong
single use : 96% accuracy, 5% false positive
combining: : 98% accuracy, 2% false positive
• TagBlur feature looking promising
its reliance on tag-tag similarity could be updated
others rely on resource content or search engine so not
• Bootstrap still an open issue
features like TagSpam and DomFp needs spam labels
• Whether unsupervised features still needed
like ValidLinks and Plagiarism
@spam: The Underground on 140 Characters or
Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang
Presented by Renjie Zhao
Focus of the Paper
• Categorization and measure of Twitter spam
– Spammers’ strategies, accounts and tools
– How good are they? (Much better than junk emails)
• Identification of spam campaigns
– URL clustering
– Extraction of distinct spam behaviors and targets
• Performance of URL blacklists against Twitter spam
– Temporal effectiveness (lead/lag)
– Spammers’ counter-measures
• Data Collection
– Tapping into Twitter’s Streaming API
• 7 million tweets per day
• Over the course of one month (January 2010 – Feburary 2010)
– Total: 200 million tweets gathered
• Spam Identification
– Focus on tweets with URL (25 million URLs)
– Check URLs with 3 blacklists: Google Safebrowsing API,
URIBL, Joewein
– Result: 2 million URLs are flagged as spam
• Challenged by manual inspection!
Spam Breakdown
Win an iTouch AND a $150 Apple
gift card @victim!
Call outs
RT @scammer: check out the
iPads there having a giveaway
Retweets RT
@barackobama A great battle is
ahead of us
Tweet hijacking
Buy more followers! #fwlr
Trend setting
Help donate to #haiti relief:
Trend hijacking
Clickthrough Analysis
• According to Clickthrough data
of 245,000 URLs:
– Only 2.3% have traffic
– They had over 1.6 million visitors
• Clickthrough rate
– For a certain spam URL,
CR = <# of clicks> / <# of URL’s exposure>
– Result: 0.13% of spams tweets generate a visit
(Compared to junk emails’ CR of 0.0003%-0.0006%)
Spam Accounts
• 2 tests to identify career spamming accounts
– χ2 test on timestamp – consistency with uniform
– Tweet entropy – whether content is repeated throughout
• Result
In a sample of 43,000 spam accounts:
– 16% are identified as career spammers
– What about the rest 84%?
Spam Accounts
• Compromised (non-career) spamming accounts
– Phishing sites
• 86% of 20,000 victims passed career spammer tests
– Malware botnet: Koobface
Spam Campaigns
• Multiple spamming accounts may co-operate to advertise a
spam website
• URL clustering
– Define a spam campaign as a binary feature vector c={0,
– For two accounts i and j, if ci∩cj ≠ Ø, then i and j are
Spam Campaigns
• Phishing for followers
– A pyramid scheme
– Most spammers are compromised users advertising the
• Personalized mentions
–<user name>
– Unique, victim-specific landing pages shortened with
– Most relevant tweets are just RT or mentions
Spam Campaigns
• Buying retweets
– Usually employed by spammers to spread malware and
– Most accounts are career spammers (by χ2 test)
• Distributing malware
– ‘Free’ software, drive-by download
– Use multiple hops of redirect to mask landing pages
URL Blacklists
• Currently (2010), Twitter relies Google Safebrowsing API to
block malicious URLs.
– Blacklists usually lags behind spam tweets
– No retroactive blocking!
Evading URL Blacklists
• URL shortening service
–   
• What about domain-wise blacklists?
8% of URLs on Twitter are spams
16% of spam accounts are automated bots
Spam Clickthrough rate = 0.13%
Spammers may coordinate thousands of accounts in a
• URL blacklists don’t work very well
– because of delayed response
– unable to reveal shortened URLs
• Advice
– Dig deeper into redirect chains
– Retroactive blacklisting to increase spammers’ cost
• More researches on spammers’ behaviors
• Twitter added feature for user to report spam
• ‘BotMaker’ launched in August
Entropy-based Classification of ‘Retweeting’
Activity [Ghosh et al.]
• Question
– Given the time series of ‘retweeting’ activity on some
user-generated content or tweet, how do we meaningfully
categorize it as organic or spam?
• Contributions
– Use information theory-based features to categorize
tweeting activity
• Time interval entropy
• User entropy
Dynamics of Retweeting Activity
(i) Popular news website (nytimes)
(ii) Popular celebrity (billgates)
(iii) Politician (silva_marina)
(iv) An aspiring artist (youngdizzy)
(v) Post by a fan site (AnnieBeiber)
(vi) Advertisement using social media(onstrategy)
Measuring time interval and user diversity
• Measure time interval between consecutive retweets
• Count distinct tweeting users
Time Interval Diversity
Many different time intervals
Frequency of time Intervals of duration Dti
pDT (Dti ) =
Few time intervals observed
k =1
Time Interval Entropy
H DT (DT ) = -å pDT (Dti )log(pDT (Dti ))
User Diversity
Many different users retweet
a few times each
Frequency of retweets by distinct user fi
pF ( fi ) =
n fi
User Entropy
H F (F) = -å pF ( fi )log(pF ( fi ))
Few users retweet many
times each
Bloggers and News Website
Dynamics of Retweeting Activity
(i) Popular news website (nytimes)
(ii) Popular celebrity (billgates)
Dynamics of Retweeting Activity
(iii) Politician (silva_marina)
(vi) Animal Right Activist(nokillanimalist)
Performers and their fans
Dynamics of Retweeting Activity
(iv) An aspiring artist (youngdizzy)
(v) Post by a fan site (AnnieBeiber)
Advertisers and spammers
(ix) Advertisement by a Japanese user (nikotono)
(vii) Advertisement using social media(onstrategy)
(viii) Account eventually suspended by Twitter(EasyCash435)
News and blogs
bot activity
advertisements & spams
Manually annotated URLs shown in the entropy plane
– Novel information theoretic approach to activity
• Content independent
• Scalable and efficient
• Robust to sampling
– Results
• sophisticated tools for marketing and spamming
• Twitter is exploited for promotional and spam-like activities
• Able to identify distinct classes of dynamic activities in Twitter and
associated content
• Separation of popular with unpopular content
– Applications-spam detection, trend identification, trust
management, user-modeling, social search, content