Slides - Information Sciences Institute

advertisement
Influence in Social Media
Kristina Lerman
University of Southern California
CS 599: Social Media Analysis
University of Southern California
1
Identifying influential users in social networks
Hard problem for large online social networks
Topics covered
How do you quantitatively measure influence in social
networks? What metrics should you use?
• Cha et al. 2010 “Measuring User Influence on Twitter: The
Million Follower Fallacy”, in ICWSM.
• Bakshy, et al. 2011 “Everyone’s an Influencer: Quantifying
Influence on Twitter”, In WSDM.
• Ghosh & Lerman. 2010 “Predicting Influential Users in Online
Social Networks” In SNA-KDD workshop.
Measuring User Influence in Twitter: The Million
Follower Fallacy
Meeyoung Cha, Hamed Haddadi, Fabrıcio Benevenuto, Krishna P. Gummadi
Niki Parmar
University of Southern
California
Problem
• Notion of Influence is important and long studied
- Business, fashion, voting trends, marketing, etc
• Measuring influence is difficult
- Involves human choices, societies, complex
• Modern view of key factors to measure influence are
- Interpersonal relationship among users
- Readiness to adapt to change
• Leads to information diffusion, set trends, friendship, news
gossip, contraversial issues..
• Can initiate large scale chain reactions
Influence in Social Media
• Online communication is the new way to receive information
• Influence :“the power or capacity of causing an effect in indirect or
intangible ways.”
• Twitter usage
- 271 million monthly active users.
- 500 million Tweets are sent per day.
• Understand roles of different users
- Through followers, retweets, identity value
• Effective advertisement on Twitter using influential users
- Target only a few influential people to target majority of the users.
Measuring Influence on Twitter
• Comparison of three influence measures
- Indegree influence: Number of followers of the user indicating size of
the audience for that user
- Retweet influence: Number of retweets containing one’s name
indicating ability to generate content with pass-along value.
- Mention influence: Number of replies to the user indicating ability of
user to engage others in a conversation
• Measure dynamics of user influence over topics and time.
• How to gain and maintain influence? How can ordinary users be
influential?
Data Collection
• Collected 54,981,152 in-use user accounts, which were
connected to each other by 1,963,263,821 social links.
• Gathered the tweets generated by all the user since early days
which amounts to 1,755,925,520 tweets.
• Focus on the largest component of the network which contains
99% of the links and tweets.
• After filtering, measure influence for 6,189,636 users on the
entire set of users.
Comparing User Influence
Comparing User Influence
Top Users for each measure
• Followers count – Public figures and news
sources
Get a lot of attention from their audience
• Retweets – Content Aggregation Service like
news channels & news sources, businessman
They are about the content and contain
URL’s
• Mentions – Celebrities, Known Figures
Ordinary users show great
passion for celebrities
Comparing User Influence
• Correlation for all users is high since it contains the least
influential users with no retweets or mentions and are not well
connected.
• Indegree has a low correlation with retweets and mentions.
Large number of followers does not correlate to influence.
Million Followers Fallacy
• Retweets are about the content and mentions are identity driven
– High Correlation , Greater Impact to influence.
Influence Across Topics
• Measure across 3 diverse topics : the Iranian presidential
election, the outbreak of the H1N1 influenza and the death of
Michael Jackson.
• Distribution of user ranks for retweets and mentions follows
power law pattern!
Influence Across Topics
• Measure variation of a user’s influence across the three topics
using Spearman’s rank correlation coefficient
• Observed strong correlation between topics
• Most influential users hold influence across a range of topics. Top
users similar for all the three topics. High for top 1%.
Influence over time – Rise or Fall?
• Engagement of top influentials
- Top 100 users based on three measures
- Calculate probability Ρ, random tweet posted on Twitter during a 15 day
period is a retweet (or a mention) of that user.
- Calculate this over 8 month period and normalize by the total tweets
Influence of Ordinary Users
• Gather top 20 users for each topic who tweet only about one topic
• Calculate Probability P again for these 60 users over 8 months
- Increase in retweets and mentions over the time period
- Influentials spread information about protests, controversial news etc
- Users limited to a single topic show largest increase in influence scores
Conclusion
• Capture different perspectives of influence – indegree, retweets,
mentions
• User’s popularity is not related to influence. Different groups of
influence depending on content and name value
• Most influential users hold influence over variety of topics
• Top twitter users had a disproportionate amount of influence
indicating a power law distribution
• Users need to self advertise and have continuous effort and
involvement to become influential over time
Thank You !
#Questions?
Everyone’s an influencer [Bakshy et al.]
• Questions
– What makes some content spread far but not others?
– Can we reliably identify influential users on Twitter?
• Findings
– Largest cascades generated by users who have generated
them in the past
– Content matters? Positive URLs tend to spread farther
– But, cannot reliably predict which user or URL will
generate large cascade
Word- of mouth diffusion & Influence
• Diffusion: mechanism for information spread on networks
– Diffusion event is a cascade
Influencer is
someone who can
consistently trigger
large cascades
Marketing
• Can we maximize cascades by seeding information (or
product) with certain influential nodes?
– These nodes can influence disproportionately many others
• What characteristics of influentials can help identify them?
– Credibility
– Expertise
– Enthusiasm
– Centrality?
• Need a large-scale, unbiased sample of observed cascades on
a social network
– Not only case studies of the most successful cascades
Twitter
• Users tweet short messages
– Retweet posts of others
– Tweets may contain URLs
• Distinct markers that allow us to
track diffusion
• Social networks
– Users follow ‘friends’ to see their
tweets
• “who listens to whom”
• Compare impact of different users by
measuring observed activity on Twitter
– User who “seeds” content (URL)
– Seed’s influence is measured by
number of users connected to her
who subsequently retweet the URL
Data set
• Tweets
Diffusion events
– 87M tweets containing a bit.ly URL that were broadcast
Sep 13, 2009 – Nov 15, 2009
– 1.6M seed users active both months who initiated 74M
cascades (46.3 cascades each on average)
• Follower graph
Social network
– Collect followers of every active user, and their followers
and so on
– 56M users and 1.7B edges
Computing influence on Twitter
• Influence on Twitter = causing others to propagate
information (URL) to their followers
• User A influenced user B to retweet the URL if
– User A tweets first
– User A is a friend of B (B is a follower of A)
– But, what if more than one friend tweets before B?
Some hard facts about cascades
• Most URLs do not spread at all!
Predicting individual influence
• User’s influence = log(average size of cascades user seeds)
• What attributes of a user consistently predict influence?
– Seed user attributes
•
•
•
•
# of followers
# friends
# tweets
Date of joining
– Past influence of seed users
• Average, minimum, maximum total influence
• Average, minimum, maximum local influence
– Local influence is number of retweet by seed’s followers
• Train regression tree model on these attributes to predict
future influence
Regression tree (part)
Most informative features: past local influence, # followers
Influence as a function of most predictive features
All users
Top 25
Prediction performance of the regression tree
Most cascades
 large cascades are driven by previously influential users. But,
the extreme rarity of such cascades means that most users
with these attributes will not be successful
Does content matter?
Do YouTube videos spread farther than niche news?
• Manually classify 1000 URLs using Amazon Mechanical Turk
– Type of URL
•
•
•
•
•
Spam/not spam/unsure
Media sharing/social networking
Blog/forum
News/mass media
Other
– Category of URL
• Lifestyle, tech, offbeat, entertainment, gaming, science, news,
business, sports, other
Average cascade size of types of content
Impact of content on cascade size
• Content judged by humans to be more interesting, or elicit
more positive feeling spreads a little further on average
• However, none of the content features had predictive value!
Summary
• What features help us predict whether a cascade will reach
many people?
– Content itself has no predictive power
– Seed user’s features are somewhat predictive
• Number of followers
• Number of followers who retweet seeder’s posts
• Targeting influencers
– What viral marketing strategies reach wide audience at
minimal cost?
– May be more cost-effective to target many “ordinary
influencers” than few highly influential users
– Moot point, since “social epidemics” are so rare
Predicting influential users in online social networks
[Ghosh & Lerman]
• Questions
– Does network structure predict influence?
– Which metric should we use to measure centrality in a
particular social network?
• Findings
– The choice of the metric depends on the nature of
interactions
– Fundamental relationship between dynamic processes and
measurement of network structure
Centrality
• SNA metrics examine topology of the network to identify
important, or central nodes
Alpha-centrality
[Bonacich, 1987]
Degree
4
5
1
4
5
3
4
2
3
PageRank
[Brin et al, 1998]
5
1
1
2
Betweenness
[Freeman, 1977]
4
5
1
2
3
2
3
Claim: The nature of interactions between nodes affects how we
measure network structure
 Consequences for network analysis
The Gossips, 1948
Norman Rockwell (American, 1894-1978)
War News from Mexico, 1848
Richard Caton Woodville (American, 18251855)
Types of interactions
Two classes of interactions between network nodes
•Conservative interactions
– One-to-one, e.g., phone
calls
– Modeled by random walk
4
•Non-conservative interactions
• One-to-many, e.g., epidemic,
information diffusion
• Modeled by contact process
5
1
4
5
2
3
1
2
3
Centrality metrics
Centrality identifies important nodes in the network, e.g., those that
are often visited by a process
• Conservative interactions
• Random walk-based metrics,
e.g., PageRank, …
4
•Non-conservative interactions
• Path-based metrics, e.g., Alphacentrality, …
5
1
4
5
2
3
1
2
3
Node size ~ centrality
Which centrality metric is right?
• Empirical study of influence in social media
•
Data from Digg and Twitter about how information (URL) spreads on the
follower graph
submitter
follower
follower
follower
Ground truth
•
Re-broadcasting (retweeting) provides ground truth for
measuring influence
• Empirical measure of influence/importance
1. average number re-broadcasts by followers
2. average size of cascades a node triggers
•
•
Rank nodes by the empirical measure  ground truth
Compare rankings produced by centrality metrics to the
ground truth
Statistical significance of the influence metric
Post
URN MODEL
submitter
fan
fan
fan
Users in OSN (N)
Balls in the urn (N)
Fans of submitter in OSN (K)
White balls in the urn (K)
No. of users who voted (n)
No. of balls picked (n)
K  N  k  No of white balls picked (k)
  

k  n  k 
P(X  k | K,N,n) 
N 
 
(Hypergeometric Dist.)
n 
No. of fans who voted (k)
Metric is statistically significant
Digg
Twitter
Which centrality metric is right for social media?
Correlation between the ground truth and rankings predicted
by Alpha-Centrality and PageRank
Digg
Twitter
Non-conservative Alpha-Centrality best predicts node centrality
(since information flow in social media is non-conservative)
Summary
• Network structure measurements
– How we measure network structure depends on the
nature of interactions between nodes
– Affects how we compute centrality, strength of ties, and
communities
Download