Phenomenology of Social Media Kristina Lerman University of Southern California

advertisement
Phenomenology of Social Media
Kristina Lerman
University of Southern California
CS 599: Social Media Analysis
University of Southern California
1
Phenomenology and phenomenological models
• What phenomena can be observed in social media?
– Look for patterns and regularities in aggregate behavior of
a large population of users
– Average behavior? Distribution of behavior
• What mechanisms explain the observed phenomena?
– Simple rules
– Express rules through mathematical models to reproduce
observed regularities
– Link simple rules to psychological or sociological theories
% of all tweets
% of all tweets
% of all tweets
Characteristic response to news events
Twitter data from trendistic
Response to news on Digg
Votes per hour received by story
Popularity (total votes) over time
Dynamics of response on Digg and Twitter
Digg
1: U.S. Government Asks Twitter to Stay Up
for #IranElection
2: Western Corporations Helped Censor
Iranian Internet
3: Iranian clerics defy ayatollah, join protests
Twitter
1: US gov asks twitter to stay up
2: Iran Has Built a Censorship Monster with
help of west tech
3: Clerics join Iran’s anti-government
protests - CNN.com
Topics covered
• Social information processing in social news, by Lerman
– Can simple models explain dynamics of popularity of news
stories?
• Strong regularities in online peer production, by Wilkinson
– Common properties of distribution of user and topic
activity across different systems
• Influence and Correlation in Social Networks, by
Anagnostopoulos, Kumar and Mahdian, KDD 2008
– Do people influence the behavior of others? How can we
tell?
The Wizards of Buzz
A new kind of Web site is turning ordinary people into hidden influencers, shaping what
we read, watch and buy.
By JAMIN WARREN and JOHN JURGENSEN
February 10, 2007; Page P1
… A new generation of hidden influencers is taking root online, fueled by a growing
love affair among Web sites with letting users vote on their favorite submissions. These
sites are the next wave in the social-networking craze -- popularized by MySpace and
Facebook. Digg is one of the most prominent of these sites, which are variously labeled
social bookmarking or social news. Others include Reddit.com3 (recently purchased by
Condé Nast), Del.icio.us4 (bought by Yahoo), Newsvine.com5 and StumbleUpon.com6.
Netscape7 relaunched last June with a similar format.
The opinions of these key users have implications for advertisers shelling out money for
Internet ads, trend watchers trying to understand what's cool among young people, and
companies whose products or services get plucked for notice. It's even sparking a new
form of payola, as marketers try to buy votes.
Social news on Digg
Front page: 100 stories
promoted daily
Upcoming stories: 25,000+
submitted daily (2009)
promoted
Social networks: follow friends to get relevant news
Stories friends
voted on
Stories friends
submitted
Top users
• Digg ranked users by
the number of
submitted stories that
were promoted to the
front page
• Displayed Top Users List
to motivate users to
contribute
Troubles in Diggville
Michael Arrington. 09/06/2006
The incredibly successful news site Digg has hit a few speed bumps recently… A
number of people have recently complained about the ability for groups of users
to get a story to the home page by acting as a group. [One] blogger analyzed
Digg and concluded that a small group of powerful Digg users, acting together,
control a large percentage of total home page stories.
To some this is troubling because… unlike newspapers like the New York Times,
where a small group of editors decide what is “news,” Digg is a more democratic
process where the readers actually decide what is newsworthy. …Others respond
that these groups are just hard core Digg users that spend much of their day
scouring the web for good stories to promote on Digg. Digg ranks users based on
how successful their submitted stories become, and a handful of users are hypercompetitive about their Digg ranking. The argument is that these users are
simply more proficient at finding stories.
Today Digg responded to these complaints. …it will soon be implementing a new
algorithm that weighs a diversified group of Diggers more heavily than groups
acting together.
User success correlated with social network size
• Observation
– Users with more friends and
followers have more stories
promoted to the front page
• Conspiracy? Or natural
outcome of social voting?
Success (fraction of user’s stories
promoted to front page) vs social
network size
followe
rs
– Conspiracy
• Users conspire to promote each
others’ stories
– Social voting
• Users look at friends’ posts to
discover interesting stories
followers
Social voting
• Claim: Users tend to digg (vote for) stories friends submit
– We will prove it by showing it is highly unlikely to observe
as many followers votes purely by chance
ave. # follower votes, <k>
Average number of followers who
vote for stories user submits vs
the number of followers user has
Could this
happen purely
by chance?
# followers, K
Urn model: voting as a stochastic process
• Assume there are N balls in an urn, K of which are white.
Suppose n balls are picked at random from the urn. What is
the probability that k are white?
K white balls in urn
Pick n balls from urn at random
Probability that k balls are white
Urn model: voting as a stochastic process
• Assume there are N users, K of whom follow the story
submitter. Suppose n users vote for the story. What is the
probability that k of them happen to be submitter’s followers?
Probability k of the first n votes
are from submitter’s followers
ave. # follower votes, <k>
Average number of followers who
vote for stories user submits vs the
number of followers user has
# followers, K
 For submitters with K>100 followers, it is highly unlikely to observe that many votes
from followers by chance. Therefore, users vote for stories friends submit.
Dynamics of social voting
User interface
Story popularity
Despite differences, each
story (colored line) has similar
dynamics of popularity
Mathematical model of social news
browse
front page
12…
browse
friends
friends
12…
view
story
12…
browse
upcoming
view
story
navigate
upcoming
12…
friends
interesting?
12…
interesting?
12…
view
story
upcoming
12…
interesting?
Mathematical model of social news
browse
view
navigate
front pageprobability
story
to view the story
interesting?
r
on the front page
12…
browse
friends
friends
12…
view
story
friends
probability to view it
in the social stream
12…
browse
upcoming
upcoming
interesting?
r
12…
view
story
upcoming
probability to view it
on upcoming pages
12…
12…
12…
interesting?
r
Mathematical model of social news
browse
view
navigate
front pageprobability
story
to view the story
interesting?
r
on the front page
12…
browse
friends
friends
12…
view
story
friends
12…
interesting?
probability to view itN=number of Digg users
r
in the social streamvf=visibility on front page
1 2r=story
…
12…
browse
upcoming
upcoming
view
story
interestingness
Model has only one adjustable
parameter (r). Other parameters
are measured from data.
upcoming
probability to view it
on upcoming pages
12…
12…
interesting?
r
Probability to view the story on the front page
Newer stories push a given story down
the page, and on to page 2, 3, …,
upcoming
Promoted story
front page
A given story is less likely to be
seen over time
[phenomenological]
Dynamics of social voting: model prediction
Evolution of popularity of six
real Digg stories. S is number
of submitter’s followers
Model predictions. Values of
story interestingness (r) are
estimated from data
Popular submitter advantage
promoted story
not promoted
promotion threshold
[2006 data]
 Less interesting (lower r) stories submitted by popular users (many followers)
will be promoted to the front pages (no need for conspiracy theories)
Predict popularity
votes
– Estimate how interesting story is based on early votes
– Solve model for later times to predict future votes
time (hours)
prediction time t
[Hogg & Lerman, “Social Dynamics of Digg” in EPJ Data Science, 2012 ]
Summary
• People use their social networks to find interesting content
– E.g., see stories friends post
– This affects how popular stories become and how
successful users are in having their stories promoted to
the front page
• Popular submitter advantage
• Simple phenomenological model explains dynamics of social
voting
– Story visibility (on front page, upcoming stories page,
social stream): all parameters measured from data
– Story interestingness: only adjustable parameter
Model explains and predicts story popularity
Strong regularities in social media (Wilkinson, 2008)
• Questions
– Are there regular patterns in the collective behavior of
social media users?
– Are there simple explanations of these regularities?
• Findings
– Heterogeneous distribution of user activity
• Small number of active users make most of the contributions
– Activity depends on level of effort
– Regularities can arise from simple dynamical rules
Social systems are complex but predictable
• Social systems are complex
– Many users
• High degree of variability in people’s decisions to participate
– Many possible interactions
• High degree of variability in people’s reactions to others
• Low barriers to interaction
• Social systems are predictable
– Macroscopic (large-scale) regularities in collective
behavior of large population
– Simple dynamical rules explain regularities
• Not psychological or sociological principles
• Distinguish between general and system-level trends
– Lots of data for empirical analysis!
Systems and data
System
Time span of data
users
Topics
contributions
Wikipedia
6 years 10months
5.07M
1.50M
50.0M
Bugzilla
9 years 7months
111K
357K
3.08M
Digg
3 years
1.05M
3.57M
105M
Essembly
1 year 4 months
12.4K
24.9K
1.31M
• Wikipedia: online encyclopedia
– Articles (topics), non-robot edits (contributions)
• Bugzilla: open source software development service
– Reported bugs (topics), discussion comments
(contributions)
• Digg: social news aggregator
– New articles (topics), votes (contributions)
• Essembly: online political forum
– Political resolves (topics), votes (contributions)
User participation: distribution of the number of
contributions
Digg & Essembly votes
Buzilla comments & Essembly
resolve submissions
Wikipedia edits &
Digg story
submissions
Power law distribution of contribution
Power law behavior
• Number of users who made k contributions N(k) = Ck-a
• Participation “momentum”: Probability user quits after kth
contribution
P(stop  k) 
Cka

C  (k  b)a
b 0

1

(1 b /k) a


(a 1)
k
b 0
– The more contributions made, the harder to quit
– Exponent a represents barrier to participation

Contribution effort and power law exponent
The larger the value of a, the greater the
effort required to contribute
• Easy
– Digg and Essembly voting requires
little time or personal investment
• Moderately difficult
– Making a Bugzilla comment or
submitting a new resolve on
Essembly
• Difficult
– Submitting a new Digg story, or
editing Wikipedia page
Contribution type
a
Essembly votes
1.47
Digg votes
1.53
Contribution type
a
Bugzilla comments
1.98
Essembly submissions 2.02
Contribution type
a
Wikipedia edits
2.28
Digg submissions
2.4
Contribution effort and power law exponent 2
difficult
easy
Topic activity
• How much activity does a single topic generate?
Number of edits of a
Wikipedia article
Number of votes for an
Essembly resolve
 Distribution is log-normal (normal distribution of log(x))
Where does log-normal come from?
• Multiplicative reinforcement as a model for log-normal
distribution
– Amount of new activity proportional to amount of existing
activity
• E.g., popularity (amount of activity) raises visibility, creating new
activity
– Phenomenological mathematical model
dNt=(m + sdBt)Nt dt
• Nt: number of contributions on a topic until time t
 m: average rate of contribution (independent of topic, time)
 sdBt: stochastic noise accounting for fluctuations in human
behavior, with variance s
Summary
Macroscopic properties of diverse social media systems where
people create, rate and share content are very similar and can
be explained in terms of simple dynamical rules
• User participation described by a power law
– Explained by “momentum” associated with participation,
where probability of quitting is inversely proportional to
the number of previous contributions
– Power law exponent related to effort required to
contribute
• Topic activity described by a log normal
– Explained by a multiplicative reinforcement mechanism in
which contributions increase popularity
• Systems depend on heavy contributors and popular topics
Influence and correlation in social networks
(Anagnostopoulos et al.)
• Questions
– Do social networks shape user behavior?
– How can we identify social influence and distinguish it
from other factors, such as homophily or other
confounding variables?
• Contributions
– Statistical test to identify influence as source of social
correlation
• Findings
– Correlation in tagging behavior on Flickr cannot be
attributed to social influence
Correlation in social networks
• Online social networks are important in shaping user
behavior. As a result, social behavior is often correlated.
• What is the source of correlation?
– Influence? Homophily? Confounding?
A
B


tag=donut
donut
t1
t2
Sources of social correlation
Confounding
Correlation through
external (environmental)
factors, e.g., users
posted pictures of the
same place since they
live in the same city
Homophily
A and B became friends
because they are similar
to each other; therefore,
they perform similar
actions
Influence
B’s action is caused by
A’s action.
If correlation is caused
by influence, we can
leverage it to amplify
diffusion
X
X
A
B
A
B
A
B






Correlation models
Confounding & homophily
Network G
Set of active users W
Select (G,W) according to joint
probability distribution
Time of activation of users in W
is picked from distribution T
Influence
At each time step, a non-active
user becomes active with
probability p(a), where a is
number of her active friends
p(a)
T
0
T
a=# active friends
Measuring social correlation
• What is the form of p(a)? Empirically, from Flickr tags data
ea ln(a 1) b
p(a) 
1 ea ln(a 1) b
– Parameter ameasures the amount of social correlation
• Estimate a, b using maximum likelihood logistic regression

– Let Ya,t be the number of users with a active friends who
performed the action at time t; and Ya=St Ya,t
– and Na,t users who did not perform the action; Na=St Na,t
– Choose values of a, b that maximize
 p(a)
Ya
a
(1 p(a)) N a
a, Ya, Na are
observed
The shuffle test
• Does influence give rise to the observed series of user actions
A
B
C
D
E
F
G
H
I
J

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10
G
E
I
A
J
F
D
C
B
H

t7

t5

t9

t1

t10

t6

t4

t4

t2

t8
– Estimate social correlation a using maximum likelihood
• Shuffle actions in time
– Estimate social correlation a’ using maximum likelihood
 there is no social influence if a ~ a’
Edge reversal test
• Alternate statistical test
A
B
C
D
E
F
G
H
I
J

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10
– Estimate parameter a
• Reverse direction of edges in a (directed friendship) graph
A
B
C
D
E
F
G
H
I
J

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10
– Estimate a’
Validation on synthetic data
Generate activations (users adopting a new tag) according to
specific rules
1. No-correlation model
– At each time step, pick new users of a tag uniformly at
random
2. Influence model
– At each time step, an inactive user becomes active with
probability p(a), where a number of active friends
– Probability parameterized by a
3. Correlation model (no-influence)
– Select S users, and add their neighbors and neighbors of
neighbors to S
– Select active users randomly, as in model 1.
Measuring correlation strength in synthetic data
• Frequency distribution (histogram) of a measured from data
No correlation
Influence model
Correlation model
Distinguishing influence: shuffle test
• Measured a of original and shuffled tagging time steps
Influence model
Correlation model
 Value of a are close:
no influence
Experiments on Flickr data
• Tagging behavior of Flickr users over a period of 16 months
– 340K users tagged a photo at least once
– 160K of these were connected
• 2.8M edges
• Rest are isolated
– Selected 1.7K of 10K tags these users used
• Most were used by more than 1K users
• “halloween”, “katrina”, “photos”, “moon”, etc.
Correlation and influence on Flickr
Measuring correlation
Distinguishing influence: a
of original vs shuffled time
step for each tag
Correlation exists: a > 0
Correlation cannot be
attributed to influence
Summary
• Proposed statistical analysis to identify and measure social
influence as a source of correlation between the actions of
individuals with social ties.
– Distinguishing correlation from causation
– Availability of time-resolved data about human behavior
enables us to tackle this difficult problem
• Applied to data from a large social system
– There is correlation, but it cannot be explained by
influence
Download