Phenomenology of Social Media Kristina Lerman University of Southern California CS 599: Social Media Analysis University of Southern California 1 Phenomenology and phenomenological models • What phenomena can be observed in social media? – Look for patterns and regularities in aggregate behavior of a large population of users – Average behavior? Distribution of behavior • What mechanisms explain the observed phenomena? – Simple rules – Express rules through mathematical models to reproduce observed regularities – Link simple rules to psychological or sociological theories % of all tweets % of all tweets % of all tweets Characteristic response to news events Twitter data from trendistic Response to news on Digg Votes per hour received by story Popularity (total votes) over time Dynamics of response on Digg and Twitter Digg 1: U.S. Government Asks Twitter to Stay Up for #IranElection 2: Western Corporations Helped Censor Iranian Internet 3: Iranian clerics defy ayatollah, join protests Twitter 1: US gov asks twitter to stay up 2: Iran Has Built a Censorship Monster with help of west tech 3: Clerics join Iran’s anti-government protests - CNN.com Topics covered • Social information processing in social news, by Lerman – Can simple models explain dynamics of popularity of news stories? • Strong regularities in online peer production, by Wilkinson – Common properties of distribution of user and topic activity across different systems • Influence and Correlation in Social Networks, by Anagnostopoulos, Kumar and Mahdian, KDD 2008 – Do people influence the behavior of others? How can we tell? The Wizards of Buzz A new kind of Web site is turning ordinary people into hidden influencers, shaping what we read, watch and buy. By JAMIN WARREN and JOHN JURGENSEN February 10, 2007; Page P1 … A new generation of hidden influencers is taking root online, fueled by a growing love affair among Web sites with letting users vote on their favorite submissions. These sites are the next wave in the social-networking craze -- popularized by MySpace and Facebook. Digg is one of the most prominent of these sites, which are variously labeled social bookmarking or social news. Others include Reddit.com3 (recently purchased by Condé Nast), Del.icio.us4 (bought by Yahoo), Newsvine.com5 and StumbleUpon.com6. Netscape7 relaunched last June with a similar format. The opinions of these key users have implications for advertisers shelling out money for Internet ads, trend watchers trying to understand what's cool among young people, and companies whose products or services get plucked for notice. It's even sparking a new form of payola, as marketers try to buy votes. Social news on Digg Front page: 100 stories promoted daily Upcoming stories: 25,000+ submitted daily (2009) promoted Social networks: follow friends to get relevant news Stories friends voted on Stories friends submitted Top users • Digg ranked users by the number of submitted stories that were promoted to the front page • Displayed Top Users List to motivate users to contribute Troubles in Diggville Michael Arrington. 09/06/2006 The incredibly successful news site Digg has hit a few speed bumps recently… A number of people have recently complained about the ability for groups of users to get a story to the home page by acting as a group. [One] blogger analyzed Digg and concluded that a small group of powerful Digg users, acting together, control a large percentage of total home page stories. To some this is troubling because… unlike newspapers like the New York Times, where a small group of editors decide what is “news,” Digg is a more democratic process where the readers actually decide what is newsworthy. …Others respond that these groups are just hard core Digg users that spend much of their day scouring the web for good stories to promote on Digg. Digg ranks users based on how successful their submitted stories become, and a handful of users are hypercompetitive about their Digg ranking. The argument is that these users are simply more proficient at finding stories. Today Digg responded to these complaints. …it will soon be implementing a new algorithm that weighs a diversified group of Diggers more heavily than groups acting together. User success correlated with social network size • Observation – Users with more friends and followers have more stories promoted to the front page • Conspiracy? Or natural outcome of social voting? Success (fraction of user’s stories promoted to front page) vs social network size followe rs – Conspiracy • Users conspire to promote each others’ stories – Social voting • Users look at friends’ posts to discover interesting stories followers Social voting • Claim: Users tend to digg (vote for) stories friends submit – We will prove it by showing it is highly unlikely to observe as many followers votes purely by chance ave. # follower votes, <k> Average number of followers who vote for stories user submits vs the number of followers user has Could this happen purely by chance? # followers, K Urn model: voting as a stochastic process • Assume there are N balls in an urn, K of which are white. Suppose n balls are picked at random from the urn. What is the probability that k are white? K white balls in urn Pick n balls from urn at random Probability that k balls are white Urn model: voting as a stochastic process • Assume there are N users, K of whom follow the story submitter. Suppose n users vote for the story. What is the probability that k of them happen to be submitter’s followers? Probability k of the first n votes are from submitter’s followers ave. # follower votes, <k> Average number of followers who vote for stories user submits vs the number of followers user has # followers, K For submitters with K>100 followers, it is highly unlikely to observe that many votes from followers by chance. Therefore, users vote for stories friends submit. Dynamics of social voting User interface Story popularity Despite differences, each story (colored line) has similar dynamics of popularity Mathematical model of social news browse front page 12… browse friends friends 12… view story 12… browse upcoming view story navigate upcoming 12… friends interesting? 12… interesting? 12… view story upcoming 12… interesting? Mathematical model of social news browse view navigate front pageprobability story to view the story interesting? r on the front page 12… browse friends friends 12… view story friends probability to view it in the social stream 12… browse upcoming upcoming interesting? r 12… view story upcoming probability to view it on upcoming pages 12… 12… 12… interesting? r Mathematical model of social news browse view navigate front pageprobability story to view the story interesting? r on the front page 12… browse friends friends 12… view story friends 12… interesting? probability to view itN=number of Digg users r in the social streamvf=visibility on front page 1 2r=story … 12… browse upcoming upcoming view story interestingness Model has only one adjustable parameter (r). Other parameters are measured from data. upcoming probability to view it on upcoming pages 12… 12… interesting? r Probability to view the story on the front page Newer stories push a given story down the page, and on to page 2, 3, …, upcoming Promoted story front page A given story is less likely to be seen over time [phenomenological] Dynamics of social voting: model prediction Evolution of popularity of six real Digg stories. S is number of submitter’s followers Model predictions. Values of story interestingness (r) are estimated from data Popular submitter advantage promoted story not promoted promotion threshold [2006 data] Less interesting (lower r) stories submitted by popular users (many followers) will be promoted to the front pages (no need for conspiracy theories) Predict popularity votes – Estimate how interesting story is based on early votes – Solve model for later times to predict future votes time (hours) prediction time t [Hogg & Lerman, “Social Dynamics of Digg” in EPJ Data Science, 2012 ] Summary • People use their social networks to find interesting content – E.g., see stories friends post – This affects how popular stories become and how successful users are in having their stories promoted to the front page • Popular submitter advantage • Simple phenomenological model explains dynamics of social voting – Story visibility (on front page, upcoming stories page, social stream): all parameters measured from data – Story interestingness: only adjustable parameter Model explains and predicts story popularity Strong regularities in social media (Wilkinson, 2008) • Questions – Are there regular patterns in the collective behavior of social media users? – Are there simple explanations of these regularities? • Findings – Heterogeneous distribution of user activity • Small number of active users make most of the contributions – Activity depends on level of effort – Regularities can arise from simple dynamical rules Social systems are complex but predictable • Social systems are complex – Many users • High degree of variability in people’s decisions to participate – Many possible interactions • High degree of variability in people’s reactions to others • Low barriers to interaction • Social systems are predictable – Macroscopic (large-scale) regularities in collective behavior of large population – Simple dynamical rules explain regularities • Not psychological or sociological principles • Distinguish between general and system-level trends – Lots of data for empirical analysis! Systems and data System Time span of data users Topics contributions Wikipedia 6 years 10months 5.07M 1.50M 50.0M Bugzilla 9 years 7months 111K 357K 3.08M Digg 3 years 1.05M 3.57M 105M Essembly 1 year 4 months 12.4K 24.9K 1.31M • Wikipedia: online encyclopedia – Articles (topics), non-robot edits (contributions) • Bugzilla: open source software development service – Reported bugs (topics), discussion comments (contributions) • Digg: social news aggregator – New articles (topics), votes (contributions) • Essembly: online political forum – Political resolves (topics), votes (contributions) User participation: distribution of the number of contributions Digg & Essembly votes Buzilla comments & Essembly resolve submissions Wikipedia edits & Digg story submissions Power law distribution of contribution Power law behavior • Number of users who made k contributions N(k) = Ck-a • Participation “momentum”: Probability user quits after kth contribution P(stop k) Cka C (k b)a b 0 1 (1 b /k) a (a 1) k b 0 – The more contributions made, the harder to quit – Exponent a represents barrier to participation Contribution effort and power law exponent The larger the value of a, the greater the effort required to contribute • Easy – Digg and Essembly voting requires little time or personal investment • Moderately difficult – Making a Bugzilla comment or submitting a new resolve on Essembly • Difficult – Submitting a new Digg story, or editing Wikipedia page Contribution type a Essembly votes 1.47 Digg votes 1.53 Contribution type a Bugzilla comments 1.98 Essembly submissions 2.02 Contribution type a Wikipedia edits 2.28 Digg submissions 2.4 Contribution effort and power law exponent 2 difficult easy Topic activity • How much activity does a single topic generate? Number of edits of a Wikipedia article Number of votes for an Essembly resolve Distribution is log-normal (normal distribution of log(x)) Where does log-normal come from? • Multiplicative reinforcement as a model for log-normal distribution – Amount of new activity proportional to amount of existing activity • E.g., popularity (amount of activity) raises visibility, creating new activity – Phenomenological mathematical model dNt=(m + sdBt)Nt dt • Nt: number of contributions on a topic until time t m: average rate of contribution (independent of topic, time) sdBt: stochastic noise accounting for fluctuations in human behavior, with variance s Summary Macroscopic properties of diverse social media systems where people create, rate and share content are very similar and can be explained in terms of simple dynamical rules • User participation described by a power law – Explained by “momentum” associated with participation, where probability of quitting is inversely proportional to the number of previous contributions – Power law exponent related to effort required to contribute • Topic activity described by a log normal – Explained by a multiplicative reinforcement mechanism in which contributions increase popularity • Systems depend on heavy contributors and popular topics Influence and correlation in social networks (Anagnostopoulos et al.) • Questions – Do social networks shape user behavior? – How can we identify social influence and distinguish it from other factors, such as homophily or other confounding variables? • Contributions – Statistical test to identify influence as source of social correlation • Findings – Correlation in tagging behavior on Flickr cannot be attributed to social influence Correlation in social networks • Online social networks are important in shaping user behavior. As a result, social behavior is often correlated. • What is the source of correlation? – Influence? Homophily? Confounding? A B tag=donut donut t1 t2 Sources of social correlation Confounding Correlation through external (environmental) factors, e.g., users posted pictures of the same place since they live in the same city Homophily A and B became friends because they are similar to each other; therefore, they perform similar actions Influence B’s action is caused by A’s action. If correlation is caused by influence, we can leverage it to amplify diffusion X X A B A B A B Correlation models Confounding & homophily Network G Set of active users W Select (G,W) according to joint probability distribution Time of activation of users in W is picked from distribution T Influence At each time step, a non-active user becomes active with probability p(a), where a is number of her active friends p(a) T 0 T a=# active friends Measuring social correlation • What is the form of p(a)? Empirically, from Flickr tags data ea ln(a 1) b p(a) 1 ea ln(a 1) b – Parameter ameasures the amount of social correlation • Estimate a, b using maximum likelihood logistic regression – Let Ya,t be the number of users with a active friends who performed the action at time t; and Ya=St Ya,t – and Na,t users who did not perform the action; Na=St Na,t – Choose values of a, b that maximize p(a) Ya a (1 p(a)) N a a, Ya, Na are observed The shuffle test • Does influence give rise to the observed series of user actions A B C D E F G H I J t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 G E I A J F D C B H t7 t5 t9 t1 t10 t6 t4 t4 t2 t8 – Estimate social correlation a using maximum likelihood • Shuffle actions in time – Estimate social correlation a’ using maximum likelihood there is no social influence if a ~ a’ Edge reversal test • Alternate statistical test A B C D E F G H I J t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 – Estimate parameter a • Reverse direction of edges in a (directed friendship) graph A B C D E F G H I J t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 – Estimate a’ Validation on synthetic data Generate activations (users adopting a new tag) according to specific rules 1. No-correlation model – At each time step, pick new users of a tag uniformly at random 2. Influence model – At each time step, an inactive user becomes active with probability p(a), where a number of active friends – Probability parameterized by a 3. Correlation model (no-influence) – Select S users, and add their neighbors and neighbors of neighbors to S – Select active users randomly, as in model 1. Measuring correlation strength in synthetic data • Frequency distribution (histogram) of a measured from data No correlation Influence model Correlation model Distinguishing influence: shuffle test • Measured a of original and shuffled tagging time steps Influence model Correlation model Value of a are close: no influence Experiments on Flickr data • Tagging behavior of Flickr users over a period of 16 months – 340K users tagged a photo at least once – 160K of these were connected • 2.8M edges • Rest are isolated – Selected 1.7K of 10K tags these users used • Most were used by more than 1K users • “halloween”, “katrina”, “photos”, “moon”, etc. Correlation and influence on Flickr Measuring correlation Distinguishing influence: a of original vs shuffled time step for each tag Correlation exists: a > 0 Correlation cannot be attributed to influence Summary • Proposed statistical analysis to identify and measure social influence as a source of correlation between the actions of individuals with social ties. – Distinguishing correlation from causation – Availability of time-resolved data about human behavior enables us to tackle this difficult problem • Applied to data from a large social system – There is correlation, but it cannot be explained by influence