Exploring Blog Networks Patterns and a Model for Information Propagation Mary McGlohon In collaboration with Jure Leskovec, Christos Faloutsos Natalie Glance, Matthew Hurst Sandia National Labs- July 6, 2007 1 Long-term Goals ● ● ● How does information on the Web propagate? With what pattern do ideas catch on, diffuse, and decrease in popularity? Can we build a model for this propagation? 2 Why blogs? ● ● ● Blogs are a widely used medium of information for many topics and have become an important mode of communication. Blogs cite one another, creating a record of how information and ideas spread through a social network. This record is publicly available. 3 Why do we care? ● Understanding how the blog network works is important for: – – Social issues: Political mapping, social trends and change, reactions to mass media. Economic issues: Marketing, predicting commercial success, discovering links between companies. Example: blogs in the 2004 election. [Adamic, Glance 2005] 4 Immediate Goals ● ● ● Temporal questions: Does popularity have half-life? Is there periodicity? Topological questions: What topological patterns do posts and blogs follow? What shapes do cascades take on? Stars? Chains? Something else? Generative model: Can we build a generative model that mimics properties of cascades? 5 Outline Motivation Preliminaries Concepts and terminology Data Temporal Observations Topological Observations Cascade Generation Model Discussion & Conclusions 6 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. slashdot boingboing 7 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. The iPhone is here, hooray! slashdot boingboing 8 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. The iPhone is here, hooray! slashdot At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because … boingboing 9 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. The iPhone is here, hooray! Here Boingboing says they’re not buying an iPhone. slashdot They’re just jealous. At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because … boingboing 10 From blogs to networks slashdot boingboing B1 B2 MichelleMalki B3 n Dlisted B4 Blogosphere network slashdot B1 1 1 B2 1 MichelleMalki 1 B3 n a boingboing 2 Dlisted 3 B4 Blog network 1 b c d e Post network 11 From networks to cascades slashdot boingboing Non-trivial vs. trivial cascades MichelleMalki n Dlisted Blogosphere network 12 Cascades From networks to cascades slashdot boingboing Non-trivial vs. trivial cascades Cascade initiators are first sources of information We also have stars and chains MichelleMalki n Dlisted Blogosphere network 13 Cascades Dataset (Nielsen Buzzmetrics) ● Gathered from August-September 2005* ● Used set of 44,362 blogs, traced cascades 2.4 million posts, ~5 million out-links, 245,404 blogto-blog links Number of posts ● Time [1 day] 14 Outline Motivation Preliminaries Concepts and terminology Data Temporal Observations Does blog traffic behave periodically? How does popularity change over time? Topological Observations Cascade Generation Model Discussion & Conclusions Future Work 15 Temporal Observations Does blog traffic behave periodically? • Posts have “weekend effect”, less traffic on Saturday/Sunday. 16 Temporal Observations Does blog traffic behave periodically? Number in-links (log) Number in-links (log) • Monday appears to compensate for this behavior, but it is not actually the case. • We normalize data: countnorm = count / pd where pd is percentage of links on that day. Monday post dropoff- days after post Same data, normalized 17 Temporal Observations Observation 1: The probability that a post written at time tp acquires a link at time tp + is: p(tp+) 1.5 Number of in-links How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. Days after post 18 Outline Motivation Preliminaries Temporal Observations Does blog traffic behave periodically? How does post popularity change over time? Topological Observations What are graph properties for blog networks? What shapes do cascades take on? Stars, chains, or something else? Cascade Generation Model Discussion & Conclusions Future Work 19 Topological Observations What graph properties does the blog network exhibit? B1 1 1 B2 1 1 B3 2 B4 3 20 Topological Observations What graph properties does the blog network exhibit? How connected? ● 44,356 nodes, 122,153 edges ● Half of blogs belong to largest connected component. B1 1 1 B2 1 1 B3 2 B4 3 21 Topological Observations Count (log scale) Count (log scale) What power laws does the blog network exhibit? Number of blog in-links (log scale) Number of blog out-links (log scale) Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3. This suggests strong rich-get-richer phenomena. 22 Topological Observations How are blog in- and out-degree related? (log scale) Number of blog out-links In-links and out-links are not correlated. (correlation coefficient 0.16) Number of blog in-links (log scale) 23 Topological Observations What graph properties does the post network exhibit? a b c d e 24 Topological Observations What graph properties does the post network exhibit? Very sparsely connected: 98% of posts are isolated. a b c d e 25 Topological Observations Count Count What power laws does the post network exhibit? • Both in-and out-degree follow power laws: • In-degree has PL exponent -2.15, out-degree has PL exponent -2.95. Post in-degree Post out-degree 26 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). a b c d e 27 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. a a b b c d c d e e 28 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. – Produces directed acyclic graph. a a b b c d c d e d b e c a e e 29 Topological Observations How do we measure how information flows through the network? Common cascade shapes are extracted using algorithms in [Leskovec2006]. 30 Topological Observations How do we measure how information flows through the network? Effective diameter Number of edges Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures. Cascade size (# nodes) 31 Cascade size Topological Observations How do we measure how information flows through the network? We work with a bag of cascades– each cascade is a disconnected subgraph. We now explore some graph properties of cascades. 32 Topological Observations What graph properties do cascades exhibit? Count Count As before, in- and out-degree in bag of cascades follow power laws. Cascade node in-degree Cascade node out-degree 33 Topological Observations What graph properties do cascades exhibit? Cascade size distributions also follow power law. 34 Topological Observations What graph properties do cascades exhibit? Cascade size distributions also follow power law. Observation 2: The probability of observing a cascade on n nodes follows a Zipf distribution: Count p(n) n-2 35 Cascade size (# of nodes) Topological Observations What graph properties do cascades exhibit? Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5). 36 Topological Observations What graph properties do cascades exhibit? Count Count Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5). Size of star (# nodes) Size of chain (# nodes) 37 Outline Motivation Preliminaries Temporal Observations Topological Observations What are graph properties for blog networks? What shapes and patterns do cascades take on? Cascade Generation Model Epidemiological Background Proposed Model Experimental Validation Discussion & Conclusions Future Work 38 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 39 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 40 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 41 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 42 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 43 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 44 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 45 Epidemiological models ● ● We consider modeling cascade generation as an epidemic, with ideas as viruses. We use the SIS model: – – – At any time, an entity is in one of two states: susceptible or infected. One parameter determines how easily spreading conversations are. [Hethcote2000] 46 Cascade Generation Model 0. Begin with Blog Net. 1 B1 B2 1 2 1 1 B3 B4 3 47 Cascade Generation Model 0. Begin with Blog Net, but ignore edge weights. Example– B1 B3 B2 B4 B1 links to B2, B2 links to B1, B4 links to B2 and B1, as well as itself B3 is isolated, linking to itself. 48 Cascade Generation Model 1. Randomly pick a blog to infect, add node to cascade B1 B1 B3 B2 B4 49 Cascade Generation Model 2. Infect each in-linked neighbor with probability . B1 B1 B3 B2 B4 50 Cascade Generation Model 2. Infect each in-linked neighbor with probability . DO NOT INFECT B1 B1 B2 INFECT B3 B4 51 Cascade Generation Model 3. Add infected neighbors to cascade. B1 B1 B2 B4 B3 B4 52 Cascade Generation Model 4. Set “old” infected nodes to uninfected. B1 B1 B2 B4 B3 B4 53 Cascade Generation Model 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. B1 B1 B2 B4 B3 B4 54 Cascade Generation Model 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. B1 B1 B2 DO NOT INFECT B3 B4 B4 55 Cascade Generation Model 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. Completed cascade! B1 B1 B2 B4 B3 B4 56 CGM matches observations ● ● ● After trying several values, we decide on =.025. 10 simulations, 2 million cascades each Most frequent cascades: 7 of 10 matched exactly. model data 57 CGM matches observations Count Cascade size in this model also follows a power law-- the model distribution is shown with the real data points. Cascade size (number of nodes) 58 CGM matches observations Count Stars and chains both follow power laws, close to those observed in real data. Count ● Star size Chain size 59 Results in brief ● ● ● ● ● Analyzed one of largest available collections of blog information. Two networks: “Post network” and “blog network”. Discovered several properties of the networks. Also analyzed properties of “cascades”. Presented generative model for cascades. 60 Immediate questions: answered Temporal questions: Does popularity have half-life? Is there periodicity? – Popularity dropoff follows a power-law distribution exactly as found in response times in other work. We do find that posts follow weekly periodicity. Number of in-links ● 61 Days after post Immediate questions: answered Topology: What topological patterns do posts and blogs follow? What shapes to cascades take on? Stars? Chains? Something else? We find power law distributions in almost every topological property. In cascade shapes, stars are more common than chains, and size of cascades follow a power law. Cascades are tree-like. Count – Count ● 62 Size of star (# nodes) Size of chain (# nodes) Immediate questions: answered Can a simple model replicate this behavior? Yes. We developed a model based on the SIS model in epidemiology. It is a simple model with only one parameter, and it produces behavior remarkably similar to that found in the dataset. Count – Count ● Star size Chain size 63 Future work and applications ● ● ● This work suggested that ideas may behave like viruses under an SIS model. This may be useful for mapping social/political trends. Further investigation into these properties may also allow us early detection of changes in social or economic structure. 64 Related work ● For explanation of SIS model: – ● For algorithms for extracting cascade shapes: – ● [Hethcote2000] H.W. Hethcote. The mathematics of infectious diseases. SIAM Rev., 42(4):599–653, 2000. [Leskovec2006] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network. PAKDD 2006. For some modeling of power laws: – [Vazquez2006] A. Vazquez, J. G. Oliveira, Z. Dezso, K. I. Goh, I. Kondor, and A. L. Barabasi. Modeling bursts and heavy tails in human dynamics. Physical Review E, 73:036127, 2006. 65 Additional Info Mary McGlohon www.cs.cmu.edu/~mmcgloho mcglohon@cmu.edu 66 Acknowledgments ● ● Mary McGlohon was partially supported by an NSF Graduate Fellowship. Jure Leskovec was partially supported by a Microsoft Fellowship. 6767 Questions? 68 ● EXTRA SLIDES BEGIN HERE! 69 Preliminaries- PCA ● ● We will work with very high-dimensional data (~9,000 dimensions). Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 7070 Preliminaries- PCA ● ● We will work with very high-dimensional data (~9,000 dimensions). Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 7171 Preliminaries- PCA ● ● We will work with very high-dimensional data (~9,000 dimensions). Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 7272 Preliminaries- PCA We can represent any real N x M matrix X as X= U x x Vt X 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 U 0 0 0 0 2 3 1 0 0 0 0 2 3 1 = 0 .1 8 0 .3 6 0 .1 8 0 .9 0 0 0 0 0 0 0 0 0 .5 3 0 .8 0 0 .2 7 x 9.64 0 0 5.29 x Vt v1 0 . 5 8 0 . 5 8 0 . 5 8 0 0 0 0 0 0 . 7 1 0 . 7 1 73 Preliminaries- PCA ● 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 Reduce dimensionality by setting all other components of to zero. 0 0 0 0 2 3 1 = 0 .1 8 0 .3 6 0 .1 8 0 .9 0 0 0 0 0 0 0 0 0 .5 3 0 .8 0 0 .2 7 x 9.64 0 0 5.29 x 0 . 5 8 0 . 5 8 0 . 5 8 0 0 0 0 0 0 . 7 1 0 . 7 1 74 Preliminaries- PCA 1 2 1 5 0 0 0 1 2 1 5 0 0 0 1 2 1 5 0 0 0 0 0 0 0 2 3 1 0 0 0 0 2 3 1 ~ 0 .1 8 0 .3 6 0 .1 8 0 .9 0 0 0 0 0 0 0 0 0 .5 3 0 .8 0 0 .2 7 x 9.64 0 0 0 x 0 . 5 8 0 . 5 8 0 . 5 8 0 0 0 0 0 0 . 7 1 0 . 7 1 Reference: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. 75 Preliminaries- Regularizing data Not everything in life is normally distributed. Blog properties, linear-linear scale Total In-links ● 76 Total Conversation Mass Downwards Preliminaries- Regularizing data Not everything in life is normally distributed. Blog properties, linear-linear scale Total In-links ● 99.4% of points! 77 Total Conversation Mass Downwards Preliminaries: Regularizing data Not everything in life is normally distributed. Blog properties, linear-linear scale Total In-links ● Try to fit a line... 78 Total Conversation Mass Downwards Preliminaries: Regularizing data Not everything in life is normally distributed. Blog properties, linear-linear scale Total In-links ● Try to fit a line... Outliers dramatically affect fit. 79 Total Conversation Mass Downwards Preliminaries: Regularizing data ● Not everything in life is normally distributed. Therefore, we propose to take log(count+1). Blog properties, log-log scale Total In-links ● 80 Total Conversation Mass Downwards Preliminaries: Regularizing data ● Not everything in life is normally distributed. Therefore, we propose to take log(count+1). Blog properties, log-log scale Total In-links ● Outliers’ effects are minimized. 81 Total Conversation Mass Downwards ● Suppose we want to cluster blogs based on content. What features do we use per blog? 82 CascadeType • Perform PCA on sparse matrix. • Use log(count+1) • Project onto 2 PC… ~9,000 cascade types ~44,000 blogs ………… slashdot boingboing 4.6 2.1 3.2 1.1 … … … … … 4.2 .09 3.4 .07 5.1 2.1 .67 1.1 .07 .01 83 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 8484 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 8585 ● Suppose we want to cluster blog posts. What features do we use? 86 Preliminaries- Blogs ● There are several terms we use to describe cascades: ● In-link, out-link ● ● – Green node has one out-link – Yellow node has one in-link. Depth downwards/upwards – Pink node has an upward depth of 1, – downward depth of 2. Conversation mass upwards/downwards – Pink node has upward CM 1, – downward CM 3 8787 ~2,400,000 posts PostFeatures slashdot-p001 4.5 slashdot-p002 .3 2.2 … … .2 4.5 1.2 2.4 Run PCA… boingboing-p001 4.2 6.2 boingboing-p002 .6 1.1 … .6 .1 8888 PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. 89 PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. – PC1 ~ CM upward – PC2 ~ CM downward – We show this scatter plot instead. MichelleMalkin Dlisted 90 Ranking blogs by PostFeatures ● ● Conversation mass up/down gives a better understanding of the blog posts than in-links and out-links. Therefore, we may choose to rank blogs based on these attributes. 9191 Blogs ranked by CM vs in-links Top blogs by conversation mass Top blogs by in-links 1 michellemalkin.com 1 boingboing.net 2 boingboing.net 2 michellemalkin.com 3 imao.us (75) 3 instapundit.com 4 captainsquartersblog.com/mt 4 waxy.org/links 5 instapundit.com 5 kottke.com/reminder 6 radioequalizer.blogspot.com (53) 6 patriotdaily.com (11) 7 powerlineblog.com 7 captainsquartersblog.com/mt 8 waxy.org/links 8 powerlineblog.com 9 washingtonmonthly.com 9 washingtonmonthly.com 10 kottke.org/reminder 10 petashon.com (30) 9292 Blogs ranked by CM vs in-links Top blogs by conversation mass Top blogs by in-links 1 michellemalkin.com 1 boingboing.net 2 boingboing.net 2 michellemalkin.com 3 imao.us (75) 3 instapundit.com 4 captainsquartersblog.com/mt 4 waxy.org/links ..... 10 petashon.com (30) in-links: 2 CM: 6 in-links: 5 CM: 5 – Perhaps IMAO has longer cascades, just fewer inlinks. – While petashun has “stars”. 9393 BlogTimeFractal: some time series ● Problem: time series data is nonuniform and difficult to analyze. in-links over time ● ● Any patterns? Any measures? 94 BlogTimeFractal: Definitions ● ● ● ● Any patterns? Self similarity! The 80-20 law describes self-similarity. For any sequence, we divide it into two equallength subsequences. 80% of traffic is in one, 20% in the other. – Repeat recursively. 9595 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 96 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 Q: How do we estimate b? 97 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 Q: How do we estimate b? A: Entropy plots! 98 BlogTimeFractal ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR 9999 BlogTimeFractal ● ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR Recursively take finer resolutions. 100 100 BlogTimeFractal ● ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution r= T/2. Record entropy Hr Recursively take finer resolutions. 101 101 BlogTimeFractal: Definitions ● ● Entropy measures the non-uniformity of histogram at a given resolution. We define entropy of our sequence at given R : where p(t) is percentage of posts from a blog on interval t, R is resolution and 2R is number of intervals. 102 BlogTimeFractal ● ● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor. Lemma: For traffic generated by a b-model, the bias factor b obeys the equation: s= - b log2 b – (1-b) log2 (1-b) 103 103 Entropy Plots Linear plot Self-similarity Entropy ● Resolution 104 Entropy Plots ● ● Linear plot Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Entropy ● Resolution 105 Entropy Plots ● ● Linear plot Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Michelle Malkin in-links, s= 0.85 Entropy ● By Lemma 1, b= 0.72 Resolution 106 BlogTimeFractal: Results ● ● Observation: Most time series of interest are self-similar. Observation: Bias factor is approximately 0.7-that is, more bursty than uniform (70/30 law). Entropy plots: MichelleMalkin in-links, b=.72 conversation mass, b=.76 number of posts, b=.70 107 107 ● Other related work 108 [Ali-Hasen, Adamic 2007] Expressing Social Relationships on the Blog through Links and Comments Analyzed three blog communities: Dallas-Fort Worth UAE Kuwait -Most links are external to community (91%) -Fewer links external to community -Fewest links external to community (53%) -Low centralization -Low reciprocity -More centralization -Obvious “hub” structure -Highly centralized -Much reciprocity 109 [Duarte et. al. 2007] Classified blogs into parlor, register, and broadcast. Fractions of sessions with comments ● register parlor broadcast Total sessions 110 [Adar et. al. 2004] ● Implicit Structure and the Dynamics of Blogspace Suggested that ideas behaved like epidemics. Presented iRank based on how “infectious” a blog was. (giant microbes, a site infectious in more ways than one) 111