Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007 1 Last week… Lots of methods for graph mining and link analysis. 2 Last week… Lots of methods for graph mining and link analysis. This week… A few examples of these methods applied to blogs. 3 Paper #1 ● Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. Patterns of Cascading Behavior in Large Blog Graphs, SDM 2007. – What temporal and topological features do we observe in a large network of blogs? 4 Representing blogs as graphs slashdot B1 MichelleMalki n B3 boingboing B2 B4 Dlisted Blogosphere network 5 Representing blogs as graphs slashdot boingboing B1 MichelleMalki n B2 B3 B4 Dlisted Blogosphere network slashdot B1 MichelleMalki n 1 B3 1 1 boingboing 1 B2 1 2 B4 Blog network 3 Dlisted 6 Representing blogs as graphs slashdot boingboing B1 MichelleMalki n B2 B3 B4 Dlisted Blogosphere network slashdot B1 MichelleMalki n 1 B3 1 1 boingboing 1 B2 1 a b c 2 B4 Blog network 3 Dlisted d e Post network 7 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). a b c d e 8 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. a a b b c d c d e e 9 Extracting subgraphs: Cascades We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. – Produces directed acyclic graph. a a b b c d c d e d b e c a e e 10 Paper #1,2 Dataset (Nielsen Buzzmetrics) ● Gathered from August-September 2005* ● Used set of 44,362 blogs, traced cascades 2.4 million posts, ~5 million out-links, 245,404 blogto-blog links Number of posts ● Time [1 day] 11 Temporal Observations Does blog traffic behave periodically? • Posts have “weekend effect”, less traffic on Saturday/Sunday. 12 Temporal Observations How does post popularity change over time? Popularity on day 1 Number in-links (log) Popularity on day 40 Monday post dropoffdays after post 13 Temporal Observations Number of in-links Number in-links (log) How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. Monday post dropoffdays after post Days after post 14 Temporal Observations The probability that a post written at time tp acquires a link at time tp + is: p(tp+) 1.5 Number of in-links How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. Days after post 15 Topological Observations What graph properties does the blog network exhibit? B1 1 1 B2 1 1 B3 2 B4 3 16 Topological Observations What graph properties does the blog network exhibit? ● 44,356 nodes, 122,153 edges ● Half of blogs belong to largest connected component. B1 1 1 B2 1 1 B3 2 B4 3 17 Topological Observations Count (log scale) Count (log scale) What power laws does the blog network exhibit? Number of blog in-links (log scale) Number of blog out-links (log scale) Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3. This suggests strong rich-get-richer phenomena. 18 Topological Observations What graph properties does the post network exhibit? a b c d e 19 Topological Observations What graph properties does the post network exhibit? Very sparsely connected: 98% of posts are isolated. a b Inlinks/outlinks also follow power laws. c d e 20 Topological Observations How do we measure how information flows through the network? Common cascade shapes are extracted using algorithms in [Leskovec2006]. 21 Topological Observations How do we measure how information flows through the network? Effective diameter Number of edges Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures. Cascade size (# nodes) 22 Cascade size More on cascades ● ● Cascade sizes, including sizes of particular shapes (stars, chains) also follow power laws. This paper also presents a model for influence propagation that generates cascades based on SIS model of epidemiology. The topic of influence propagation has been reserved for a later date. Paper #2 Mary McGlohon, Jure Leskovec, Christos Faloutsos, Matthew Hurst, and Natalie Glance. Finding patterns in blog shapes and blog evolution, SDM 2007. ● ● Do different kinds of blogs exhibit different properties? What tools can we use to describe the behavior of a blog over time? 24 ● Suppose we wanted to characterize a blog based on the properties of its posts. – Obtain a set of post features based on its role in a cascade. – Use PCA for dimensionality reduction. Post features ● There are several terms we use to describe cascades: ● In-link, out-link ● ● – Green node has one out-link – Yellow node has one in-link. Depth downwards/upwards – Pink node has an upward depth of 1, – downward depth of 2. Conversation mass upwards/downwards – Pink node has upward CM 1, – downward CM 3 2626 Dimensionality reduction ● ● Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 2727 Dimesionality reduction ● ● Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 2828 Dimensionality reduction ● ● Post features may be correlated, so some information may be unnecessary. Principal Component Analysis is a method of dimensionality reduction. Hypothetically, for each blog... Depth upwards Conversation mass upwards 2929 ~2,400,000 posts Setting up the matrix slashdot-p001 4.5 slashdot-p002 .3 2.2 … … .2 4.5 1.2 2.4 Run PCA… boingboing-p001 4.2 6.2 boingboing-p002 .6 1.1 … .6 .1 3030 PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. – PC1 ~ CM upward – PC2 ~ CM downward 31 PostFeatures: Results • Observation: Posts within a blog tend to retain similar network characteristics. – PC1 ~ CM upward – PC2 ~ CM downward MichelleMalkin Dlisted 32 ● Suppose we want to cluster blogs based on content. What features do we use? – Get set of features based on cascade shapes. – Run PCA to reduce dimensionality. 33 PCA on a sparse matrix ~9,000 cascade types ………… ~44,000 blogs • This time, each blog is one row. • Use log(count+1) • Project onto 2 PC… slashdot boingboing 4.6 2.1 3.2 1.1 … … … … … 4.2 .09 3.4 .07 5.1 2.1 .67 1.1 .07 .01 34 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 3535 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. • Distinct clusters for “conservative” and “humorous” blogs (hand-labeling). 3636 ● ● What about time series data? How can we deal with that? Problem: time series data is nonuniform and difficult to analyze. in-links over time 37 BlogTimeFractal: Definitions ● ● ● Fortunately, we find that behavior is often selfsimilar. The 80-20 law describes self-similarity. For any sequence, we divide it into two equallength subsequences. 80% of traffic is in one, 20% in the other. – Repeat recursively. 3838 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 39 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 Q: How do we estimate b? 40 Self-similarity ● The bias factor for the 80-20 law is b=0.8. 20 80 Q: How do we estimate b? A: Entropy plots! 41 BlogTimeFractal ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR 4242 BlogTimeFractal ● ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution R= T/2. Record entropy HR Recursively take finer resolutions. 4343 BlogTimeFractal ● ● ● ● An entropy plot plots entropy vs. resolution. From time series data, begin with resolution r= T/2. Record entropy Hr Recursively take finer resolutions. 4444 BlogTimeFractal: Definitions ● ● Entropy measures the non-uniformity of histogram at a given resolution. We define entropy of our sequence at given R : where p(t) is percentage of posts from a blog on interval t, R is resolution and 2R is number of intervals. 45 BlogTimeFractal ● ● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor. Lemma: For traffic generated by a b-model, the bias factor b obeys the equation: s= - b log2 b – (1-b) log2 (1-b) 4646 Entropy Plots Linear plot Self-similarity Entropy ● Resolution 47 Entropy Plots ● ● Linear plot Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Entropy ● Resolution 48 Entropy Plots ● ● Linear plot Self-similarity Uniform: slope s=1. bias=.5 Point mass: s=0. bias=1 Michelle Malkin in-links, s= 0.85 Entropy ● By Lemma 1, b= 0.72 Resolution 49 BlogTimeFractal: Results ● ● Observation: Most time series of interest are self-similar. Observation: Bias factor is approximately 0.7-that is, more bursty than uniform (70/30 law). Entropy plots: MichelleMalkin in-links, b=.72 conversation mass, b=.76 number of posts, b=.70 5050 Papers #1,2 conclusions ● ● ● ● There are several power laws observed in a network of blogs. We can extract cascades to help describe how information propagates through a network. We can use cascade properties to describe behavior of some blogs. We can also use self-similarity to describe behavior of blogs over time. Paper #3 ● Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit Structure and the Dynamics of Blogspace. WWW 2004. – What are the large- and small- scale patterns of blog epidemics? 52 Large scale: Epidemic profiles ● Example: The effects of popular websites linking to a given blog may cause popularity spikes. 53 Large scale: Epidemic profiles ● Quantify popularity of a topic into a vector. ● Then, cluster different topics’ profiles. Large scale: Epidemic profiles ● Used k-means clustering on topic buzz to identify different ways ideas gain and lose popularity. Found k=4 worked best. Centroids of clusters identified 55 Large scale: Epidemic profiles ● ‘Catchall’- picked up by different communities, no major spike. ● ‘Back page’ news- delayed spike, broader popularity. ● ‘Slashdot’- link picked up quickly, dies off quickly. ● ‘Front page’ news- immediate spike, broader popularity. ‘catchall’ 48% ‘back page’ ‘slashdot’ 20% 14% ‘front page’ 18% 56 Link gathering ● ● Links acquired by blogrolls or automated trackbacks. Posts sometimes give information on source of information (‘via’). May 16 2003, 8:48a “GIANTmicrobes http://www.giantmicrobes.com/ ‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via 57 BoingBoing) Small scale: link mining ● ● Links acquired by blogrolls or automated trackbacks. Posts sometimes give information on source of information (‘via’). May 16 2003, 8:48a EpsteinBarr Ebola “GIANTmicrobes http://www.giantmicrobes.com/ ‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via 58 BoingBoing) Small scale: link mining ● Unfortunately, since ‘via’ information is rare (O(.1%)), there needs to be a better way to infer infection paths. – Solution: link prediction. Link prediction ● ● Predict likelihood of 2 blogs linking to each other. – Blog similarity- common links to other blogs – Link similarity- common non-blog links – Textual similarity- text vector similarity – Timing of posts on certain topics. First three are cosine similarity, timing is likelihood based on observed distributions of link timings. 60 Link prediction results ● Used SVMs to predict links. – Undirected link prediction accuracy 91% – (Directed link prediction, 57%) 61 More goodies from Paper #3 ● And… – Built Zoomgraph, a visualization tool (stay tuned next week.) – Proposed iRank, a ranking based on “infectiousness” of blogs (stay tuned Oct. 23.) A more in-depth slide show may be found here: http://www.blogpulse.com/papers/Adar_blogworkshop2_ppt.pdf 62 Paper #4 ● Noor Ali-Hasan and Lada Adamic. Expressing Social Relationships on the Blog through Links and Comments. ICWSM 2007 – Do different blog communities exhibit certain structural properties? 63 [Ali-Hasen and Adamic 2007] ● ● Dataset of 3 blogging communities – Dallas/Ft. Worth – United Arab Emirates (UAE) – Kuwait Analyzed 3 types of links – Blogrolls (on a blog’s webpage) – Citations (link in a post) – Comments (interaction in a post’s discussion) 64 Citation link Blogroll link 65 Comment link 66 Link type analysis ● It is of interest to compare different types of links… – Co-occurrence of link types (Kuwait) Co-occurrences of different link types. 67 Link type analysis ● It is of interest to compare different types of links… – Co-occurrences of different link types. – Reciprocity among link types, between communities. Co-occurrence of link types (Kuwait) Link reciprocation rates 68 Structural properties ● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”) Links per blog 69 Structural properties ● ● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”) Modularity- to what extent “subcommunities” have formed. Links per blog Modularity 70 Comparing communities Dallas-Fort Worth UAE Kuwait -Most links are external to community (91%) -Fewer links external to community -Fewest links external to community (53%) -Low centralization -More centralization -Highly centralized -Low reciprocity -Obvious “hub” structure -Much reciprocity 71 Paper #4 Conclusions ● Based on a survey, they suggest that these different network characteristics indicated different mindsets inside the community. – Kuwait bloggers more often reported blogging in order to make new friends. – DFW more often reported blogging to update friends/family on events. Conclusions ● Link analysis has discovered patterns in several aspects of the blogosphere. – Observing general network characteristics. – Describing behavior of specific blogs, or blog topics. – Illustrating how influence propagates. – Comparing different blogging communities.