Graph mining techniques applied to blogs Mary McGlohon

advertisement
Graph mining
techniques applied to
blogs
Mary McGlohon
Seminar on Social Media Analysis- Oct 2 2007
1
Last week…
Lots of methods for graph mining and
link analysis.
2
Last week…
Lots of methods for graph mining and
link analysis.
This week…
A few examples of these methods
applied to blogs.
3
Paper #1
●
Jure Leskovec, Mary McGlohon, Christos
Faloutsos, Natalie Glance, and Matthew Hurst.
Patterns of Cascading Behavior in Large Blog
Graphs, SDM 2007.
–
What temporal and topological features do we
observe in a large network of blogs?
4
Representing blogs as graphs
slashdot
B1
MichelleMalki
n
B3
boingboing
B2
B4
Dlisted
Blogosphere network
5
Representing blogs as graphs
slashdot
boingboing
B1
MichelleMalki
n
B2
B3
B4
Dlisted
Blogosphere network
slashdot
B1
MichelleMalki
n
1
B3
1
1
boingboing
1
B2
1
2
B4
Blog network
3
Dlisted
6
Representing blogs as graphs
slashdot
boingboing
B1
MichelleMalki
n
B2
B3
B4
Dlisted
Blogosphere network
slashdot
B1
MichelleMalki
n
1
B3
1
1
boingboing
1
B2
1
a
b
c
2
B4
Blog network
3
Dlisted
d
e
Post network
7
Extracting subgraphs: Cascades
We gather cascades using the following procedure:
–
Find all initiators (out-degree 0).
a
b
c
d
e
8
Extracting subgraphs: Cascades
We gather cascades using the following procedure:
–
Find all initiators (out-degree 0).
–
Follow in-links.
a
a
b
b
c
d
c
d
e
e
9
Extracting subgraphs: Cascades
We gather cascades using the following procedure:
–
Find all initiators (out-degree 0).
–
Follow in-links.
–
Produces directed acyclic graph.
a
a
b
b
c
d
c
d
e
d
b
e
c
a
e
e
10
Paper #1,2 Dataset
(Nielsen Buzzmetrics)
●
Gathered from August-September 2005*
●
Used set of 44,362 blogs, traced cascades
2.4 million posts, ~5 million out-links, 245,404 blogto-blog links
Number of posts
●
Time [1 day]
11
Temporal Observations
Does blog traffic behave periodically?
• Posts have “weekend effect”, less traffic on
Saturday/Sunday.
12
Temporal Observations
How does post popularity change over time?
Popularity
on day 1
Number in-links
(log)
Popularity
on day 40
Monday post dropoffdays after post
13
Temporal Observations
Number of in-links
Number in-links
(log)
How does post popularity change over time?
Post popularity dropoff follows a power law
identical to that found in communication response
times in [Vazquez2006].
Monday post dropoffdays after post
Days after post
14
Temporal Observations
The probability that a
post written at time tp
acquires a link at
time tp +  is:
p(tp+)  1.5
Number of in-links
How does post popularity change over time?
Post popularity dropoff follows a power law
identical to that found in communication response
times in [Vazquez2006].
Days after post
15
Topological Observations
What graph properties does the blog network
exhibit?
B1
1
1
B2
1
1
B3
2
B4
3
16
Topological Observations
What graph properties does the blog network
exhibit?
● 44,356 nodes, 122,153 edges
● Half of blogs belong to largest connected
component.
B1
1
1
B2
1
1
B3
2
B4
3
17
Topological Observations
Count (log scale)
Count (log scale)
What power laws does the blog network exhibit?
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7, out-degree
PL exponent near -3.
This suggests strong rich-get-richer phenomena.
18
Topological Observations
What graph properties does the post network
exhibit?
a
b
c
d
e
19
Topological Observations
What graph properties does the post network
exhibit?
Very sparsely
connected: 98% of
posts are isolated.
a
b
Inlinks/outlinks also
follow power laws.
c
d
e
20
Topological Observations
How do we measure how information flows
through the network?
Common cascade shapes are extracted using
algorithms in [Leskovec2006].
21
Topological Observations
How do we measure how information flows
through the network?
Effective diameter
Number of edges
Number of edges increases linearally with
cascade size, while effective diameter increases
logarithmically, suggesting tree-like structures.
Cascade size (# nodes)
22
Cascade size
More on cascades
●
●
Cascade sizes, including sizes of particular
shapes (stars, chains) also follow power laws.
This paper also presents a model for influence
propagation that generates cascades based on
SIS model of epidemiology. The topic of
influence propagation has been reserved for a
later date. 
Paper #2
Mary McGlohon, Jure Leskovec, Christos
Faloutsos, Matthew Hurst, and Natalie Glance.
Finding patterns in blog shapes and blog
evolution, SDM 2007.
●
●
Do different kinds of blogs exhibit different
properties?
What tools can we use to describe the behavior
of a blog over time?
24
●
Suppose we wanted to characterize a blog
based on the properties of its posts.
–
Obtain a set of post features based on its role in a
cascade.
–
Use PCA for dimensionality reduction.
Post features
●
There are several terms we use to describe cascades:
●
In-link, out-link
●
●
–
Green node has one out-link
–
Yellow node has one in-link.
Depth downwards/upwards
–
Pink node has an upward depth of 1,
–
downward depth of 2.
Conversation mass upwards/downwards
–
Pink node has upward CM 1,
–
downward CM 3
2626
Dimensionality reduction
●
●
Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically,
for each blog...
Depth
upwards
Conversation mass
upwards
2727
Dimesionality reduction
●
●
Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically,
for each blog...
Depth
upwards
Conversation mass
upwards
2828
Dimensionality reduction
●
●
Post features may be correlated, so some
information may be unnecessary.
Principal Component Analysis is a method of
dimensionality reduction.
Hypothetically,
for each blog...
Depth
upwards
Conversation mass
upwards
2929
~2,400,000 posts
Setting up the matrix
slashdot-p001
4.5
slashdot-p002
.3
2.2
…
…
.2
4.5
1.2
2.4
Run
PCA…
boingboing-p001 4.2 6.2
boingboing-p002 .6
1.1
…
.6
.1
3030
PostFeatures: Results
• Observation: Posts within a blog tend to
retain similar network characteristics.
– PC1 ~ CM upward
– PC2 ~ CM
downward
31
PostFeatures: Results
• Observation: Posts within a blog tend to
retain similar network characteristics.
– PC1 ~ CM upward
– PC2 ~ CM
downward
MichelleMalkin
Dlisted
32
●
Suppose we want to cluster blogs based on
content. What features do we use?
–
Get set of features based on cascade shapes.
–
Run PCA to reduce dimensionality.
33
PCA on a sparse matrix
~9,000 cascade types
…………
~44,000 blogs
• This time, each
blog is one row.
• Use log(count+1)
• Project onto 2
PC…
slashdot
boingboing
4.6
2.1
3.2
1.1
…
…
…
…
…
4.2
.09
3.4
.07
5.1
2.1
.67
1.1
.07
.01
34
CascadeType: Results
●
Observation: Content of blogs and cascade behavior
are often related.
• Distinct clusters for
“conservative” and
“humorous” blogs
(hand-labeling).
3535
CascadeType: Results
●
Observation: Content of blogs and cascade behavior
are often related.
• Distinct clusters for
“conservative” and
“humorous” blogs
(hand-labeling).
3636
●
●
What about time series data? How can we
deal with that?
Problem: time series data is nonuniform and
difficult to analyze.
in-links over time
37
BlogTimeFractal: Definitions
●
●
●
Fortunately, we find that behavior is often selfsimilar.
The 80-20 law describes self-similarity.
For any sequence, we divide it into two equallength subsequences. 80% of traffic is in one,
20% in the other.
–
Repeat recursively.
3838
Self-similarity
●
The bias factor for the 80-20 law is b=0.8.
20
80
39
Self-similarity
●
The bias factor for the 80-20 law is b=0.8.
20
80
Q: How do we
estimate b?
40
Self-similarity
●
The bias factor for the 80-20 law is b=0.8.
20
80
Q: How do we
estimate b?
A: Entropy plots!
41
BlogTimeFractal
●
●
●
An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R=
T/2.
Record entropy HR
4242
BlogTimeFractal
●
●
●
●
An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution R=
T/2.
Record entropy HR
Recursively take finer resolutions.
4343
BlogTimeFractal
●
●
●
●
An entropy plot plots entropy vs. resolution.
From time series data, begin with resolution r=
T/2.
Record entropy Hr
Recursively take finer resolutions.
4444
BlogTimeFractal: Definitions
●
●
Entropy measures the non-uniformity of histogram at
a given resolution.
We define entropy of our sequence at given R :
where p(t) is percentage of posts from a blog on interval
t, R is resolution and 2R is number of intervals.
45
BlogTimeFractal
●
●
For a b-model (and self similar cases), entropy
plot is linear. The slope s will tell us the bias
factor.
Lemma: For traffic generated by a b-model, the
bias factor b obeys the equation:
s= - b log2 b – (1-b) log2 (1-b)
4646
Entropy Plots
Linear plot  Self-similarity
Entropy
●
Resolution
47
Entropy Plots
●
●
Linear plot  Self-similarity
Uniform: slope s=1. bias=.5
Point mass: s=0. bias=1
Entropy
●
Resolution
48
Entropy Plots
●
●
Linear plot  Self-similarity
Uniform: slope s=1. bias=.5
Point mass: s=0. bias=1
Michelle Malkin in-links,
s= 0.85
Entropy
●
By Lemma 1, b= 0.72
Resolution
49
BlogTimeFractal: Results
●
●
Observation: Most time series of interest are
self-similar.
Observation: Bias factor is approximately 0.7-that is, more bursty than uniform (70/30 law).
Entropy plots:
MichelleMalkin
in-links, b=.72
conversation mass, b=.76
number of posts, b=.70
5050
Papers #1,2 conclusions
●
●
●
●
There are several power laws observed in a
network of blogs.
We can extract cascades to help describe how
information propagates through a network.
We can use cascade properties to describe
behavior of some blogs.
We can also use self-similarity to describe
behavior of blogs over time.
Paper #3
●
Eytan Adar, Li Zhang, Lada A. Adamic, and
Rajan M. Lukose. Implicit Structure and the
Dynamics of Blogspace. WWW 2004.
–
What are the large- and small- scale patterns of
blog epidemics?
52
Large scale: Epidemic profiles
●
Example: The effects of popular websites
linking to a given blog may cause popularity
spikes.
53
Large scale: Epidemic profiles
●
Quantify popularity of a topic into a vector.
●
Then, cluster different topics’ profiles.
Large scale: Epidemic profiles
●
Used k-means clustering on topic buzz to
identify different ways ideas gain and lose
popularity. Found k=4 worked best.
Centroids of clusters identified
55
Large scale: Epidemic profiles
●
‘Catchall’- picked up by different communities, no major spike.
●
‘Back page’ news- delayed spike, broader popularity.
●
‘Slashdot’- link picked up quickly, dies off quickly.
●
‘Front page’ news- immediate spike, broader popularity.
‘catchall’
48%
‘back page’ ‘slashdot’
20%
14%
‘front page’
18%
56
Link gathering
●
●
Links acquired by blogrolls or automated
trackbacks.
Posts sometimes give information on source of
information (‘via’).
May 16 2003, 8:48a
“GIANTmicrobes
http://www.giantmicrobes.com/
‘We make stuffed animals that look like tiny
microbes– only a million times actual size!
Now available: The Common Cold, The Flu,
Sore Throat, and Stomach Ache.’ (via
57
BoingBoing)
Small scale: link mining
●
●
Links acquired by blogrolls or automated
trackbacks.
Posts sometimes give information on source of
information (‘via’).
May 16 2003, 8:48a
EpsteinBarr
Ebola
“GIANTmicrobes
http://www.giantmicrobes.com/
‘We make stuffed animals that look like tiny
microbes– only a million times actual size!
Now available: The Common Cold, The Flu,
Sore Throat, and Stomach Ache.’ (via
58
BoingBoing)
Small scale: link mining
●
Unfortunately, since ‘via’ information is rare
(O(.1%)), there needs to be a better way to infer
infection paths.
–
Solution: link prediction.
Link prediction
●
●
Predict likelihood of 2 blogs linking to each
other.
–
Blog similarity- common links to other blogs
–
Link similarity- common non-blog links
–
Textual similarity- text vector similarity
–
Timing of posts on certain topics.
First three are cosine similarity, timing is
likelihood based on observed distributions of
link timings.
60
Link prediction results
●
Used SVMs to predict links.
–
Undirected link prediction accuracy 91%
–
(Directed link prediction, 57%)
61
More goodies from Paper #3
●
And…
–
Built Zoomgraph, a visualization tool (stay tuned
next week.)
–
Proposed iRank, a ranking based on
“infectiousness” of blogs (stay tuned Oct. 23.)
A more in-depth slide show may be found here:
http://www.blogpulse.com/papers/Adar_blogworkshop2_ppt.pdf
62
Paper #4
●
Noor Ali-Hasan and Lada Adamic. Expressing
Social Relationships on the Blog through Links
and Comments. ICWSM 2007
–
Do different blog communities exhibit certain
structural properties?
63
[Ali-Hasen and Adamic 2007]
●
●
Dataset of 3 blogging communities
–
Dallas/Ft. Worth
–
United Arab Emirates (UAE)
–
Kuwait
Analyzed 3 types of links
–
Blogrolls (on a blog’s webpage)
–
Citations (link in a post)
–
Comments (interaction in a post’s discussion)
64
Citation link
Blogroll link
65
Comment link
66
Link type analysis
●
It is of interest to
compare different
types of links…
–
Co-occurrence of link types (Kuwait)
Co-occurrences of
different link types.
67
Link type analysis
●
It is of interest to
compare different
types of links…
–
Co-occurrences of
different link types.
–
Reciprocity among
link types, between
communities.
Co-occurrence of link types (Kuwait)
Link reciprocation rates
68
Structural properties
●
Centralization- to
what extent links
are not uniformly
distributed. (low in
all communities,
indicating “hubs”)
Links per blog
69
Structural properties
●
●
Centralization- to
what extent links
are not uniformly
distributed. (low in
all communities,
indicating “hubs”)
Modularity- to
what extent
“subcommunities”
have formed.
Links per blog
Modularity
70
Comparing communities
Dallas-Fort Worth
UAE
Kuwait
-Most links are
external to
community (91%)
-Fewer links
external to
community
-Fewest links
external to
community (53%)
-Low centralization
-More centralization
-Highly centralized
-Low reciprocity
-Obvious “hub”
structure
-Much reciprocity
71
Paper #4 Conclusions
●
Based on a survey, they suggest that these
different network characteristics indicated
different mindsets inside the community.
–
Kuwait bloggers more often reported blogging in
order to make new friends.
–
DFW more often reported blogging to update
friends/family on events.
Conclusions
●
Link analysis has discovered patterns in several
aspects of the blogosphere.
–
Observing general network characteristics.
–
Describing behavior of specific blogs, or blog topics.
–
Illustrating how influence propagates.
–
Comparing different blogging communities.
Download