Web Networks Filippo Menczer Indiana University Department of Computer Science

advertisement
Web Networks
Filippo Menczer
Department of Computer Science
School of Informatics
Indiana University
Research supported by NSF
CAREER Award IIS-0348940
csn.indiana.edu
Outline
Link network
Lexical network
Growth models
Semantic network
Peer search network
Traffic network
Three network
topologies
Text
Meaning
Links
The Web as a text corpus
Pages close in
word vector space
tend to be related
Cluster hypothesis
(van Rijsbergen 1979)
mass
p1
p2
weapons
destruction
The WebCrawler (Pinkerton 1994)
The whole first generation of search engines
Enter the
Web’s link
structure
Mining the Web’s link cues
Pages that
link to each other
tend to be related
Link-cluster
conjecture
(Menczer 1997)
Link eigenvector
analysis
HITS, hubs and authorities (Kleinberg & al 1998, …)
Google’s PageRank (Brin & Page 1998, …)
The second generation of search engines
Web growth models
How are links created and why content matters
Preferential attachment
“BA” model
Pr(i) ∝ k(i)
(Barabasi & Albert 1999,
de Solla Price 1976)
At each step t
add new page p
Create m new links
from p to i (i<t)
Rich-get-richer
Pr(i) =
k(i)
mt
=⇒ Pr(k) ∼ k−γ
?
Other growth models
Web copying
(Kleinberg, Kumar & al 1999, 2000)
Pr(i) ∝ Pr(j) · Pr(j → i)
same indegree distribution
no need to know degree
Other growth models
Random mixture
(Pennock & al. 2002,
Cooper & Frieze 2001,
Dorogovtsev & al 2000)
Pr(i) ∝ ψ ·
1
t
+ (1 − ψ) ·
k(i)
mt
winners don’t take all
general indegree distribution
fits non-power-law cases
orks, power
laws
and
phase
transition
Other growth models
Mixture with
Euclidean distance
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
+
(HOT: Fabrikant,
Koutsoupias
o
& Papadimitriou
2002)
oo
++
o
o
o
o o
o
o
o
o
o
o
o
o
+
+
o o
o +
oi =
min(φr
+ gi)
o arg
o
it
o
o
o
+ ++
+
+
+
o
+
++
o
+
o
+
tradeoff
between
centrality
o
+
o
o o
+ locality
and geometric
o
o
o
oo o o
o o
o
o
o
o o
o
o
oo
oo o
o
oo
o
o
oo
o
o
o
oo
o oo o
o o
o
o
o
o
o
o o oo
o o
o
o
o
o
oo
o o
oo o
o
o
o
o
o
o
o
o
oo o
o
o
o
o o
o
o
oo o oo
o
o
o
o
o
o
o
o o
o o o
o
o
oo
o o
o
o o
o
o
oo
oo
oo o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
oo
fits power-law in certain critical
trade-off
regimes Microsoft
Raissa
D’Souza,
o
oo
Research
o
o
o
o
o
o
o
o
o
o
What about content?
Link probability
vs lexical distance
r = 1 σ c −1
( p,q) : r = ρ ∧ σ l > λ
Pr(λ | ρ ) =
( p,q) : r = ρ
Phase
transition
€
ρ*
Power law tail
Pr(λ | ρ ) ~ ρ −α ( λ )
€
€
Proc. Natl. Acad.
Sci. USA 99(22):
14014-14019, 2002
Local content-based growth model

k(i)

Pr( pt → pi< t ) = 
mt
c[r( pi , pt )]−α
•
Similar to preferential
attachment (BA)
•
Use degree info
(popularity/
importance) only for
nearby (similar/
related) pages
€
if r( pi , pt ) < ρ
*
otherwise
γ=4.3±0.1
γ=4.26±0.07
So, many models can predict
degree distributions...
Which is “right” ?
Need an independent observation (other than
degree) to validate models
Distribution of content similarity across
linked pairs
Across all pairs: Pr(σc) ∼ 10−7σc
(Why?!?)
None of these models is right!
Back to the mixture model
Pr(i) ∝ ψ ·
degree-uniform mixture
a
i2
1
t
b
i2
i1
t
i3
+ (1 − ψ) ·
k(i)
mt
c
i2
i1
t
i3
i1
t
i3
Bias choice by content similarity instead
of uniform distribution
Degree-similarity mixture model
Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) ·
P̂r(i) ∝ [r(i, t)]−α
ψ = 0.2, α = 1.7
k(i)
mt
Build it...
(M.M.)
Both mixture models get the degree
distribution right…
…but the degree-similarity mixture model
predicts the similarity distribution better
Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004
Citation
networks
15,785 articles
published in
PNAS between
1997 and 2002
What now?
Understand exponential distribution of
similarity
Growth model to explain evolution of both
link topology and content similarity
With Alex Vespignani & Sandro Flammini
Mapping the relationship
between links, content,
and semantic topologies
• Given any pair of pages, need ‘similarity’ or
‘proximity’ metric for each topology:
– Content: textual/lexical (cosine) similarity
– Link: co-citation/bibliographic coupling
– Semantic: relatedness inferred from manual classification
• Data: Open Directory Project (dmoz.org)
– ~ 1 M pages after cleanup
– ~ 1.3*1012 page pairs!
(
)
σ c p1, p2 =
p1 ⋅ p2
term j
p1
p1 ⋅ p2
Content similarity
€
p2
term i
term k
Link similarity
p1
p2
σ l ( p1, p2 ) =
€
U p1 ∩ U p 2
U p1 ∪ U p 2
Semantic similarity
top
lca
c2
c1
•
Information-theoretic
measure based on
classification tree
(Lin 1998)
•
Classic path distance in special case of balanced tree
€
2logPr[lca(c1,c 2 )]
σ s (c1,c 2 ) =
logPr[c1 ] + logPr[c 2 ]
News
Home
Reference
Sports
Correlations
Computers
between Kids and Teens
Recreation
similarities
Health
Adult
Shopping
Science
content-link
content-semantic
link-semantic
All pairs
Society
European Physical
Journal B 38(2):
211-221, 2004
Business
Arts
Games
0
0.1
0.2
0.3
0.4
0.5
Precision =
Recall =
| Retrieved & Relevant |
| Retrieved |
| Retrieved & Relevant |
| Relevant |
∑σ ( p,q)
P(sc ,sl ) =
s
{ p,q:σ c = sc ,σ l = sl }
Averaging
semantic
similarity
{ p,q : σ c = sc ,σ l = sl }
∑σ ( p,q)
R(sc ,sl ) =
s
{ p,q:σ c = sc ,σ l = sl }
∑σ ( p,q)
s
{ p,q}
Summing
semantic
similarity
Business
log Recall
σl
σc
Precision
Home
log Recall
σl
σc
Precision
News
log Recall
σl
σc
Precision
All pairs
log Recall
σl
σc
Precision
t1
t3
t2
t5
t4
t6
Edge Type
T
S
R
t7
t8
Better semantic similarity measure
Work w/Ana Maguitman, Heather Roinestad,
Alex Vespignani
Include cross-links (symbolic) and see-also
links (related)
Transitive closure of topic graph
Compute entropy based on fuzzy membership
matrix
+
+
W=T
!G!T
2 · min (Wk1, Wk2) · log Pr[tk ]
σs(t1, t2) = max
k log(Pr[t1|tk ] · Pr[tk ]) + log(Pr[t2|tk ] · Pr[tk ])
Differences
1
G
!s >
0.6
<
0.8
0.4
actual difference
0.1
0.2
relative difference %
0
0
0.1
0.2
0.3
0.4
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
20
15
10
5
0
T
!s
mean
stderr
tree
5.7%
0.8%
graph
84.7%
1.8%
Combining content & links
0.6
Spearman rank correlation
0.5
0.4
!c !l
!c H(!l)
!l
0.2 !c + 0.8 !l
0.8 !c + 0.2 !l
!c
0.3
0.2
0.1
0
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
!c threshold
0.7
0.8
0.9
1
6S: peer distributed crawling and
collaborative searching
query
query
query
A
query
hit
B
query
hit
Data mining & referral
opportunities
C
Emerging communities
Pajek
Work with Le-Shin Wu & Ruj Akavipat
Pajek
#%!
70 peers
39)-:*+.3;*<<,3,*5:
=,0>*:*+
#$!
500 peers
'!
2500
connections within groups
diameter
&!
%!
2000
$!
!
!$!
!
"
#!
Average change (%)
01*+02*.34052*.678
#!!
1500
1000 #"
()*+,*-./*+./**+
$!
$"
500
0
0
50
100
time
150
200
WWW traffic network on Internet2
Work with Mark Meiss & Alex Vespignani
9
10
8
10
< sin(kin )>
7
10
6
10
5
10
4
10
3
10
2
10
0
1
10
2
10
3
10
4
10
kin
10
8
10
7
< sout(kout ) >
10
6
10
5
10
Web clients:
super-linear
growth of
traffic
handled as a
function of
number of
connections
4
10
1.2
s∼k
3
10
2
10
1
10 0
10
1
10
2
10
3
kout
10
4
10
5
10
Web servers: no typical traffic (diverging average)
0
10
P(sin,S )
-3
10
P (s) ∼ s−1.7
-6
10
-9
10
-12
10
-15
10
1
10
2
10
3
10
4
10
5
10
6
sin,S
10
7
10
8
10
9
10
10
10
0
10
P(sout,S )
-3
10
P (s) ∼ s−1.8
-6
10
-9
10
-12
10
-15
10
1
10
2
10
3
10
4
10
5
10
6
10
sout,S
7
10
8
10
9
10
10
10
Thank you!
Questions?
http://informatics.indiana.edu/fil
Research supported by NSF
CAREER Award IIS-0348940
Download