Web Networks Filippo Menczer Department of Computer Science School of Informatics Indiana University Research supported by NSF CAREER Award IIS-0348940 csn.indiana.edu Outline Link network Lexical network Growth models Semantic network Peer search network Traffic network Three network topologies Text Meaning Links The Web as a text corpus Pages close in word vector space tend to be related Cluster hypothesis (van Rijsbergen 1979) mass p1 p2 weapons destruction The WebCrawler (Pinkerton 1994) The whole first generation of search engines Enter the Web’s link structure Mining the Web’s link cues Pages that link to each other tend to be related Link-cluster conjecture (Menczer 1997) Link eigenvector analysis HITS, hubs and authorities (Kleinberg & al 1998, …) Google’s PageRank (Brin & Page 1998, …) The second generation of search engines Web growth models How are links created and why content matters Preferential attachment “BA” model Pr(i) ∝ k(i) (Barabasi & Albert 1999, de Solla Price 1976) At each step t add new page p Create m new links from p to i (i<t) Rich-get-richer Pr(i) = k(i) mt =⇒ Pr(k) ∼ k−γ ? Other growth models Web copying (Kleinberg, Kumar & al 1999, 2000) Pr(i) ∝ Pr(j) · Pr(j → i) same indegree distribution no need to know degree Other growth models Random mixture (Pennock & al. 2002, Cooper & Frieze 2001, Dorogovtsev & al 2000) Pr(i) ∝ ψ · 1 t + (1 − ψ) · k(i) mt winners don’t take all general indegree distribution fits non-power-law cases orks, power laws and phase transition Other growth models Mixture with Euclidean distance o o o o o o o o o o o o o o o o o + (HOT: Fabrikant, Koutsoupias o & Papadimitriou 2002) oo ++ o o o o o o o o o o o o o + + o o o + oi = min(φr + gi) o arg o it o o o + ++ + + + o + ++ o + o + tradeoff between centrality o + o o o + locality and geometric o o o oo o o o o o o o o o o o oo oo o o oo o o oo o o o oo o oo o o o o o o o o o o oo o o o o o o oo o o oo o o o o o o o o o oo o o o o o o o o oo o oo o o o o o o o o o o o o o o oo o o o o o o o oo oo oo o o o o o o o o o o o oo o o o o o o o o oo fits power-law in certain critical trade-off regimes Microsoft Raissa D’Souza, o oo Research o o o o o o o o o o What about content? Link probability vs lexical distance r = 1 σ c −1 ( p,q) : r = ρ ∧ σ l > λ Pr(λ | ρ ) = ( p,q) : r = ρ Phase transition € ρ* Power law tail Pr(λ | ρ ) ~ ρ −α ( λ ) € € Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002 Local content-based growth model k(i) Pr( pt → pi< t ) = mt c[r( pi , pt )]−α • Similar to preferential attachment (BA) • Use degree info (popularity/ importance) only for nearby (similar/ related) pages € if r( pi , pt ) < ρ * otherwise γ=4.3±0.1 γ=4.26±0.07 So, many models can predict degree distributions... Which is “right” ? Need an independent observation (other than degree) to validate models Distribution of content similarity across linked pairs Across all pairs: Pr(σc) ∼ 10−7σc (Why?!?) None of these models is right! Back to the mixture model Pr(i) ∝ ψ · degree-uniform mixture a i2 1 t b i2 i1 t i3 + (1 − ψ) · k(i) mt c i2 i1 t i3 i1 t i3 Bias choice by content similarity instead of uniform distribution Degree-similarity mixture model Pr(i) ∝ ψ · P̂r(i) + (1 − ψ) · P̂r(i) ∝ [r(i, t)]−α ψ = 0.2, α = 1.7 k(i) mt Build it... (M.M.) Both mixture models get the degree distribution right… …but the degree-similarity mixture model predicts the similarity distribution better Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004 Citation networks 15,785 articles published in PNAS between 1997 and 2002 What now? Understand exponential distribution of similarity Growth model to explain evolution of both link topology and content similarity With Alex Vespignani & Sandro Flammini Mapping the relationship between links, content, and semantic topologies • Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology: – Content: textual/lexical (cosine) similarity – Link: co-citation/bibliographic coupling – Semantic: relatedness inferred from manual classification • Data: Open Directory Project (dmoz.org) – ~ 1 M pages after cleanup – ~ 1.3*1012 page pairs! ( ) σ c p1, p2 = p1 ⋅ p2 term j p1 p1 ⋅ p2 Content similarity € p2 term i term k Link similarity p1 p2 σ l ( p1, p2 ) = € U p1 ∩ U p 2 U p1 ∪ U p 2 Semantic similarity top lca c2 c1 • Information-theoretic measure based on classification tree (Lin 1998) • Classic path distance in special case of balanced tree € 2logPr[lca(c1,c 2 )] σ s (c1,c 2 ) = logPr[c1 ] + logPr[c 2 ] News Home Reference Sports Correlations Computers between Kids and Teens Recreation similarities Health Adult Shopping Science content-link content-semantic link-semantic All pairs Society European Physical Journal B 38(2): 211-221, 2004 Business Arts Games 0 0.1 0.2 0.3 0.4 0.5 Precision = Recall = | Retrieved & Relevant | | Retrieved | | Retrieved & Relevant | | Relevant | ∑σ ( p,q) P(sc ,sl ) = s { p,q:σ c = sc ,σ l = sl } Averaging semantic similarity { p,q : σ c = sc ,σ l = sl } ∑σ ( p,q) R(sc ,sl ) = s { p,q:σ c = sc ,σ l = sl } ∑σ ( p,q) s { p,q} Summing semantic similarity Business log Recall σl σc Precision Home log Recall σl σc Precision News log Recall σl σc Precision All pairs log Recall σl σc Precision t1 t3 t2 t5 t4 t6 Edge Type T S R t7 t8 Better semantic similarity measure Work w/Ana Maguitman, Heather Roinestad, Alex Vespignani Include cross-links (symbolic) and see-also links (related) Transitive closure of topic graph Compute entropy based on fuzzy membership matrix + + W=T !G!T 2 · min (Wk1, Wk2) · log Pr[tk ] σs(t1, t2) = max k log(Pr[t1|tk ] · Pr[tk ]) + log(Pr[t2|tk ] · Pr[tk ]) Differences 1 G !s > 0.6 < 0.8 0.4 actual difference 0.1 0.2 relative difference % 0 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 20 15 10 5 0 T !s mean stderr tree 5.7% 0.8% graph 84.7% 1.8% Combining content & links 0.6 Spearman rank correlation 0.5 0.4 !c !l !c H(!l) !l 0.2 !c + 0.8 !l 0.8 !c + 0.2 !l !c 0.3 0.2 0.1 0 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 !c threshold 0.7 0.8 0.9 1 6S: peer distributed crawling and collaborative searching query query query A query hit B query hit Data mining & referral opportunities C Emerging communities Pajek Work with Le-Shin Wu & Ruj Akavipat Pajek #%! 70 peers 39)-:*+.3;*<<,3,*5: =,0>*:*+ #$! 500 peers '! 2500 connections within groups diameter &! %! 2000 $! ! !$! ! " #! Average change (%) 01*+02*.34052*.678 #!! 1500 1000 #" ()*+,*-./*+./**+ $! $" 500 0 0 50 100 time 150 200 WWW traffic network on Internet2 Work with Mark Meiss & Alex Vespignani 9 10 8 10 < sin(kin )> 7 10 6 10 5 10 4 10 3 10 2 10 0 1 10 2 10 3 10 4 10 kin 10 8 10 7 < sout(kout ) > 10 6 10 5 10 Web clients: super-linear growth of traffic handled as a function of number of connections 4 10 1.2 s∼k 3 10 2 10 1 10 0 10 1 10 2 10 3 kout 10 4 10 5 10 Web servers: no typical traffic (diverging average) 0 10 P(sin,S ) -3 10 P (s) ∼ s−1.7 -6 10 -9 10 -12 10 -15 10 1 10 2 10 3 10 4 10 5 10 6 sin,S 10 7 10 8 10 9 10 10 10 0 10 P(sout,S ) -3 10 P (s) ∼ s−1.8 -6 10 -9 10 -12 10 -15 10 1 10 2 10 3 10 4 10 5 10 6 10 sout,S 7 10 8 10 9 10 10 10 Thank you! Questions? http://informatics.indiana.edu/fil Research supported by NSF CAREER Award IIS-0348940