The web as a graph: structure and interpretation. Sridhar Rajagopalan IBM Almaden

advertisement
The web as a graph: structure and
interpretation.
Sridhar Rajagopalan
IBM Almaden
Ravi Kumar, Prabhakar Raghavan, Andrew Tomkins (IBM, Almaden)
Andrei Broder, Farzin Maghoul (AltaVista Corp.)
Raymie Stata, Janet Wiener (Compaq SRC)
Eli Upfal (Brown University)
A Picture of (~200M) pages.
Part A: Structure.
•
•
•
•
The graph.
The Questions ….
… and some Answers.
The picture.
The web graph
•
•
•
•
•
Nodes (or vertices) = web pages. Edges = (non-nepotistic) links.
The graph = all web pages and links.
Many nodes, estimates range from 500M to over 1B.
Is very sparse. Average links/page between 5-10.
Average (links/page | more than 6 links) > 30.
Concentrate on graph structure, ignore content.
Questions about the web graph
• How big is the graph?How
many links on a page
(outdegree)? How many links
to a page (indegree)?
• Can one browse from any
web page to any other? How
many clicks?
• Can we pick a random page
on the web?
• How different is browsing
from a “random walk”?
• Can we exploit the structure
of the web graph for
searching and mining?
• What does the web graph
reveal about social
processes which result in its
creation and dynamics?
Power laws: How many pages point to a
random page on the web?
• Indegrees.
I (u ) | {v  u} |
Pru ( I (u )  k )  k
2.1
Slope = 2.1
How many links on a page?
Slope = 2.7
Yule/Pareto/Zipf and power laws.
•
•
•
•
•
•
•
•
Inverse polynomial tail.
Word frequency in text. Yule (later Mandelbrot) Statistical study of the literary
vocabulary.[Yule, 1944].
Citation analysis [Lotka, 1926].
Zipf Human behavior and the principle of least effort. [Zipf, 1947].
Pareto Cours d’economie politique. [Pareto,1897]
Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999]
Oligonucleotide sequences [Martindale-Konopka, 1996]
Many other instances.
More Germane
• Access statistics for web pages. (From server logs) [Glassman97]
• User behavior (by instrumenting browsers and proxies) [LukoseHuberman-98, Crovella and others,97-99]
• Earliest analytical model, [Herb Simon, 1955].
Co-citation and Bibliographic coupling:
Signature of a community.
• Bipartite cores: small “complete” bipartite
subgraphs.
• Bibliographic coupling, Co-citation analysis.
• Hubs and Authorities.
K ( 3, 3 )
Uses:
• Web searching (HITS/Clever).
• Mining communities (Campfire project).
• Backlink browsing, “find similar.”
Small world.
• (Small World Prediction) [Barabasi and Albert 99, Albert-JeongBarabasi 99]. Based on a simple model, predict that most pages
are within 19 links of each other. Justify the model by crawling
nd.edu
Facts (about the crawl).
• Most of the time (75%) a random page u is not reachable from
another random page v.
• Indegree and Outdegree distributions satisfy the power law.
Consistent over time and scale.
Component sizes.
• Component sizes are distributed by the power law.
Reachability
• How many vertices are reachable from a random vertex?
A Picture of (~200M) pages.
Part B: Interpretation
• Random graph theory.
• Application 1: The Campfire Project.
• Application 2: Classical IR/Learning.
Random Graphs
•
•
•
•
•
Erdos and Renyi’s Gn , p model [Bollobas].
Graph with n vertices.
Each of n(n-1) arcs appear with probability p.
Graphical evolution [Palmer]: study properties of the resulting random graph
as p is increased from 0 to 1.
[Shelah and Spencer] 0-1 law: Most properties exhibit a threshold “phase
change” like behavior.
p
Facts about the Erdos-Renyi model
• A random graph with average degree 4 has a giant connected
component containing almost all (90%) of the vertices.
• Indegrees and outdegrees are concentrated around the mean.
And have exponentially declining tails.
• Most vertices in the graph are close to most others (small world).
A new random graph model.
Content creation hypothesis
• Some page creators create content without regard to what exists
on the web.
• Many create pages which are inspired by pre-existing content.
• Effectively, some links are random, others are copied from preexisting pages.
Probabilistic analysis: Evolving graphs.
• Creation and Deletion processes for nodes and edges.
– e.g. at each time step, a new node is created with a fixed probability pv
– at each time step, a new edge is created with probability pe
• links two random nodes with probability 1 
• a node in proportion to its indegree with probability
• (copy a random link).
– At each time step a node (resp. edge) is deleted with probability pu
(resp pd )
Simple model: creation probabilities are 1 and deletion probabilities are 0.

•

Theory
1. The indegree distributi on has an inverse polynomial tail,
moreover, the exponent of the tail is dependent only on  , and
can be expressed in the formula :
Pru ( I (u )  k )  k
 ( 2 
1
)
2. Almost surely, the distributi on converges to this in the limit.
3. Almost surely, the graph has a linear sized SCC.
Why study models?
• Good predictors of macroscopic behavior.
– Degree distributions. Existence and number of cores. [WWW8]
• Algorithmic advantages (speed and accuracy).
– Better and analyzable algorithmic methods. Inclusion-Exclusion pruning.
[VLDB].
– Applications to Data Mining.
• Better understanding of the data/corpus.
– What is “surprising” depends on what is typical. To find interesting stuff,
you must know what is expected.
Be careful about...
• Predicting and analyzing microscopic properties.
– Microscopic Properties which can be changed by the addition/deletion of
a few nodes/edges/features.
• Examples: Diameter and girth, rare terms and features.
• Very susceptible to noise and systematic but small inconsistencies in
the model.
– Macroscopic Major dataset surgery required to significantly alter the
property.
• Examples: Degree distributions. Connectivity.
• Law of large numbers or equivalent applies.
Application 1: The campfire project.
Co-citation: Signature of a community.
• Bipartite cores: small “complete” bipartite
subgraphs.
K ( 3, 3 )
Fact :
Let K (i , j ) denote a core with i left hand vertices and j right
hand vertices. Then, if i  j  ij then a sparse Gn, p random graph
almost surely contains no K (i , j ) .
Campfire project
• Automatically find and organize communities on the web.
• Approach:
– Find all cores.
– Grow cores into the full community.
– Do IR/Categorization/Clustering etc. to organize the community space.
[KRRT] WWW8, and [KRRT] VLDB’99.
The cores are interesting.
Explicit communities.
•Yahoo!, Excite, Infoseek
•webrings
•news groups
•mailing lists
(1) Implicit communities are defined by
cores.
(2) There are an order of magnitute
more of these. There are efficient
heuristics to compute all cores.
(3) Can grow the core to the community
using Clever.
Implicit communities
•
•
•
•
•
•
•
•
hotels in costa rica
clipart
japanese elementary schools
turkish student associations
oil spills off the coast of japan
australian fire brigades
aviation/aircraft vendors
guitar manufacturers
Costa Rican Hotels.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
The Costa Rica Inte...ion on arts, busi...
Informatica Interna...rvices in Costa Rica
Cocos Island Research Center
Aero Costa Rica
Hotel Tilawa - Home Page
COSTA RICA BY INTER@MERICA
tamarindo.com
Costa Rica
New Page 5
The Costa Rica Internet Directory.
Costa Rica, Zarpe Travel and Casa Maria
Si Como No Resort Hotels & Villas
Apartotel El Sesteo... de San José, Cos...
Spanish Abroad, Inc. Home Page
Costa Rica's Pura V...ry - Reservation ...
YELLOW\RESPALDO\HOTELES\Orquide1
Costa Rica - Summary Profile
COST RICA, MANUEL A...EPOS: VILLA
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Hotels and Travel in Costa Rica
Nosara Hotels & Res...els &
Restaurants...
Costa Rica Travel, Tourism &
Resorts
Association Civica de Nosara
http://www...ca/hotels/mimos.html
Costa Rica, Healthy...t Pura Vida
Domestic & International Airline
HOTELES / HOTELS - COSTA RICA
tourgems
Hotel Tilawa - Links
Costa Rica Hotels T...On line
Reservations
Yellow pages Costa ...Rica Export
INFOHUB Costa Rica Travel Guide
Hotel Parador, Manuel Antonio, Costa Rica
Destinations
Elementary Schools in Japan
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
The American School in Japan
The Link Page
以
èŽ
s—§ˆä“c
¬Šw
Zƒz
[ƒ
ƒy
[ƒW
Kids' Space
ˆÀ•
éŽ
s—§ˆÀ
é
¼•”
¬Šw
Z
‹{
鋳ˆç‘åŠw•
‘®
¬Šw
Z
KEIMEI GAKUEN Home Page ( Japanese )
Shiranuma Home Page
fuzoku-es.fukui-u.ac.jp
welcome to Miasa E&J school
_“ޏ

쌧
E‰¡•l
s—§’†
ì¼

¬Šw
Z‚̃y
http://www...p/~m_maru/index.html
fukui haruyama-es HomePage
Torisu primary school
goo
Yakumo Elementary,Hokkaido,Japan
FUZOKU Home Page
Kamishibun Elementary School...
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
schools
LINK Page-13
“ú–{‚ÌŠw•
Z
a‰„

¬Šw
Zƒz
[ƒ
ƒy
[ƒW
100 Schools Home Pages (English)
K-12 from Japan 10/...rnet and Education )
http://www...iglobe.ne.jp/~IKESAN
‚l‚f‚j
¬Šw
Z‚U”N‚P‘g•¨Œê
ÒŠ—’¬—§

ÒŠ—“Œ
¬Šw
Z
Koulutus ja oppilaitokset
TOYODA HOMEPAGE
Education
Cay's Homepage(Japanese)
–y“ì
¬Šw
Z‚̃z
[ƒ
ƒy
[ƒW
UNIVERSITY
‰J—³
¬Šw
Z DRAGON97-TOP
‰ª

¬Šw
Z‚T”N‚P‘gƒz
[ƒ
ƒy
[ƒW
¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Application 2: Classical Learning/IR.
Vector space and other classical models.
• Document is a vector in a real-valued space with dimensions
identified with “features.” [Salton]

x  R F , F  {features}
Some notion of similarity, usually, cosine or dot-product.
 
 
D( x , y )  cos( x , y )
Built in assumption: Features are independent.
Uses of the Vector Space model.
•
•
•
•
•
•
•
Search, Clustering, Classification.
Term weighting. [Salton, Dumais, Sparck-Jones]
SVD (for instance, LSI [Deerwester et.al.]).
Gaussian assumption and classification. (for instance, [Koller-Sahami],
[Chakrabarti et.al.]).
Many ad-hoc methods and heuristics, some of which work remarkably
well.[Modha et.al.]
Clustering. [Drineas et.al.]
Dimensionality reduction. Feature selection. [Johnson-Lindenstrauss, KollerSahami, Chakrabarti et.al. and others]
Two (new ?) ingredients.
• Hypertext -- the graph.
• Zipfian distributions on term occurances.
Hypertext Classification/Clustering.
• Class of a page is a function of text + class of neighbor set.
– Classification problem -- Markov Random fields. [Chakrabarti-Dom-Indyk]
– Clustering problem -- [Modha]
Research Issue
• Rework applications in these new (rather old) context. OR
• Explain why the standard algorithms continue to work despite the
sometime questionable assumptions behind their derivation.
Download