The Web Graph and PageRank

advertisement
The Web Graph and PageRank
Networked Life
Prof. Michael Kearns
Lecture Roadmap
•
•
•
•
Viewing the Web as a directed graph or network
Structure of the Web
Measures of vertex importance in a network
Google’s PageRank algorithm
Directed Networks
• The web graph is a directed network:
–
–
–
–
page A can point/link to page B, without a reciprocal link
represent directed edges with arrows
each web page/site thus has “in-links” and “out-links”
in-degree = # of in-links; out-degree = # of out-links
B
D
A
C
E
F
The Web as Network
• What is the structure of this network?
– connected components
– degree distributions
– etc.
• What does it say about the people building and using it?
– page and link generation
– visitation statistics
• What are the algorithmic consequences?
– web search
– community identification
Graph Structure in the Web
[Broder et al. paper]
• Report on the results of two massive “web crawls”
• Executed by AltaVista in May and October 1999
• Details of the crawls:
–
–
–
–
automated script following hyperlinks (URLs) from pages found
large set of starting points collected over time
crawl implemented as breadth-first search
have to deal with webspam, infinite paths, timeouts, duplicates, etc.
• May ’99 crawl:
– 200 million pages, 1.5 billion links
• Oct ’99 crawl:
– 271 million pages, 2.1 billion links
• Unaudited, self-reported Sep ’03 stats:
– 3 major search engines claim > 3 billion pages indexed
Five Easy Pieces
• Authors did two kinds of breadth-first search:
– ignoring link direction  weak connectivity
– only following forward links  strong connectivity
• They then identify five different regions of the web:
– strongly connected component (SCC):
• can reach any page in SCC from any other in directed fashion
– component IN:
• can reach any page in SCC in directed fashion, but not reverse
– component OUT:
• can be reached from any page in SCC, but not reverse
– component TENDRILS:
• weakly connected to all of the above, but cannot reach SCC or be
reached from SCC in directed fashion (e.g. pointed to by IN)
– SCC+IN+OUT+TENDRILS form weakly connected component (WCC)
– everything else is called DISC (disconnected from the above)
– here is a visualization of this structure
Size of the Five
•
•
•
•
•
•
•
SCC: ~56M pages, ~28%
IN: ~43M pages, ~ 21%
OUT: ~43M pages, ~21%
TENDRILS: ~44M pages, ~22%
DISC: ~17M pages, ~8%
WCC > 91% of the web --- the giant component
One interpretation of the pieces:
– SCC: the heart of the web
– IN: newer sites not yet discovered and linked to
– OUT: “insular” pages like corporate web sites
Diameter Measurements
• Directed worst-case diameter of the SCC:
– at least 28
• Directed worst-case diameter of IN  SCC  OUT:
– at least 503
• Over 75% of the time, there is no directed path between a
random start and finish page in the WCC
– when there is a directed path, average length is 16
• Average undirected distance in the WCC is 7
• Moral:
– web is a “small world” when we ignore direction
– otherwise the picture is more complex
Degree Distributions
• They are, of course, heavy-tailed
• Power law distribution of component size
– consistent with the Erdos-Renyi model
• Undirected connectivity of web not reliant on “connectors”
– what happens as we remove high-degree vertices?
Important Vertices
• So far: have emphasized macroscopic aspects of network structure
–
–
–
–
degree distribution: all degrees
diameter: average over all vertex pairs
connectivity: giant component with most vertices
clustering: average over all vertices, compare to overall edge density
–
–
–
–
high degree
link between communities
betweeness centrality
Google’s PageRank algorithm
• Also interesting to identify “important” individuals within the network
• Some purely structural definitions of importance:
Background on Web Search
• Most important sources of information: words in query and on web pages
• For instance, on query “mountain biking”:
– realize “bike” and “bicycle” are synonymous under context “mountain”
– but “bike” and “motorcylce” are not
– find documents with these words and their correlates (“Trek”, “trails”)
• Subject of the field of information retreival
• But many other “features” or “signals” may be useful for identifying
good or useful sites:
– font sizes
– frequency of exclamation marks
• PageRank idea: use the link structure of the web
Directed Networks
• Q: What constitutes an “important” vertex in a directed network?
– could just use in-degree; view directed links as referrals
• PageRank answer: an important page is pointed to by lots of other
important pages
• This, of course, is circular…
B
D
A
C
E
F
The PageRank Algorithm
•
•
•
•
•
Suppose p and q are pages where q  p
Let R(q) be rank of q; let out(q) be out-degree of q
Idea: q “distributes” its rank over its out-links
Each out-link of q receives R(q)/out(q)
So then p’s rank should be:
R( p) 
 R(q) /out(q)
qPOINTS( p )
B

D
A
C
E
F
The PageRank Algorithm
• What guarantee do we have that all these equations are consistent?
• Idea: turn the equations into an update rule (algorithm)
• At each time step, pick some vertex p to update, and perform:
R( p)
 R(q) /out(q)
qPOINTS( p )
• Right-hand side R(q) are current/frozen values; R(p) is new rank of p
• Initialize all R(p) to 1
• Claim: Under broad conditions, this algorithm will converge:
– at some point, updates no longer change any values
– have found solution to all the R(p) equations

0.31
0.15
B
C
A
0.38
E
D
0.38
F
0.64
0.54
0.15
G
0.21
0.40
B
C
A
0.44
E
D
0.44
F
0.70
0.67
Summary
• PageRank defines importance in a circular or self-referential fashion
• Circularity is broken with a simple algorithm that provably converges
• PageRank defines R(p) globally; more subtle than local properties of p
Download