The Web as Network Networked Life CSE 112 Spring 2005

advertisement
The Web as Network
Networked Life
CSE 112
Spring 2005
Prof. Michael Kearns
The Web as Network
• Consider the web as a network
– vertices: individual (html) pages
– edges: hyperlinks between pages
– will view as both a directed and undirected graph
• What is the structure of this network?
– connected components
– degree distributions
– etc.
• What does it say about the people building and using it?
– page and link generation
– visitation statistics
• What are the algorithmic consequences?
– web search
– community identification
Graph Structure in the Web
[Broder et al. paper]
• Report on the results of two massive “web crawls”
• Executed by AltaVista in May and October 1999
• Details of the crawls:
–
–
–
–
automated script following hyperlinks (URLs) from pages found
large set of starting points collected over time
crawl implemented as breadth-first search
have to deal with spam, infinite paths, timeouts, duplicates, etc.
• May ’99 crawl:
– 200 million pages, 1.5 billion links
• Oct ’99 crawl:
– 271 million pages, 2.1 billion links
• Unaudited, self-reported Sep ’03 stats:
– 3 major search engines claim > 3 billion pages indexed
Five Easy Pieces
• Authors did two kinds of breadth-first search:
– ignoring link direction  weak connectivity
– only following forward links  strong connectivity
• They then identify five different regions of the web:
– strongly connected component (SCC):
• can reach any page in SCC from any other in directed fashion
– component IN:
• can reach any page in SCC in directed fashion, but not reverse
– component OUT:
• can be reached from any page in SCC, but not reverse
– component TENDRILS:
• weakly connected to all of the above, but cannot reach SCC or be
reached from SCC in directed fashion (e.g. pointed to by IN)
– SCC+IN+OUT+TENDRILS form weakly connected component (WCC)
– everything else is called DISC (disconnected from the above)
– here is a visualization of this structure
Size of the Five
•
•
•
•
•
•
•
SCC: ~56M pages, ~28%
IN: ~43M pages, ~ 21%
OUT: ~43M pages, ~21%
TENDRILS: ~44M pages, ~22%
DISC: ~17M pages, ~8%
WCC > 91% of the web --- the giant component
One interpretation of the pieces:
– SCC: the heart of the web
– IN: newer sites not yet discovered and linked to
– OUT: “insular” pages like corporate web sites
Diameter Measurements
• Directed worst-case diameter of the SCC:
– at least 28
• Directed worst-case diameter of IN  SCC  OUT:
– at least 503
• Over 75% of the time, there is no directed path between a
random start and finish page in the WCC
– when there is a directed path, average length is 16
• Average undirected distance in the WCC is 7
• Moral:
– web is a “small world” when we ignore direction
– otherwise the picture is more complex
Degree Distributions
• They are, of course, heavy-tailed
• Power law distribution of component size
– consistent with the Erdos-Renyi model
• Undirected connectivity of web not reliant on “connectors”
– what happens as we remove high-degree vertices?
Beyond Macroscopic Structure
• Such studies tell us the coarse overall structure of the web
• Use and construction of the web are more fine-grained
– people browse the web for certain information or topics
– people build pages that link to related or “similar” pages
• How do we quantify & analyze this more detailed structure?
• We’ll examine two related examples:
– Kleinberg’s hubs and authorities
• automatic identification of “web communities”
– PageRank
• automatic identification of “important” pages
• one of the main criteria used by Google
– both rely mainly on the link structure of the web
– both have an algorithm and a theory supporting it
Hubs and Authorities
• Suppose we have a large collection of pages on some topic
•
•
•
•
– possibly the results of a standard web search
Some of these pages are highly relevant, others not at all
How could we automatically identify the important ones?
What’s a good definition of importance?
Kleinberg’s idea: there are two kinds of important pages:
– authorities: highly relevant pages
– hubs: pages that point to lots of relevant pages
• If you buy this definition, it further stands to reason that:
– a good hub should point to lots of good authorities
– a good authority should be pointed to by many good hubs
– this logic is, of course, circular
• We need some math and an algorithm to sort it out
The HITS System
(Hyperlink-Induced Topic Search)
• Given a user-supplied query Q:
– assemble root set S of pages (e.g. first 200 pages by AltaVista)
– grow S to base set T by adding all pages linked (undirected) to S
– might bound number of links considered from each page in S
• Now consider directed subgraph induced on just pages in T
• For each page p in T, define its
– hub weight h(p); initialize all to be 1
– authority weight a(p); initialize all to be 1
• Repeat “forever”:
– a(p) := sum of h(q) over all pages q  p
– h(p) := sum of a(q) over all pages p  q
– renormalize all the weights
• This algorithm will always converge!
– weights computed related to eigenvectors of connectivity matrix
– further substructure revealed by different eigenvectors
• Here are some examples
The PageRank Algorithm
• Let’s define a measure of page importance we will call the rank
• Notation: for any page p, let
– N(p) be the number of forward links (pages p points to)
– R(p) be the (to-be-defined) rank of p
• Idea: important pages distribute importance over their forward links
• So we might try defining
–
–
–
–
R(p) := sum of R(q)/N(q) over all pages q  p
can again define iterative algorithm for computing the R(p)
if it converges, solution again has an eigenvector interpretation
problem: cycles accumulate rank but never distribute it
• The fix:
– R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p)
– E(p) is some external or exogenous measure of importance
– some technical details omitted here (e.g. normalization)
• Let’s play with the PageRank calculator
The “Random Surfer” Model
• Let’s suppose that E(p) sums to 1 (normalized)
• Then the resulting PageRank solution R(p) will
– also be normalized
– can be interpreted as a probability distribution
• R(p) is the stationary distribution of the following process:
–
–
–
–
starting from some random page, just keep following random links
if stuck in a loop, jump to a random page drawn according to E(p)
so surfer periodically gets “bored” and jumps to a new page
E(p) can thus be personalized for each surfer
• An important component of Google’s search criteria
But What About Content?
• PageRank and Hubs & Authorities
– both based purely on link structure
– often applied to an pre-computed set of pages filtered for content
• So how do (say) search engines do this filtering?
• This is the domain of information retrieval
Basics of Information Retrieval
• Represent a document as a “bag of words”:
–
–
–
–
–
–
for each word in the English language, count number of occurences
so d[i] is the number of times the i-th word appears in the document
usually ignore common words (the, and, of, etc.)
usually do some stemming (e.g. “washed”  “wash”)
vectors are very long (~100Ks) but very sparse
need some special representation exploiting sparseness
• Note all that we ignore or throw away:
– the order in which the words appear
– the grammatical structure of sentences (parsing)
– the sense in which a word is used
• firing a gun or firing an employee
Bag of Words Document Comparison
• View documents as vectors in a very high-dimensional space
• Can now import geometry and linear algebra concepts
• Similarity between documents d and e:
– S d[i]*e[i] over all words i
– may normalize d and e first
– this is their projection onto each other
• Improve by using TF/IDF weighting of words:
– term frequency --- how frequent is the word in this document?
– inverse document frequency --- how frequent in all documents?
– give high weight to words with high TF and low IDF
• Search engines:
– view the query as just another “document”
– look for similar documents via above
Download