Network Structure and Web Search Networked Life

advertisement
Network Structure
and
Web Search
Networked Life
CIS 112
Spring 2010
Prof. Michael Kearns
Beyond Macroscopic Structure
• Broder et al. report on coarse overall structure of the web
• Use and construction of the web are more fine-grained
– people browse the web for certain information or topics
– people build pages that link to related or “similar” pages
• How do we quantify & analyze this more detailed structure?
• We’ll examine two related examples:
– Kleinberg’s hubs and authorities
• automatic identification of “web communities”
– PageRank
• automatic identification of “important” pages
• one of the main criteria used by Google
– both rely mainly on the link structure of the web
– both have an algorithm and a theory supporting them
Hubs and Authorities
• Suppose we have a large collection of pages on some topic
•
•
•
•
– possibly the results of a standard web search
Some of these pages are highly relevant, others not at all
How could we automatically identify the important ones?
What’s a good definition of importance?
Kleinberg’s idea: there are two kinds of important pages:
– authorities: highly relevant pages
– hubs: pages that point to lots of relevant pages
• If you buy this definition, it further stands to reason that:
– a good hub should point to lots of good authorities
– a good authority should be pointed to by many good hubs
– this logic is, of course, circular
• We need some math and an algorithm to sort it out
The HITS System
(Hyperlink-Induced Topic Search)
• Given a user-supplied query Q:
– assemble root set S of pages (e.g. first 200 pages by AltaVista)
– grow S to base set T by adding all pages linked (undirected) to S
– might bound number of links considered from each page in S
• Now consider directed subgraph induced on just pages in T
• For each page p in T, define its
– hub weight h(p); initialize all to be 1
– authority weight a(p); initialize all to be 1
• Repeat “forever”:
– a(p) := sum of h(q) over all pages q  p
– h(p) := sum of a(q) over all pages p  q
– renormalize all the weights
• This algorithm will always converge!
– weights computed related to eigenvectors of connectivity matrix
– further substructure revealed by different eigenvectors
• Here are some examples
The PageRank Algorithm
• Let’s define a measure of page importance we will call the rank
• Notation: for any page p, let
– N(p) be the number of forward links (pages p points to)
– R(p) be the (to-be-defined) rank of p
• Idea: important pages distribute importance over their forward links
• So we might try defining
–
–
–
–
R(p) := sum of R(q)/N(q) over all pages q  p
can again define iterative algorithm for computing the R(p)
if it converges, solution again has an eigenvector interpretation
problem: cycles accumulate rank but never distribute it
• The fix:
– R(p) := [sum of R(q)/N(q) over all pages q  p] + E(p)
– E(p) is some external or exogenous measure of importance
– some technical details omitted here (e.g. normalization)
• Let’s play with the PageRank calculator
G
B
D
A
C
E
F
The “Random Surfer” Model
• Let’s suppose that E(p) sums to 1 (normalized)
• Then the resulting PageRank solution R(p) will
– also be normalized
– can be interpreted as a probability distribution
• R(p) is the stationary distribution of the following process:
–
–
–
–
starting from some random page, just keep following random links
if stuck in a loop, jump to a random page drawn according to E(p)
so surfer periodically gets “bored” and jumps to a new page
E(p) can thus be personalized for each surfer
• An important component of Google’s search criteria
But What About Content?
• PageRank and Hubs & Authorities
– both based purely on link structure
– often applied to an pre-computed set of pages filtered for content
• So how do (say) search engines do this filtering?
• This is the domain of information retrieval
Basics of Information Retrieval
• Represent a document as a “bag of words”:
–
–
–
–
–
–
for each word in the English language, count number of occurences
so d[i] is the number of times the i-th word appears in the document
usually ignore common words (the, and, of, etc.)
usually do some stemming (e.g. “washed”  “wash”)
vectors are very long (~100Ks) but very sparse
need some special representation exploiting sparseness
• Note all that we ignore or throw away:
– the order in which the words appear
– the grammatical structure of sentences (parsing)
– the sense in which a word is used
• firing a gun or firing an employee
– and much, much more…
Bag of Words Document Comparison
• View documents as vectors in a very high-dimensional space
• Can now import geometry and linear algebra concepts
• Similarity between documents d and e:
– S d[i]*e[i] over all words i
– may normalize d and e first
– this is their projection onto each other
• Improve by using TF/IDF weighting of words:
– term frequency --- how frequent is the word in this document?
– inverse document frequency --- how frequent in all documents?
– give high weight to words with high TF and low IDF
• Search engines:
– view the query as just another “document”
– look for similar documents via above
Looking Ahead: Left Side vs. Right Side
•
So far we are discussing the “left hand” search results on Google
•
“Right hand” or “sponsored” search: paid advertisements in a formal market
•
•
Same two types of search/results on Yahoo!, MSN,…
Common perception:
•
But both sides are subject to “gaming” (strategic behavior)…
•
… and perhaps to outright fraud
•
More later…
–
a.k.a “organic” search;
–
We will spend a lecture on these markets later in the term
–
–
organic results are “objective”, based on content, importance, etc.
sponsored results are subjective advertisements
–
–
–
organic: invisible terms in the html, link farms and web spam, reverse engineering
sponsored: bidding behavior, “jamming”
optimization of each side has its own industry: SEO and SEM
–
–
organic: typo squatting
sponsored: click fraud
Download