L21LinkAnalysis

advertisement
Link Analysis
Adapted from Lectures by
Prabhakar Raghavan (Yahoo, Stanford),
Christopher Manning (Stanford), and
Raymond Mooney (UT, Austin)
Prasad
L21LinkAnalysis
1
Evolution of Search Engines



1st Generation : Retrieved documents that
matched keyword-based queries based on
boolean model.
2nd Generation : Incorporated contentspecific relevance ranking based on vector
space model (TF-IDF), to deal with high
recall.
3rd Generation: Incorporated contentindependent source ranking, to overcome
spamming, and to exploit “collective web
wisdom”.
Prasad
L21LinkAnalysis
2
(cont’d)


3rd Generation: Tried to glean relative
semantic emphasis of various words based
on syntactic features such as fonts, span of
query term hits, etc. to enhance the efficacy
of VSM.
Future search engines will incorporate
context, profile, and past query history
associated with a user, to personalize
search, and apply additional reasoning and
heuristics to improve satisfaction of
information need.
Prasad
L21LinkAnalysis
3
Meta-Search Engines

Search engine that passes query to
several other search engines and
integrates results.





Submit queries to host sites.
Parse resulting HTML pages to extract search
results.
Integrate multiple rankings into a “consensus”
ranking.
Present integrated results to user.
Examples: Metacrawler, SavvySearch ,Dogpile
4
HTML Structure & Feature
Weighting

Weight tokens under particular HTML tags
more heavily:




<TITLE> tokens (Google seems to like title
matches)
<H1>,<H2>… tokens
<META> keyword tokens
Parse page into conceptual sections (e.g.
navigation links vs. page content) and weight
tokens differently based on section.
5
Bibliometrics: Citation Analysis



Many standard documents include
bibliographies (or references), explicit
citations to other previously published
documents.
Using citations as links, standard corpora
can be viewed as a graph.
The structure of this graph, independent
of the content, can provide interesting
information about the similarity of
documents and the structure of
6
information.
Bibliographic Coupling:
Similarity Measure




Measure of similarity of documents introduced by
Kessler in 1963.
The bibliographic coupling of two documents A
and B is the number of documents cited by both
A and B.
Size of the intersection of their bibliographies.
Maybe want to normalize by size of
bibliographies?
A
B
7
Co-Citation : Similarity Measure



An alternate citation-based measure of similarity
introduced by Small in 1973.
Number of documents that cite both A and B.
Maybe want to normalize by total number of
documents citing either A or B ?
A
B
8
Impact Factor (of a journal)





Developed by Garfield in 1972 to measure the
importance (quality, influence) of scientific
journals.
Measure of how often papers in the journal are
cited by other scientists.
Computed and published annually by the
Institute for Scientific Information (ISI).
The impact factor of a journal J in year Y is the
average number of citations (from indexed
documents published in year Y) to a paper
published in J in year Y1 or Y2.
Does not account for the quality of the citing
article.
9
Authorities




Authorities are pages that are recognized as
providing significant, trustworthy, and useful
information on a topic.
In-degree (number of pointers to a page) is
one simple measure of authority.
However in-degree treats all links as equal.
Should links from pages that are themselves
authoritative count for more?
10
Hubs

Hubs are index pages that provide
lots of useful links to relevant
content pages (topic authorities).

Hub pages for IR are included in the
course home page:

http://www.cs.utexas.edu/users/mo
oney/ir-course
11
Hubs and Authorities

Together they tend to form a bipartite
graph:
Hubs
Authorities
12
Today’s lecture:
Basics of 3rd Generation Search Engines

Role of anchor text

Link analysis for ranking

Pagerank and variants (Page and Brin)


Prasad
BONUS: Illustrates how to solve an important
problem by adapting topology/statistics of large
dataset for mathematically sound analysis and a
practical scalable algorithm
HITS (Klienberg)
L21LinkAnalysis
13
The Web as a Directed Graph
Page A
Anchor
hyperlink
Page B
Assumption 1: A hyperlink between pages denotes
author perceived relevance (quality signal)
Assumption 2: The anchor text of the hyperlink
describes the target page (textual context)
Prasad
L21LinkAnalysis
14
Anchor Text
WWW Worm - McBryan [Mcbr94]

For ibm how to distinguish between:



IBM’s home page (mostly graphical)
IBM’s copyright page (high term freq. for ‘ibm’)
Rival’s spam page (arbitrarily high term freq.)
“ibm”
A million pieces of
anchor text with “ibm”
send a strong signal
Prasad
“ibm.com”
“IBM home page”
www.ibm.com
L21LinkAnalysis
15
Indexing anchor text

When indexing a document D, include
anchor text from links pointing to D.
Armonk, NY-based computer
giant IBM announced today
www.ibm.com
Joe’s computer hardware links
Compaq
HP
IBM
Big Blue today announced
record profits for the quarter
16
Motivation and
Introduction to PageRank


Why is Page Importance Rating important?
 New challenges for information retrieval on the
World Wide Web.
 Huge number of web pages: 150 million by1998
1000 billion by 2008
 Diversity of web pages: different topics, different
quality, etc.
What is PageRank?
 A method for rating the importance of web
pages objectively and mechanically using the link
structure of the web.
Initial PageRank Idea

Can view it as a process of PageRank
“flowing” from pages to the pages they cite.
.1
.08
.05
.05
.03
.09
.08
.03
.03
.03
22
Sample Stable Fixpoint
0.4
0.2
0.2
0.4 0.4
23
0.2
0.2
Pagerank scoring : More details

Imagine a browser doing a random walk on
web pages:
1/3



1/3
Start at a random page
1/3
At each step, go out of the current page
along one of the links on that page, with
equal probability
“In the steady state” each page has a longterm visit rate - use this as the page’s score.
Prasad
L21LinkAnalysis
24
Not quite enough

The web is full of dead-ends.


Random walk can get stuck in dead-ends.
Makes no sense to talk about long-term visit
rates.
??
Prasad
L21LinkAnalysis
25
Teleporting


At a dead end, jump to a random
web page.
At any non-dead end, with
probability 10%, jump to a
random web page.
With remaining probability (90%),
go out on a random link.
 10% - a parameter.

Prasad
L21LinkAnalysis
26
Result of teleporting
Now cannot get stuck locally.
 There is a long-term rate at
which any page is visited (not
obvious, will show this).
 How do we compute this visit
rate?

Prasad
L21LinkAnalysis
27
Markov chains



A Markov chain consists of n states, plus an
nn transition probability matrix P.
At each step, we are in exactly one of these
states.
For 1  i,j  n, the matrix entry Pij tells us the
probability of j being the next state, given
we are currently in state i.
Pii>0
is OK.
Prasad
i
Pij
j
28
Markov chains
n



Clearly, for all i,  Pij  1.
j 1
Markov chains are abstractions of random
walks.
Exercise: represent the teleporting random
walk from 3 slides ago as a Markov chain.
For this case:
Prasad
29
Ergodic Markov chains

A Markov chain is ergodic if


Prasad
you have a path from any state to any
other.
For any start state, after a finite
transient time T0, the probability of
being in any state at a fixed time T>T0
is nonzero.
Not
ergodic
(even/
odd).
30
Ergodic Markov chains



For any ergodic Markov chain, there
is a unique long-term visit rate for
each state.
 Steady-state probability
distribution.
Over a long time-period, we visit
each state in proportion to this rate.
It doesn’t matter where we start.
Prasad
L21LinkAnalysis
31
Probability vectors


A probability (row) vector x = (x1, … xn)
tells us where the walk is, at any point.
E.g., (000…1…000) means we’re in state i.
1
i
n
More generally, the vector x = (x1, … xn)
means the walk is in state i with probability xi.
n
x
Prasad
i 1
i
 1.
32
Change in probability vector



If the probability vector is
x = (x1, … xn) at this step, what is it at the next
step?
Recall that row i of the transition prob. matrix P
tells us where we go next from state i.
So from x, our next state is distributed as xP.
n
x P
i 1
Prasad
i ij
 xj.
L21LinkAnalysis
33
Steady state example

The steady state looks like a vector of
probabilities a = (a1, … an):

ai is the probability of being in state i.
3/4
1/4
1
2
3/4
1/4
For this example, a1=1/4 and a2=3/4.
Prasad
L21LinkAnalysis
34
How do we compute this vector?



Let a = (a1, … an) denote the row vector of
steady-state probabilities.
If current state is described by a, then the next
step is distributed as aP.
But a is the steady state, so a=aP.
 So a is the (left) eigenvector for P.
 Corresponds to the “principal” eigenvector of
P with the largest eigenvalue.


Prasad
Transition probability matrices always have largest
eigenvalue 1 (and others are smaller).
Principal eigenvector can be computed iteratively
from any initial vector.
35
One way of computing a






Recall, regardless of where we start, we
eventually reach the steady state a.
Start with any distribution (say x=(10…0)).
After one step, we’re at xP;
after two steps at xP2 , then xP3 and so on.
“Eventually” means for “large” k, xPk = a.
Algorithm: multiply x by increasing powers
of P until the product looks stable.
Prasad
L21LinkAnalysis
36
Link Structure of the Web

150 million web pages  1.7 billion links
Backlinks and Forward links:
A and B are C’s backlinks
C is A and B’s forward link
Intuitively, a webpage is important if it has a lot of backlinks.
What if a webpage has only one link off www.yahoo.com?
A Simple Version of PageRank




u: a web page
Bu: the set of u’s backlinks
Nv: the number of forward links of page v
c: the normalization factor to make
||R||L1 = 1 (||R||L1= |R1 + … + Rn|)
An example of Simplified
PageRank (Transposed version)
PageRank Calculation: first iteration
An example of Simplified
PageRank (Transposed version)
PageRank Calculation: second iteration
An example of Simplified
PageRank (Transposed version)
Convergence after some iterations
A Problem with Simplified
PageRank
A loop:
During each iteration, the loop accumulates
rank but never distributes rank to other pages!
An example of the Problem
An example of the Problem
An example of the Problem
Random Walks in Graphs

The Random Surfer Model


The simplified model: the standing probability
distribution of a random walk on the graph of
the web. simply keeps clicking successive links
at random
The Modified Model

The modified model: the “random surfer”
simply keeps clicking successive links at
random, but periodically “gets bored” and
jumps to a random page based on the
distribution of E
Modified Version of PageRank
E(u): a distribution of ranks of web pages that “users” jump to
when they “gets bored” after successive links at random.
An example of Modified
PageRank
48
Pagerank summary

Preprocessing:




Given graph of links, build matrix P.
From it compute a.
The entry ai is a number between 0 and 1: the
pagerank of page i.
Query processing:



Prasad
Retrieve pages meeting query.
Rank them by their pagerank.
Order is query-independent.
L21LinkAnalysis
49
The reality

Pagerank is used in google,
but with so are many other
clever heuristics.
Prasad
L21LinkAnalysis
50
Pagerank: Issues and Variants


How realistic is the random surfer model?
 What if we modeled the back button? [Fagi00]
 Surfer behavior sharply skewed towards short
paths [Hube98]
 Search engines, bookmarks & directories make
jumps non-random.
Biased Surfer Models
 Weight link traversal probabilities based on
match with topic/query (skewed selection)
 Bias jumps to pages on topic (e.g., based on
personal bookmarks & categories of interest)
Prasad
51
Non-uniform Teleportation
Sports
Prasad
Teleport with 10% probability to a Sports page
L21LinkAnalysis
55
Hubs and Authorities
Alternative to Pagerank
Prasad
L21LinkAnalysis
60
Hyperlink-Induced Topic Search
(HITS) - Klei98

In response to a query, instead of an
ordered list of pages each meeting the
query, find two sets of inter-related pages:

Hub pages are good lists of links on a
subject.




e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good
hubs for the subject.
Best suited for “broad topic” queries rather
than for page-finding queries.
Gets at a broader slice of common opinion.
Prasad
L21LinkAnalysis
61
Hubs and Authorities



Thus, a good hub page for a topic
points to many authoritative
pages for that topic.
A good authority page for a topic
is pointed to by many good hubs
for that topic.
Circular definition - will turn this
into an iterative computation.
Prasad
L21LinkAnalysis
62
The hope
Alice
AT&T
Authorities
Hubs
Bob
Prasad
Sprint
MCI
Long distance telephone companies
63
High-level scheme
Extract from the web a base
set of pages that could be
good hubs or authorities.
 From these, identify a small
set of top hub and authority
pages;


Prasad
iterative algorithm.
L21LinkAnalysis
64
Base set



Given text query (say browser), use a text
index to get all pages containing
browser.
 Call this the root set of pages.
Add in any page that either
 points to a page in the root set, or
 is pointed to by a page in the root set.
Call this the base set.
Prasad
L21LinkAnalysis
65
Visualization
Root
set
Base set
Prasad
L21LinkAnalysis
66
Assembling the base set [Klei98]



Root set typically 200-1000 nodes.
Base set may have up to 5000 nodes.
How do you find the base set nodes?



Prasad
Follow out-links by parsing root set pages.
Get in-links (and out-links) from a connectivity
server.
(Actually, suffices to text-index strings of the
form href=“URL” to get in-links to URL.)
L21LinkAnalysis
67
Distilling hubs and authorities




Compute, for each page x in the base
set, a hub score h(x) and an authority
score a(x).
Initialize: for all x, h(x)1; a(x) 1;
Iteratively update all h(x), a(x); Key
After iterations


Prasad
output pages with highest h() scores as
top hubs
highest a() scores as top authorities.
L21LinkAnalysis
68
Illustrated Update Rules
1
4
2
3
5
h4 = a5 + a6 + a7 4
6
7
69
a4 = h1 + h2 + h3
Iterative update

Repeat the following updates, for all x:
h( x ) 
 a( y )
y’s
x
x y
a ( x) 
 h( y )
y’s
x
y x
Prasad
L21LinkAnalysis
70
Scaling


To prevent the h() and a() values
from getting too big, can
normalize after each iteration.
Scaling factor doesn’t really
matter:

Prasad
we only care about the relative
values of the scores.
L21LinkAnalysis
71
Old Results

Authorities for query: “Java”



Authorities for query “search engine”





java.sun.com
comp.lang.java FAQ
Yahoo.com
Excite.com
Lycos.com
Altavista.com
Authorities for query “Gates”


Microsoft.com
roadahead.com
72
Similar Page Results from Hubs

Given “honda.com”







toyota.com
ford.com
bmwusa.com
saturncars.com
nissanmotors.com
audi.com
volvocars.com
73
Japan Elementary Schools
Hubs


















schools
LINK Page-13
“ú–{‚ÌŠw•
Z
a‰„

¬Šw
Zƒz
[ƒ
ƒy
[ƒW
100 Schools Home Pages (English)
K-12 from Japan 10/...rnet and Education )
http://www...iglobe.ne.jp/~IKESAN
‚l‚f‚j
¬Šw
Z‚U”N‚P‘g•¨Œê
ÒŠ—’¬—§

ÒŠ—“Œ
¬Šw
Z
Koulutus ja oppilaitokset
TOYODA HOMEPAGE
Education
Cay's Homepage(Japanese)
–y“ì
¬Šw
Z‚̃z
[ƒ
ƒy
[ƒW
UNIVERSITY
‰J—³
¬Šw
Z DRAGON97-TOP
‰ª

¬Šw
Z‚T”N‚P‘gƒz
[ƒ
ƒy
[ƒW
¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Prasad
Authorities


















L21LinkAnalysis
The American School in Japan
The Link Page
以
èsŽ—§ˆä“c
¬Šw
Zƒz
[ƒ
ƒy
[ƒW
Kids' Space
ˆÀ•
ésŽ—§ˆÀ
é¼
 •”
¬Šw
Z
‹{
鋳ˆç‘åŠw•
‘®
¬Šw
Z
KEIMEI GAKUEN Home Page ( Japanese )
Shiranuma Home Page
fuzoku-es.fukui-u.ac.jp
welcome to Miasa E&J school
_“ޏ

쌧
E‰¡•l
s—§’†
ì
¼
¬Šw
Z‚̃y
http://www...p/~m_maru/index.html
fukui haruyama-es HomePage
Torisu primary school
goo
Yakumo Elementary,Hokkaido,Japan
FUZOKU Home Page
Kamishibun Elementary School...
76
Things to note


Pulled together good pages
regardless of language of page
content.
Use link analysis only after base set
assembled


Prasad
iterative scoring is query-independent.
Iterative computation after text index
retrieval - significant overhead.
L21LinkAnalysis
77
Proof of convergence

nn adjacency matrix A:


each of the n pages in the base set has a row
and column in the matrix.
Entry Aij = 1 if page i links to page j, else = 0.
1
2
3
Prasad
L21LinkAnalysis
1
1
0
2
1
3
0
2
1
1
1
3
1
0
0
78
Hub/authority vectors


View the hub scores h() and the authority
scores a() as vectors with n components.
Recall the iterative updates
h( x ) 
 a( y )
x y
a ( x) 
 h( y )
y x
Prasad
79
Rewrite in matrix form


h=Aa.
a=Ath.
Recall At
is the
transpose
of A.
Substituting, h=AAth and a=AtAa.
Thus, h is an eigenvector of AAt and a is an
eigenvector of AtA.
Further, our algorithm is known for computing
eigenvectors: the power iteration method.
Prasad
Guaranteed to converge.
80
Issues

Topic Drift

Off-topic pages can cause off-topic
“authorities” to be returned


E.g., the neighborhood graph can be about
a “super topic”
Mutually Reinforcing Affiliates

Affiliated pages/sites can boost each
others’ scores

Prasad
Linkage between affiliated pages is not a
useful signal
L21LinkAnalysis
81
Download