Class Slides

advertisement
CS246
Link-Based Ranking
Problems of TFIDF Vector

Works well on small controlled corpus, but not on the
Web



Easy to spam



Top result for “American Airlines” query:
accident report of American Airline flights
Do users really care how many times “American Airlines”
mentioned?
Ranking purely based on page content
Authors can manipulate page content to get high ranking
Any idea?
Link-based Ranking



People “expect” to get AA home page for the query
“American Airlines”
Many pages point to AA home page, but not to accident
report
Use link-count!
Simple Link Count

Still easy to spam


Create many pages and add links to a page
How to avoid spam?
PageRank




A page is important if it is pointed by many important
pages
PR(p) = PR(p1)/n1 + … + PR(pk)/nk
pi : page pointing to p,
ni : number of links in pi
PageRank of p is the sum of PageRanks of its parents
One equation for every page

N equations, N unknown variables
Example: Web of 1842

Netscape, Microsoft and Amazon
PR(n) = PR(n)/2
+ PR(a)/2
PR(m) =
+PR(a)/2
PR(a) = PR(n)/2
+ PR(m)
Ne
MS
Am
 n  1 / 2 0 1 / 2  n 
m    0 0 1 / 2 m 
  
 
 a  1 / 2 1 0   a 
PageRank: Matrix Notation

Web graph matrix M = { mij }




Each page i corresponds to row i and column i of the matrix M
mij = 1/n if page i is one of the n children of page j
mij = 0 otherwise
PageRank vector
PageRank equation


p M p
 p1 

p   p2 
 
 p3 
PageRank:
Iterative Computation



Initially every page has a unit of importance
At each round, each page shares its importance among its
children and receives new importance from its parents
Eventually the importance of each page reaches a limit

Stochastic matrix
Example: Web of 1842
 n  1 / 2 0 1 / 2  n 
m    0 0 1 / 2 m 
  
 
 a  1 / 2 1 0   a 
 n  1 / 3
 m   1 / 3
   
 a  1 / 3
1 / 3
1 / 6
 
1 / 2
5 / 12  3 / 8 
 1/ 4   1/ 6 



 1 / 3  11/ 24
Ne
MS
Am
 5 / 12 
11/ 48


17 / 48
 2 / 5
1 / 5 


 2 / 5
PageRank: Eigenvector

PageRank equation


p M p


p is the principal eigenvector of M
PageRank:
Random Surfer Model

The probability of a Web surfer to reach a page after
many clicks, following random links
Random Click
Problems on the Real Web

Dead end



A page with no links to send importance
All importance “leak out of” the Web
Crawler trap


A group of one or more pages that have no links out of the
group
Accumulate all the importance of the Web
Example: Dead End

No link from Microsoft
Dead end
Ne
MS
Am
 n  1 / 2 0 1 / 2  n 
m    0 0 1 / 2 m 
  
 
 a  1 / 2 0 0   a 
Example: Dead End
 n  1 / 2 0 1 / 2  n 
m    0 0 1 / 2 m 
  
 
 a  1 / 2 0 0   a 
 n  1 / 3
 m   1 / 3
   
 a  1 / 3
1 / 3
1 / 6
 
1 / 6
 1/ 4 
1 / 12


 1 / 6 
Ne
MS
Am
5 / 24
 1 / 12 


 1 / 8 
 1/ 6 
1 / 16 


5 / 48
0
0
 
 0 
Solution to Dead End

Assume a surfer to jumps to a random page at a dead end
 n  1 / 2 1 / 3 1 / 2  n 
m    0 1 / 3 1 / 2 m 
  
 
 a  1 / 2 1 / 3 0   a 
Ne
MS
Am
Example: Crawler Trap

Only self-link at Microsoft
Crawler trap
Ne
MS
Am
 n  1 / 2 0 1 / 2  n 
m    0 1 1 / 2 m 
  
 
 a  1 / 2 0 0   a 
Example: Crawler Trap
 n  1 / 2 0 1 / 2  n 
m    0 1 1 / 2 m 
  
 
 a  1 / 2 0 0   a 
 n  1 / 3
 m   1 / 3
   
 a  1 / 3
1 / 3
1 / 2
 
1 / 6
 1/ 4 
7 / 12


 1 / 6 
Ne
MS
Am
5 / 24
 2/3 


 1 / 8 
 1/ 6 
35 / 48


 5 / 48 
0 
1 
 
 0 
Crawler Trap: Damping Factor

“Tax” each page some fraction of its importance and
distribute it equally


Probability to jump to a random page
Assuming 20% tax
n
1 / 2 0 1 / 2  n 
1 / 3
m   0.8  0 1 1 / 2 m   0.2 1 / 3
 

 
 
 a 
1 / 2 0 0   a 
1 / 3
PR( pi )  d   PR( p j ) / c j  (1  d ) / N
j
 n  7 / 33
 m    7 / 11
  

 a  5 / 33
Link Spam Problem

Q: What if a spammer creates a lot of pages and create a
link to a single spam page?


PageRank better than simple link count, but still vulnerable to
link spam
Q: Any way to avoid link spam?
TrustRank


[Gyongyi et al. 2004]
Good pages don’t point to spam pages
Trust a page only if it is linked by what you trust
TR( pi )  d  TR( p j ) / c j  bi
j
(1  d ) / NT
bi  
0


if pi is trusted
otherwise
Same as PageRank except the random jump probability term
TrustRank: Theory [Bianchini et al.
2005]
Given P( pi )  d   P( p j ) / c j  bi
j
consider a set of pages S
S
IN(S)
OUT(S)
DP(S)
TrustRank: Theory [Bianchini et al.
2005]
PS   P( p)
BS 
pS
POUT 
PIN 
 P( p ) / c
pi  OUT ( S )
i
i
b
pi S
PDP 
i
 P( p )
pi  DP ( S )
i
 P( p ) / c
pi IN ( S )
i
i
 d
  d
PS  B S  
PIN   
POUT
1 d
 1 d
  d

PDP 

 1 d

What Does It Mean?
PS  BS  PIN   POUT   PDP 


PS = 0 if BS= 0 and PIN= 0
You cannot improve your TrustRank simply by creating
more pages and linking within yourself

To get non-zero TrustRank, you need to be either trusted or
get links from outside
Is TrustRank the Ultimate Solution?

Not really…

Honeypot: A page with good content with hidden links to
spams


Blogs, forums, wikis, mailing lists


Easy to add spam links
Link exchange


Good users link to honeypot due to its quality content
Set of sites exchanging links to boost ranking
A never-ending rat race…
Anti-Spamming at Search Engines

Anchor text




Consider what others think about your page
Give higher weights to anchors from high PageRank pages
More difficult to spam
TrustRank



To gain importance, you need to convince many pages under
other’s control or convince search engines
More difficult to spam
Consider inter-site links with higher weight
Hub and Authority


More detailed evaluation of importance
A page is useful if



It has good contents or
It has links to useful pages (good bookmark)
Hub/Authority


Authority: pages with good contents
Hub: pages pointing to good content pages
Hub/Authority: Definition

Recursive definition similar to PageRank



Authority pages are linked to by many hub pages
Hub pages link to many authority pages
H(p) = A(p1) + … + A(pk)
A(p) = H(p1) + … + H(pm)
Hub/Authority: Matrix Notation

Web graph matrix A = { aij }




Each page i corresponds to row i and column i of the matrix A
aij = 1 if page i points to page j
aij = 0 otherwise
A is not a stochastic matrix
AT: similar to PageRank matrix M, without stochastic
restriction
Example: Web of 1842
[n, m, a]: vector
Ne
1 1 1 
A  0 0 1 


1 1 0
1 0 1
AT  1 0 1


1 1 0
MS
Am

Hub/Authority: Iterative Computation


Hub/Authority vector
hn 
 
h  hm 

ha 



T
a A h
an 
 
a  am 

aa 

: divergence scaling factor

 : divergence scaling factor
 h  A a


 Compute a and hiteratively with scaling
Hub/Authority: Eigenvector


T
 a A h


 h  Aa




T
T
T
a   A h   A  Aa    A A a
 



h  A a   A  AT h   AAT h

T
 a : eigenvector of
A
A

T
:
eigenvector
of
AA
h

 








Example: Web of 1842
2 2 1


AT A  2 2 1

1 1 2

 3 1 2
AAT  1 1 0


2 0 2
Ne










 an  1

a  am   1
  
 aa  1
5
5
 
 4 
 24
 24
 
18
114
114
 
 84 
 hn  1
   
h  hm  1
  
 ha  1
6
2
 
 4 
 28
8
 
 20
132
 36 
 
 96 
MS
Am
3  1

3  1
2 
3  1

3  1
2 
Hub/Authority and Root Set

Apply the equations on a small neighbor graph (base set)




Start with, say, 100 pages on “bicycling”
Add pages pointing to the 100 pages
Add pages that the 100 pages are pointing to
Identified pages are good “Hub” and “Authority” on
“bicycling”
Hub/Authority and
Web Community

Hub/Authority is often used to identify Web communities


Nice notion of “Hub” and “Authority” of the community
Often Hub and Authority are tightly linked to each other
Any Questions?
Questions

Can we apply Hub/Authority to the entire Web like
PageRank?
Hub/Authority on the Entire Web?
Hub/Authority works well on a topic-specific subset,
but works poorly for the whole Web
Easy to spam


1.
2.
Create a page pointing to many authority pages
(e.g.,Yahoo, Google, etc.)
 The page becomes a good hub page
On the page, add a link to your home page
Questions

Can we apply PageRank to a small base set?
PageRank on a Small Subset


In general, PageRank works better for larger dataset
We may be able to compute “topic-specific” PageRank

Any other way for “topic-specific” PageRank?
Summary: Link-Based Ranking

PageRank


TrustRank variation
Hub/Authority
Download