19-Mar - SEAS - University of Pennsylvania

advertisement
Web Searching & Ranking
Zachary G. Ives
University of Pennsylvania
CIS 455/555 – Internet and Web Systems
March 23, 2016
Some content based on slides by Marti Hearst, Ray Larson
Recall Where We Left Off
 We were discussing information retrieval ranking
models
 The Boolean model captures some intuitions of
what we want – AND, OR
 But it’s too restrictive, and has no real ranking
between returned answers
2
Vector Model
j
dj
Sim(q,dj)
= cos()
= [vec(dj)  vec(q)] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|

q
i
 Since wij > 0 and wiq > 0,
0 ≤ sim(q,dj) ≤ 1
 A document is retrieved even if it matches the
query terms only partially
3
Weights in the Vector Model
Sim(q,dj) = [ wij * wiq] / |dj| * |q|
 How do we compute the weights wij and wiq?
 A good weight must take into account two effects:
 quantification of intra-document contents (similarity)
 tf factor, the term frequency within a document
 quantification of inter-documents separation
(dissimilarity)
 idf factor, the inverse document frequency
wij = tf(i,j) * idf(i)
4
TF and IDF Factors
 Let:
N be the total number of docs in the collection
ni be the number of docs which contain ki
freq(i,j) raw frequency of ki within dj
 A normalized tf factor is given by
f(i,j) = freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which occur within the
document dj
 The idf factor is computed as
idf(i) = log (N / ni)
the log is used to make the values of tf and idf comparable.
It can also be interpreted as the amount of information associated with the
term ki
5
Vector Model
Example 1I
k2
k1
d7
d6
d2
d4
d5
d3
d1
k3
d1
d2
d3
d4
d5
d6
d7
k1
1
1
0
1
1
1
0
k2
0
0
1
0
1
1
1
k3
1
0
1
0
1
0
0
q
1
2
3
q  dj
4
1
5
1
6
3
2
6
Vector Model
Example III
k2
k1
d7
d6
d2
d4
d5
d1
d3
k3
d1
d2
d3
d4
d5
d6
d7
k1
2
1
0
2
1
1
0
k2
0
0
1
0
2
2
5
k3
1
0
3
0
4
0
0
q
1
2
3
q  dj
5
1
11
2
17
5
10
7
Vector Model, Summarized
 The best term-weighting schemes tf-idf weights:
wij = f(i,j) * log(N/ni)
 For the query term weights, a suggestion is
wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N / ni)
 This model is very good in practice:
 tf-idf works well with general collections
 Simple and fast to compute
 Vector model is usually as good as the known ranking
alternatives
8
Pros & Cons of Vector Model
Advantages:
 term-weighting improves quality of the answer set
 partial matching allows retrieval of docs that approximate
the query conditions
 cosine ranking formula sorts documents according to
degree of similarity to the query
Disadvantages:
 assumes independence of index terms; not clear if this is a
good or bad assumption
9
Comparison of Classic Models
 Boolean model does not provide for partial
matches and is considered to be the weakest classic
model
 Experiments indicate that the vector model
outperforms the third alternative, the probabilistic
model, in general
 Generally we use a variation of the vector model in
most text search systems
10
Switching Our Sights to the Web
 Information retrieval is more heterogeneous in nature:
 No editor to control quality
 Deliberately misleading information (“web spam”)
 Great variety in types of information
 Phone books, catalogs, technical reports, news, slide shows, …
 Many languages; partial duplication; jargon
 Diverse user goals
 Very short queries
 ~2.35 words on average (Aug 2000; Google results)
 And much larger scale!
11
Handling Short Queries &
Mixed-Quality Information
 Human processing
 Web directories: Yahoo, Open Directory, …
 Human-created answers: about.com, Search Wikia
 (Still not clear that automated question-answering works)
 Capitalism: “paid placement”
 Advertisers pay to be associated with certain keywords
 Clicks / page popularity: pages visited most often
 Link analysis: use link structure to determine
credibility
 … combination of all?
12
Link Analysis for Starting Points:
HITS (Kleinberg), PageRank (Google)
 Assumptions:
 Credible sources will mostly point to credible sources
 Names of hyperlinks suggest meaning
 Ranking is a function of the query terms and of the hyperlink structure
 An example of why this makes sense:
 The official Olympics site will be linked to by most highquality sites about sports, Olympics, etc.
 A spammer who adds “Olympics” to his/her web site
probably won’t have many links to it
 Caveat: “Search engine optimization”
13
Google’s PageRank (Brin/Page 98)
 Mine structure of web graph independently of the query!
Each web page is a node, each hyperlink is a directed edge
 Assumes a random walk (surf) through the web:
 Start at a random page
 At each step, the surfer proceeds
 to a randomly chosen web page with probability d
 to a randomly chosen successor of the current page with probability 1- d
 The PageRank of a page p is the fraction of steps the surfer
spends at p in the limit
14
Link Counts Aren’t Everything…
“A-Team”
page
Mr. T’s
Hollywood
page
“Series to
Recycle” page
Team
Sports
Cheesy
TV
Shows
page
Yahoo Wikipedia
Directory
15
PageRank
Importance of page i is governed by pages linking to it
1
xi   x j
jBi N j
Rank of page j
Rank of page i
Every page
j that links to i
Number of
links out
from page j
16
Computing PageRank (Simple version)
Initialize so total
rank sums to 1.0
Iterate until
convergence
(0)
i
x
( k 1)
i
x
1

n
1 (k )
  xj
jBi N j
17
Computing PageRank (Step 0)
Initialize so total
rank sums to 1.0
(0)
i
x
1

n
0.33
0.33
0.33
18
Computing PageRank (Step 1)
Propagate weights
across out-edges
( k 1)
i
x
1 (k )
  xj
jBi N j
0.33
0.17
0.33
0.17
19
Computing PageRank (Step 2)
Compute weights
based on in-edges
xi
(1)
1 ( 0)
  xj
jBi N j
0.50
0.33
0.17
20
Computing PageRank (Convergence)
( k 1)
i
x
1 (k )
  xj
jBi N j
0.4
0.40
0.2
21
Naïve PageRank Algorithm Restated
Let
 N(p) = number outgoing links from page p
 B(p) = number of back-links to page p
1
PageRank ( p)  
PageRank (b)
bBi N (b)
 Each page b distributes its importance to all of the pages it
points to (so we scale by N(b))
 Page p’s importance is increased by the importance of its back
set
22
In Linear Algebra Terms
Create an m x m matrix M to capture links:
 M(i, j) = 1 / nj if page i is pointed to by page j
and page j has nj outgoing links
 Initialize all PageRanks to 1, multiply by M repeatedly until all
values converge:
 PageRank ( p1 ' ) 
 PageRank ( p1 ) 
 PageRank ( p ' ) 
 PageRank ( p ) 
2
2 

  M
...
...








PageRank
(
p
'
)
PageRank
(
p
)
m 
m 


 (Computes principal eigenvector via power iteration)
23
A Brief Example
Google
Amazon
Yahoo
g'
y’
a’
=
0
0
0.5 0.5
0 0.5 *
1
0.5
0
g
y
a
Running for multiple iterations:
g
y
a
=
1
1
1
1 , 0.5 , 0.75
1
1.5
1.25
,…
1
0.01
1.99
Total rank sums to number of pages
24
Oops #1 – PageRank Sinks: Dead Ends
Google
Amazon
g'
y’
0 0 0.5
= 0.5 0 0.5 *
g
y
a’
0.5 0 0
a
Yahoo
Running for multiple iterations:
g
y
a
=
1
0.5
0.25
1 , 1 , 0.5
1
0.5
0.25
,…
0
0
0
25
Oops #2 – Hogging all the PageRank
g'
Google
Amazon
Yahoo
y’
a’
0 0 0.5
= 0.5 1 0.5 *
g
y
0.5 0 0
a
Running for multiple iterations:
g
y
a
=
1
0.5
0.25
1 , 2 , 2.5
1
0.5
0.25
,…
0
3
0
26
Improved PageRank
 Remove out-degree 0 nodes (or consider them to refer back
to referrer)
 Add decay factor to deal with sinks
 PageRank(p) = d b  B(p) (PageRank(b) / N(b)) + (1 – d)
 Intuition in the idea of the “random surfer”:
 Surfer occasionally stops following link sequence and jumps to new
random page, with probability 1 - d
27
Stopping the Hog
g'
y’
a’
Google
Amazon
0 0 0.5
= 0.8 0.5 1 0.5 *
0.5 0 0
g
y
a
0.2
+ 0.2
0.2
Yahoo
Running for multiple iterations:
g
y
a
=
0.6
1.8 ,
0.6
0.44
2.12 ,
0.44
0.38
2.25 ,
0.38
0.35
2.30
0.35
… though does this seem right?
28
Summary of Link Analysis
 Use back-links as a means of adjusting the
“worthiness” or “importance” of a page
 Use iterative process over matrix/vector values to
reach a convergence point
 PageRank is query-independent and considered
relatively stable
 But vulnerable to SEO
29
Can We Go Beyond?
 PageRank assumes a “random surfer” who starts at
any node and estimates likelihood that the surfer will
end up at a particular page
 A more general notion: label propagation
 Take a set of start nodes each with a different label
 Estimate, for every node, the distribution of arrivals from
each label
 In essence, captures the relatedness or influence of nodes
 Used in YouTube video matching, schema matching, …
30
Overall Ranking Strategies in
Web Search Engines
 Everybody has their own “secret sauce” that uses:







Vector model (TF/IDF)
Proximity of terms
Where terms appear (title vs. body vs. link)
Link analysis
Info from directories
Page popularity
gorank.com “search engine optimization site” compares these factors
 Some alternative approaches:
 Some new engines (Vivisimo, Teoma, Clusty) try to do clustering
 A few engines (Dogpile, Mamma.com) try to do meta-search
31
Download