Crawling The Web For a Search Engine Or Why Crawling is Cool

advertisement
Crawling The Web
For a Search
Engine
Or Why Crawling is Cool
Talk Outline
What is a crawler?
 Some of the interesting problems
 RankMass Crawler
 As time permits:

Refresh Policies
 Duplicate Detection

What is a Crawler?
initial urls
init
get next url
get page
web
to visit urls
visited urls
extract urls
web pages
Applications

Internet Search Engines


Comparison Shopping Services


Google, Yahoo, MSN, Ask
Shopping
Data mining

Stanford Web Base, IBM Web
Fountain
Is that it?
Not quite
Crawling the Big Picture







Duplicate Pages
Mirror Sites
Identifying
Similar Pages
Templates
Deep Web
When to stop?
Incremental
Crawler






Refresh Policies
Evolution of the
Web
Crawling the
“good” pages first
Focused Crawling
Distributed
Crawlers
Crawler Friendly
Webservers
Today’s Focus
A crawler which guarantees
coverage of the Web
 As time permits:

Refresh Policies
 Duplicate Detection Techniques

RankMass Crawler
A Crawler with High Personalized
PageRank Coverage Guarantee
Motivation

Impossible to download the entire
web:

Example: many pages from one
calendar
When can we stop?
 How to gain the most benefit from
the pages we download

Main Issues


Crawler Guarantee:
 guarantee on how much of the “important” part of
the Web they “cover” when they stop crawling
 If we don’t see the pages, how do we know how
important they are?
Crawler Efficiency:
 Download “important” pages early during a crawl
 Obtain coverage with a min number of downloads
Outline





Formalize coverage metric
L-Neighbor: Crawling with RankMass
guarantee
RankMass: Crawling to achieve high
RankMass
Windowed RankMass: How greedy do you
want to be?
Experimental Results
Web Coverage Problem
D – The potentially infinite set of
documents of the web
 DC – The finite set of documents in
our document collection
 Assign importance weights to each
page

Web Coverage Problem
What weights? Per query? Topic?
Font?
 PageRank? Why PageRank?

Useful as importance mesure
 Random surfer.
 Effective for ranking.

PageRank a Short Review

rj 
1
ri  d  

(
1

d
)

|D|
 p jI ( pi ) c j 
p2
p1
p3
p4
Now it’s Personal

Personal, TrustRank, General
p3
p4
 t1 
T  t2 
  

rj 
ri  d     (1  d )ti
 p jI ( pi c j 
RankMass Defined

Using personalized pagerank formally define
RankMass of DC :
RM ( DC )   p D ri
i

C
Coverage Guarantee:

We seek a crawler that given , when it stops the downloaded
pages DC:
RM ( DC )   p D ri  1  
i

C
Efficient crawling:

We seek a crawler that, for a given N, downloads |DC|=N s.t. RM(DC) is greater or
equal to any other |DC|=N, DC D
How to Calculate RankMass
Based on PageRank
 How do you compute RM(Dc)
without downloading the entire
web
 We can’t compute the exact but
can lower bound
 Let’s a start a simple case

Single Trusted Page
T(1): t1=1 ; ti = 0 i≠1
 Always jump to p1 when bored
 We can place a lowerbound on being
within L of P1
 NL(p1)=pages reachable from p1 in L
links

Single Trusted Page
Lower bound guarantee:
Single Trusted

Theorem 1:

Assuming the trust vector T(1), the
sum of the PageRank values of all Lneighbors of p1 is at least dL+1 close
to 1.. That is:
L 1
r

1

d
i
pi N L ( p1 )
Lower bound guarantee:
General Case

Theorem 2:

The RankMass of the L-neighbors of
the group of all trusted pages G,
NL(G), is at least dL+1 close to 1.
That is:
L 1
r

1

d
i
pi N L ( G )
RankMass Lower Bound

Lower bound given a single trusted page
 r  1 d
L 1
i
pi N L ( p1 )

Extension: Given a set of trusted pages
G
L 1
r

1

d
i
pi N L ( G )

That’s the basis of the crawling
algorithm with a coverage guarantee
The L-Neighbor Crawler
1.
2.
3.
L := 0
N[0] = {pi|ti > 0} // Start with the
trusted pages
While ( < dL+1)
1.
2.
3.
Download all uncrawled pages in
N[L]
N[L + 1] = {all pages linked to by a
page in N[L]}
L=L+1
But what about efficency?
L-Neighbor similar to BFS
 L-Neighbor simple and efficient
 May wish to prioritize further
certain neighborhoods first
 Page level prioritization.

t0=0.99
t1=0.01
Page Level Prioritizing


We want a more fine-grained page-level
priority
The idea:





Estimate PageRank on a page basis
High priority for pages with a high
estimate of PageRank
We cannot calculate exact PageRank
Calculate PageRank lower bound of
undownloaded pages
…But how
Probability of being at
Page P
Click Link
Interrupted
Page
Trusted Page
Random Surfer
Calculating PageRank Lower Bound
PageRank(p) = Probability Random
Surfer in p
 Breakdown path by “interrupts”,
jumps to a trusted page
 Sum up all paths that start with an
interrupt and end with p

Interrupt
(1-d)
Pj
(tj)
(d*1/3)
P1
P2
P3
P4
P5
(d*1/5) (d*1/3) (d*1/3) (d*1/3) (d*1/3)
Pi
RankMass Basic Idea
p1
0.99
p3
0.25
p1
0.99
p3
0.25
p1
0.99
p1
0.99
p2
0.01
p4
0.25
p5
0.25
p6
0.09
p7
0.09
RankMass Crawler: High
Level
But that sounds complicated?!
 Luckily we don’t need all that
 Based on this idea:

Dynamically update lower bound on
PageRank
 Update total RankMass
 Download page with highest lower
bound

RankMass Crawler
(Shorter)

Variables:



CRM: RankMass lower bound of crawled pages
rmi: Lower bound of PageRank of pi.
RankMassCrawl()



CRM = 0
rmi = (1 − d)ti for each ti > 0
While (CRM < 1 − ):
•
•
•
•
Pick pi with the largest rmi.
Download pi if not downloaded yet
CRM = CRM + rmi
Foreach pj linked to by pi:
• rmj = rmj + d/ci rmi
• rmi = 0
Greedy vs Simple
L-Neighbor is simple
 RankMass is very greedy.
 Update expensive: random access to
web graph
 Compromise?
 Batching

downloads together
 updates together

Windowed RankMass


Variables:
 CRM: RankMass lower bound of crawled pages
 rmi: Lower bound of PageRank of pi.
Crawl()
 rmi = (1 − d)ti for each ti > 0
 While (CRM < 1 − ):
• Download top window% pages according to rmi
• Foreach page pi ∈ DC
• CRM = CRM + rmi
• Foreach pj linked to by pi:
• rmj = rmj + d/ci rmi
• rmi = 0
Experimental Setup
HTML files only
 Algorithms simulated over web
graph
 Crawled between Dec’ 2003 and Jan’
2004
 141 millon URLs span over 6.9 million
host names
 233 top level domains.

Metrics Of Evaluation
1.
2.
3.
How much RankMass is collected
during the crawl
How much RankMass is “known” to
have been collected during the
crawl
How much computational and
performance overhead the
algorithm introduces.
L-Neighbor
RankMass
Windowed RankMass
Window Size
Algorithm Efficiency
Algorithm
Downloads
required for
above 0.98%
guaranteed
RankMass
Downloads
required
for above
0.98% actual
RankMass
L-Neighbor
7 million
65,000
RankMass
131,072
27,939
WindowedRankMass
217,918
30,826
Optimal
27,101
27,101
Algorithm Running Time
Window
Hours
Number of
Iterations
Number of
Documents
L-Neighbor
1:27
13
83,638,834
20%Windowed
4:39
44
80,622,045
10%Windowed
10:27
85
80,291,078
5%Windowed
17:52
167
80,139,289
RankMass
25:39
Not
comparable
10,350,000
Refresh Policies
Refresh Policy: Problem
Definition





You have N urls
you want to keep fresh
Limited resources: f documents / second
Download order to maximize average
freshness
What do you do?

Note: Can’t always know how the page
really looks
The Optimal Solution
Depends on freshness definition
 Freshness boolean:

A page can only be fresh or not
 One small change deems it unfresh

Understand Freshness
Better





Two page database
Pd changes daily
Pw changes once a week
We can refresh one page per week
How should we visit pages?



Uniform: Pd, Pd, Pd, Pd, Pd, Pd,…
Proportional: Pd,Pd, Pd, Pd, Pd,Pd,Pw
Other?
Proportional Often Not
Good!

Visit fast changing e1
 get 1/2 day of freshness

Visit slow changing e2
 get 1/2 week of freshness

Visiting Pw is a better deal!
Optimal Refresh
Frequency
Problem
Given 1 , 2 , ..., N
find
f1 , f 2 ,... , f N
that maximize
1
F (S ) 
N
N
and f ,
N


 f   fi / N 
i 1


F (ei )

i 1
Optimal Refresh
Frequency
• Shape of curve is the same in all cases
• Holds for any change frequency distribution
Do Not Crawl In The DUST:
Different URLs Similar Text
Ziv Bar-Yossef
Idit Keidar
Uri Schonfeld
(Technion and Google)
(Technion)
(UCLA)
48
Even the WWW Gets
Dusty


DUST – Different URLs Similar Text
Examples:
 Default Directory Files:

“/index.html”  “/”
Domain names and virtual hosts

“news.google.com”  “google.com/news”
Aliases and symbolic links:
•
•
“~shuri”  “/people/shuri”
Parameters with little effect on content
• ?Print=1
URL transformations:
•


•
“/story_<num>”  “story?id=<num>”
49
Why Care about DUST?

Reduce the crawl and indexing


Avoid fetching the same document more than once
Canonization for better ranking
References to a document may be split among its aliases
Avoid returning duplicate results
Many algorithm which use URLs as unique ids will
benefit


50
Related Work

Similarity detection via document sketches
[Broder et al, Hoad-Zobel, Shivakumar et al, Di Iorio et al, Brin et al,
Garcia-Molina et al]



Requires fetching all document duplicates
Cannot be used to find "DUST rules"
Mirror detection [Bharat,Broder 99],
[Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00],
[Liang 01]


Not suitable for finding site-specific "DUST
rules"
Mining association rules [Agrawal and Srikant]

A technically different problem
51
So what are we looking for?
52
Our Contributions







DustBuster, an algorithm that
Discovers site-specific DUST rules from a
URL list without examining page content
Requires a small number of page fetches to
validate the rules
Site specific URL canonization algorithm
Experimented on real data both from:
Web access logs
Crawl logs
53
DUST Rules
Valid DUST Rule: a mapping Ψ that maps each
valid URL u to a valid URL Ψ(u) with similar
content
“/index.html”  “/”
“news.google.com”  “google.com/news”
“/story_<num>”  “story?id=<num>”



Invalid DUST Rules: Either do not preserve
similarity or do not produce valid URLs
54
1
Types of DUST Rules
Substring Substitution DUST:

Focus
of

Talk



“story_1259”  “story?id=1259”
“news.google.com”  “google.com/news”
“/index.html”  “”
Parameter DUST:

Removing a parameter or replacing its value to a default value

“Color=pink” “Color=black”
55
Basic Detection Framework

Input: List of URLs from a site:
Crawl or web access log
Detect likely DUST rules
No Fetch here
Eliminate redundant rules
Validate DUST rules using samples




Detect
56
Eliminate
Validate
Example: Instances & Support
57
END
Download