Searching the Web - Computer Science

advertisement
Searching the Web
Junghoo Cho
UCLA Computer Science
1
Information Galore
Biblio sever
Legacy database
Plain text files
2
Information Overload Problem
3
Solution

Indexing approach
– Google, Excite, AltaVista

Integration approach
– MySimon, BizRate
4
Indexing Approach
Central Index
5
Challenges

Page selection and download
– What page to download?

Page and index update
– How to update pages?

Page ranking
– What page is “important” or “relevant”?

Scalability
6
Integration Approach
Mediator
Wrapper
Source 1
Wrapper
Source 2
Wrapper
Source n
7
Challenges
 Heterogeneous
sources
– Different data models: relational, object-oriented
– Different schemas and representations:
“Keanu Reeves” or “Reeves, K.” etc.
 Limited
query capabilities
 Mediator caching
8
Focus of the Talk
Indexing approach
 How to maintain pages up-to-date?

9
Outline of This Talk
How can we maintain pages fresh?
How does the Web change?
 What do we mean by “fresh” pages?
 How should we refresh pages?

10
Web Evolution Experiment
How often does a Web page change?
 How long does a page stay on the Web?
 How long does it take for 50% of the Web
to change?
 How do we model Web changes?

11
Experimental Setup


February 17 to June 24, 1999
270 sites visited (with permission)
– identified 400 sites with highest “PageRank”
– contacted administrators

720,000 pages collected
– 3,000 pages from each site daily
– start at root, visit breadth first (get new & old pages)
– ran only 9pm - 6am, 10 seconds between site requests
12
Average Change Interval
0.35
fraction of pages
0.30
0.25
0.20
0.15
0.10
0.05
0.00

1day
1day1week
1week1month4months
1month
4months
average change interval

13
Change Interval – By Domain
fraction of pages
0.6
0.5
0.4
com
netorg
edu
gov
0.3
0.2
0.1
0

1day
1day1week
1week1month
1month4months
4months 
average change interval
14
Modeling Web Evolution
Poisson process with rate 
 T is time to next event
 fT (t) =  e- t (t > 0)

15
Change Interval of Pages
fraction of changes
with given interval
for pages that
change every
10 days on average
Poisson model
interval in days
16
Change Metrics

Freshness
– Freshness of element ei at time t is
F ( ei ; t ) = 1 if ei is up-to-date at time t
0 otherwise
1
F( S ; t ) =
N
N
F( e ; t )

i=1
ei
ei
...
Freshness of the database S at time t is
database
...
web
i
(Assume “equal importance” of pages)
17
Change Metrics
Age
– Age of element ei at time t is
0 if ei is up-to-date at time t
t - (modification ei time) otherwise
Age of the database S at time t is
A( S ; t ) =
1
N
N
A( e ; t )

i=1
i
web
database
ei
ei
...
A( ei ; t ) =
...

(Assume “equal importance” of pages)
18
Change Metrics
F(ei)
Time averages:
1
0
time
A(ei)
1 t
F (ei )  lim  F (ei ; t ) dt
t  t 0
1 t
F ( S )  lim  F ( S ; t ) dt
t  t 0
0
time
update
refresh
19
Trick Question





Two page database
e1 changes daily
e2 changes once a week
Can visit one page per week
How should we visit pages?
–
–
–
–
–
e1
e1
e2
e2
database
web
e1 e2 e1 e2 e1 e2 e1 e2... [uniform]
e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 … [proportional]
e1 e1 e1 e1 e1 e1 ...
e2 e2 e2 e2 e2 e2 ...
?
20
Proportional Often Not Good!

Visit fast changing e1
 get 1/2 day of freshness

Visit slow changing e2
 get 1/2 week of freshness

Visiting e2 is a better deal!
21
Optimal Refresh Frequency
Problem
Given 1 , 2 , ..., N and f ,
find
f1 , f 2 ,... , f N
N


 f   fi / N 
i 1


that maximize
1
F (S ) 
N
N
F (ei )

i 1
22
Optimal Refresh Frequency
• Shape of curve is the same in all cases
• Holds for any change frequency distribution
23
Optimal Refresh for Age
• Shape of curve is the same in all cases
• Holds for any change frequency distribution
24
Comparing Policies
Proportional
Uniform
Optimal
Freshness
Age
0.12 400 days
0.57 5.6 days
0.62 4.3 days
Based on Statistics from experiment
and revisit frequency of every month
25
Not Every Page is Equal!
 Some pages are “more important”
e1
Accessed by users 10 times/day
e2
Accessed by users 20 times/day
1
F ( S )  1 F (e1 )  2  F (e2 )
3
 In general,
N

F ( S )   wi  F (ei ) 
 i 1

N
w
i 1
i
26
Weighted Freshness
f
w=2
w=1

27
Change Frequency Estimation

How to estimate change frequency?
–
–
–
–

Naïve Estimator: X/T
X: number of detected changes
T: monitoring period
2 changes in 10 days: 0.2 times/day
Incomplete change history
Page changed
1 day
Page visited
Change detected
28
Improved Estimator
Based on the Poisson model
 N 2 

f  log 

 N  X 1 

– X: number of detected changes
– N: number of accesses
– f : access frequency

3 changes in 10 days: 0.36 times/day
 Accounts for “missed” changes
29
Improvement Significant?
 Application
to a Web crawler
– Visit pages once every week for 5 weeks
– Estimate change frequency
– Adjust revisit frequency based on the estimate
» Uniform: do not adjust
» Naïve: based on the naïve estimator
» Ours: based on our improved estimator
30
Improvement from Our Estimator
Detected changes Ratio to uniform
Uniform
2,147,589
100%
Naïve
4,145,582
193%
Ours
4,892,116
228%
(9,200,000 visits in total)
31
Summary

Information overload problem
– Indexing approach
– Integration approach

Page update
–
–
–
–
Web evolution experiment
Change metric
Refresh policy
Frequency estimator
32
Research Opportunity
Efficient query processing?
 Automatic source discovery?
 Automatic data extraction?

33
Web Archive Project

Can we store the history of the Web?
– Web is ephemeral
– Study of the Evolution of the Web

Challenges
–
–
–
–
Update policy?
Compression?
New storage structure?
New index structure?
34
The End
Thank you for your attention
 For more information visit

http://www.cs.ucla.edu/~cho/
35
Download