Evaluating Top-k Queries over Web

advertisement
Evaluating Top-k Queries over
Web-Accessible Databases
Nicolas Bruno
Luis Gravano
Amélie Marian
Columbia University
“Top-k” Queries Natural in
Many Scenarios
Example: NYC Restaurant Recommendation
Service.
Goal: Find best restaurants for a user:
Close to address: “2290 Broadway”
Price around $25
Good rating
Query: Specification of Flexible Preferences
Answer: Best k Objects for Distance Function
2/27/2002
2
Attributes often Handled by
External Sources
MapQuest returns the distance between
two addresses.
NYTimes Review gives the price range of
a restaurant.
Zagat gives a food rating to the
restaurant.
2/27/2002
3
“Top-k” Query Processing
Challenges
Attributes handled by external sources
(e.g., MapQuest distance).
External sources exhibit a variety of
interfaces (e.g., NYTimes Review,
Zagat).
Existing algorithms do not handle all
types of interfaces.
2/27/2002
4
Processing Top-k Queries over
Web-Accessible Data Sources
Data and query model
Algorithms for sources with different
interfaces
Our new algorithm: Upper
Experimental results
2/27/2002
5
Data Model
Top-k Query: assignment of weights and
target values to attributes
< $25, “2290 Broadway”, very good >
preferred price
close to address
weights: <4, 1, 2>
price: most important attribute
2/27/2002
preferred rating
Combined in
scoring function
6
Sorted Access Source S
Return objects sorted
by scores for a given
query.
GetNextS interface
Example: Zagat
S-Source
Access Time: tS(S)
2/27/2002
7
Random Access Source R
Return the score of a
given object for a given
query.
GetScoreR interface
Example: MapQuest
R-Source
Access Time: tR(R)
2/27/2002
8
Query Model
Attributes scores between 0 and 1.
Sequential access to sources.
Score Ties broken arbitrarily.
No wild guesses.
One S-Source (or SR-Source) and
multiple R-sources. (More on this later.)
2/27/2002
9
Query Processing Goals
Processing top-k queries over R-Sources.
Returning exact answer to top-k query q.
Minimizing query response time.
Naïve solution too expensive (access all
sources for all objects).
2/27/2002
10
Example: NYC Restaurants
S-Source:
Zagat: restaurants sorted by food rating.
R-Sources:
MapQuest: distance between two input
addresses.
User address: “2290 Broadway”
NYTimes Review: price range of the input
restaurant.
Target Value: $25
2/27/2002
11
TA Algorithm for SR-Sources
Fagin, Lotem, and Naor (PODS 2001)
Perform sorted access sequentially to all SR-Sources
Completely probe every object found for all
attributes using random access.
Keep best k objects.
Stop when scores of best k objects are no less than
maximum possible score of unseen objects
(threshold).
Does NOT handle R-Sources
2/27/2002
12
Our Adaptation of TA Algorithm
for R-Sources: TA-Adapt
Perform sorted access to S-Source S.
Probe every R-Source Ri for newly found
object.
Keep best k objects.
Stop when scores of best k objects are no
less than maximum possible score of unseen
objects (threshold).
2/27/2002
13
An Example Execution of
TA-Adapt
Object
S(Zagat)
R1(MQ) R2(NYT) Final Score
o1
0.9
0.1
0.5
0.56
o2
o3
0.8
0.45
0.7
0.6
0.7
0.3
0.75
0.55
GetScore
GetScore
GetNext
GetNext
(q)
(q,o
R1
R2
R1
R2
S(q,o
S(q)213))
Threshold
Threshold
Threshold===10.725
0.95
0.9
0.725
score
score
x
Total Execution Time = 9
xx
o1
o1
x
o2
o3
o3
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
2/27/2002
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
14
Improvements over TA-Adapt
Add a shortcut test after each randomaccess probe (TA-Opt).
Exploit techniques for processing
selections with expensive predicates
(TA-EP).
Reorder accesses to R-Sources.
Best weight/time ratio.
2/27/2002
15
The Upper Algorithm
Selects a pair (object,source) to probe next.
Based on the property:
The object with the highest upper bound will
be probed before top-k solution is reached.
score
score
erocs
xx
2/27/2002
x
Object is not
one one
of top-k
of top-k
objects
objects
16
An Example Execution of Upper
Object
Upper Bound
S(Zagat) R1(MQ) R2(NYT)
o1
0.65
0.95
0.9
0.1
o2
0.8
0.75
0.9
0.8
0.7
o
0.725
0.45
3
GetScore
GetScore
GetNext
GetNext
GetNext
(q)
(q)
o21)
R2
R1S(q,
S(q,o
S(q)
Threshold
Threshold
Threshold
Threshold
===
0.725
0.95
0.95
0.9
Threshold
==
10.9
Total Execution Time = 6
Final Score
0.7
0.75
score
x
o1
o2
o3
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
2/27/2002
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
17
The Upper Algorithm
Choose object with highest upper bound.
If some unseen object can have higher upper bound:
Access S-Source S
Else:
Access best R-Source Ri for chosen object
Keep best k objects
If top-k objects have final values higher than
maximum possible value of any other object, return
top-k objects.
Interleaves accesses on objects
2/27/2002
18
Selecting the Best Source
Upper relies on expected values to make its
choices.
Upper computes “best subset” of sources
that is expected to:
1. Compute the final score for k top objects.
2. Discard other objects as fast as possible.
Upper chooses best source in “best subset”.
Best weight/time ratio.
2/27/2002
19
Experimental Setting:
Synthetic Data
Attribute scores randomly generated (three
data sets: uniform, gaussian and correlated).
tR(Ri): integer between 1 and 10.
tS(S)  {0.1, 0.2,…,1.0}.
Query execution time: ttotal
Default: k=50, 10000 objects, uniform data.
Results: average ttotal of 100 queries.
Optimal assumes complete knowledge
(unrealistic, but useful performance bound)
2/27/2002
20
Experiments: Varying Number
of Objects Requested k
210
180
ttotal
150
Optimal
120
Upper
TA-EP
90
TA-Opt
60
TA-Adapt
30
0
0
20
40
60
80
100
k
2/27/2002
21
Experiments: Varying Number
of Database Objects N
350
300
ttotal
250
Optimal
200
Upper
150
TA-EP
TA-Opt
100
50
0
0
20000
40000
60000
80000
100000
Number of objects in S-Source S
2/27/2002
22
Experimental Setting:
Real Web Data
S-Source: Verizon Yellow Pages
(sorted by distance)
R-Sources:
Subway Navigator
Subway time
Altavista
MapQuest
NYTimes Review
Popularity
Zagat
2/27/2002
Driving time
Food and price
ratings
Food, Service, Décor
and Price ratings
23
Experiments: Real-Web Data
# of Random
nR Accesses
6000
5000
4000
Upper
3000
TA-EP
TA-Opt
2000
1000
2/27/2002
7
Q
ue
ry
6
Q
ue
ry
5
Q
ue
ry
4
Q
ue
ry
3
Q
ue
ry
2
Q
ue
ry
Q
ue
ry
1
0
24
Evaluation Conclusions
TA-EP and TA-Opt much faster than
TA-Adapt.
Upper significantly better than all
versions of TA.
Upper close to optimal.
Real data experiments: Upper faster
than TA adaptations.
2/27/2002
25
Conclusion
Introduced first algorithm for top-k processing
over R-Sources.
Adapted TA to this scenario.
Presented new algorithms: Upper and Pick (see
paper)
Evaluated our new algorithms with both real
and synthetic data.
Upper close to optimal
2/27/2002
26
Current and Future Work
Relaxation of the Source Model
Current source model limited
Any number of R-Sources and SR-Sources
Upper has good results even with only SR-Sources
Parallelism
Define a query model for parallel access to
sources
Adapt our algorithms to this model
Approximate Queries
2/27/2002
27
References
Top-k Queries:
Evaluating Top-k Selection Queries, S. Chaudhuri and L.
Gravano. VLDB 1999
TA algorithm:
Optimal Aggregation Algorithms for Middleware, R. Fagin, A.
Lotem, and M. Naor. PODS 2001
Variations of TA:
Query Processing Issues on Image (Multimedia) Databases,
S. Nepal and V. Ramakrishna. ICDE 1999
Optimizing Multi-Feature Queries for Image Databases, U.
Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000
Expensive Predicates
Predicate Migration: Optimizing queries with Expensive
Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD
2/27/2002
1993
28
Real-web Experiments
6000
5000
ttotal
4000
Upper
3000
TA-EP
TA-Opt
2000
1000
2/27/2002
7
Q
ue
ry
6
Q
ue
ry
5
Q
ue
ry
4
Q
ue
ry
3
Q
ue
ry
2
Q
ue
ry
Q
ue
ry
1
0
29
Real-web Experiments with
Adaptive Time
TA-Opt
TA-EP
Upper
1200
ttotal (seconds)
1000
800
600
400
200
0
Query 1
2/27/2002
Query 2
Query 3
Query 4
30
Relaxing the Source Model
Upper_Weight
Upper-Relaxed
TA-Upper
TAz-EP-NODUP
TAz-EP
200000
180000
160000
140000
ttotal
120000
TA-EP
100000
80000
60000
Upper
40000
20000
0
0
1
2
3
4
5
6
7
Number of SR-Sources (out of 6 sources)
2/27/2002
31
Upcoming Journal Paper
Variations of Upper
Select best source
Data Structures
Complexity Analysis
Relaxing Source Model
Adaptation of our Algorithms
New Algorithms
Variations of Data and Query Model to handle
real web data
2/27/2002
32
Optimality
TA instance optimal over:
Algorithms that do not make wild guesses.
Databases that satisfy the distinctness property.
TAZ instance optimal over:
Algorithms that do not make wild guesses.
No complexity analysis of our algorithms, but
experimental evaluation instead
2/27/2002
33
Download