Monochromatic reverse top-k

advertisement
Reverse Top-k Queries
Akrivi Vlachou*, Christos Doulkeridis*, Yannis Kotidis#, Kjetil Nørvåg*
*Norwegian University of Science and Technology (NTNU), Trondheim, Norway
#Athens University of Economics and Business (AUEB), Greece
Outline
 Motivation
& Preliminaries
 Monochromatic Reverse Top-k Queries
 Bichromatic Reverse Top-k Queries
 Threshold-based
Algorithm
 Materialized Views
 Experimental
Evaluation
 Conclusions & Future Work
2
Rank-aware Query Processing

Huge amount of
available data
 Users prefer to retrieve
a limited set of k
ranked data objects
that best match their
preferences (top-k
queries)
3
Top-k Query


Given a scoring function f(),
retrieve the k object that best
match the user preferences
Linear scoring function
f w(p) = Σw[i]*p[i]

Weight w[i]:


relative importance of attribute i
Definition TOPk(w): Given a
weighting vector w and a
positive integer k, find the k data
points p with the minimum f(p)
scores
Query line of w at point p: defines the
score of p
Query space of w defined by point p:
number of enclosed points determines
the rank of p
4
Reversing the Top-k Query

From the perspective of
manufacturers:
 it is important that a
product is returned in
the highest ranked
positions for as many
user preferences as
possible
 estimate the impact of
a product compared to
their competitors
products
 advertise a product to
potential customers
Which customers
would be interested?
sales representative
customer customer customer customer
5
Reversing the Top-k Query

Reverse top-k query:
Given a potential product q
and a positive integer k,
which are the weighting
vectors w for which q is in
the top-k query result set?

Two different versions
Monochromatic:

sales representative
no knowledge of user
preferences
Bichromatic:

a dataset with user
preferences is given
customer customer customer customer
6
Car Database Example




A database containing information about different cars
Different users have different preferences
Bob prefers a cheap car, and does not care much about the age
 the best choice (top-1) for Bob is the car p1 with score 2.5
Tom prefers a newer car rather than a cheap car
 the best choice for Tom and Max is the car p2
7
Car Database Example
Query point q=p2, k=1:
 Bichromatic reverse top-k: {(0.2,0.8), (0.5,0.5)}


advertise product to Tom and Max
Monochromatic reverse top-k: line segment w[price]=[1/7,5/6]

estimate the impact of p2 as 69%
Query point q=p3, k=1: empty result set for the bichromatic query
8
Outline
 Motivation
& Preliminaries
 Monochromatic Reverse Top-k Queries
 Bichromatic Reverse Top-k Queries
 Threshold-based
Algorithm
 Materialized Views
 Experimental
Evaluation
 Conclusions & Future Work
9
Monochromatic Reverse Top-k Query



mRTOPk(q): Given a point q, a
positive number k and a dataset
S, the result set of the
monochromatic reverse top-k
query is the locus for which
there exists p in TOPk(wi) such
that fwi(q) ≤ fwi(p).
The solution space W can be
split into a finite set of nonadjacent partitions such that
query point q has the same rank
for all the weighting vectors.
For the monochromatic case: we
focus on the 2-d space
2
mRTOP1(q)
1
2
Solution space
10
Geometric Interpretation d=2, k =1

If q belongs to the convex hull, then
there exists exactly one partition in
mRTOP1(q)



Weighting vectors that are
perpendicular to pq and qr define the
line segment
For weighting vectors with smaller and
larger slopes than w1, the relative order
of p and q changes
Monochromatic reverse top-k, k>1:

The solution space may contain
more than 1 partition
11
Outline
 Motivation
& Preliminaries
 Monochromatic Reverse Top-k Queries
 Bichromatic Reverse Top-k Queries
 Threshold-based
Algorithm
 Materialized Views
 Experimental
Evaluation
 Conclusions & Future Work
14
Bichromatic Reverse Top-k Query

bRTOPk(q): Given a point q, a positive number k
and two datasets S and W, where S represents data
points and W is a dataset containing different
weighting vectors, a weighting vector wi belongs to
the result set, if and only if there exists p in TOPk(wi)
such that fwi(q) ≤ fwi(p)
 Naïve approach:
 for each weighting vector process the top-k query
 test if query point q is in the top-k list
15
Threshold-based Algorithm (RTA)

Goal:


reduce the number of top-k evaluations by discarding
weighting vectors
Threshold-based Algorithm (RTA):

sort the weighting vectors based on pairwise similarity

top-k queries defined by similar vectors, have similar result
sets
evaluate the first top-k query, calculate a threshold
 For each weighting vector

possibly prune based on threshold
 refine threshold

16
Example of RTA Algorithm (k=2)
Buffer: p1, p2

Evaluate top-2 query
for w1
 Set threshold based
on w2
 fw2(q) > threshold 
discard w2
 Refine threshold for
w3
p9
p8
p10
p5
p1
p6
p4
w3
p
w1 2
w2
p7
p3
q
W=[ w1, w2, w3 ]
17
Materialized Views

Threshold-based Algorithm (RTA)
reduce the top-k evaluations by discarding some
weighting vectors that are not in the reverse top-k
result set
 process at least as many top-k evaluations as the
cardinality of the result set


Materialized Views

find weighting vectors that belong definitely to
the result without top-k evaluation
18
Materialized Views

Grid-based space
partitioning

w1, w2, w3
cell Ci
lower left corner CiL
 upper right corner CiU


We store for each cell Ci
the results of reverse
top-k queries for corners
CiL and CiU
19
Materialized Views

Given a point q enclosed
in cell Ci
 all weighting vectors
in RTOPk(CiU) belong
to the result set of q
 only weighting
vectors in
w1, w2, w3
RTOPk(CiL) - RTOPk(CiU)
have to be examined
 Materialized views can
be generalized for
arbitrary k<K values
w1, w2, w3 , w4
20
Outline
 Motivation
& Preliminaries
 Monochromatic Reverse Top-k Queries
 Bichromatic Reverse Top-k Queries
 Threshold-based
Algorithm
 Materialized Views
 Experimental
Evaluation
 Conclusions & Future Work
21
Experimental Setup

Comparison between Naïve and RTA
(varying dimensionality, cardinality, data
distribution – real data)
 Queries: uniform and k-skyband points
 Metrics:
time
 I/Os
 number of top-k evaluations

22
RTA vs. Naïve
uniform distribution of S and uniform weights W
|S|=10K, |W|=10K, top-k=10, skyband query points


RTA outperforms naive by 1 to 2 orders of magnitude
as dimensionality increases, |RTOPk(q)| decreases leading to
fewer top-k evaluations
23
Scalability of RTA Algorithm
various distributions (UN, AC, CO) of S and uniform weights W
|S|=10K or |W|=10K, d=5, top-k=10, skyband query points


naive requires |W| top-k query evaluations
|W|=5K, correlated dataset:
 RTA needs on 544 out of 5000 top-k evaluations (saves
89.12% of the cost)
 the average size of the result set is 459
24
Performance of RTA on Real Data
NBA consists of 17265 tuples, d=5 (number of points scored, rebounds,
assists, steals and blocks)
HOUSE consists of 127930 tuples, d=6 (income spent on gas, electricity,
water, heating, insurance, and property tax)


uniform and clustered weights W (|W|=10K)
clustered weights lead to fewer top-k evaluations
25
Outline
 Motivation
& Preliminaries
 Example of Reverse Top-k Queries
 Monochromatic Reverse Top-k Queries
 Bichromatic Reverse Top-k Queries
 Threshold-based
Algorithm
 Materialized Views
 Experimental
Evaluation
 Conclusions & Future Work
26
Conclusions and Future Work

We introduced reverse top-k queries
 geometric interpretation of the solution space
 efficient algorithm for bichromatic reverse top-k
query
 materialized reverse top-k views
 Future Work
 interpretation of solution space for higher
dimensions (monochromatic reverse top-k)
 improve the performance of the bichromatic reverse
top-k computation
27
Thank you!
Related work:
Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: "Reverse
Top-k Queries"
Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Yannis Kotidis: "Identifying
the Most Influential Data Objects with Reverse Top-k Queries"
More information: http://www.idi.ntnu.no/~vlachou/
28
Download