Presented - Muhammad Aamir Cheema

advertisement
A Unified Approach for Computing Top-k Pairs in
Multidimensional Space
Presented By: Muhammad Aamir Cheema1
Joint work with
Xuemin Lin1, Haixun Wang2, Jianmin Wang3, Wenjie Zhang1
1 University
of New South Wales, Australia
2 Microsoft Research Asia
3 Tsinghua University, China
Introduction
Top-k Pairs Query:
Given a scoring function f() that computes the score of
a pair of objects, return k pairs of objects with smallest
scores.
Examples:
Answer(k=1)
(k=1)==(o(o4,o
,o4)
Answer
1
2 5)2
y-axis
k-closest
k-furthest pairs
pairs
f(ou,o
,ov)==(odist(o
f(o
- udist(o
.x +o
,o
u
u,o
v)v)+ (ou.y +ov.y)
u v)v
vu.x)
o2
o1
o5
o3
o4
x-axis
Related Work
K-Closest Pairs Queries
• Computational geometry [M Smid, Handbook on Comp. Geometry]
• Database community
[Hjaltason et. al, SIGMOD 1998]
[Corral et. al, SIGMOD 2000]
[Yang et. al, IDEAS 2002]
[Shan et. al, SSTD 2003]
K-Furthest Pairs Queries
[Supowit , SODA 1990]
[Katoh et. al, IJCGA 1995]
[Corral et. al, DKE 2004]
Top-k Queries
• Fagin’s Algorithm [Fagin, PODS 1996]
•Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999] ,
[Gȕntzer et. al, VLDB 2000]
• No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al,
TODS 2007]
Motivation
• No existing work for more general queries
• Other Lp distances (e.g., Manhattan distance) ?
• More general scoring functions
• Chromatic queries
SELECT a.id , b.id FROM AGENT a, AGENT b
WHERE a.id < b.id AND a.manager <> b.manager
ORDER BY |a.sold – b.sold| - |a.salary – b.salary|
LIMIT k;
• No existing unified algorithm
• One framework that answers a broad class of top-k pairs queries
Problem Definition (Preliminaries)
•
Monotonic function
f() is monotonic if f(x1,…,xN) ≤ f(y1,…,yN) whenever xi ≤ yi for every 1 ≤ I ≤ N
Examples:
f(x1,…,xN) = x1 + x2 + … + xN
f(x1,…,xN) = (x1 + x2 + … + xN) / N
(summation)
(average)
Problem Definition (Preliminaries)
•
Loose monotonic function
s() takes two parameters and is loose monotonic if both of following hold
for every fixed value x
1. for every y > x, s(x,y) either monotonically increases or monotonically
decreases as y increases
2. for every y < x, s(x,y) either monotonically increases or montonically
decreases as y decreases
Loose monotonic functions are more general than the monotonic functions
-∞
-3
s1(x,y) = |x – y| = 1
4
y
x
y
0
1
2
5
s2(x,y) = (x + y) = 1
3
∞
-2
6
Problem Definition
• Return k pairs of objects with smallest scores.
SCORE (a,b) = f ( s1(a,b),…,sd(a,b) )
si( ) is called local scoring function and can be any loose monotonic
function of user’s choice.
f( ) is called global scoring function and can be any monotonic
function that involves an arbitrary set of attributes.
s1(a,b) = | a.sold – b.sold |
s2(a,b) = -| a.salary – b.salary |
f( ) = s1(a,b) + s2(a,b)
SELECT a.id , b.id FROM AGENT a, AGENT b
WHERE a.id < b.id
ORDER BY |a.sold – b.sold| - |a.salary – b.salary|
LIMIT k;
Problem Definition
• Return k pairs of objects with smallest scores among
the valid pairs.
Let each object be assigned a color.
Chromatic Queries:
Homochromatic Queries: pairs containing objects of same color
Heterochromatic Queries: pairs containing objects of different colors
SELECT
SELECT a.id
a.id ,, b.id
b.id FROM
FROM AGENT
AGENT a,
a, AGENT
AGENT b
b
WHERE
a.id <
< b.id
b.id AND a.manager =
≠ b.manager
WHERE a.id
ORDER
|a.salary
– b.salary|
ORDER BY
BY |a.sold
|a.sold –
– b.sold|
b.sold| - -|a.salary
– b.salary|
LIMIT
LIMIT k;
k;
Contributions
Unified algorithm (internal and external)
• k-closest pairs, k-furthest pairs and variants (any Lp distance)
• queries involving any arbitrary subset of attributes
• chromatic and non-chromatic queries
• skyline pairs queries and rank based top-k pairs queries
No pre-built indexes required
• efficiently builds a simple data structure on-the-fly
• can answer queries involving filtering conditions on objects
Known memory requirement
• existing R-tree based approaches may require arbitrarily large heaps
a.idrequires
, b.id O(k)
FROM
AGENT
a, pages
AGENT b
•SELECT
our algorithm
space
+ 2d buffer
WHERE a.id < b.id AND a.age > 40 AND b.age > 40
Efficient
ORDER BY |a.sold – b.sold| - |a.salary – b.salary|
• LIMIT
Theoretically
Optimal for d ≤ 2
k;
• Experimentally
Framework
Top-K algorithms
(e.g., FA, TA, NRA etc.)
(o1,o2)
3
(o2,o3)
5
(o2,o5)
4
(o1,o5)
6
(o1,o3)
9
(o1,o2)
6
…
…
s1(a,b)
…
…
s2(a,b)
…
(o1,o2)
1
(o3,o4)
2
(o1,o4)
5
…
…
sd(a,b)
f ( s1(a,b), s2(a,b), …,sd(a,b) )
How to efficiently create and maintain these sources???
Creating/maintaining sources
Naïve approach
• Create all possible pairs
O(N2)
• Sort them according to their local scores
O(N2 log N)
space requirement: O(N2)
Features of our approach
• Optimal internal memory algorithm
• requires O(N) space
• returns first pair in O(N log N)
• each next best pair is returned in O( log N)
• Optimal external memory algorithm
• B = number of elements that can be stored in one disk page
• M = used internal memory
minimum M = 2B
• returns first pair in O(N/B logM/B N/B)
• each next best pair is returned in O(logM/B N/B)
Creating/maintaining sources
Initialize
• sort the objects
• for each object ou
• create its best pair (ou,ov)
• insert (ou,ov) in heap
getNextPair()
• report the top pair (ou,ov) of heap
• create next best pair of ou
• enheap the new pair and delete (ou,ov)
2
12
3
2
5
(o433,o55)
(o11,o22)
56
6
(o55,o66)
10
s(x,y) = |x – y|
6
3
6
(o232,o43)
(o424,o535)
1
5
10
6
12
14
15
20
30
o1
o2
o3
o4
o5
o6
Homochromatic Queries
o1
o2
o3
o4
o5
o6
6
12
14
15
20
30
Heterochromatic Queries
• Let (ou,ov) be the pair
• ox = the object next to ov
• If ou and ox have different color
•(ou,ox) is the next best pair
• else
•oy = the adjacent object of ox
• (ou,oy) is the next best pair
o1
o2
o3
o4
o5
o6
6
12
14
15
20
30
Experiments
K-closest pairs queries [Corral et. al, SIGMOD 2000]
• Data size: two dataset each containing 100K objects
• k: 10
Experiments
• Naive: join the dataset with itself using nested loop (block nested loop for
external memory algorithm)
• Scoring function:
• Local scoring function is either sum or absolute difference (chosen
randomly)
• Global scoring function is weighted aggregate (weights are chosen
randomly and negative weights are allowed)
Number of Objects
Number of attributes (d)
Value of k
Number of colors
Thanks
Complexity
Internal memory algorithm =
External memory algorithm =
d = number of local scoring functions involved
N = total number of objects
V = total number of valid pairs (N2 at most)
M = internal memory used by the algorithm
B = the number of entries one disk page can store
Download