ICDE2012_TopKPairs - School of Computer Science and

advertisement
Efficiently Monitoring Top-k
Pairs over Sliding Windows
Computer Science and Engineering
Presented By: Zhitao Shen1
Joint work with
Muhammad Aamir Cheema1, Xuemin Lin21, Wenjie Zhang1, Haixun Wang3
1The
University of New South Wales, Australia
2 East China Normal University
3 Microsoft Research Asia
Introduction
Top-k Pairs Query:
• Given a scoring function score() that computes the score of a pair of
objects, return k pairs of objects with the smallest scores.
Examples:
• k closest pairs queries
• k furthest pairs queries
Top-k Pairs against sliding windows
• Given a data stream, return top-k pairs among the most recent N objects.
Applications
• Wireless sensor network, stock market, traffic monitoring and transaction
monitoring
2
Motivation
No existing work for general pairs queries over sliding windows
Select a.id, b.id from trans a, trans b
Support arbitrary scoring functions.
where a.id <> b.id and a.account = b.account
order by |a.time - b.time| - dist(a.loc, b.loc)
limit k
Example:
window [24 hours]
Fraud detection over transaction streams
– Query the transaction pairs that have small time difference but the
locations are far away.
203-13845
10:15:20
New York
$1000
203-13845
3
10:18:10
L.A.
$1000
Problem Definitions (Preliminaries)
Sliding Windows
– A sliding window contains most recent N objects of
the data stream.
– The number of pairs is N(N – 1) / 2
The age of a pair
depends on the
older object.
older
. . . . .
o7
o6
Age of an object: 5
o5
o4
o3
o2
o01
4 Sliding
3 window
2
1
of size0 5
Lower bound runtime cost : O(N) for each new object
Lower bound storage cost : O(N)
4
newer
Contributions
Unified framework
• First to study top-k pairs queries over sliding windows.
• Support arbitrarily complex scoring functions
• Support efficient queries for any window size n ≤ N and any k ≤ K
Lower bound
5
Expected cost for our algorithms
Storage requirement
O(N)
O(N) + O(K log(N/K)) for each
scoring function
Skyband maintenance
cost for each object
O(N)
O(N (log (log N) + log K))
Answering top-k pairs
O(k)
O(log(log n) + log K + k)
Preliminaries
o4
Score
Map all the pairs to an age–score space
Top-2 pairs
Task1 : how we
efficiently
maintain the
K-skyband
Task2 : how we
use the Kskyband to
efficiently
obtain top-k
pairs against
any sliding
window n ≤ N
p6
p1
p9
p8
p2
p7
p4
2
3
o1
p2 dominates p5 because
p2.score < p5.score and p2
expires no later than p5.
Naive: O(N |SKB|) for
checking all N-1 pairs
Our: O(N log|SKB|)
4
Age
K-skyband[Papadias et al., TODS05] keeps the minimum set for the
candidate results.
6
o0
Expected size of skyband is
O(K log(N/K))
p5
1
o2
p1(o0, o1) (p1.age, p1.score)
(1, 3)
p10
p3
o3
Efficient Skyband Maintenance
p5 s1
How can we efficiently
compute the K-staircase
and K-skyband?
Update the K-staircase
and K-skyband in
O(|SKB| log K)),
Score
Can we find a boundary between the
skyband points and non-skyband points?
p1
p2
K-staircase
K-staircase
p7
s2
p6
Check if a pair is
dominated by K-skyband
in O(log |SKB|) time for
each new pair by doing
binary search.
p3
s1
p5
p4
2-skyband
7
s2
Age
p1
Efficient Query Answering
Score
Can we do better for
any sliding window
size n < N?
Use Priority Search
Tree to index the
skyband points
AnyWindow
windowsize
size==Nn < N
p2
p1
p3
p5
p1
p4
p7
p6
p5
p7
2-skyband
p2
3
p6
1
p8
p8
8
6
2
4
p3
p4
9
8
5
Priority Search Tree
Age
Self-balancing tree
Efficient 3-sides range query
Efficient Query Answering
Score
Our contribution: Retrieve
top-k pairs in the 1-sided
range.
An algorithm similar to
post-order traversal
costs O(log|SKB| + k)
Any window size = n < N
p2
p1
p3
p5
p1
p4
p7
p6
p5
p7
2-skyband
p2
3
p6
1
p8
p8
9
6
2
4
p3
p4
5
Priority Search Tree
Age
8
9
What else in the paper?
Efficient continuous queries on the skyband.
• Continuously monitoring the top-k results for any fixed k (k ≤ K) and
n (n ≤ N).
• Amortized O(k/n (log |SKB| + k)) time per update.
Optimization on monotonic scoring functions.
• Handling the k-closest pairs, k-furthest pairs queries.
• Applying Threshold Algorithm on sorted lists
• Improving the number of considered pairs for each new object from
N to (d+1) N d/(d+1) K 1/(d+1)
10
Experimental Settings
Real dataset.
– Sensor data in the Intel research lab
– 2.3 million records.
score(ox , o y ) 
| o x .time- o y .time|
| o x .temp- o y .temp| | o x .humidity- o y .humidity|
Synthetic data.
– Uniform, correlated and anti-correlated distributions.
– 2 million objects
– Closest and furthest pairs in Manhattan distance
11
Experiments (Overall Cost on real data)
SCase: our algorithm using K-staircase to maintain the skyband.
Naïve: maintains kN pairs and sort them on their scores.
LB: shows lower bound cost
Varying K
12
Varying N (in thousands)
Experiments (Query Answering)
Linear: scan the skyband points to find the top-k pairs.
Snapshot: our snapshot query algorithm.
Continuous: our continuous query algorithm.
LB: an algorithm to obtain top-k results in O(k) time.
Varying K
13
Varying |Q| (in thousands)
Conclusion:
• First to study a broad class of top-k pairs queries over
sliding windows.
• We present efficient algorithms and show that the
performance of our algorithm is reasonably close to the
lower bound cost.
• We provide extensive experiment results on both real
and synthetic data sets to show the efficiency and
scalability of the proposed algorithms.
14
Question and Answer
Thank You!
Any Questions?
15
Related Work
Top-k Query Processing
• Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access
(NRA)
Top-k Pairs Queries Processing
• k-closest pairs queries
• k-furthest pairs queries
• Top-k pairs queries [Cheema et al., ICDE’11]
Data Stream Processing
• Top-k query processing over data stream [Mouratidis et al.,
SIGMOD’06]
• k-nearest neighbour queries [Böhm et al., ICDE’07]
16
Experiments (Skyband Maintenance algorithm)
Basic: maintening algorithm without K-staircase
SCase: our algorithm using K-staircase to maintain the skyband.
TA: Optimized algorithm for monotonic scoring functions.
LB: show lower bound cost
Varying K
17
# of attributes
Download