Online Interval Skyline Queries on Time Serious

advertisement
Online Interval Skyline Queries
on Time Series
From: ICDE2009
Author: Bin Jiang, Jian Pei
Speaker: Zhifeng Lin
Date: Nov 28th,2008
Outline
• What’s online Interval Skyline Query?
• Definition and Notation
• On-the-fly Solution
• Conclusion
Electricity Consumption Diagram
Interval Query
Analyst ask:
“In the week of June 16-22, which regions have high power consumption?”
Region 1: Interesting
Highest daily consumption on June
20th
Region 2: Interesting
Highest average consumption on
this query interval (June 16-22)
Region 3: Not Interesting
Lower than region 2 every day
Interval Skyline Query
Technically, a time series s is interesting if,
in the query interval, there does not exist another
time series s’ such that
(1) s’ is better than s on at least one timestamp
and
(2) s’ is not worse than s on every timestamp.
Naive Approach
Bottleneck
time series: (t0,v0),(t1,v1)…….(tn,vn)
(1) Many skyline cannot be applied to online
query answering.
(2) For overlapping query intervals, we cannot
explore sharing on computation.
(3) High Dimension Skyline query is not
effective
timestamp => dimension
point: (v0,v1,…..vn)
skyline computation
find skyline point (interesting series)
Interval Skyline Query
• Time series s is said to dominate time series q
in interval [i:j], denoted by
, if
,
s[k]≥q[k]; and
, s[l]>q[l].
• Given a set S of time series and interval [i:j], the
interval skyline is the set of time series that are
not dominated by any other series in [i:j],
denoted by
Electricity Consumption Diagram
Skyline = {region1, region2}
Problem Definition
Notation
Meaning
s,q
time series
[i:j](i ≤ j)
an interval
Sky[i:j]
the skyline in interval [i:j]
tc
the most recent timestamp
w
the size of the base interval
n
the number of time series
W=[tc –w+1:tc]
s.max
s.min[i:j]
the base interval of time series
the maximum value of s in the base interval W
the minimum value of s in interval [i,j] in W
Problem:
Given a set of time series S such that each time series is in the base
interval W=[tc –w+1:tc], we want to maintain in a data structure D such
that any interval skyline queries in interval
can be answered
efficiently using D.
On-the-fly (OTF) Method
• Interval Skyline Query Answering Algorithm
• Online Interval Skyline Query Answering
Example data:
s1
s2
s3
s4
s5
Interval Skyline Query (OTF) Answering Algorithm
• Lemma 1 (max-min)
For two time series s,q and interval [i:j]⊆W, if s.min[i:j]
> q.max, then s dominates q in [i:j].
s.max reflects the capability of s not being dominated
by some other time series.
Basic Intention:
1. Store s.max and s.min for each time series to
compute interval skyline at query time.
2. Reduce the unnecessary “checking dominating
relationship” since it’s costly.
Interval Skyline Query (OTF) Answering Algorithm
Interval Skyline Query (OTF) Answering Algorithm
List the “capability” of not being
dominated in descending order
Record the maximum “ability” of
dominating other time series
if maxmin>s.max,
The remaining time series must be
dominated by some Skyline in Sky
because:
(1). L is in descending order of s.max
(2). Lemma 1
Example
w=3 W=[1:3]
compute skyline in interval [2:3]
s1
s2
s3
s4
s5
Sky={}, maxmin =
check s2, maxmin=1, none can dominate s2, add s2
Sky={s2}
check s3, maxmin=2, none can dominate s3, add s3
Sky={s2, s3}
check s5, maxmin=4, none can dominate s5, add s5
Sky={s2, s3, s5}
check s1, maxmin=4, s1 is dominated by s5 , discard
Sky={s2, s3, s5}
check s4, maxmin=4 > s4.max, discard
Result:
Sky ={s2, s3, s5}
Bottleneck for Online Extension
• Data Dependence in algorithm
--- s.min[i:j]
--- s.max for all time series
• As time elapse, base interval move (sliding window)
--- check s.min[i:j]
(costly? Time: O(nw) Space: O(nw))
--- check s.max
(costly? Time: O(nw) Space: O(nw))
Online Interval Skyline Query Answeringhow to maintain min[i:j]
• min[i,j]
--- search [i,j] in timestamp dimension
(binary search tree)
--- return min value
(min Heap)
Therefore, author use
Radix Priority Search Tree
(hybrid of a heap on one dimension and a binary on
the other dimension)
Radix Priority Search Tree
• Points (x,y)
• All x values are different, integral, and in the
range [0, k – 1].
• Each node of the priority search tree has exactly
one element.
• The y value of the element in node w is <= the y
value of all elements in the subtree rooted at w
(y values define a min tree).
• Root interval is [0,k).
• Interval for node w is [a,b).
– Left child interval is [a, floor((a+b)/2)).
– Right child interval is [floor((a+b)/2, b)).
Example of RPST
[0,16)
7,1
[0,8)
5,8
[0,4) 2,12
[4,8) 6,9
[8,16)
11,5
Insert
•
•
•
•
Start with empty RPST.
k = 16.
Root interval is [0,16).
Insert (5,8).
[0,16)
5,8
[0,16)
• Insert (6,9).
5,8
• (5,8) remains in root, because 8 < 9.
• (6,9) inserted in left subtree,
[0,8)
because 6 is in the left child interval.
6,9
Insert
[0,16)
5,8
• Insert (7,1).
• (7,1) goes into the root, because 1 < 8. [0,8)
• (5,8) inserted in left subtree, because 6,9
[0,16)
5 is in the left child interval.
7,1
• (5,8) displaces (6,9), because 8 < 9.
[0,8)
• (6,9) inserted in right subtree, because
5,8
6 is in the right child interval.
[4,8) 6,9
Insert
[0,16)
[0,16)
7,1
7,1
[0,8)
[0,8)
5,8
5,8
[4,8) 6,9
[4,8) 6,9
• Insert (11,5).
[8,16)
11,5
Properties
[0,16)
7,1
[0,8)
5,8
[0,4) 2,12
[8,16)
11,5
[4,8) 6,9
• Height is O(log k).
• Insert time is O(log k).
Search
[0,16)
7,1
[0,8)
5,8
[0,4) 2,12
[4,8) 6,9
• Search time is O(log k).
[8,16)
11,5
Search (1:4)
Delete
[0,16)
7,1
[0,8)
5,8
[0,4) 2,12
[8,16)
11,5
[4,8) 6,9
• Similar to delete min of min heap.
• Delete time is O(log k).
Maintain RPST for Time Series
• Use the time as binary tree dimension (X)
• Use the data value as the heap dimension (Y)
Because Base interval size w is fixed
We map W into fixed domain of X, which is {0,1…,w-1}
The height of RPST is fixed and balanced, which is O(logw).
Insertion ---O(logw)
Deletion ---O(logw)
Search ---O(logw)
Maintain RPST for Time Series
• map timestamp to w (mod w)
• wt = tc mod w.
• W=[tc–w+1:tc]map to
wt+1,wt+2,…w-1,0,1,…, wt
• When time elapse (sliding window move)
W=[tc–w+2:tc+1]
mapping [tc–w+2:tc] stay same, RPST do NOT change
tc+1 replace the expired tc+1-w, RPST only update here
----- O(logw)
Retrieve the min[i:j]
Given RPST, OTF need to retrieve min[i:j].
wi= i mod w; wj= j mod w;
Case 1: wi <= wj
retrieve the value falling into [wi ,wi]
Case 2: wi > wj
retrieve value falling into [0, wj ]
and [wj,w-1], pick the smaller one
---------------O(logw)
Online Interval Skyline Query Answeringhow to maintain max value
• Store max value for n time series
----worst case: space cost O(nw)
• Strategy: Use Sketches
A pair(v,t) is maintained if there is no other
pair(v’,t’) such that v’>=v and t’>t. In this way,
we only keep (logw) pairs on average.
Find the maximum value cost O(1)
Update a new timestamp cost O(logw)
Example: max in s1
(4,1) comes, save {(4,1)}
(3,2) comes, save {(4,1),(3,2)}
(2,3) comes, save {((4,1),(3,2),(2,3)}
(5,4) comes,
4>3,5>2 remove (2,3)
4>2,5>3 remove (3,2)
4>1,5>4 remove (4,1)
(5,4) left {(5,4)}
Analysis of OTF
• For each time series (space)
use O(w) space to store a RPST
use O(logw) space to maintain the sketch of max values.
---- Space Consumption for n time series O(nw)
• For each time series (time)
use O(logw) to update RPST
use O(logw) to update sketch
---- Time Consumption for n time series O(nlogw)
Before the End
• Experiments fits the analysis of OTF algorithm
• There is another method View-Materialization
• This paper,
Propose innovative concept.
Excellently propose concise and effective data
structure for solution.
Thanks!
Download