Online Interval Skyline Queries on Time Series From: ICDE2009 Author: Bin Jiang, Jian Pei Speaker: Zhifeng Lin Date: Nov 28th,2008 Outline • What’s online Interval Skyline Query? • Definition and Notation • On-the-fly Solution • Conclusion Electricity Consumption Diagram Interval Query Analyst ask: “In the week of June 16-22, which regions have high power consumption?” Region 1: Interesting Highest daily consumption on June 20th Region 2: Interesting Highest average consumption on this query interval (June 16-22) Region 3: Not Interesting Lower than region 2 every day Interval Skyline Query Technically, a time series s is interesting if, in the query interval, there does not exist another time series s’ such that (1) s’ is better than s on at least one timestamp and (2) s’ is not worse than s on every timestamp. Naive Approach Bottleneck time series: (t0,v0),(t1,v1)…….(tn,vn) (1) Many skyline cannot be applied to online query answering. (2) For overlapping query intervals, we cannot explore sharing on computation. (3) High Dimension Skyline query is not effective timestamp => dimension point: (v0,v1,…..vn) skyline computation find skyline point (interesting series) Interval Skyline Query • Time series s is said to dominate time series q in interval [i:j], denoted by , if , s[k]≥q[k]; and , s[l]>q[l]. • Given a set S of time series and interval [i:j], the interval skyline is the set of time series that are not dominated by any other series in [i:j], denoted by Electricity Consumption Diagram Skyline = {region1, region2} Problem Definition Notation Meaning s,q time series [i:j](i ≤ j) an interval Sky[i:j] the skyline in interval [i:j] tc the most recent timestamp w the size of the base interval n the number of time series W=[tc –w+1:tc] s.max s.min[i:j] the base interval of time series the maximum value of s in the base interval W the minimum value of s in interval [i,j] in W Problem: Given a set of time series S such that each time series is in the base interval W=[tc –w+1:tc], we want to maintain in a data structure D such that any interval skyline queries in interval can be answered efficiently using D. On-the-fly (OTF) Method • Interval Skyline Query Answering Algorithm • Online Interval Skyline Query Answering Example data: s1 s2 s3 s4 s5 Interval Skyline Query (OTF) Answering Algorithm • Lemma 1 (max-min) For two time series s,q and interval [i:j]⊆W, if s.min[i:j] > q.max, then s dominates q in [i:j]. s.max reflects the capability of s not being dominated by some other time series. Basic Intention: 1. Store s.max and s.min for each time series to compute interval skyline at query time. 2. Reduce the unnecessary “checking dominating relationship” since it’s costly. Interval Skyline Query (OTF) Answering Algorithm Interval Skyline Query (OTF) Answering Algorithm List the “capability” of not being dominated in descending order Record the maximum “ability” of dominating other time series if maxmin>s.max, The remaining time series must be dominated by some Skyline in Sky because: (1). L is in descending order of s.max (2). Lemma 1 Example w=3 W=[1:3] compute skyline in interval [2:3] s1 s2 s3 s4 s5 Sky={}, maxmin = check s2, maxmin=1, none can dominate s2, add s2 Sky={s2} check s3, maxmin=2, none can dominate s3, add s3 Sky={s2, s3} check s5, maxmin=4, none can dominate s5, add s5 Sky={s2, s3, s5} check s1, maxmin=4, s1 is dominated by s5 , discard Sky={s2, s3, s5} check s4, maxmin=4 > s4.max, discard Result: Sky ={s2, s3, s5} Bottleneck for Online Extension • Data Dependence in algorithm --- s.min[i:j] --- s.max for all time series • As time elapse, base interval move (sliding window) --- check s.min[i:j] (costly? Time: O(nw) Space: O(nw)) --- check s.max (costly? Time: O(nw) Space: O(nw)) Online Interval Skyline Query Answeringhow to maintain min[i:j] • min[i,j] --- search [i,j] in timestamp dimension (binary search tree) --- return min value (min Heap) Therefore, author use Radix Priority Search Tree (hybrid of a heap on one dimension and a binary on the other dimension) Radix Priority Search Tree • Points (x,y) • All x values are different, integral, and in the range [0, k – 1]. • Each node of the priority search tree has exactly one element. • The y value of the element in node w is <= the y value of all elements in the subtree rooted at w (y values define a min tree). • Root interval is [0,k). • Interval for node w is [a,b). – Left child interval is [a, floor((a+b)/2)). – Right child interval is [floor((a+b)/2, b)). Example of RPST [0,16) 7,1 [0,8) 5,8 [0,4) 2,12 [4,8) 6,9 [8,16) 11,5 Insert • • • • Start with empty RPST. k = 16. Root interval is [0,16). Insert (5,8). [0,16) 5,8 [0,16) • Insert (6,9). 5,8 • (5,8) remains in root, because 8 < 9. • (6,9) inserted in left subtree, [0,8) because 6 is in the left child interval. 6,9 Insert [0,16) 5,8 • Insert (7,1). • (7,1) goes into the root, because 1 < 8. [0,8) • (5,8) inserted in left subtree, because 6,9 [0,16) 5 is in the left child interval. 7,1 • (5,8) displaces (6,9), because 8 < 9. [0,8) • (6,9) inserted in right subtree, because 5,8 6 is in the right child interval. [4,8) 6,9 Insert [0,16) [0,16) 7,1 7,1 [0,8) [0,8) 5,8 5,8 [4,8) 6,9 [4,8) 6,9 • Insert (11,5). [8,16) 11,5 Properties [0,16) 7,1 [0,8) 5,8 [0,4) 2,12 [8,16) 11,5 [4,8) 6,9 • Height is O(log k). • Insert time is O(log k). Search [0,16) 7,1 [0,8) 5,8 [0,4) 2,12 [4,8) 6,9 • Search time is O(log k). [8,16) 11,5 Search (1:4) Delete [0,16) 7,1 [0,8) 5,8 [0,4) 2,12 [8,16) 11,5 [4,8) 6,9 • Similar to delete min of min heap. • Delete time is O(log k). Maintain RPST for Time Series • Use the time as binary tree dimension (X) • Use the data value as the heap dimension (Y) Because Base interval size w is fixed We map W into fixed domain of X, which is {0,1…,w-1} The height of RPST is fixed and balanced, which is O(logw). Insertion ---O(logw) Deletion ---O(logw) Search ---O(logw) Maintain RPST for Time Series • map timestamp to w (mod w) • wt = tc mod w. • W=[tc–w+1:tc]map to wt+1,wt+2,…w-1,0,1,…, wt • When time elapse (sliding window move) W=[tc–w+2:tc+1] mapping [tc–w+2:tc] stay same, RPST do NOT change tc+1 replace the expired tc+1-w, RPST only update here ----- O(logw) Retrieve the min[i:j] Given RPST, OTF need to retrieve min[i:j]. wi= i mod w; wj= j mod w; Case 1: wi <= wj retrieve the value falling into [wi ,wi] Case 2: wi > wj retrieve value falling into [0, wj ] and [wj,w-1], pick the smaller one ---------------O(logw) Online Interval Skyline Query Answeringhow to maintain max value • Store max value for n time series ----worst case: space cost O(nw) • Strategy: Use Sketches A pair(v,t) is maintained if there is no other pair(v’,t’) such that v’>=v and t’>t. In this way, we only keep (logw) pairs on average. Find the maximum value cost O(1) Update a new timestamp cost O(logw) Example: max in s1 (4,1) comes, save {(4,1)} (3,2) comes, save {(4,1),(3,2)} (2,3) comes, save {((4,1),(3,2),(2,3)} (5,4) comes, 4>3,5>2 remove (2,3) 4>2,5>3 remove (3,2) 4>1,5>4 remove (4,1) (5,4) left {(5,4)} Analysis of OTF • For each time series (space) use O(w) space to store a RPST use O(logw) space to maintain the sketch of max values. ---- Space Consumption for n time series O(nw) • For each time series (time) use O(logw) to update RPST use O(logw) to update sketch ---- Time Consumption for n time series O(nlogw) Before the End • Experiments fits the analysis of OTF algorithm • There is another method View-Materialization • This paper, Propose innovative concept. Excellently propose concise and effective data structure for solution. Thanks!