Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4th, 2009 Outline • Summarizing Data Streams. • Efficient Array Partitioning. 1D Case. 2D Case. • Hierarchical Synopses with Optimal Error Guarantees. Summarizing Data Streams • Approximate a sequence [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that an error metric is minimized. • Data arrive as a stream: Seen only once. Cannot be stored. • Objective functions: Max. abs. error: Euclidean error: L F , X max i f i xi f i xi L2 F , X i n 2 1 2 Histograms [KSM 2007] • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric. d , d Each value di defines a tolerance interval i i w w i i Bucket closed when running intersection of interval becomes null Complexity: On Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space B B, with actual error Error-bounded algorithm running under constraint If error Complexity: requires ~ BB error , run an optimality test: instead of error space, then optimal solution has been reached. O n log * Independent of buckets B What about streaming case? Streamstrapping [Guha 2009] • Metric error satisfies property: X H Y , H X , H X Y , H X H Y , H X , H • Run multiple algorithms. 1. Read first B items, keep reading until first error (>1/M) J 1 J 1 2. Start versions for 0 , 1 0 ,, 1 0 J O 1 log 1 3. When a version for some fails, a) Terminate all versions for J b) Start new versions for 1 using summary of 4. Repeat until end of input. as first input. Streamstrapping [Guha 2009] • Theorem: For any 101 StreamStrap algorithm achieves an 1 1 3 approximation, running O 1 log 1 copies and O 1 log * M initializations. • Proof: 2 Consider lowest value of for which an algorithm runs. Suppose error estimate was raised j times before reaching Α Xi : prefix of input just before error estimate was raised for ith time. Yj : suffix between (j-1)th and jth raising of error estimate. Hi : summary built for Xi. Then: target error added error X j H j Y , H X j , H j X j Y , H X j H j Y , H X j , H j Furthermore: X j , H j X j 1 H j 1 Y j , H j X j 1 , H j 1 Error estimate is raised by 1 1 at every time. J recursion Streamstrapping [Guha 2009] • Proof (cont’d): Putting it all together, telescoping: X j , H j X H Y , H 1i j Total error is: i 1 i 1 1 i i 1 1 2 j 1 13 1 added error 1 10 optimal error Moreover, X j H j Y , H * 1 X j Y , H * However, X j H j Y , H * 11 Thus, X j Y , H * 11 In conclusion, total error is # Initializations follows. (algorithm failed for it) 1 1 3 113 * 1 10 1 1 3 2 * Streamstrapping [Guha 2009] • Theorem: Algorithm runs in O B log 1 space and On B log 2 B log * M time. • Proof: Space bound follows from copies. B 1 Batch input values in groups of t O log Define binary tree of t values, compute min & max over tree nodes: On Using tree, max & min of any interval computed in Olog t Every copy has to check violation of its bound over t items. Non-violation decided in O(1). Total On t O 1 log 1 On B Violation located in O log 2 t . For all buckets, O B log 2 t Over all algorithms it becomes: log log O Blog 2 t 1 log * M O B 2B * M 1D Array Partitioning [KMS 1997] • Problem: Partition an array of n items into p intervals so j that the maximum weight F Ai, j Ak of the k i intervals is minimized. Arises in load balancing in pipelined, parallel environments. 1D Array Partitioning [KMS 1997] • Idea: Perform binary search on all possible O(n2) intervals responsible for maximum weight result (bottlenecks). • Obstacle: Approximate median has to be calculated in O(n) time. 1D Array Partitioning [KMS 1997] • Solution: Exploit internal structure of O(n2) intervals. n columns, column c consisting of F Ai, c, 1 i c 1, 1,1 2, 1,2 2,2 c 1, c 2, c 3, c c, c n 1, n 2, n 3, n n, n Monotonically non-increasing 1D Array Partitioning [KMS 1997] • Calls to F(...) need O(1). (why?) • Median of any subcolumn determined with one call to F oracle. (how?) Splitter-finding Algorithm: • Find median weight in each active subcolumn. • Find median of medians m in O(n) (standard). • Cl (Cr): set of columns with median < (>) m. 1D Array Partitioning [KMS 1997] • The median of medians m is not always a splitter. min Cl , Cr Cl Cr 8 1D Array Partitioning [KMS 1997] • If median of medians m is not a splitter, recur to set of active subcolumns (Cl or Cr) with more elements (ignored elements still considered in future set size calculations). • Otherwise, return m as a good splitter (approximate median). min Cl , Cr Cl Cr 8 End of Splitter-finding Algorithm. 1D Array Partitioning [KMS 1997] 1. 2. 3. 4. 5. Overall Algorithm: Arrange intervals in subcolumns. Find a splitter weight m of active subcolumns. Check whether array is partitionable in p intervals of maximum weight m (how?) If true, then m is upper bound of optimal maximum weight, eliminate half of elements of each subcolumn in Cl - otherwise in Cr. Recur until convergence to optimal m. Complexity: O(n log n) 2D Array Partitioning [KMS 1997] • Problem: Partition a 2D array of n x n items into a p x p partition (inducing p2 blocks) so that the maximum weight of the blocks is minimized. Arises in particle-in-cell computations, sparse matric computations, etc. • NP-hard [GM 1996] • APX-hard [CCM 1996] 2D Array Partitioning [KMS 1997] • Definition: Two axis-parallel rectangles are independent if their projections are disjoint along both the x-axis and the y-axis. • Observation 1: If an array has a W , partition, then it may contain at most 2 independent rectangles of weight strictly greater than W. (why?) 2D Array Partitioning [KMS 1997] • At least one line needed to stab each of the independent rectangles. • Best case: 2 independent rectangles 2D Array Partitioning [KMS 1997] The Algorithm: Assume we know optimal W. max Ai, j W i, j Step 1: (define P ) Given W, obtain partition P such that each row/column within any block has weight at most 2W. (how?) Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists ?) 2D Array Partitioning [KMS 1997] Step 2: (from P to S ) Construct set S of all minimal rectangles of weight more than W, entirely contained in blocks of P . (how?) Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones. Property of S : block weight at most 3W. (why?) Hint : rows/columns in blocks of P at most 2W. 2D Array Partitioning [KMS 1997] Step 3: (from S to M ) Determine local 3-optimal set M S of independent rectangles. 3-optimality : There does not exist set of i 1,2,3 independent rectangles in S M that, added to M after removing i 1 rectangles from it, do not violate independence condition. Polynomial-time construction (how? with swaps: local optimality easy) 2D Array Partitioning [KMS 1997] Step 4: (from M to new partition) For each rectangle in M, set two straddling horizontal and two straddling vertical lines that induce it. At most 2 M 2 M partition derived New partition: P from step 1 together with this. h 2M horizontal lines v 2M vertical lines 2D Array Partitioning [KMS 1997] Step 5: (final) Retain every h th horizontal line, every v th vertical line. Maximum weight increased at most by h v 2D Array Partitioning [KMS 1997] Analysis: We have to show that: a. Given W (large enough) such that there exists W , partition, the maximum block weight in constructed partition is OW b. Minimum W for which analysis holds (found by binary search) is upper bound to optimum W. 2D Array Partitioning [KMS 1997] Lemma 1: (at Step 1) Let block b contained in partition P. If b exceeds 27W, then b can be partitioned in 3 independent rectangles of weight >W. Proof: Vertical scan in b, cut as soon as seen slab weight exceeds 7W. (hence slab weight < 9W ) (why?) Horizontal scan, cut as soon as one seen slab weight exceeds W. 2D Array Partitioning [KMS 1997] b 7W W W b 4W 3W 4W 3W W 3W W 3W Proof (cont’d): Slab weight exceeding W does not exceed 3W. (why?) Eventually, 3 rectangles weighting >W each. 2D Array Partitioning [KMS 1997] Lemma 2: (at Step 4) Weight of any block of Step-4-partition is OW Proof: Case 1: b M Weight of b is O(W). (recall block in S <3W ) Case 2: b M Weight of b is <27W. If >27W, then b partitionable in 3 independent rectangles, which can substitute the at most 2 blocks in M non-independent of b: violates 3-optimality of M. 2D Array Partitioning [KMS 1997] Lemma 3: (at Step 3) If W , , then M 2 Proof: Weight of rectangles in M is >W. By Observation 1, at most 2 independent rectangles can be contained in M. 2D Array Partitioning [KMS 1997] Lemma 4: (at Step 5) If W , , weight of any block in final solution is OW Proof: At Step 5, maximum weight increased at most by 2 M 2 M h v 25 By Lemma 2, maximum weight is OW Hence, final weight is OW (a) Least W for which Step 1 and Step 3 succeed exceeds optimum W. Found by binary search. (b) Compact Hierarchical Histograms • Assign arbitrary values to CHH coefficients, so that a maximumerror metric is minimized. c0 • Heuristic solutions: Reiss et al. VLDB 2006 O nB log 2 n log B O B log 2 n n c1 time space c2 c3 c4 d0 d1 c5 d2 c6 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006] Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case v, S i, v si* , si* 1 0, S i, v 1, 2, a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d a, b c, d v a, b c, d ci z a, b c, d z a, b c, d v a, b ci c2i 0 z c, d c2i+1 0 z a, b c, d c2i 0 0 a, b c, d Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case v L0 S iL , v si*L v R0 S iR , v si*R si*L si*R , L0 R 0 v L0 R 0 S i, v si*L si*R 1, L 0 R 0 v L 0 R 0 L 0 R 0 v L 0 R 0 s * s * 2, L0 R 0 v L0 R 0 iR iL Complexity: (space-efficient) logn n O 0 2 1 O n log 2 n 2 O 0 2 On logn time space • Apply to the space-bounded problem. * Complexity: O n log n log log n Polynomially Tractable References 1. P. Karras, D. Sacharidis, N. Mamoulis: Exploiting duality in summarization with deterministic guarantees. KDD 2007. 2. S. Guha: Tight results for clustering and summarizing data streams . ICDT 2009. 3. S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997. 4. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. 5. P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008. Thank you! Questions?