Special Topics (ppt)

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4th, 2009 Outline • Summarizing Data Streams. • Efficient Array Partitioning. 1D Case. 2D Case. • Hierarchical Synopses with Optimal Error Guarantees. Summarizing Data Streams • Approximate a sequence [d1, d2, …, dn] with B buckets, si = [bi, ei, vi] so that an error metric is minimized. • Data arrive as a stream: Seen only once. Cannot be stored. • Objective functions: Max. abs. error: Euclidean error: L F , X   max i  f i  xi   f i  xi  L2 F , X    i n   2       1 2 Histograms [KSM 2007] • Solve the error-bounded problem. Maximum Absolute Error bound ε = 2 4 5 6 2 15 17 3 6 9 12 … [ 4 ] [ 16 ] [ 4.5 ] [… • Generalized to any weighted maximum-error metric.     d  , d  Each value di defines a tolerance interval  i i  w w i i  Bucket closed when running intersection of interval becomes null Complexity: On  Histograms • Apply to the space-bounded problem. Perform binary search in the domain of the error bound ε For error values requiring space B  B, with actual error  Error-bounded algorithm running under constraint If error   Complexity: requires ~ BB  error   , run an optimality test: instead of error   space, then optimal solution has been reached. O n log  *  Independent of buckets B What about streaming case? Streamstrapping [Guha 2009] • Metric error satisfies property:   X H  Y , H     X , H     X  Y , H     X H   Y , H     X , H  • Run multiple algorithms. 1. Read first B items, keep reading until first error (>1/M) J 1   J  1  2. Start versions for    0 , 1    0 ,, 1     0  J  O 1 log 1   3. When a version for some fails, a) Terminate all versions for     J b) Start new versions for 1    using summary of 4. Repeat until end of input.   as first input. Streamstrapping [Guha 2009] • Theorem: For any   101 StreamStrap algorithm achieves an 1 1  3  approximation, running O 1 log 1 copies and O 1 log  * M initializations. • Proof:         2 Consider lowest value of  for which an algorithm runs. Suppose error estimate was raised j times before reaching Α  Xi : prefix of input just before error estimate was raised for ith time. Yj : suffix between (j-1)th and jth raising of error estimate. Hi : summary built for Xi. Then: target error added error  X j H j  Y , H    X j , H j    X j  Y , H    X j H j  Y , H    X j , H j  Furthermore:  X j , H j    X j 1 H j 1  Y j , H j    X j 1 , H j 1  Error estimate is raised by 1     1  at every time. J recursion Streamstrapping [Guha 2009] • Proof (cont’d): Putting it all together, telescoping:  X j , H j      X H   Y , H      1i  j Total error is: i 1  i 1  1 i i  1 1  2      j 1 13  1  added error     1 10 optimal error Moreover,  X j H j  Y , H *  1    X j  Y , H *  However,  X j H j  Y , H *   11  Thus,  X j  Y , H *   11 In conclusion, total error is # Initializations follows. (algorithm failed for it)     1   1  3     113  *  1 10 1 1 3 2 * Streamstrapping [Guha 2009] • Theorem:  Algorithm runs in O B log 1    space and On B     log 2 B  log  * M  time.  • Proof: Space bound follows from copies. B 1 Batch input values in groups of t  O  log  Define binary tree of t values, compute min & max over tree nodes: On  Using tree, max & min of any interval computed in Olog t  Every copy has to check violation of its bound over t items. Non-violation decided in O(1). Total On t O 1 log 1  On B  Violation located in O log 2 t  . For all buckets, O  B log 2 t  Over all algorithms it becomes:        log log  O Blog 2 t 1 log  * M   O B  2B  * M  1D Array Partitioning [KMS 1997] • Problem: Partition an array of n items into p intervals so j that the maximum weight F  Ai, j    Ak  of the k i intervals is minimized. Arises in load balancing in pipelined, parallel environments. 1D Array Partitioning [KMS 1997] • Idea: Perform binary search on all possible O(n2) intervals responsible for maximum weight result (bottlenecks). • Obstacle: Approximate median has to be calculated in O(n) time. 1D Array Partitioning [KMS 1997] • Solution: Exploit internal structure of O(n2) intervals. n columns, column c consisting of F  Ai, c, 1  i  c 1, 1,1 2,  1,2  2,2  c  1, c  2, c  3, c  c, c    n 1, n 2, n 3, n   n, n Monotonically non-increasing 1D Array Partitioning [KMS 1997] • Calls to F(...) need O(1). (why?) • Median of any subcolumn determined with one call to F oracle. (how?) Splitter-finding Algorithm: • Find median weight in each active subcolumn. • Find median of medians m in O(n) (standard). • Cl (Cr): set of columns with median < (>) m. 1D Array Partitioning [KMS 1997] • The median of medians m is not always a splitter. min  Cl , Cr   Cl  Cr 8 1D Array Partitioning [KMS 1997] • If median of medians m is not a splitter, recur to set of active subcolumns (Cl or Cr) with more elements (ignored elements still considered in future set size calculations). • Otherwise, return m as a good splitter (approximate median). min  Cl , Cr   Cl  Cr 8 End of Splitter-finding Algorithm. 1D Array Partitioning [KMS 1997] 1. 2. 3. 4. 5. Overall Algorithm: Arrange intervals in subcolumns. Find a splitter weight m of active subcolumns. Check whether array is partitionable in p intervals of maximum weight m (how?) If true, then m is upper bound of optimal maximum weight, eliminate half of elements of each subcolumn in Cl - otherwise in Cr. Recur until convergence to optimal m. Complexity: O(n log n) 2D Array Partitioning [KMS 1997] • Problem: Partition a 2D array of n x n items into a p x p partition (inducing p2 blocks) so that the maximum weight of the blocks is minimized. Arises in particle-in-cell computations, sparse matric computations, etc. • NP-hard [GM 1996] • APX-hard [CCM 1996] 2D Array Partitioning [KMS 1997] • Definition: Two axis-parallel rectangles are independent if their projections are disjoint along both the x-axis and the y-axis. • Observation 1: If an array has a W ,  partition, then it may contain at most 2 independent rectangles of weight strictly greater than W. (why?) 2D Array Partitioning [KMS 1997] • At least one line needed to stab each of the independent rectangles. • Best case: 2 independent rectangles 2D Array Partitioning [KMS 1997] The Algorithm: Assume we know optimal W. max Ai, j   W i, j Step 1: (define P ) Given W, obtain   partition P such that each row/column within any block has weight at most 2W. (how?) Independent horizontal/vertical scans, keeping track of running sum of weights of each row/column in block. (why exists   ?) 2D Array Partitioning [KMS 1997] Step 2: (from P to S ) Construct set S of all minimal rectangles of weight more than W, entirely contained in blocks of P . (how?) Start from each location within block, consider all possible rectangles in order of increasing sides, until W exceeded, keep minimal ones. Property of S : block weight at most 3W. (why?) Hint : rows/columns in blocks of P at most 2W. 2D Array Partitioning [KMS 1997] Step 3: (from S to M ) Determine local 3-optimal set M  S of independent rectangles. 3-optimality : There does not exist set of i  1,2,3 independent rectangles in S  M that, added to M after removing i  1 rectangles from it, do not violate independence condition. Polynomial-time construction (how? with swaps: local optimality easy) 2D Array Partitioning [KMS 1997] Step 4: (from M to new partition) For each rectangle in M, set two straddling horizontal and two straddling vertical lines that induce it. At most 2 M  2 M partition derived New partition: P from step 1 together with this. h  2M   horizontal lines v  2M   vertical lines 2D Array Partitioning [KMS 1997] Step 5: (final) Retain every h  th horizontal line, every v  th vertical line. Maximum weight increased at most by h   v  2D Array Partitioning [KMS 1997] Analysis: We have to show that: a. Given W (large enough) such that there exists W ,  partition, the maximum block weight in constructed partition is OW  b. Minimum W for which analysis holds (found by binary search) is upper bound to optimum W. 2D Array Partitioning [KMS 1997] Lemma 1: (at Step 1) Let block b contained in partition P. If b exceeds 27W, then b can be partitioned in 3 independent rectangles of weight >W. Proof: Vertical scan in b, cut as soon as seen slab weight exceeds 7W. (hence slab weight < 9W ) (why?) Horizontal scan, cut as soon as one seen slab weight exceeds W. 2D Array Partitioning [KMS 1997] b  7W W W b  4W  3W  4W  3W  W  3W W  3W Proof (cont’d): Slab weight exceeding W does not exceed 3W. (why?) Eventually, 3 rectangles weighting >W each. 2D Array Partitioning [KMS 1997] Lemma 2: (at Step 4) Weight of any block of Step-4-partition is OW  Proof: Case 1: b  M Weight of b is O(W). (recall block in S <3W ) Case 2: b  M Weight of b is <27W. If >27W, then b partitionable in 3 independent rectangles, which can substitute the at most 2 blocks in M non-independent of b: violates 3-optimality of M. 2D Array Partitioning [KMS 1997] Lemma 3: (at Step 3) If  W ,  , then M  2 Proof: Weight of rectangles in M is >W. By Observation 1, at most 2 independent rectangles can be contained in M. 2D Array Partitioning [KMS 1997] Lemma 4: (at Step 5) If  W ,  , weight of any block in final solution is OW  Proof: At Step 5, maximum weight increased at most by 2 M  2 M  h   v               25 By Lemma 2, maximum weight is OW  Hence, final weight is OW  (a) Least W for which Step 1 and Step 3 succeed exceeds optimum W. Found by binary search. (b) Compact Hierarchical Histograms • Assign arbitrary values to CHH coefficients, so that a maximumerror metric is minimized. c0 • Heuristic solutions: Reiss et al. VLDB 2006  O nB log 2 n log B  O B log 2 n  n   c1 time space c2 c3 c4 d0 d1 c5 d2 c6 d3 The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node. [Reiss et al. VLDB 2006] Compact Hierarchical Histograms • Solve the error-bounded problem. Next-to-bottom level case   v, S i, v   si* , si*  1 0,  S i, v   1, 2,  a, b c, d     v  a, b c, d  a, b c, d     v  a, b c, d   a, b c, d     v  a, b c, d  a, b c, d     v  a, b c, d  a, b c, d    v  a, b c, d  ci z  a, b c, d  z a, b c, d    v  a, b ci c2i 0 z  c, d  c2i+1 0 z a, b c, d  c2i 0 0 a, b c, d  Compact Hierarchical Histograms • Solve the error-bounded problem. General, recursive case v  L0  S iL , v   si*L v  R0  S iR , v  si*R  si*L  si*R , L0  R 0    v  L0  R 0  S i, v    si*L  si*R  1, L 0  R 0    v  L 0  R 0  L 0  R 0    v  L 0  R 0 s *  s *  2, L0  R 0    v  L0  R 0 iR  iL Complexity: (space-efficient)  logn  n  O  0 2   1   O n log 2 n  2     O  0 2  On logn time space • Apply to the space-bounded problem.   * Complexity: O n log n log   log n  Polynomially Tractable References 1. P. Karras, D. Sacharidis, N. Mamoulis: Exploiting duality in summarization with deterministic guarantees. KDD 2007. 2. S. Guha: Tight results for clustering and summarizing data streams . ICDT 2009. 3. S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array Partitioning. ICALP 1997. 4. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact histograms for hierarchical identifiers. VLDB 2006. 5. P. Karras, N. Mamoulis: Hierarchical synopses with optimal error guarantees. ACM TODS 33(3): 2008. Thank you! Questions?

Special Topics (ppt)

Related documents

Products

Support

Special Topics (ppt)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib