Special Topics (ppt)

advertisement
Special Topics in Data Engineering
Panagiotis Karras
CS6234 Lecture, March 4th, 2009
Outline
• Summarizing Data Streams.
• Efficient Array Partitioning.
1D Case.
2D Case.
• Hierarchical Synopses with Optimal Error
Guarantees.
Summarizing Data Streams
• Approximate a sequence [d1, d2, …, dn] with B buckets,
si = [bi, ei, vi] so that an error metric is minimized.
• Data arrive as a stream:
Seen only once.
Cannot be stored.
• Objective functions:
Max. abs. error:
Euclidean error:
L F , X   max i  f i  xi
  f i  xi

L2 F , X    i
n


2






1
2
Histograms [KSM 2007]
• Solve the error-bounded problem.
Maximum Absolute Error bound ε = 2
4 5 6 2 15 17 3 6 9 12 …
[
4
] [ 16 ] [ 4.5 ] […
• Generalized to any weighted maximum-error metric.


 
d

,
d

Each value di defines a tolerance interval  i
i

w
w
i
i

Bucket closed when running intersection of interval becomes null
Complexity:
On 
Histograms
• Apply to the space-bounded problem.
Perform binary search in the domain of the error bound ε
For error values requiring space
B  B, with actual error 
Error-bounded algorithm running under constraint
If
error  
Complexity:
requires
~
BB

error  
, run an optimality test:
instead of
error  
space, then optimal solution has been reached.
O n log 
*

Independent of buckets B
What about streaming case?
Streamstrapping [Guha 2009]
• Metric error satisfies property:
  X H  Y , H     X , H     X  Y , H     X H   Y , H     X , H 
• Run multiple algorithms.
1. Read first B items, keep reading until first error (>1/M)
J
1   J  1 
2. Start versions for    0 , 1    0 ,, 1     0

J  O 1 log 1


3. When a version for some
fails,
a) Terminate all versions for    
J
b) Start new versions for 1    using summary of
4. Repeat until end of input.
  as first input.
Streamstrapping [Guha 2009]
• Theorem:
For any   101 StreamStrap algorithm achieves an 1 1  3  approximation,
running O 1 log 1 copies and O 1 log  * M initializations.
• Proof:








2
Consider lowest value of  for which an algorithm runs.
Suppose error estimate was raised j times before reaching Α 
Xi : prefix of input just before error estimate was raised for ith time.
Yj : suffix between (j-1)th and jth raising of error estimate.
Hi : summary built for Xi. Then:
target error
added error
 X j H j  Y , H    X j , H j    X j  Y , H    X j H j  Y , H    X j , H j 
Furthermore:
 X j , H j    X j 1 H j 1  Y j , H j    X j 1 , H j 1 
Error estimate is raised by 1     1  at every time.
J
recursion
Streamstrapping [Guha 2009]
• Proof (cont’d):
Putting it all together, telescoping:
 X j , H j  
   X H   Y , H     
1i  j
Total error is:
i 1

i 1

1
i
i

1
1

2

   
j
1
13

1

added error
  

1
10
optimal error
Moreover,
 X j H j  Y , H *  1    X j  Y , H * 
However,
 X j H j  Y , H *   11 
Thus,
 X j  Y , H *   11
In conclusion, total error is
# Initializations follows.
(algorithm failed for it)
  
 1   1  3     113  *

1
10
1
1 3 2
*
Streamstrapping [Guha 2009]
• Theorem:

Algorithm runs in O B log 1


 space and On
B
 


log 2 B  log  * M  time.

• Proof:
Space bound follows from copies.
B
1
Batch input values in groups of t  O  log 
Define binary tree of t values, compute min & max over tree nodes: On 
Using tree, max & min of any interval computed in Olog t 
Every copy has to check violation of its bound over t items.
Non-violation decided in O(1). Total On t O 1 log 1  On B 
Violation located in O log 2 t  . For all buckets, O  B log 2 t 
Over all algorithms it becomes:






 log log 
O Blog 2 t 1 log  * M   O
B

2B

* M

1D Array Partitioning [KMS 1997]
• Problem:
Partition an array of n items into p intervals
so
j
that the maximum weight F  Ai, j    Ak  of the
k i
intervals is minimized.
Arises in load balancing in pipelined, parallel
environments.
1D Array Partitioning [KMS 1997]
• Idea:
Perform binary search on all possible O(n2)
intervals responsible for maximum weight result
(bottlenecks).
• Obstacle:
Approximate median has to be calculated in O(n)
time.
1D Array Partitioning [KMS 1997]
• Solution:
Exploit internal structure of O(n2) intervals.
n columns, column c consisting of F  Ai, c, 1  i  c
1,
1,1
2, 
1,2 
2,2

c

1, c 
2, c 
3, c

c, c



n
1, n
2, n
3, n


n, n
Monotonically
non-increasing
1D Array Partitioning [KMS 1997]
• Calls to F(...) need O(1). (why?)
• Median of any subcolumn determined with one
call to F oracle. (how?)
Splitter-finding Algorithm:
• Find median weight in each active subcolumn.
• Find median of medians m in O(n) (standard).
• Cl (Cr): set of columns with median < (>) m.
1D Array Partitioning [KMS 1997]
• The median of medians m is not always a splitter.
min  Cl , Cr  
Cl  Cr
8
1D Array Partitioning [KMS 1997]
• If median of medians m is not a splitter, recur to
set of active subcolumns (Cl or Cr) with more
elements (ignored elements still considered in
future set size calculations).
• Otherwise, return m as a good splitter
(approximate median).
min  Cl , Cr  
Cl  Cr
8
End of Splitter-finding Algorithm.
1D Array Partitioning [KMS 1997]
1.
2.
3.
4.
5.
Overall Algorithm:
Arrange intervals in subcolumns.
Find a splitter weight m of active subcolumns.
Check whether array is partitionable in p
intervals of maximum weight m (how?)
If true, then m is upper bound of optimal
maximum weight, eliminate half of elements of
each subcolumn in Cl - otherwise in Cr.
Recur until convergence to optimal m.
Complexity: O(n log n)
2D Array Partitioning [KMS 1997]
• Problem:
Partition a 2D array of n x n items into a p x p
partition (inducing p2 blocks) so that the
maximum weight of the blocks is minimized.
Arises in particle-in-cell computations, sparse
matric computations, etc.
• NP-hard [GM 1996]
• APX-hard [CCM 1996]
2D Array Partitioning [KMS 1997]
• Definition:
Two axis-parallel rectangles are independent if
their projections are disjoint along both the x-axis
and the y-axis.
• Observation 1:
If an array has a W ,  partition, then it may
contain at most 2 independent rectangles of
weight strictly greater than W. (why?)
2D Array Partitioning [KMS 1997]
• At least one line needed to stab each of the
independent rectangles.
• Best case: 2 independent rectangles
2D Array Partitioning [KMS 1997]
The Algorithm:
Assume we know optimal W. max Ai, j   W
i, j
Step 1: (define P )
Given W, obtain   partition P such that each
row/column within any block has weight at most
2W. (how?)
Independent horizontal/vertical scans, keeping
track of running sum of weights of each
row/column in block. (why exists   ?)
2D Array Partitioning [KMS 1997]
Step 2: (from P to S )
Construct set S of all minimal rectangles of
weight more than W, entirely contained in blocks
of P . (how?)
Start from each location within block, consider all
possible rectangles in order of increasing sides,
until W exceeded, keep minimal ones.
Property of S : block weight at most 3W. (why?)
Hint : rows/columns in blocks of P at most 2W.
2D Array Partitioning [KMS 1997]
Step 3: (from S to M )
Determine local 3-optimal set M  S of
independent rectangles.
3-optimality : There does not exist set of i  1,2,3
independent rectangles in S  M that, added to
M after removing i  1 rectangles from it, do not
violate independence condition.
Polynomial-time construction
(how? with swaps: local optimality easy)
2D Array Partitioning [KMS 1997]
Step 4: (from M to new partition)
For each rectangle in M, set two straddling
horizontal and two straddling vertical lines that
induce it. At most 2 M  2 M partition derived
New partition: P from step 1 together with this.
h  2M   horizontal lines
v  2M   vertical lines
2D Array Partitioning [KMS 1997]
Step 5: (final)
Retain every h  th horizontal line,
every v  th vertical line.
Maximum weight increased at most by h   v 
2D Array Partitioning [KMS 1997]
Analysis:
We have to show that:
a. Given W (large enough) such that there exists
W ,  partition, the maximum block weight in
constructed partition is OW 
b. Minimum W for which analysis holds (found by
binary search) is upper bound to optimum W.
2D Array Partitioning [KMS 1997]
Lemma 1: (at Step 1)
Let block b contained in partition P.
If b exceeds 27W, then b can be partitioned in 3
independent rectangles of weight >W.
Proof:
Vertical scan in b, cut as soon as seen slab weight
exceeds 7W. (hence slab weight < 9W ) (why?)
Horizontal scan, cut as soon as one seen slab
weight exceeds W.
2D Array Partitioning [KMS 1997]
b  7W
W
W
b
 4W  3W
 4W
 3W
 W  3W
W
 3W
Proof (cont’d):
Slab weight exceeding W does not exceed 3W.
(why?)
Eventually, 3 rectangles weighting >W each.
2D Array Partitioning [KMS 1997]
Lemma 2: (at Step 4)
Weight of any block of Step-4-partition is OW 
Proof:
Case 1: b  M
Weight of b is O(W). (recall block in S <3W )
Case 2: b  M
Weight of b is <27W.
If >27W, then b partitionable in 3 independent
rectangles, which can substitute the at most 2 blocks
in M non-independent of b: violates 3-optimality of M.
2D Array Partitioning [KMS 1997]
Lemma 3: (at Step 3)
If  W ,  , then M  2
Proof:
Weight of rectangles in M is >W.
By Observation 1, at most 2 independent rectangles
can be contained in M.
2D Array Partitioning [KMS 1997]
Lemma 4: (at Step 5)
If  W ,  , weight of any block in final solution is OW 
Proof:
At Step 5, maximum weight increased at most by
2 M 
2 M 
h


v



           25
By Lemma 2, maximum weight is OW 
Hence, final weight is OW  (a)
Least W for which Step 1 and Step 3 succeed exceeds
optimum W. Found by binary search. (b)
Compact Hierarchical Histograms
• Assign arbitrary values to CHH coefficients, so that a maximumerror metric is minimized.
c0
• Heuristic solutions:
Reiss et al. VLDB 2006

O nB log 2 n log B

O B log 2 n  n


c1
time
space
c2
c3
c4
d0
d1
c5
d2
c6
d3
The benefit of making node B a bucket (occupied)
node depends on whether node A is a bucket node
– and also on whether node C is a bucket node.
[Reiss et al. VLDB 2006]
Compact Hierarchical Histograms
• Solve the error-bounded problem.
Next-to-bottom level case


v, S i, v   si* , si*  1
0,

S i, v   1,
2,

a, b c, d     v  a, b c, d 
a, b c, d     v  a, b c, d   a, b c, d     v  a, b c, d 
a, b c, d     v  a, b c, d 
a, b c, d   
v  a, b c, d 
ci
z  a, b c, d 
z
a, b c, d   
v  a, b
ci
c2i
0
z  c, d 
c2i+1
0
z
a, b
c, d 
c2i
0
0
a, b
c, d 
Compact Hierarchical Histograms
• Solve the error-bounded problem.
General, recursive case
v  L0  S iL , v   si*L
v  R0  S iR , v  si*R
 si*L  si*R ,
L0  R 0    v  L0  R 0

S i, v    si*L  si*R  1, L 0  R 0    v  L 0  R 0  L 0  R 0    v  L 0  R 0
s *  s *  2,
L0  R 0    v  L0  R 0
iR
 iL
Complexity:
(space-efficient)
 logn  n 
O  0 2   1   O n log 2 n 
2 



O  0 2  On
logn
time
space
• Apply to the space-bounded problem.


*
Complexity: O n log n log   log n

Polynomially Tractable
References
1. P. Karras, D. Sacharidis, N. Mamoulis: Exploiting duality in
summarization with deterministic guarantees. KDD 2007.
2. S. Guha: Tight results for clustering and summarizing data
streams . ICDT 2009.
3. S. Khanna, S. Muthukrishnan, S. Skiena: Efficient Array
Partitioning. ICALP 1997.
4. F. Reiss, M. Garofalakis, and J. M. Hellerstein: Compact
histograms for hierarchical identifiers. VLDB 2006.
5. P. Karras, N. Mamoulis: Hierarchical synopses with optimal
error guarantees. ACM TODS 33(3): 2008.
Thank you! Questions?
Download