Scalable Histograms on Larger Probabilistic Data Mingwang Tang and Feifei Li

advertisement
Scalable Histograms on Larger Probabilistic Data
Mingwang Tang and Feifei Li
School of Computing, University of Utah
September 21, 2015
Introduction
New Challenges
Large scale data size
Distributed data sources
Uncertainty
Data synopsis on large probabilistic data
Scalable histograms on large probabilistic data
2 / 31
Histograms on deterministic data
−
V-optimal histogram: Given a frequency vector →
v = {v1 , . . . , vn },
where vi is the frequency of item i in [n], a space budget B, it seeks
to minimize the SSE error:
ek
B X
X
min{
(vi − b̂k )2 }
(1)
k=1 i=sk
v
n = 12,
B = 3
1 2 3 4 5 6 7 8 9 10 11 12
Optimal B-bucket histogram takes O(Bn2 ) time.
3 / 31
Probabilistic Database
Probabilistic database D on domain [n] = {1, . . . , n}
: W1
frequency
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items
D = {g1 , g2 , . . . , gn } where
gi = {(gi (W ), Pr(W ))|W ∈ W}
(2)
4 / 31
Probabilistic Database
Probabilistic database D on domain [n] = {1, . . . , n}
: W1
frequency
3
2
1
0
: W2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items
D = {g1 , g2 , . . . , gn } where
gi = {(gi (W ), Pr(W ))|W ∈ W}
(2)
5 / 31
Probabilistic Data Models
gi = {(gi (W ), Pr(W ))|W ∈ W}
Tuple Model
Each tuple tj = (tj1 , pj1 ), . . . , (tj`j , pj`j ) . Each tjk is drawn from
[n] for k ∈ [1, `j ].
P`j
1 − k=1
pjk specify the possibility that tj generates no item.
t1
t2
t3
{(1, 0.2), (3, 0.3), (7, 0.2)}
{(3, 0.3), (5, 0.1), (9, 0.4)}
{(3, 0.5), (10, 0.4), (13, 0.1)}
t|τ |
6 / 31
Probabilistic Data Models
gi = {(gi (W ), Pr(W ))|W ∈ W}
Tuple Model
Each tuple tj = (tj1 , pj1 ), . . . , (tj`j , pj`j ) . Each tjk is drawn from
[n] for k ∈ [1, `j ].
P`j
1 − k=1
pjk specify the possibility that tj generates no item.
t1
t2
t3
{(1, 0.2), (3, 0.3), (7, 0.2)}
{(3, 0.3), (5, 0.1), (9, 0.4)}
{(3, 0.5), (10, 0.4), (13, 0.1)}
t|τ |
7 / 31
Probabilistic Data Models
gi = {(gi (W ), Pr(W ))|W ∈ W}
Value Model
Each tuple tj = h j : fj = ((fj1 , pj1 ), . . . , (fj`j , pj`j )) i, j is drawn from
[n].
P`j
Pr(fj = 0) = 1 − k=1
pjk
t1
t2
t3
{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}
{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}
{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}
tn
8 / 31
Probabilistic Data Models
gi = {(gi (W ), Pr(W ))|W ∈ W}
Value Model
Each tuple tj = h j : fj = ((fj1 , pj1 ), . . . , (fj`j , pj`j )) i, j is drawn from
[n].
P`j
Pr(fj = 0) = 1 − k=1
pjk
t1
t2
t3
{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}
{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}
{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}
tn
9 / 31
Histograms on Probabilistic data
Possible world semantic
frequency
3
2
1
0
: W1
: W2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items
gi : frequency of item i becomes random variable across possible
worlds
Expectation based histogram


ek
B X
X
H(n, B) = min{EW 
(gj − bbk )2 }.
k=1 j=sk
[ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009
[VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009
10 / 31
Efficient computation of bucket error
The optimal B bucket histogram takes O(Bn2 ) time.
b is:
[TKDE10] shows that the minimal error of a bucket b = (s, e, b)
b =
SSE (b, b)
e
X
i=s
by setting bb =
1
e−s+1 EW
2
EW [gi ] −
Pe
i=s
e
X
1
2
EW [
gi ] .
e−s+1
i=s
(3)
gi .
b can be
Based on two precomputed arrays (A, B), SSE (b, b)
computed in constant time.
[TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010
11 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equal
size and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase into
B buckets.
frequency
3
2
1
0
partition
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
sub-domain boundary
bucket
12 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equal
size and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase into
B buckets.
frequency
3
2
1
0
partition
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
sub-domain boundary
bucket
13 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equal
size and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase into
B buckets.
: frequency in W1 : frequency in W2 : weighted frequency
3
2
w=2 w=3 w=1 w=3 w=3 w=1
merge
1 w=2
w=1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
frequency
partition
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
sub-domain boundary
bucket
14 / 31
Pmerge Method
Pmerge method based on partition and merge principle
Partition phase: partition the domain n into m sub-domain of equal
size and compute the local optimal B buckets for each sub-domain.
Merge phase: merge mB input buckets from the partition phase into
B buckets.
: frequency in W1 : frequency in W2 : weighted frequency
3
2
w=2 w=3 w=1 w=3 w=3 w=1
merge
1 w=2
w=1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
frequency
partition
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value
sub-domain boundary
bucket
15 / 31
Recursive Merging Method
Pmerge method:
Approximation quality: Pmerge produces a
O(N + Bn2 /m + B 3 m2 ) time.
√
10 approximation in
Recursive merging (RPmerge):
Partition [n] into m` subdomains, producing Bm` .
Using ` iterations and each iteration reduce the domain size by a
factor of m.
P
2
Takes O(N + B mn ` + B 3 `i=1 m(i+1) ) time and the RPmerge
`
method gives a 10 2 approximation of the optimal B-buckets
histogram found by OptHist.
In practice, Pmerge and RPmerge always provide close to
optimal approximation quality as shown in our experiments.
16 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
Probabilistic Database D
m sub-domains
1
τ1
τ`
...
τβ
n
Communication cost
Computing Ak , Bk arrays in the partition phase
Tuple model: O(βn) bytes.
Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
17 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
Probabilistic Database D
m sub-domains
1
τ1
τ`
...
ti
h(i) = di/ dn/mee
(i, (E[fi ], E[fi2 ]))
τβ
n
Communication cost
Computing Ak , Bk arrays in the partition phase
Tuple model: O(βn) bytes.
Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
18 / 31
Distributed and Parallel Pmerge
Partition phase in the distributed environment
Probabilistic Database D
m sub-domains
A, B
1
τ1
τ`
...
ti
h(i) = di/ dn/mee
(i, (E[fi ], E[fi2 ]))
τβ
n
Communication cost
Computing Ak , Bk arrays in the partition phase
Tuple model: O(βn) bytes.
Value model: O(n) bytes.
O(Bm) bytes in the merge phase for both models.
19 / 31
Pmerge Based on Sampling
Sampling A, B arrays in the partition phase
Ak [j] =
i
X
E[fj 2 ], Bk [j] =
i=1
j
X
E[fi ]
i=1
Estimate Ak , Bk arrays using quantile sampling
E[fi]
2 3 5 9
···
item: 1 2 3 4 . . .
20 / 31
Pmerge Based on Sampling
Sampling A, B arrays in the partition phase
Ak [j] =
i
X
E[fj 2 ], Bk [j] =
i=1
j
X
E[fi ]
i=1
Estimate Ak , Bk arrays using quantile sampling
E[fi]
2 3 5 9
···
item: 1 2 3 4 . . .
21 / 31
Pmerge Based on Sampling
Sampling A, B arrays in the partition phase
Ak [j] =
i
X
E[fj 2 ], Bk [j] =
i=1
j
X
E[fi ]
i=1
Estimate Ak , Bk arrays using quantile sampling
3 3 3 3 3
E[fi]
2 3 5 9
···
item: 1 2 3 4 . . .
22 / 31
Pmerge Based on Sampling
Sampling A, B arrays in the partition phase
Ak [j] =
i
X
E[fj 2 ], Bk [j] =
i=1
j
X
E[fi ]
i=1
Estimate Ak , Bk arrays using quantile sampling
√
β
p = min{Θ( N ), Θ( 21N )}
3 3 3 3 3
E[fi]
2 3 5 9
···
item: 1 2 3 4 . . .
23 / 31
Pmerge Based on Sketch
Tuple model A, B arrays
Pβ
Pj
Estimate F2 = i=sk ( `=1 EW,` [gi ])2 using AMS Sketch
techniques and binary decomposition of domain [sk , ek ].
(a) binary decomposition
F2 =
s
F2 = 2εMk′′
εMk′′
e· · ·
···
· · ·· · ·
F2 =
EW [gsk ]
EW [gαk,1 ]
(b) local Q-AMS
Mk′′
AMS
F2 = 2εMk′′
F2 = εMk′′
AMS
···
···
···
···
AMS
EW,ℓ [gek ]
EW [gek ] EW,ℓ [gsk ]
EW [gαk, 1 −1 ]
EW,ℓ [gαk,1 ] EW,ℓ [gαk, 1 −1 ]
ε
ε
24 / 31
Outline
1
Optimal B-buckets Histograms
2
Approximate Histograms
3
Pmerge Based on Sampling
4
Experiments
25 / 31
Experiment setup
Generate tuple model and the value model dataset using the client
id filed of 1998 WorldCup dataset and atmospheric measurements
from the SAMOS project.
The default experimental parameters:
Symbol
Definition
Default Value
B
number of buckets
400
n
domain size
100k (600k)
`
depth of recursions
2
26 / 31
Running time:
n: domain size
6
OptHist
PMerge
RPMerge
4
10
2
10
10
50
100
OptHist
PMerge
RPMerge
6
10
Time (seconds)
Time (seconds)
10
150
domain size: n (×103 )
Figure: Tuple Model
200
4
10
2
10
10
50
100
150
200
domain size: n (×103 )
Figure: Value Model
27 / 31
Approximation Ratio:
n: domain size
PMerge
1.0004
RPMerge
Approximation ratio
Approximation ratio
1.08
1.06
1.04
1.02
1
10
50
100
150
domain size: n (×103 )
Figure: Tuple Model
200
PMerge
RPMerge
1.0003
1.0002
1.0001
1
10
50
100
150
200
domain size: n (×103 )
Figure: Value Model
28 / 31
Running time on large scale probabilistic data
n: domain size
5
Time (seconds)
RPMerge
Parallel-PMerge
Parallel-RPMerge
4
10
3
10
2
10
200
400
600
800
1000
3
domain size: n (×10 )
Figure: Tuple Model
Time (×102 seconds)
5
10
10
RPMerge
Parallel-PMerge
Parallel-RPMerge
4
10
3
10
2
10
200
400
600
800
domain size: n (×103 )
Figure: Value Model
29 / 31
Conclusion & Future work
Conclusion
Novel approximation methods for constructing scalable histograms
on large probabilistic data.
The quality of the approximate histograms are almost as good as the
optimal histogram in practice.
Extended the techniques to distributed and parallel settings to
further improve scalability.
Future work
extend our study to probabilistic histograms with pdf bucket
representatives and handle histogram of other error metrics
30 / 31
The end
Thank You
Q and A
31 / 31
Download