Scalable Histograms on Larger Probabilistic Data Mingwang Tang and Feifei Li School of Computing, University of Utah September 21, 2015 Introduction New Challenges Large scale data size Distributed data sources Uncertainty Data synopsis on large probabilistic data Scalable histograms on large probabilistic data 2 / 31 Histograms on deterministic data − V-optimal histogram: Given a frequency vector → v = {v1 , . . . , vn }, where vi is the frequency of item i in [n], a space budget B, it seeks to minimize the SSE error: ek B X X min{ (vi − b̂k )2 } (1) k=1 i=sk v n = 12, B = 3 1 2 3 4 5 6 7 8 9 10 11 12 Optimal B-bucket histogram takes O(Bn2 ) time. 3 / 31 Probabilistic Database Probabilistic database D on domain [n] = {1, . . . , n} : W1 frequency 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items D = {g1 , g2 , . . . , gn } where gi = {(gi (W ), Pr(W ))|W ∈ W} (2) 4 / 31 Probabilistic Database Probabilistic database D on domain [n] = {1, . . . , n} : W1 frequency 3 2 1 0 : W2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items D = {g1 , g2 , . . . , gn } where gi = {(gi (W ), Pr(W ))|W ∈ W} (2) 5 / 31 Probabilistic Data Models gi = {(gi (W ), Pr(W ))|W ∈ W} Tuple Model Each tuple tj = (tj1 , pj1 ), . . . , (tj`j , pj`j ) . Each tjk is drawn from [n] for k ∈ [1, `j ]. P`j 1 − k=1 pjk specify the possibility that tj generates no item. t1 t2 t3 {(1, 0.2), (3, 0.3), (7, 0.2)} {(3, 0.3), (5, 0.1), (9, 0.4)} {(3, 0.5), (10, 0.4), (13, 0.1)} t|τ | 6 / 31 Probabilistic Data Models gi = {(gi (W ), Pr(W ))|W ∈ W} Tuple Model Each tuple tj = (tj1 , pj1 ), . . . , (tj`j , pj`j ) . Each tjk is drawn from [n] for k ∈ [1, `j ]. P`j 1 − k=1 pjk specify the possibility that tj generates no item. t1 t2 t3 {(1, 0.2), (3, 0.3), (7, 0.2)} {(3, 0.3), (5, 0.1), (9, 0.4)} {(3, 0.5), (10, 0.4), (13, 0.1)} t|τ | 7 / 31 Probabilistic Data Models gi = {(gi (W ), Pr(W ))|W ∈ W} Value Model Each tuple tj = h j : fj = ((fj1 , pj1 ), . . . , (fj`j , pj`j )) i, j is drawn from [n]. P`j Pr(fj = 0) = 1 − k=1 pjk t1 t2 t3 {< 1, (50, 0.2), (7, 0.1), (14, 0.2) >} {< 2, (6, 0.4), (7, 0.3), (15, 0.3) >} {< 3, (10, 0.3), (15, 0.2), (20, 0.5) >} tn 8 / 31 Probabilistic Data Models gi = {(gi (W ), Pr(W ))|W ∈ W} Value Model Each tuple tj = h j : fj = ((fj1 , pj1 ), . . . , (fj`j , pj`j )) i, j is drawn from [n]. P`j Pr(fj = 0) = 1 − k=1 pjk t1 t2 t3 {< 1, (50, 0.2), (7, 0.1), (14, 0.2) >} {< 2, (6, 0.4), (7, 0.3), (15, 0.3) >} {< 3, (10, 0.3), (15, 0.2), (20, 0.5) >} tn 9 / 31 Histograms on Probabilistic data Possible world semantic frequency 3 2 1 0 : W1 : W2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 items gi : frequency of item i becomes random variable across possible worlds Expectation based histogram ek B X X H(n, B) = min{EW (gj − bbk )2 }. k=1 j=sk [ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009 [VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009 10 / 31 Efficient computation of bucket error The optimal B bucket histogram takes O(Bn2 ) time. b is: [TKDE10] shows that the minimal error of a bucket b = (s, e, b) b = SSE (b, b) e X i=s by setting bb = 1 e−s+1 EW 2 EW [gi ] − Pe i=s e X 1 2 EW [ gi ] . e−s+1 i=s (3) gi . b can be Based on two precomputed arrays (A, B), SSE (b, b) computed in constant time. [TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010 11 / 31 Pmerge Method Pmerge method based on partition and merge principle Partition phase: partition the domain n into m sub-domain of equal size and compute the local optimal B buckets for each sub-domain. Merge phase: merge mB input buckets from the partition phase into B buckets. frequency 3 2 1 0 partition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value sub-domain boundary bucket 12 / 31 Pmerge Method Pmerge method based on partition and merge principle Partition phase: partition the domain n into m sub-domain of equal size and compute the local optimal B buckets for each sub-domain. Merge phase: merge mB input buckets from the partition phase into B buckets. frequency 3 2 1 0 partition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value sub-domain boundary bucket 13 / 31 Pmerge Method Pmerge method based on partition and merge principle Partition phase: partition the domain n into m sub-domain of equal size and compute the local optimal B buckets for each sub-domain. Merge phase: merge mB input buckets from the partition phase into B buckets. : frequency in W1 : frequency in W2 : weighted frequency 3 2 w=2 w=3 w=1 w=3 w=3 w=1 merge 1 w=2 w=1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value frequency partition 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value sub-domain boundary bucket 14 / 31 Pmerge Method Pmerge method based on partition and merge principle Partition phase: partition the domain n into m sub-domain of equal size and compute the local optimal B buckets for each sub-domain. Merge phase: merge mB input buckets from the partition phase into B buckets. : frequency in W1 : frequency in W2 : weighted frequency 3 2 w=2 w=3 w=1 w=3 w=3 w=1 merge 1 w=2 w=1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value frequency partition 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 domain value sub-domain boundary bucket 15 / 31 Recursive Merging Method Pmerge method: Approximation quality: Pmerge produces a O(N + Bn2 /m + B 3 m2 ) time. √ 10 approximation in Recursive merging (RPmerge): Partition [n] into m` subdomains, producing Bm` . Using ` iterations and each iteration reduce the domain size by a factor of m. P 2 Takes O(N + B mn ` + B 3 `i=1 m(i+1) ) time and the RPmerge ` method gives a 10 2 approximation of the optimal B-buckets histogram found by OptHist. In practice, Pmerge and RPmerge always provide close to optimal approximation quality as shown in our experiments. 16 / 31 Distributed and Parallel Pmerge Partition phase in the distributed environment Probabilistic Database D m sub-domains 1 τ1 τ` ... τβ n Communication cost Computing Ak , Bk arrays in the partition phase Tuple model: O(βn) bytes. Value model: O(n) bytes. O(Bm) bytes in the merge phase for both models. 17 / 31 Distributed and Parallel Pmerge Partition phase in the distributed environment Probabilistic Database D m sub-domains 1 τ1 τ` ... ti h(i) = di/ dn/mee (i, (E[fi ], E[fi2 ])) τβ n Communication cost Computing Ak , Bk arrays in the partition phase Tuple model: O(βn) bytes. Value model: O(n) bytes. O(Bm) bytes in the merge phase for both models. 18 / 31 Distributed and Parallel Pmerge Partition phase in the distributed environment Probabilistic Database D m sub-domains A, B 1 τ1 τ` ... ti h(i) = di/ dn/mee (i, (E[fi ], E[fi2 ])) τβ n Communication cost Computing Ak , Bk arrays in the partition phase Tuple model: O(βn) bytes. Value model: O(n) bytes. O(Bm) bytes in the merge phase for both models. 19 / 31 Pmerge Based on Sampling Sampling A, B arrays in the partition phase Ak [j] = i X E[fj 2 ], Bk [j] = i=1 j X E[fi ] i=1 Estimate Ak , Bk arrays using quantile sampling E[fi] 2 3 5 9 ··· item: 1 2 3 4 . . . 20 / 31 Pmerge Based on Sampling Sampling A, B arrays in the partition phase Ak [j] = i X E[fj 2 ], Bk [j] = i=1 j X E[fi ] i=1 Estimate Ak , Bk arrays using quantile sampling E[fi] 2 3 5 9 ··· item: 1 2 3 4 . . . 21 / 31 Pmerge Based on Sampling Sampling A, B arrays in the partition phase Ak [j] = i X E[fj 2 ], Bk [j] = i=1 j X E[fi ] i=1 Estimate Ak , Bk arrays using quantile sampling 3 3 3 3 3 E[fi] 2 3 5 9 ··· item: 1 2 3 4 . . . 22 / 31 Pmerge Based on Sampling Sampling A, B arrays in the partition phase Ak [j] = i X E[fj 2 ], Bk [j] = i=1 j X E[fi ] i=1 Estimate Ak , Bk arrays using quantile sampling √ β p = min{Θ( N ), Θ( 21N )} 3 3 3 3 3 E[fi] 2 3 5 9 ··· item: 1 2 3 4 . . . 23 / 31 Pmerge Based on Sketch Tuple model A, B arrays Pβ Pj Estimate F2 = i=sk ( `=1 EW,` [gi ])2 using AMS Sketch techniques and binary decomposition of domain [sk , ek ]. (a) binary decomposition F2 = s F2 = 2εMk′′ εMk′′ e· · · ··· · · ·· · · F2 = EW [gsk ] EW [gαk,1 ] (b) local Q-AMS Mk′′ AMS F2 = 2εMk′′ F2 = εMk′′ AMS ··· ··· ··· ··· AMS EW,ℓ [gek ] EW [gek ] EW,ℓ [gsk ] EW [gαk, 1 −1 ] EW,ℓ [gαk,1 ] EW,ℓ [gαk, 1 −1 ] ε ε 24 / 31 Outline 1 Optimal B-buckets Histograms 2 Approximate Histograms 3 Pmerge Based on Sampling 4 Experiments 25 / 31 Experiment setup Generate tuple model and the value model dataset using the client id filed of 1998 WorldCup dataset and atmospheric measurements from the SAMOS project. The default experimental parameters: Symbol Definition Default Value B number of buckets 400 n domain size 100k (600k) ` depth of recursions 2 26 / 31 Running time: n: domain size 6 OptHist PMerge RPMerge 4 10 2 10 10 50 100 OptHist PMerge RPMerge 6 10 Time (seconds) Time (seconds) 10 150 domain size: n (×103 ) Figure: Tuple Model 200 4 10 2 10 10 50 100 150 200 domain size: n (×103 ) Figure: Value Model 27 / 31 Approximation Ratio: n: domain size PMerge 1.0004 RPMerge Approximation ratio Approximation ratio 1.08 1.06 1.04 1.02 1 10 50 100 150 domain size: n (×103 ) Figure: Tuple Model 200 PMerge RPMerge 1.0003 1.0002 1.0001 1 10 50 100 150 200 domain size: n (×103 ) Figure: Value Model 28 / 31 Running time on large scale probabilistic data n: domain size 5 Time (seconds) RPMerge Parallel-PMerge Parallel-RPMerge 4 10 3 10 2 10 200 400 600 800 1000 3 domain size: n (×10 ) Figure: Tuple Model Time (×102 seconds) 5 10 10 RPMerge Parallel-PMerge Parallel-RPMerge 4 10 3 10 2 10 200 400 600 800 domain size: n (×103 ) Figure: Value Model 29 / 31 Conclusion & Future work Conclusion Novel approximation methods for constructing scalable histograms on large probabilistic data. The quality of the approximate histograms are almost as good as the optimal histogram in practice. Extended the techniques to distributed and parallel settings to further improve scalability. Future work extend our study to probabilistic histograms with pdf bucket representatives and handle histogram of other error metrics 30 / 31 The end Thank You Q and A 31 / 31