Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattle WA, US University of Massachusetts Amherst · Department of Computer Science Applications of Uncertain Data Management TV Department of Computer Science 2 Motivating Application – Sloan Digital Sky Survey name type description OBJ_ID bigint SDSS identifier … … (rowc, rowc_err) real (row center position, error term) (colc, colc_err) real (column center position, error term) (q_u, qErr_u) real (stokes Q parameter, error term) (u_u, uErr_u) real (stokes U parameter, error term) (ra, dec, ra_err, dec_err, ra_dec_corr) real (right ascension, declination, error in ra, error in dec, ra/dec correlation) … … Q1: SELECT * FROM Galaxy AS G WHERE G.r < 22 AND G.q_r2+G.u_r2 > 0.25 Q2: SELECT * FROM Galaxy AS G1, Galaxy AS G2 WHERE G1.OBJ_ID < G2.OBJ_ID AND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05 AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05 AND (G1.rowc-G2.rowc)2+ (G1.colc-G2.colc)2 < 1E4 Continuous uncertain data Complex selection and join predicates Return answers of high confidence efficiently Department of Computer Science 3 Previously Proposed Data Model Gaussian Mixture Models (GMMs) for continuous uncertain attributes X ~ 0.5N (-6, 4) + 0.5N ( 6,8) • Flexible • Succinct • Computation efficiency Object_ID Speed X Y TEP Tuple model MA123456 0.7 [Tran et al. PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010 Tran et al. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. PVLDB 2010] Department of Computer Science 4 Scope of Problem Probabilistic threshold query processing and optimization • Avoid expensive operations for non-viable tuples • Find efficient plans based on predicates and distributions id Results with tuple existence probability (TEP) >λ Select-Project-Join (SPJ) queries with threshold λ TEP 1 0.8 2 0.5 SELECT * FROM Galaxy AS G WHERE G.r > 22 id Continuous uncertain data r r (λ=0.7) TEP 1 1 2 1 Gaussian Mixture Models (GMMs) Department of Computer Science 5 Outline Motivation Optimize Probabilistic Threshold Selections Optimize Probabilistic Threshold Joins Per-tuple Based Planning and Execution Evaluation Department of Computer Science 6 Probabilistic Threshold Selections SELECT * FROM Galaxy AS G WHERE G.q_r2+G.u_r2 < 0.25 Selection condition θ u_r Continuous uncertain attributes: q_r S={q_r, u_r} Selection region Rθ f Given a tuple with distribution f, the probability to satisfy θ: Return tuples with TEP>0.8 (λ) Department of Computer Science q_r u_r > 7 Probabilistic Threshold Selections Q2: SELECT * FROM Galaxy AS G1, Galaxy AS G2 WHERE G1.OBJ_ID < G2.OBJ_ID AND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05 AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05 AND (G1.rowc-G2.rowc)2+(G1.colc-G2.colc)2 < 1E4 Continuous uncertain attributes: S={G1.u, G1.g, G1.r, G1.rowc, G1.colc, G2.u, G2.g, G2.r, G2.rowc, G2.colc} Selection condition θ Selection region Rθ Given a tuple with distribution f, the probability to satisfy θ: Return tuples with TEP>0.8 (λ) > A high-dimensional integral for each tuple! Department of Computer Science 8 Applying Fast Filters to Avoid Integrals Derive an upper bound (Ũ) for the integral at a low cost • If Ũ<λ, filter tuples without computing integrals • Otherwise, still integrate to compute the true probability A general approach to derive an upper bound Given a tuple X, define a (multi-dim) Chebyshev region u Rλ(X) Test the overlap of Rλ(X) with predicate region Rθ • If Rλ(X) and Rθ are disjoint, filter the tuple A geometric intersection problem Constrained optimization generally. Can use techniques like Lagrange multiplier 0.2 Rθ g -0.2 0.2 -0.2 |u|<0.2 and |g|<0.2 Fast filters for common predicates Department of Computer Science 9 Reducing Dimensionality of Integration Linear transformation (LT): Let Xn~N(μ,Σ) and Y=Bm×nX+bm×1 then Ym~N(Bμ+b,BΣBT) σθ : n-dim space • region: Rθ • distribution: fX(x) • integral: Pr éë Rq ( x )ùû = Y=BX+b ò f ( x) dx X Rq σθ’ : m-dim space • region: R’θ = {y|y=Bx+b, xÎRθ} • distribution: fY(y) • integral: Pr éë R'q ( y)ùû = ò fY ( y) dy R'q An algorithm to find a transformation matrix Bm×n m≤n if m<n, LT helps to reduce dimensionality if m=n, LT does not help Department of Computer Science 10 Outline Motivation Optimize Probabilistic Threshold Selections Optimize Probabilistic Threshold Joins Per-tuple Based Planning and Execution Evaluation Department of Computer Science 11 Probabilistic Threshold Joins A probabilistic threshold join of relations R and S is: True match: tuple pair (r,s) such that > Large numbers of intermediate tuples! Key idea: filtered cross-product using indexes • For each tuple r, the index returns a subset of S to pair with r • (r,s) pairs returned by include all true matches • • A necessary condition for > • “Tight” enough, a sufficient and necessary condition if possible Department of Computer Science 12 Designing an Index Build an index on S for a<R.A-S.A<b search key Deterministic query region Instantiate with a deterministic value of R.A S.A E.g. when R.A=5, 5-b<S.A<5-a A distribution instead of a deterministic value! A necessary condition Pr [ a < Xfor r - Xs Probabilistic Quantities concerning S Department of Computer Science < b] > l Instantiate with quantities concerning R 13 Band Join of GMMs (a<R.A-S.A<b) r.A: Xr, μr, σr2 s.A: Xs, μs, σs2 ì Necessary condition: ï 1- l 2 mr - ms s r + s s2 < b ï l í Pr [ a < Xr - Xs < b] > l 1- l 2 ï 2 m m + s + s s r s >a ï r R1 R2 l î x y R3 R4 R5 R6 R7 2 Search key: ( x, y) = ms , s s ì Theorem 1: ï m - s 1- l < b Z=Xr-Xs follows a … GMM with ï l Pr Query region: y 2 [ a < Z < b] > l íï μz=μr-μs and σz2=σr2+σ s 1- l >a ï m +s l î ( ( ) ( ) ) Overlap test of RQ1 and RI [x1,x2;y1,y2]: RI overlaps with RQ1 if its upper left vertex (x1,y2) is in RQ1 Department of Computer Science x μr-a 14 Band Join of Gaussians (a<R.A-S.A<b) Gaussian properties offer a sufficient and necessary condition Theorem 2: Given Z~N(μ,σ2), Pr[a<Z<b] > λ iff there exists an a Î ( 0,1- l ) such that a - F-1 (a ) s < m < b - F-1 ( l + a ) s inverse of the standard normal cdf ( Search key: ( x, y) = ms , s s2 Query region: ) Z=Xr-Xs ( x', y') = (mr - x, s r2 + y ) y’ x’ Overlap test: Requires math derivation; can be implemented efficiently Department of Computer Science 15 Outline Motivation Optimize Probabilistic Threshold Selections Optimize Probabilistic Threshold Joins Per-tuple Based Planning and Execution Evaluation Department of Computer Science 16 Query Planning Logical Operators p Faster filters based on inequalities Physical Operators Filtered cross-product using indexes Exact selection using integrals (with LT) How to arrange operators to get an efficient plan ? Department of Computer Science 17 Per-tuple Based Planning Consider both selectivity and cost like the traditional planner Differences • Exact selections are expensive due to the use of integrals • Selectivity should be defined on a per-tuple basis => The optimal order varies on a per-tuple basis Q1: SELECT * FROM Galaxy WHERE r < 24 AND q_r2+u_r2 > 0.25 20 24 25 Predicate Selectivities Tuple Attributes θ1 θ2 θ1 θ2 id r q_r u_r 1 N(27, 2.2) N(1, 2.2) N(0.1, 1.1) 0.08 0.95 2 N(21, 0.1) N(0, 0.1) N(-0.1, 0.1) 1 0.0002 Optimal plan for t1: θ1 θ2 Optimal plan for t2: θ2 θ1 30 Department of Computer Science 18 Tuple-based Query Planning and Execution Tuple from R needs to go through three selection Step 4: 1:t1Execute 2: 3: Estimate Choose athe filters selectivities relation (filtered) first, tothen join cross-product andexact with rank selections selection predicates predicates and five join predicates To-process tuple pool t1 Predicates on R σθ1 σθ2 σθ3 Est. cost 100 300 104 Selectivity 0.8 0.2 0.1 2 1 3 Rank σθ1 σθ2 Join R with σθ3 Predicate Est. cost θ4 θ5 θ6 θ7 t4 t3 t2 selection: θ4 θ5 θ6 join: θ7 θ8 Department of Computer Science S θ8 Has index #candidate s Choose θ4 T θ5 θ6 θ7 θ8 500 300 100 104 50 Y Y N Y N 10 4 105 1 102 ✓ 19 Outline Motivation Optimize Probabilistic Threshold Selections Optimize Probabilistic Threshold Joins Per-tuple Based Planning and Execution Evaluation using Data and Queries from SDSS Department of Computer Science 20 Fast Filters for Selections SELECT * FROM Galaxy WHERE 100<rowc<100+δ AND 100<colc<100+δ (λ=0.7) General filter v.s. Exact integration Data Characteristics • Gaussians (from SDSS) Parameters • δ: predicate range • λ: probability threshold Metrics • Time per tuple • Without filters, constant high cost for all ranges tested • With filters, per tuple cost is very low for small predicate ranges • More improvement for larger λ values tested Department of Computer Science 21 Indexes for Band Joins (stream) SELECT * FROM Galaxy AS R, Galaxy AS S WHERE |R.u-S.u|<δ (λ=0.7, W=500) xbound vs GaussJoin in filtering power xbound vs GaussJoin in efficiency • Our index for Gaussians the true Xbound join index [R. Cheng et returns al. VLDB exactly 2004 & CIKM 2006]match set •• Given a distribution f and [l,u], store x% quantiles from both ends Xbound returns more candidates •• AOur loose necessary condition for true matchessignificantly index outperforms xbound in efficiency Department of Computer Science 22 Tuple Based Planning and Execution Q1: SELECT * FROM Galaxy AS G WHERE G.r < δ1 θ1 AND G.q_r2+G.u_r2 > δ22 θ2 δ1 20 20 δ2 static order static time (ms) dynamic time (ms) performance gain Dynamic 0.2 [1 2] query 0.6planning 0.181 70% • Rank predicates for each tuple 0.5 [1 2] 0.6 0.068 89% optimal time (ms) 0.177 0.067 gains in0.050 Very20close the 1 toOver [2 1] 50% 9.6 99% 2010] 0.048 Static query planning [Y. Qi et al. SIGMOD most cases optimal cases •0.2all A fixed for each query based 22 in [2 1] plan 18.2 7.216 60% on the 7.007 of predicates entire data1.482 set 22 0.5 selectivities [2 1] 13.9 1.515 over 89% 22 24 24 24 1 [2 1]query9.6 96% 0.348 Optimal planning0.351 •0.2Generate the tuple offline [2 1] 18.2 best plan 15.613for each 14% 15.287 it14.4 into memory execution 6.334 0.5and [2 load 1] 6.390 before56% 1 Department of Computer Science [2 1] 9.6 2.264 76% 2.236 23 Conclusions Optimize probabilistic threshold selections • Fast filters to avoid integrals • Reducing dimensionality of integration by linear transformation Optimize probabilistic threshold joins • Filtered cross-product using new indexes Dynamic, per-tuple based planning Evaluation • Significant performance gains over the state-of-the-art indexing technique and query optimizer Future work • Extend to a larger class of queries including group-by aggregates • Support user-defined functions • Query optimization with correlated tuples Department of Computer Science 24 Thank you! Q&A Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu http://claro.cs.umass.edu/ Department of Computer Science 25