Slides

advertisement
Optimizing Probabilistic Query Processing on
Continuous Uncertain Data
Liping Peng
Yanlei Diao
Anna Liu
VLDB 2011
Seattle WA, US
University of Massachusetts Amherst · Department of Computer Science
Applications of Uncertain Data Management
TV
Department of Computer Science
2
Motivating Application – Sloan Digital Sky Survey
name
type
description
OBJ_ID
bigint
SDSS identifier
…
…
(rowc, rowc_err)
real
(row center position, error term)
(colc, colc_err)
real
(column center position, error
term)
(q_u, qErr_u)
real
(stokes Q parameter, error term)
(u_u, uErr_u)
real
(stokes U parameter, error term)
(ra, dec,
ra_err, dec_err,
ra_dec_corr)
real
(right ascension, declination,
error in ra, error in dec,
ra/dec correlation)
…
…
Q1: SELECT *
FROM Galaxy AS G
WHERE G.r < 22
AND G.q_r2+G.u_r2 > 0.25
Q2: SELECT *
FROM Galaxy AS G1, Galaxy AS G2
WHERE G1.OBJ_ID < G2.OBJ_ID
AND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05
AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05
AND (G1.rowc-G2.rowc)2+
(G1.colc-G2.colc)2 < 1E4
Continuous uncertain data
Complex selection and join predicates
Return answers of high confidence efficiently
Department of Computer Science
3
Previously Proposed Data Model
Gaussian Mixture Models (GMMs) for continuous uncertain
attributes
X ~ 0.5N (-6, 4) + 0.5N ( 6,8)
• Flexible
• Succinct
• Computation efficiency
Object_ID
Speed
X
Y
TEP
Tuple model
MA123456
0.7
[Tran et al. PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010
Tran et al. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. PVLDB 2010]
Department of Computer Science
4
Scope of Problem
Probabilistic threshold query processing and optimization
• Avoid expensive operations for non-viable tuples
• Find efficient plans based on predicates and distributions
id
Results with tuple existence
probability (TEP) >λ
Select-Project-Join (SPJ)
queries with threshold λ
TEP
1
0.8
2
0.5
SELECT *
FROM Galaxy AS G
WHERE G.r > 22
id
Continuous uncertain data
r
r
(λ=0.7)
TEP
1
1
2
1
Gaussian Mixture Models (GMMs)
Department of Computer Science
5
Outline
 Motivation
 Optimize Probabilistic Threshold Selections
 Optimize Probabilistic Threshold Joins
 Per-tuple Based Planning and Execution
 Evaluation
Department of Computer Science
6
Probabilistic Threshold Selections
SELECT *
FROM Galaxy AS G
WHERE G.q_r2+G.u_r2 < 0.25
Selection condition θ
u_r
Continuous uncertain attributes:
q_r
S={q_r, u_r}
Selection region Rθ
f
Given a tuple with distribution f,
the probability to satisfy θ:
Return tuples with TEP>0.8 (λ)
Department of Computer Science
q_r
u_r
>
7
Probabilistic Threshold Selections
Q2: SELECT * FROM Galaxy AS G1, Galaxy AS G2
WHERE G1.OBJ_ID < G2.OBJ_ID
AND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05
AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05
AND (G1.rowc-G2.rowc)2+(G1.colc-G2.colc)2 < 1E4
Continuous uncertain attributes:
S={G1.u, G1.g, G1.r, G1.rowc, G1.colc,
G2.u, G2.g, G2.r, G2.rowc, G2.colc}
Selection condition θ
Selection region Rθ
Given a tuple with distribution f,
the probability to satisfy θ:
Return tuples with TEP>0.8 (λ)
>
A high-dimensional integral for each tuple!
Department of Computer Science
8
Applying Fast Filters
to Avoid Integrals
Derive an upper bound (Ũ) for the integral at a low cost
• If Ũ<λ, filter tuples without computing integrals
• Otherwise, still integrate to compute the true probability
 A general approach to derive an upper bound
 Given a tuple X, define a (multi-dim) Chebyshev region
u
Rλ(X)
 Test the overlap of Rλ(X) with predicate region Rθ
• If Rλ(X) and Rθ are disjoint, filter the tuple
A geometric intersection problem
 Constrained optimization generally. Can
use techniques like Lagrange multiplier
0.2
Rθ
g
-0.2
0.2
-0.2
|u|<0.2 and |g|<0.2
 Fast filters for common predicates
Department of Computer Science
9
Reducing Dimensionality of Integration
Linear transformation (LT):
Let Xn~N(μ,Σ) and Y=Bm×nX+bm×1 then Ym~N(Bμ+b,BΣBT)
σθ : n-dim space
• region:
Rθ
• distribution: fX(x)
• integral: Pr éë Rq ( x )ùû =
Y=BX+b
ò f ( x) dx
X
Rq
σθ’ : m-dim space
• region: R’θ = {y|y=Bx+b, xÎRθ}
• distribution:
fY(y)
• integral: Pr éë R'q ( y)ùû = ò fY ( y) dy
R'q
An algorithm to find a transformation matrix Bm×n

 m≤n
 if m<n, LT helps to reduce dimensionality
 if m=n, LT does not help
Department of Computer Science
10
Outline
 Motivation
 Optimize Probabilistic Threshold Selections
 Optimize Probabilistic Threshold Joins
 Per-tuple Based Planning and Execution
 Evaluation
Department of Computer Science
11
Probabilistic Threshold Joins
A probabilistic threshold join of relations R and S is:
True match: tuple pair (r,s) such that
>
Large numbers of intermediate tuples!
Key idea: filtered cross-product
using indexes
• For each tuple r, the index returns a subset of S to pair with r
• (r,s) pairs returned by
include all true matches
•
• A necessary condition for
>
• “Tight” enough, a sufficient and necessary condition if possible
Department of Computer Science
12
Designing an Index
Build an index on S for a<R.A-S.A<b
search key
Deterministic
query region
Instantiate with a
deterministic value of R.A
S.A
E.g. when R.A=5, 5-b<S.A<5-a
A distribution instead of a deterministic value!
A necessary condition
Pr [ a < Xfor
r - Xs
Probabilistic
Quantities concerning S
Department of Computer Science
< b] > l
Instantiate with
quantities concerning R
13
Band Join of GMMs (a<R.A-S.A<b)
r.A: Xr, μr, σr2
s.A: Xs, μs, σs2
ì
 Necessary condition: ï
1- l 2
mr - ms s r + s s2 < b
ï
l
í
Pr [ a < Xr - Xs < b] > l 
1- l 2
ï
2
m
m
+
s
+
s
s
r
s >a
ï r
R1 R2
l
î
x
y
R3 R4 R5
R6 R7
2
 Search key: ( x, y) = ms , s s
ì
Theorem 1:
ï m - s 1- l < b
Z=Xr-Xs follows a …
GMM with
ï
l
 Pr
Query
region:
y 2
[ a < Z < b] > l  íï
μz=μr-μs and σz2=σr2+σ
s
1- l
>a
ï m +s
l
î
(
(
)
(
)
)
 Overlap test of RQ1 and RI [x1,x2;y1,y2]:
RI overlaps with RQ1 if its upper left vertex
(x1,y2) is in RQ1
Department of Computer Science
x
μr-a
14
Band Join of Gaussians (a<R.A-S.A<b)
Gaussian properties offer a sufficient and necessary
condition
Theorem 2:
Given Z~N(μ,σ2), Pr[a<Z<b] > λ iff there exists an a Î ( 0,1- l )
such that a - F-1 (a ) s < m < b - F-1 ( l + a ) s
inverse of the standard normal cdf
(
 Search key: ( x, y) = ms , s s2
 Query region:
)
Z=Xr-Xs
( x', y') = (mr - x, s r2 + y )
y’
x’
 Overlap test: Requires math derivation; can be implemented efficiently
Department of Computer Science
15
Outline
 Motivation
 Optimize Probabilistic Threshold Selections
 Optimize Probabilistic Threshold Joins
 Per-tuple Based Planning and Execution
 Evaluation
Department of Computer Science
16
Query Planning
Logical
Operators
p
Faster filters based on inequalities
Physical
Operators
Filtered cross-product using indexes
Exact selection using integrals (with LT)
How to arrange operators to get an efficient plan ?
Department of Computer Science
17
Per-tuple Based Planning
 Consider both selectivity and cost like the traditional planner
 Differences
• Exact selections are expensive due to the use of integrals
• Selectivity should be defined on a per-tuple basis
=> The optimal order varies on a per-tuple basis
Q1: SELECT *
FROM Galaxy
WHERE r < 24
AND q_r2+u_r2 > 0.25
20
24
25
Predicate Selectivities
Tuple Attributes
θ1
θ2
θ1
θ2
id
r
q_r
u_r
1
N(27, 2.2)
N(1, 2.2)
N(0.1, 1.1)
0.08
0.95
2
N(21, 0.1)
N(0, 0.1)
N(-0.1, 0.1)
1
0.0002
Optimal plan for t1:
θ1

θ2
Optimal plan for t2:
θ2

θ1
30
Department of Computer Science
18
Tuple-based Query Planning and Execution
 Tuple
from R needs
to go
through
three
selection
Step 4:
1:t1Execute
2:
3:
Estimate
Choose
athe
filters
selectivities
relation
(filtered)
first,
tothen
join
cross-product
andexact
with
rank
selections
selection
predicates
predicates and five join predicates
To-process tuple pool
t1
Predicates on R
σθ1
σθ2
σθ3
Est. cost
100
300
104
Selectivity
0.8
0.2
0.1
2
1
3
Rank
σθ1
σθ2
Join R with
σθ3
Predicate
Est. cost
θ4
θ5
θ6
θ7
t4 t3 t2
selection: θ4 θ5 θ6
join: θ7 θ8
Department of Computer Science
S
θ8
Has index
#candidate
s
Choose
θ4
T
θ5
θ6
θ7
θ8
500
300
100
104
50
Y
Y
N
Y
N
10
4
105
1
102
✓
19
Outline
 Motivation
 Optimize Probabilistic Threshold Selections
 Optimize Probabilistic Threshold Joins
 Per-tuple Based Planning and Execution
 Evaluation using Data and Queries from SDSS
Department of Computer Science
20
Fast Filters for Selections
SELECT * FROM Galaxy WHERE 100<rowc<100+δ AND 100<colc<100+δ (λ=0.7)
General filter v.s. Exact integration
Data Characteristics
• Gaussians (from SDSS)
Parameters
• δ: predicate range
• λ: probability threshold
Metrics
• Time per tuple
• Without filters, constant high cost for all ranges tested
• With filters, per tuple cost is very low for small predicate ranges
• More improvement for larger λ values tested
Department of Computer Science
21
Indexes for Band Joins (stream)
SELECT * FROM Galaxy AS R, Galaxy AS S WHERE |R.u-S.u|<δ (λ=0.7,
W=500)
xbound vs GaussJoin in filtering power
xbound vs GaussJoin in efficiency
• Our index
for Gaussians
the true
Xbound
join index
[R. Cheng et returns
al. VLDB exactly
2004 & CIKM
2006]match set
•• Given
a distribution
f and
[l,u], store x% quantiles from both ends
Xbound
returns more
candidates
•• AOur
loose
necessary
condition
for true
matchessignificantly
index
outperforms
xbound
in efficiency
Department of Computer Science
22
Tuple Based Planning and Execution
Q1: SELECT *
FROM Galaxy AS G
WHERE G.r < δ1
 θ1
AND G.q_r2+G.u_r2 > δ22  θ2
δ1
20
20
δ2
static
order
static
time (ms)
dynamic
time (ms)
performance
gain
Dynamic
0.2 [1 2] query
0.6planning
0.181
70%
• Rank predicates for each tuple
0.5
[1 2]
0.6
0.068
89%
optimal
time (ms)
0.177
0.067
gains in0.050
Very20close
the
1 toOver
[2
1] 50%
9.6
99% 2010] 0.048
Static
query
planning
[Y. Qi et al. SIGMOD
most
cases
optimal
cases
•0.2all
A fixed
for each
query based
22 in
[2 1] plan
18.2
7.216
60% on the 7.007
of predicates
entire data1.482
set
22 0.5 selectivities
[2 1]
13.9
1.515 over 89%
22
24
24
24
1
[2 1]query9.6
96%
0.348
Optimal
planning0.351
•0.2Generate
the
tuple offline
[2 1]
18.2 best plan
15.613for each
14%
15.287
it14.4
into memory
execution 6.334
0.5and
[2 load
1]
6.390 before56%
1
Department of Computer Science
[2 1]
9.6
2.264
76%
2.236
23
Conclusions
 Optimize probabilistic threshold selections
• Fast filters to avoid integrals
• Reducing dimensionality of integration by linear transformation
 Optimize probabilistic threshold joins
• Filtered cross-product using new indexes
 Dynamic, per-tuple based planning
 Evaluation
• Significant performance gains over the state-of-the-art indexing
technique and query optimizer
 Future work
• Extend to a larger class of queries including group-by aggregates
• Support user-defined functions
• Query optimization with correlated tuples
Department of Computer Science
24
Thank you!
Q&A
Optimizing Probabilistic Query Processing on
Continuous Uncertain Data
Liping Peng Yanlei Diao Anna Liu
http://claro.cs.umass.edu/
Department of Computer Science
25
Download