Presentation

advertisement
A new class of lineage
expressions over probabilistic
databases computable in
PTIME
SUM 2013
Batya Kenig
Avigdor Gal
Ofer Strichman
Probabilistic Databases for managing
uncertain data
• A variety of data sources generate incomplete, noisy and
uncertain data (sensor networks, information extraction,
data integration…).
• Probabilistic databases enable storing and querying such
data
• A lot of research in recent years
MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md]
Tuple Independent Probabilistic Databases
𝑅
A
𝑆
B P
B
C
P
π‘Ÿ1
π‘Ž1 𝑏1 0.6
𝑠1
𝑏1
𝑐1 0.6
π‘Ÿ2
π‘Ž2 𝑏2 0.5
𝑠2
𝑏2
𝑐1 0.5
𝑠3
𝑏2
𝑐2 0.25
𝑇
C D P
𝑑1 𝑐1 𝑑1 0.4
Each possible world 𝐼 is a standard database
instance with probability 𝑃(𝐼)
𝑃 𝐼0 = 0.018
𝑅
A
B
𝑆
B
C
𝑇
CD
𝑃 𝐼1 = 0.027
𝑅
A
π‘Ÿ1 π‘Ž1
𝑆
𝑇
B
CD
B
𝑏1
C
𝑃 𝐼2 = 0.0405
𝑅
A
π‘Ÿ1 π‘Ž1
𝑆
B
𝑠1 𝑏1
𝑇
CD
B
𝑏1
C
𝑐1
𝑃 𝐼3 = 0.027
𝑅
A
B
π‘Ÿ1 π‘Ž1
𝑆
B
𝑏1
C
𝑠1 𝑏1
𝑇
C D
𝑑1 𝑐1 𝑑1
𝑐1
β‹―
Query Semantics
• Let π‘ž be a query evaluated against probabilistic DB 𝐷
• Let 𝐼𝑑 ⊆ 𝐼 be the possible worlds that return 𝑑 ∈ π‘ž[𝐷].
• Sum the probabilities of instances that return 𝑑.
P(t ) ο€½ οƒ₯iοƒŽIt P(i )
• Goal: efficiently evaluate 𝑃(π‘ž[𝐷])
– In time polynomial in |𝐷|
– Not always possible, in general #P-hard
Probabilistic Inference for queries
• DalviSuciu04: Conjunctive queries (without self joins) are
either:
– Safe queries: Have query plans that run in 𝑃𝑇𝐼𝑀𝐸 on all
DB instances
– Unsafe queries: Data complexity is #𝑃-hard
• π‘ž: Π∅ 𝑅(𝑋) β‹ˆ 𝑆(𝑋, π‘Œ) β‹ˆ 𝑇(π‘Œ)
• However,
– Even for unsafe queries there are DB instances which will
enable efficient computation
Why lineage?
Each tuple is associated with a binary random variable
𝑅
A
𝑆
B P
B
C P
π‘Ÿ1
π‘Ž1 𝑏1 0.6
𝑠1
𝑏1
𝑐1 0.6
π‘Ÿ2
π‘Ž2 𝑏2 0.5
𝑠2
𝑏2
𝑐1 0.5
𝑠3
𝑏2
𝑐2 0.25
𝑇 π‘ž: Π∅ 𝑅 β‹ˆπ΅ S β‹ˆπΆ 𝑇
C D P
𝑑1 𝑐1 𝑑1 0.4
β‹ˆπ΅
A
𝑗1
A
B
C
lineage
β‹ˆπΆ
B
C
D
lineage
π‘Ž1 𝑏1 𝑐1 𝑑1 π‘Ÿ1 ∧ 𝑠1 ∧ 𝑑1
𝑗2 π‘Ž2 𝑏2 𝑐1 𝑑1 π‘Ÿ2 ∧ 𝑠2 ∧ 𝑑1
𝑖1
π‘Ž1
𝑏1
𝑐1
π‘Ÿ1 ∧ 𝑠1
𝑖2
π‘Ž2
𝑏2
𝑐1
π‘Ÿ2 ∧ 𝑠2
𝑖3
π‘Ž2
𝑏2
𝑐2
π‘Ÿ2 ∧ 𝑠3 Compute probability of this formula
(π‘Ÿ1 ∧ 𝑠1 ∧ 𝑑1 ) ∨ (π‘Ÿ2 ∧ 𝑠2 ∧ 𝑑1 )
Why Lineage ?
Efficient computation
• Expression in Read-Once form
[Roy2011,Sen2010]
– Linear time probability computation
• 𝑃 π‘₯∧𝑦 =𝑃 π‘₯ 𝑃 𝑦
• 𝑃 π‘₯∨𝑦 =1− 1−𝑃 π‘₯
1−𝑃 𝑦
• 𝑓 = (π‘Ÿ1 ∧ 𝑠1 ∧ 𝑑1 ) ∨ (π‘Ÿ2 ∧ 𝑠2 ∧ 𝑑1 )=π’•πŸ ∧ ((π’“πŸ ∧ π’”πŸ ) ∨ (π’“πŸ ∧ π’”πŸ ))
∧ 𝑝 𝑑1 𝑝(𝑖3 )
𝑝 𝑖3
∨ =1
𝑑1
−
𝑝 𝑖1 = 𝑝 𝑠1 𝑝 π‘Ÿ1
π‘Ÿ1
∧
1 − 𝑝 𝑖1
βˆ™ 1 − 𝑝 𝑖2
∧ 𝑝 𝑖2 = 𝑝 𝑠2 𝑝 π‘Ÿ2
𝑠1
π‘Ÿ2
𝑠2
Safe plans for safe queries
produce formulas in read-once
form [Olteanu&Huang2008]
𝑅
A
B P
𝑆
B
C P
π‘Ÿ1
π‘Ž1 𝑏1 0.6
𝑠1
𝑏1
𝑐1 0.6
π‘Ÿ2
π‘Ž2 𝑏2 0.5
𝑠2
𝑏2
𝑐1 0.5
𝑠3
𝑏2
𝑐2 0.25
C D P
𝑑1 𝑐1 𝑑1 0.4
𝑑2 𝑐2 𝑑2 0.3
𝑅 β‹ˆπ΅ S β‹ˆπΆ 𝑇
𝑄:
Unsafe
query
𝑇
∅
f ο€½ r1s1t1  r2 s2t1  r2 s3t2
Not read-once
Solutions:
• Jha&Suciu11:
Compile
to decision
diagram
𝑂𝐡𝐷𝐷, 𝐹𝐡𝐷𝐷
We will
show how
to compute
the probability
of
• Jha&Suciu12:
Exponential
in pathwidth,
double in
exponential
disjoint
branch lineage
expressions
𝑃𝑇𝐼𝑀𝐸 in expression
pathwidth
Lineage as a hypergraph
f ο€½ r1s1t1  r2 s2t1  r2 s3t2
Primal Graph
s2
s1
s3
t1
r1
r2
t2
In general, expanding a formula to its DNF form can lead to an
exponential blowup.
• For SPJ queries without self joins, the primal graph can be
generated directly from the formula [Roy 2011].
Hyperedges
Junction trees for lineages
• Let 𝐻 𝑉, β„° be a hypergraph.
• Hypergraph 𝐻 𝑉, β„° is acyclic iff it has a junction tree
𝑇(β„°,A) [Beeri et al 1981]
– The junction tree property: for every 𝑣 ∈ 𝑉 the set of
nodes in the tree that contain 𝑣, 𝐾𝑣 ⊆ β„°, induce a
(connected) tree.
Junction trees for lineages
f ο€½ r1s1t1  r2 s2t1  r2 s3t2
r1
r1s1t1
s1
t1
t1
s2
r2
r2 s2t1
s3
r2
t2
r2 s3t2
Background: junction trees for probabilistic
inference
node
r1s1t1
F (r1 , s1 , t1 )
t1
F (t1 )
separator
r2 s2t1
F (r2 , s2 , t1 )
r2
F (r2 )
r2 s3t2
F (r2 , s3 , t2 )
Each node and separator stores joint pdf
Send messages towards a given root node
• Messages are passed by multiplication of factor entries
• Once the root node has received messages from all of its
neighbors, its factor holds the marginal of the joint probability
distribution of the entire variable set.
(Naïve) Junction Tree Algorithm
for lineage computation
f ο€½ r1s1t1  r2 s2t1  r2 s3t2
P( f ο€½ 1) ο€½ 1 ο€­ P( f ο€½ 0)
r1 s1 t1
0 0 0
0
0
0
1
P
pr1 ps1 pt1
r1s1t1
t1
r2 s2t1
r2
r2 s3t2
0 1 pr ps pt The JT Alg runs in time that is exponential in
PROBLEM:
1 0largest
pr ps pt factor. οƒ 
the
r2 , s3 , tF2 )(rο€½,Fs (,rt2 ,)s3 , t2 )·F (r2 )
F (t1 ) ο€½ οƒ₯r ,s F (r1 , s1 , t1 ) F (r2 ) Fο€½ (οƒ₯
s2 ,t1
2 2 1
Restricted
be efficiently
1 1 pr ps pt to lineage expressions that can
(r2junction
, s2 , t1 ) ο€½ Ftree
(r2 , s2[i.e;
, t1 )·Flow
(t1 ) tree width]
represented
usingF a
0 0 pr ps pt
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
0 1
pr1 ps1 pt1
1
1
0
pr1 ps1 pt1
1
1
1
pr1 ps1 pt1
𝑷(𝒇 = 𝟎) =
0
οƒ₯
r2 ,s3 ,t2
F (r2 , s3 , t2 )
Take advantage of Junction Tree structure
Rooted Directed Path Graphs [Gavril1975]:
A graph 𝐺 𝑉, 𝐸 is a rooted directed path graph
(RDPG) iff there exists a rooted directed junction tree
π‘‡π‘Ÿ such that for every vertex 𝑣 ∈ 𝑉, the set of nodes
that contain 𝑣 form a directed path of π‘‡π‘Ÿ
Disjoint Branch Junction Trees (DBJT)
𝑓 = π‘Žπ‘π‘π‘‘π‘’ + π‘Žπ‘π‘“ + 𝑐𝑑𝑒𝑔 + π‘Žπ‘“β„Ž + 𝑏𝑖 + 𝑐𝑒𝑗 + π‘‘π‘”π‘˜
Cr
C1
C3
π‘Ž, 𝑓, β„Ž C4
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
π‘Ž, 𝑏, 𝑓
𝑏, 𝑖
C2
C5
𝑐, 𝑑, 𝑒, 𝑔
𝑐, 𝑒, 𝑗 C6
𝑑, 𝑔, π‘˜
Use compact factors
Cr
C1
C3
π‘Ž, 𝑓, β„Ž C4
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
π‘Ž, 𝑏, 𝑓
𝑏, 𝑖
C2
C5
𝑐, 𝑑, 𝑒, 𝑔
𝑐, 𝑒, 𝑗 C6
a
0
1
1
1
1
1
b
*
0
1
1
1
1
c
*
*
0
1
1
1
d
*
*
*
0
1
1
e P
* ?
* ?
* ?
* ?
0 ?
1 ?
𝑑, 𝑔, π‘˜
𝑃(𝑓 =
We would ultimately like to calculate the entry
probabilities. Their sum is exactly 𝑃(𝑓 = 0)
The Algorithm
Cr
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
πœ‡2,π‘Ÿ (𝑐, 𝑑, 𝑒)
πœ‡1,π‘Ÿ (π‘Ž, 𝑏)
C1
C3
a
0
1
1
1
1
b
*
0
1
1
1
c
*
*
0
1
1
π‘Ž, 𝑏, 𝑓
π‘Ž, 𝑓, β„Ž C4
d
*
*
*
0
1
e P
* ?
* ?
* ?
* ?
0 ?
=
𝑏, 𝑖
a
0
1
1
b P
* ?
0 ?
1 ?
𝑐, 𝑑, 𝑒, 𝑔
C2
C5
×
𝑐, 𝑒, 𝑗 C6
c
0
1
1
1
d
*
0
1
1
e P
* ?
* ?
0 ?
1 ?
𝑑, 𝑔, π‘˜
This can be
done due to the
disjoint branch
property.
Projection/Marginalization
• Sending a message involves summing
out variables in the factor
a
f
h
P
0
*
*
pa
1
0
*
1
1
0
pa p f
a h
0 *
1 *
P
pa
pa p f
pa p f ph
1 0
pa p f ph
Ππ‘Ž,β„Ž
No longer mutual exclusive!
Disables subsequent projections.
Projection/Marginalization
• Solution: Perform marginalization by repeatedly
projecting out only the last (rightmost) var.
a
f
h
P
0
*
*
pa
1
0
*
1
1
0
pa p f
a
0
1
f
*
0
P
pa
pa p f
pa p f ph
1
1
pa p f ph
Ππ‘Ž,𝑓
• Requires ordering message-vars before those to
be summed out.
• Due to the junction tree property this is always
possible.
Cr
C1
π‘Ž, 𝑏, 𝑓
3,1 (a, f )
C3
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
π‘Ž, 𝑓, β„Ž C4
C2
C1 (a, b, f ) ο€½
𝑐, 𝑑, 𝑒, 𝑔
4,1 (b)
𝑏, 𝑖
𝑐, 𝑒, 𝑗 C6
C5
3,1 (a, f ) ο€½

a b
f
P
0 *
*
pa ( pb  pb pi )
1 0
*
pa pb ( p f  p f ph )
1 1
0
pa pb p f pi
𝑑, 𝑔, π‘˜
a
f
P
0
*
pa
1
0
pa p f
1
1
pa p f ph
×
b
4,1 (b) ο€½ 0
1

a, f
C3 (a, f , h) ο€½
a
f
h
P
0
*
*
pa
1
0
*
pa p f
1
1
0
pa p f ph
P
pb
pb pi
b
b i
C4 (b, i ) ο€½ 0 *
1 0
P
pb
pb pi
Cr
C1
π‘Ž, 𝑏, 𝑓
3,1 (a, f )
C3
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
6,2 (d , g )
4,1 (b) 5,2 (c, e)
π‘Ž, 𝑓, β„Ž C4
𝑏, 𝑖
Cr
𝑐, 𝑒, 𝑗 C6
C5
C1
3,1 (a, f )
π‘Ž, 𝑓, β„Ž C4
𝑑, 𝑔, π‘˜
π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒
2,r (c, d , e)
1,r (a, b)
C3
𝑐, 𝑑, 𝑒, 𝑔
C2
π‘Ž, 𝑏, 𝑓
𝑐, 𝑑, 𝑒, 𝑔
C2
4,1 (b) 5,2 (c, e)
𝑏, 𝑖
C5
a
0
1
1
1
1
𝑐, 𝑒, 𝑗 C6
6,2 (d , g )
𝑑, 𝑔, π‘˜
πΉπ‘Ÿ π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒 =
b
*
0
1
1
1
c
*
*
0
1
1
d
*
*
*
0
1
e
*
*
*
*
0
P
pa ( p  pb p ) 2, r (*,*,*)
i
b
pa p ( p  p f p ) 2, r (*,*,*)
b f
h
pa pb p p pc [ p  pd ( p g  p g p )]
f i
d
k
pa pb p p pc p ( pe  pe p )
j
f i
d
pa pb p p pc pe pd ( p g  p g p )
f i
k
Complexity Analysis
• Let π‘˜ be the size of the largest factor
• Each node can have at most π‘˜ children
• Therefore, each entry in the factor is
updated at most π‘˜ times.
• Overall 𝑂(𝑁 ⋅ π‘˜ 2 )
Conclusions
• Define disjoint branch lineage
expressions
• Provide an algorithm for computing the
probability of disjoint branch lineage
𝟐
expressions in PTIME - 𝑢 𝑡 ⋅ π’Œ
Future Work
• Are there other structural properties of junction
trees that can facilitate efficient probabilistic
inference ?
• Real data is correlated
– Drop tuple-independence assumption
• Characterize queries and DB instances which
induce lineage with “efficient” junction trees.
Thank You
Download