Presentation

A new class of lineage expressions over probabilistic databases computable in PTIME SUM 2013 Batya Kenig Avigdor Gal Ofer Strichman Probabilistic Databases for managing uncertain data • A variety of data sources generate incomplete, noisy and uncertain data (sensor networks, information extraction, data integration…). • Probabilistic databases enable storing and querying such data • A lot of research in recent years MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md] Tuple Independent Probabilistic Databases 𝑅 A 𝑆 B P B C P 𝑟1 𝑎1 𝑏1 0.6 𝑠1 𝑏1 𝑐1 0.6 𝑟2 𝑎2 𝑏2 0.5 𝑠2 𝑏2 𝑐1 0.5 𝑠3 𝑏2 𝑐2 0.25 𝑇 C D P 𝑡1 𝑐1 𝑑1 0.4 Each possible world 𝐼 is a standard database instance with probability 𝑃(𝐼) 𝑃 𝐼0 = 0.018 𝑅 A B 𝑆 B C 𝑇 CD 𝑃 𝐼1 = 0.027 𝑅 A 𝑟1 𝑎1 𝑆 𝑇 B CD B 𝑏1 C 𝑃 𝐼2 = 0.0405 𝑅 A 𝑟1 𝑎1 𝑆 B 𝑠1 𝑏1 𝑇 CD B 𝑏1 C 𝑐1 𝑃 𝐼3 = 0.027 𝑅 A B 𝑟1 𝑎1 𝑆 B 𝑏1 C 𝑠1 𝑏1 𝑇 C D 𝑡1 𝑐1 𝑑1 𝑐1 ⋯ Query Semantics • Let 𝑞 be a query evaluated against probabilistic DB 𝐷 • Let 𝐼𝑡 ⊆ 𝐼 be the possible worlds that return 𝑡 ∈ 𝑞[𝐷]. • Sum the probabilities of instances that return 𝑡. P(t )  iIt P(i ) • Goal: efficiently evaluate 𝑃(𝑞[𝐷]) – In time polynomial in |𝐷| – Not always possible, in general #P-hard Probabilistic Inference for queries • DalviSuciu04: Conjunctive queries (without self joins) are either: – Safe queries: Have query plans that run in 𝑃𝑇𝐼𝑀𝐸 on all DB instances – Unsafe queries: Data complexity is #𝑃-hard • 𝑞: Π∅ 𝑅(𝑋) ⋈ 𝑆(𝑋, 𝑌) ⋈ 𝑇(𝑌) • However, – Even for unsafe queries there are DB instances which will enable efficient computation Why lineage? Each tuple is associated with a binary random variable 𝑅 A 𝑆 B P B C P 𝑟1 𝑎1 𝑏1 0.6 𝑠1 𝑏1 𝑐1 0.6 𝑟2 𝑎2 𝑏2 0.5 𝑠2 𝑏2 𝑐1 0.5 𝑠3 𝑏2 𝑐2 0.25 𝑇 𝑞: Π∅ 𝑅 ⋈𝐵 S ⋈𝐶 𝑇 C D P 𝑡1 𝑐1 𝑑1 0.4 ⋈𝐵 A 𝑗1 A B C lineage ⋈𝐶 B C D lineage 𝑎1 𝑏1 𝑐1 𝑑1 𝑟1 ∧ 𝑠1 ∧ 𝑡1 𝑗2 𝑎2 𝑏2 𝑐1 𝑑1 𝑟2 ∧ 𝑠2 ∧ 𝑡1 𝑖1 𝑎1 𝑏1 𝑐1 𝑟1 ∧ 𝑠1 𝑖2 𝑎2 𝑏2 𝑐1 𝑟2 ∧ 𝑠2 𝑖3 𝑎2 𝑏2 𝑐2 𝑟2 ∧ 𝑠3 Compute probability of this formula (𝑟1 ∧ 𝑠1 ∧ 𝑡1 ) ∨ (𝑟2 ∧ 𝑠2 ∧ 𝑡1 ) Why Lineage ? Efficient computation • Expression in Read-Once form [Roy2011,Sen2010] – Linear time probability computation • 𝑃 𝑥∧𝑦 =𝑃 𝑥 𝑃 𝑦 • 𝑃 𝑥∨𝑦 =1− 1−𝑃 𝑥 1−𝑃 𝑦 • 𝑓 = (𝑟1 ∧ 𝑠1 ∧ 𝑡1 ) ∨ (𝑟2 ∧ 𝑠2 ∧ 𝑡1 )=𝒕𝟏 ∧ ((𝒓𝟏 ∧ 𝒔𝟏 ) ∨ (𝒓𝟐 ∧ 𝒔𝟐 )) ∧ 𝑝 𝑡1 𝑝(𝑖3 ) 𝑝 𝑖3 ∨ =1 𝑡1 − 𝑝 𝑖1 = 𝑝 𝑠1 𝑝 𝑟1 𝑟1 ∧ 1 − 𝑝 𝑖1 ∙ 1 − 𝑝 𝑖2 ∧ 𝑝 𝑖2 = 𝑝 𝑠2 𝑝 𝑟2 𝑠1 𝑟2 𝑠2 Safe plans for safe queries produce formulas in read-once form [Olteanu&Huang2008] 𝑅 A B P 𝑆 B C P 𝑟1 𝑎1 𝑏1 0.6 𝑠1 𝑏1 𝑐1 0.6 𝑟2 𝑎2 𝑏2 0.5 𝑠2 𝑏2 𝑐1 0.5 𝑠3 𝑏2 𝑐2 0.25 C D P 𝑡1 𝑐1 𝑑1 0.4 𝑡2 𝑐2 𝑑2 0.3 𝑅 ⋈𝐵 S ⋈𝐶 𝑇 𝑄: Unsafe query 𝑇 ∅ f  r1s1t1  r2 s2t1  r2 s3t2 Not read-once Solutions: • Jha&Suciu11: Compile to decision diagram 𝑂𝐵𝐷𝐷, 𝐹𝐵𝐷𝐷 We will show how to compute the probability of • Jha&Suciu12: Exponential in pathwidth, double in exponential disjoint branch lineage expressions 𝑃𝑇𝐼𝑀𝐸 in expression pathwidth Lineage as a hypergraph f  r1s1t1  r2 s2t1  r2 s3t2 Primal Graph s2 s1 s3 t1 r1 r2 t2 In general, expanding a formula to its DNF form can lead to an exponential blowup. • For SPJ queries without self joins, the primal graph can be generated directly from the formula [Roy 2011]. Hyperedges Junction trees for lineages • Let 𝐻 𝑉, ℰ be a hypergraph. • Hypergraph 𝐻 𝑉, ℰ is acyclic iff it has a junction tree 𝑇(ℰ,A) [Beeri et al 1981] – The junction tree property: for every 𝑣 ∈ 𝑉 the set of nodes in the tree that contain 𝑣, 𝐾𝑣 ⊆ ℰ, induce a (connected) tree. Junction trees for lineages f  r1s1t1  r2 s2t1  r2 s3t2 r1 r1s1t1 s1 t1 t1 s2 r2 r2 s2t1 s3 r2 t2 r2 s3t2 Background: junction trees for probabilistic inference node r1s1t1 F (r1 , s1 , t1 ) t1 F (t1 ) separator r2 s2t1 F (r2 , s2 , t1 ) r2 F (r2 ) r2 s3t2 F (r2 , s3 , t2 ) Each node and separator stores joint pdf Send messages towards a given root node • Messages are passed by multiplication of factor entries • Once the root node has received messages from all of its neighbors, its factor holds the marginal of the joint probability distribution of the entire variable set. (Naïve) Junction Tree Algorithm for lineage computation f  r1s1t1  r2 s2t1  r2 s3t2 P( f  1)  1  P( f  0) r1 s1 t1 0 0 0 0 0 0 1 P pr1 ps1 pt1 r1s1t1 t1 r2 s2t1 r2 r2 s3t2 0 1 pr ps pt The JT Alg runs in time that is exponential in PROBLEM: 1 0largest pr ps pt factor.  the r2 , s3 , tF2 )(r,Fs (,rt2 ,)s3 , t2 )·F (r2 ) F (t1 )  r ,s F (r1 , s1 , t1 ) F (r2 ) F ( s2 ,t1 2 2 1 Restricted be efficiently 1 1 pr ps pt to lineage expressions that can (r2junction , s2 , t1 )  Ftree (r2 , s2[i.e; , t1 )·Flow (t1 ) tree width] represented usingF a 0 0 pr ps pt 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 pr1 ps1 pt1 1 1 0 pr1 ps1 pt1 1 1 1 pr1 ps1 pt1 𝑷(𝒇 = 𝟎) = 0  r2 ,s3 ,t2 F (r2 , s3 , t2 ) Take advantage of Junction Tree structure Rooted Directed Path Graphs [Gavril1975]: A graph 𝐺 𝑉, 𝐸 is a rooted directed path graph (RDPG) iff there exists a rooted directed junction tree 𝑇𝑟 such that for every vertex 𝑣 ∈ 𝑉, the set of nodes that contain 𝑣 form a directed path of 𝑇𝑟 Disjoint Branch Junction Trees (DBJT) 𝑓 = 𝑎𝑏𝑐𝑑𝑒 + 𝑎𝑏𝑓 + 𝑐𝑑𝑒𝑔 + 𝑎𝑓ℎ + 𝑏𝑖 + 𝑐𝑒𝑗 + 𝑑𝑔𝑘 Cr C1 C3 𝑎, 𝑓, ℎ C4 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 𝑎, 𝑏, 𝑓 𝑏, 𝑖 C2 C5 𝑐, 𝑑, 𝑒, 𝑔 𝑐, 𝑒, 𝑗 C6 𝑑, 𝑔, 𝑘 Use compact factors Cr C1 C3 𝑎, 𝑓, ℎ C4 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 𝑎, 𝑏, 𝑓 𝑏, 𝑖 C2 C5 𝑐, 𝑑, 𝑒, 𝑔 𝑐, 𝑒, 𝑗 C6 a 0 1 1 1 1 1 b * 0 1 1 1 1 c * * 0 1 1 1 d * * * 0 1 1 e P * ? * ? * ? * ? 0 ? 1 ? 𝑑, 𝑔, 𝑘 𝑃(𝑓 = We would ultimately like to calculate the entry probabilities. Their sum is exactly 𝑃(𝑓 = 0) The Algorithm Cr 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 𝜇2,𝑟 (𝑐, 𝑑, 𝑒) 𝜇1,𝑟 (𝑎, 𝑏) C1 C3 a 0 1 1 1 1 b * 0 1 1 1 c * * 0 1 1 𝑎, 𝑏, 𝑓 𝑎, 𝑓, ℎ C4 d * * * 0 1 e P * ? * ? * ? * ? 0 ? = 𝑏, 𝑖 a 0 1 1 b P * ? 0 ? 1 ? 𝑐, 𝑑, 𝑒, 𝑔 C2 C5 × 𝑐, 𝑒, 𝑗 C6 c 0 1 1 1 d * 0 1 1 e P * ? * ? 0 ? 1 ? 𝑑, 𝑔, 𝑘 This can be done due to the disjoint branch property. Projection/Marginalization • Sending a message involves summing out variables in the factor a f h P 0 * * pa 1 0 * 1 1 0 pa p f a h 0 * 1 * P pa pa p f pa p f ph 1 0 pa p f ph Π𝑎,ℎ No longer mutual exclusive! Disables subsequent projections. Projection/Marginalization • Solution: Perform marginalization by repeatedly projecting out only the last (rightmost) var. a f h P 0 * * pa 1 0 * 1 1 0 pa p f a 0 1 f * 0 P pa pa p f pa p f ph 1 1 pa p f ph Π𝑎,𝑓 • Requires ordering message-vars before those to be summed out. • Due to the junction tree property this is always possible. Cr C1 𝑎, 𝑏, 𝑓 3,1 (a, f ) C3 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 𝑎, 𝑓, ℎ C4 C2 C1 (a, b, f )  𝑐, 𝑑, 𝑒, 𝑔 4,1 (b) 𝑏, 𝑖 𝑐, 𝑒, 𝑗 C6 C5 3,1 (a, f )   a b f P 0 * * pa ( pb  pb pi ) 1 0 * pa pb ( p f  p f ph ) 1 1 0 pa pb p f pi 𝑑, 𝑔, 𝑘 a f P 0 * pa 1 0 pa p f 1 1 pa p f ph × b 4,1 (b)  0 1  a, f C3 (a, f , h)  a f h P 0 * * pa 1 0 * pa p f 1 1 0 pa p f ph P pb pb pi b b i C4 (b, i )  0 * 1 0 P pb pb pi Cr C1 𝑎, 𝑏, 𝑓 3,1 (a, f ) C3 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 6,2 (d , g ) 4,1 (b) 5,2 (c, e) 𝑎, 𝑓, ℎ C4 𝑏, 𝑖 Cr 𝑐, 𝑒, 𝑗 C6 C5 C1 3,1 (a, f ) 𝑎, 𝑓, ℎ C4 𝑑, 𝑔, 𝑘 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 2,r (c, d , e) 1,r (a, b) C3 𝑐, 𝑑, 𝑒, 𝑔 C2 𝑎, 𝑏, 𝑓 𝑐, 𝑑, 𝑒, 𝑔 C2 4,1 (b) 5,2 (c, e) 𝑏, 𝑖 C5 a 0 1 1 1 1 𝑐, 𝑒, 𝑗 C6 6,2 (d , g ) 𝑑, 𝑔, 𝑘 𝐹𝑟 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = b * 0 1 1 1 c * * 0 1 1 d * * * 0 1 e * * * * 0 P pa ( p  pb p ) 2, r (*,*,*) i b pa p ( p  p f p ) 2, r (*,*,*) b f h pa pb p p pc [ p  pd ( p g  p g p )] f i d k pa pb p p pc p ( pe  pe p ) j f i d pa pb p p pc pe pd ( p g  p g p ) f i k Complexity Analysis • Let 𝑘 be the size of the largest factor • Each node can have at most 𝑘 children • Therefore, each entry in the factor is updated at most 𝑘 times. • Overall 𝑂(𝑁 ⋅ 𝑘 2 ) Conclusions • Define disjoint branch lineage expressions • Provide an algorithm for computing the probability of disjoint branch lineage 𝟐 expressions in PTIME - 𝑶 𝑵 ⋅ 𝒌 Future Work • Are there other structural properties of junction trees that can facilitate efficient probabilistic inference ? • Real data is correlated – Drop tuple-independence assumption • Characterize queries and DB instances which induce lineage with “efficient” junction trees. Thank You

Presentation

Related documents

Products

Support

Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib