A new class of lineage expressions over probabilistic databases computable in PTIME SUM 2013 Batya Kenig Avigdor Gal Ofer Strichman Probabilistic Databases for managing uncertain data • A variety of data sources generate incomplete, noisy and uncertain data (sensor networks, information extraction, data integration…). • Probabilistic databases enable storing and querying such data • A lot of research in recent years MayBMS [Cornell], Trio [Stanford], SPROUT [Oxford], PrDB [U.Md] Tuple Independent Probabilistic Databases π A π B P B C P π1 π1 π1 0.6 π 1 π1 π1 0.6 π2 π2 π2 0.5 π 2 π2 π1 0.5 π 3 π2 π2 0.25 π C D P π‘1 π1 π1 0.4 Each possible world πΌ is a standard database instance with probability π(πΌ) π πΌ0 = 0.018 π A B π B C π CD π πΌ1 = 0.027 π A π1 π1 π π B CD B π1 C π πΌ2 = 0.0405 π A π1 π1 π B π 1 π1 π CD B π1 C π1 π πΌ3 = 0.027 π A B π1 π1 π B π1 C π 1 π1 π C D π‘1 π1 π1 π1 β― Query Semantics • Let π be a query evaluated against probabilistic DB π· • Let πΌπ‘ ⊆ πΌ be the possible worlds that return π‘ ∈ π[π·]. • Sum the probabilities of instances that return π‘. P(t ) ο½ ο₯iοIt P(i ) • Goal: efficiently evaluate π(π[π·]) – In time polynomial in |π·| – Not always possible, in general #P-hard Probabilistic Inference for queries • DalviSuciu04: Conjunctive queries (without self joins) are either: – Safe queries: Have query plans that run in πππΌππΈ on all DB instances – Unsafe queries: Data complexity is #π-hard • π: Π∅ π (π) β π(π, π) β π(π) • However, – Even for unsafe queries there are DB instances which will enable efficient computation Why lineage? Each tuple is associated with a binary random variable π A π B P B C P π1 π1 π1 0.6 π 1 π1 π1 0.6 π2 π2 π2 0.5 π 2 π2 π1 0.5 π 3 π2 π2 0.25 π π: Π∅ π βπ΅ S βπΆ π C D P π‘1 π1 π1 0.4 βπ΅ A π1 A B C lineage βπΆ B C D lineage π1 π1 π1 π1 π1 ∧ π 1 ∧ π‘1 π2 π2 π2 π1 π1 π2 ∧ π 2 ∧ π‘1 π1 π1 π1 π1 π1 ∧ π 1 π2 π2 π2 π1 π2 ∧ π 2 π3 π2 π2 π2 π2 ∧ π 3 Compute probability of this formula (π1 ∧ π 1 ∧ π‘1 ) ∨ (π2 ∧ π 2 ∧ π‘1 ) Why Lineage ? Efficient computation • Expression in Read-Once form [Roy2011,Sen2010] – Linear time probability computation • π π₯∧π¦ =π π₯ π π¦ • π π₯∨π¦ =1− 1−π π₯ 1−π π¦ • π = (π1 ∧ π 1 ∧ π‘1 ) ∨ (π2 ∧ π 2 ∧ π‘1 )=ππ ∧ ((ππ ∧ ππ ) ∨ (ππ ∧ ππ )) ∧ π π‘1 π(π3 ) π π3 ∨ =1 π‘1 − π π1 = π π 1 π π1 π1 ∧ 1 − π π1 β 1 − π π2 ∧ π π2 = π π 2 π π2 π 1 π2 π 2 Safe plans for safe queries produce formulas in read-once form [Olteanu&Huang2008] π A B P π B C P π1 π1 π1 0.6 π 1 π1 π1 0.6 π2 π2 π2 0.5 π 2 π2 π1 0.5 π 3 π2 π2 0.25 C D P π‘1 π1 π1 0.4 π‘2 π2 π2 0.3 π βπ΅ S βπΆ π π: Unsafe query π ∅ f ο½ r1s1t1 ο« r2 s2t1 ο« r2 s3t2 Not read-once Solutions: • Jha&Suciu11: Compile to decision diagram ππ΅π·π·, πΉπ΅π·π· We will show how to compute the probability of • Jha&Suciu12: Exponential in pathwidth, double in exponential disjoint branch lineage expressions πππΌππΈ in expression pathwidth Lineage as a hypergraph f ο½ r1s1t1 ο« r2 s2t1 ο« r2 s3t2 Primal Graph s2 s1 s3 t1 r1 r2 t2 In general, expanding a formula to its DNF form can lead to an exponential blowup. • For SPJ queries without self joins, the primal graph can be generated directly from the formula [Roy 2011]. Hyperedges Junction trees for lineages • Let π» π, β° be a hypergraph. • Hypergraph π» π, β° is acyclic iff it has a junction tree π(β°,A) [Beeri et al 1981] – The junction tree property: for every π£ ∈ π the set of nodes in the tree that contain π£, πΎπ£ ⊆ β°, induce a (connected) tree. Junction trees for lineages f ο½ r1s1t1 ο« r2 s2t1 ο« r2 s3t2 r1 r1s1t1 s1 t1 t1 s2 r2 r2 s2t1 s3 r2 t2 r2 s3t2 Background: junction trees for probabilistic inference node r1s1t1 F (r1 , s1 , t1 ) t1 F (t1 ) separator r2 s2t1 F (r2 , s2 , t1 ) r2 F (r2 ) r2 s3t2 F (r2 , s3 , t2 ) Each node and separator stores joint pdf Send messages towards a given root node • Messages are passed by multiplication of factor entries • Once the root node has received messages from all of its neighbors, its factor holds the marginal of the joint probability distribution of the entire variable set. (Naïve) Junction Tree Algorithm for lineage computation f ο½ r1s1t1 ο« r2 s2t1 ο« r2 s3t2 P( f ο½ 1) ο½ 1 ο P( f ο½ 0) r1 s1 t1 0 0 0 0 0 0 1 P pr1 ps1 pt1 r1s1t1 t1 r2 s2t1 r2 r2 s3t2 0 1 pr ps pt The JT Alg runs in time that is exponential in PROBLEM: 1 0largest pr ps pt factor. ο the r2 , s3 , tF2 )(rο½,Fs (,rt2 ,)s3 , t2 )·F (r2 ) F (t1 ) ο½ ο₯r ,s F (r1 , s1 , t1 ) F (r2 ) Fο½ (ο₯ s2 ,t1 2 2 1 Restricted be efficiently 1 1 pr ps pt to lineage expressions that can (r2junction , s2 , t1 ) ο½ Ftree (r2 , s2[i.e; , t1 )·Flow (t1 ) tree width] represented usingF a 0 0 pr ps pt 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 pr1 ps1 pt1 1 1 0 pr1 ps1 pt1 1 1 1 pr1 ps1 pt1 π·(π = π) = 0 ο₯ r2 ,s3 ,t2 F (r2 , s3 , t2 ) Take advantage of Junction Tree structure Rooted Directed Path Graphs [Gavril1975]: A graph πΊ π, πΈ is a rooted directed path graph (RDPG) iff there exists a rooted directed junction tree ππ such that for every vertex π£ ∈ π, the set of nodes that contain π£ form a directed path of ππ Disjoint Branch Junction Trees (DBJT) π = πππππ + πππ + ππππ + ππβ + ππ + πππ + πππ Cr C1 C3 π, π, β C4 π, π, π, π, π π, π, π π, π C2 C5 π, π, π, π π, π, π C6 π, π, π Use compact factors Cr C1 C3 π, π, β C4 π, π, π, π, π π, π, π π, π C2 C5 π, π, π, π π, π, π C6 a 0 1 1 1 1 1 b * 0 1 1 1 1 c * * 0 1 1 1 d * * * 0 1 1 e P * ? * ? * ? * ? 0 ? 1 ? π, π, π π(π = We would ultimately like to calculate the entry probabilities. Their sum is exactly π(π = 0) The Algorithm Cr π, π, π, π, π π2,π (π, π, π) π1,π (π, π) C1 C3 a 0 1 1 1 1 b * 0 1 1 1 c * * 0 1 1 π, π, π π, π, β C4 d * * * 0 1 e P * ? * ? * ? * ? 0 ? = π, π a 0 1 1 b P * ? 0 ? 1 ? π, π, π, π C2 C5 × π, π, π C6 c 0 1 1 1 d * 0 1 1 e P * ? * ? 0 ? 1 ? π, π, π This can be done due to the disjoint branch property. Projection/Marginalization • Sending a message involves summing out variables in the factor a f h P 0 * * pa 1 0 * 1 1 0 pa p f a h 0 * 1 * P pa pa p f pa p f ph 1 0 pa p f ph Ππ,β No longer mutual exclusive! Disables subsequent projections. Projection/Marginalization • Solution: Perform marginalization by repeatedly projecting out only the last (rightmost) var. a f h P 0 * * pa 1 0 * 1 1 0 pa p f a 0 1 f * 0 P pa pa p f pa p f ph 1 1 pa p f ph Ππ,π • Requires ordering message-vars before those to be summed out. • Due to the junction tree property this is always possible. Cr C1 π, π, π ο3,1 (a, f ) C3 π, π, π, π, π π, π, β C4 C2 C1 (a, b, f ) ο½ π, π, π, π ο4,1 (b) π, π π, π, π C6 C5 ο3,1 (a, f ) ο½ ο a b f P 0 * * pa ( pb ο« pb pi ) 1 0 * pa pb ( p f ο« p f ph ) 1 1 0 pa pb p f pi π, π, π a f P 0 * pa 1 0 pa p f 1 1 pa p f ph × b ο4,1 (b) ο½ 0 1 ο a, f C3 (a, f , h) ο½ a f h P 0 * * pa 1 0 * pa p f 1 1 0 pa p f ph P pb pb pi b b i C4 (b, i ) ο½ 0 * 1 0 P pb pb pi Cr C1 π, π, π ο3,1 (a, f ) C3 π, π, π, π, π ο6,2 (d , g ) ο4,1 (b) ο5,2 (c, e) π, π, β C4 π, π Cr π, π, π C6 C5 C1 ο3,1 (a, f ) π, π, β C4 π, π, π π, π, π, π, π ο2,r (c, d , e) ο1,r (a, b) C3 π, π, π, π C2 π, π, π π, π, π, π C2 ο4,1 (b) ο5,2 (c, e) π, π C5 a 0 1 1 1 1 π, π, π C6 ο6,2 (d , g ) π, π, π πΉπ π, π, π, π, π = b * 0 1 1 1 c * * 0 1 1 d * * * 0 1 e * * * * 0 P pa ( p ο« pb p ) ο2, r (*,*,*) i b pa p ( p ο« p f p ) ο2, r (*,*,*) b f h pa pb p p pc [ p ο« pd ( p g ο« p g p )] f i d k pa pb p p pc p ( pe ο« pe p ) j f i d pa pb p p pc pe pd ( p g ο« p g p ) f i k Complexity Analysis • Let π be the size of the largest factor • Each node can have at most π children • Therefore, each entry in the factor is updated at most π times. • Overall π(π ⋅ π 2 ) Conclusions • Define disjoint branch lineage expressions • Provide an algorithm for computing the probability of disjoint branch lineage π expressions in PTIME - πΆ π΅ ⋅ π Future Work • Are there other structural properties of junction trees that can facilitate efficient probabilistic inference ? • Real data is correlated – Drop tuple-independence assumption • Characterize queries and DB instances which induce lineage with “efficient” junction trees. Thank You