Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande University of Maryland Motivation: Information Extraction/Integration Structured entities extracted from text in the internet ...located at 52 A Goregaon West Mumbai ... ADDRESS SEGMENTATION Location INFORMATION EXTRACTION SENTIMENT ANALYSIS CORRELATIONS CarAds Reputed [Gupta&Sarawagi’2006, Jayram et al. 2006] Why Lineage Processing ? Location CarAds Reputed List all “reputed” car sellers in “Mumbai” who offer Honda cars SELECT SellerId FROM Location, CarAds, Reputed WHERE reputation = ‘good’ AND city = `Mumbai’ Location.SellerId = CarAds.SellerId AND CarAds.SellerId = Reputed.SellerId We need to compute the probability of the above boolean formula [Das Sarma et al. 2006] Motivation: RFID based Event Monitoring A building instrumented with RFID readers to track assets / personnel • RFID readings are noisy – Miss readings – Add spurious readings • Subjected to probabilistic modeling • Probabilities associated with events • Spatial and Temporal correlations found(PC, X, 2pm), prob = 0.9 Was the PC correctly transferred from room A to the conference room ? found(x,PC) ∧found(z,PC)∧ [found(y1,PC)∨found(y2,PC)] [RFID Ecosystem UW, Diao et al. 2009, Letchner et al. 2009, KD 2008] PrDB System Overview insert into reputation values (‘z1’,219, uncertain(‘Good 0.5; Bad 0.5’); User insert factor ‘0 0 1; 1 1 1’ in address on ‘y1.e’,‘y2.e’; Query Processor PARSER INDSEP Manager • Insert data + correlations • Issue Data tables INDSEP Indexes Uncertainty Parameters A Relational DBMS – SPJ queries – Inference queries – Aggregation queries [Kanagal & Deshpande SIGMOD 2009, SDG08, www.cs.umd.edu/~amol/PrDB/] Outline • Motivation & Problem definition [done] • Background – Probabilistic Databases as Junction trees – Query processing over Junction trees – INDSEP • Lineage Processing over Junction trees • Lineage Processing using INDSEP • Results Background: ProbDBs as Junction trees id Y Exists ? 1 34 a? 2 33 b ? 3 25 ?c .. .. .. 5 11 q ? Random Variable 1 tuple exists 0 otherwise Tuple Uncertainty Attribute Uncertainty Converted to Tuple Uncertainty Correlations Consise encoding of the joint probability distribution Query evaluation is performed directly over Junction Trees Forest of junction trees Background: Junction trees Clique p(a,b,c) p(b,c) Separator p(b,c,d) Each clique and separator stores joint pdf (POTENTIAL) Tree structure reflects Markov property Joint distribution Marginal: p(a,d) Given b, c: a independent of d Marginal Computation Steiner tree + Send messages toward a given pivot node {b, c, n} For ProbDBs ≈ 1 million tuples, not scalable (1) Span of the query can be very large – almost the PIVOT complete database accessed even for a 3 variable query (2) Searching for cliques is expensive: Linear scan over all the nodes is inefficient Keep query variables Keep correlations Remove others Shortcut Potentials How can we make marginal computation scalable ? 100 ops Shortcut Potential Junction tree on set variables {c, f, g, j, k, l, m} (1) Boundary separators (2) Distribution required to completely shortcut the partition (3) Which to build ? 50 ops INDSEP - Overview Obtained by hierarchical partitioning of the junction tree 1. Variables: {a,b,..} {c,f,..} {j,n..q} 2. Child Separators: p(c), p(j) 3. Tree induced on the children Root I1 4. Shortcut potentials of children: {p(c), p(c,j), p(j)} P1 I2 P2 P3 I3 P4 Actual Construction: [Kanagal & Deshpande SIGMOD 2009] P5 P6 Computing Marginals using INDSEP Recursion on INDSEP {b, c, n} Root {b, c} c} {j, n} {n} {c, j} I1 I2 I3 {b, c, n} P1 P2 P3 P4 P5 P6 [Kanagal & Deshpande SIGMOD 2009] Intermediate Junction tree Outline • Motivation & Problem definition [done] • Background [done] – Junction trees & Query processing over junction trees – INDSEP • Lineage Processing over Junction trees • Lineage Processing using INDSEP • Results Lineage Processing Typically classified into 2 types Read-Once Non-Read-Once (a∧b)∨(c∧d) (a∧b)∨(b∧c) ∨(c∧d) The problem of lineage processing is #Pcomplete in general for correlated probabilistic databases, even for read-once lineages Reduction from #DNF Lineage Processing on Junction trees Naïve: Evaluate marginal query over variables in formula (a∧b)∨(c∧d) p(a, b, c, d) Multiply with p(a∧b|a,b) COMPLEXITY (1) (name of the above process) p(a, Simplifcation b, a∧b, c, d) (2) Dependent on the size of the intermediate pdf (3) Here, itEliminate is at least a,b (n+1) (#terms in the formula) (4) Not scalable to large formulae p((a∧b)∨(c∧d)) p(a∧b, c, d) Multipl y Multiply / Eliminate p(a∧b, c∧d) p(a∧b, c, d, c∧d) Eliminate Lineage Processing [Optimization opportunities] 1. EAGER Exploit conditional independence & simplify early Query: (a∧b)∨(c∧d) p(a, c, d) p(a, d) p(a, c∧d) p(a, d) PIVOT [Kanagal & Deshpande SIGMOD 2010] Lineage Processing [Optimization opportunities] 2. EAGER+ORDER Distribute simplification into the product (c∧h)∨(m∧n) p(f, h) Max pdf: 5 p(c, f, g) p(g,m∧n ) p(c, f, g, h, m∧n) p(c, f, g, h) p(c, h, m∧n) p(g, c∧h) p((c∧h)∨(m∧n)) p(g,m∧n ) p(c∧h, m∧n) How to compute good ordering ? [Kanagal & Deshpande SIGMOD 2010] Max pdf: 4 Lineage Processing [Pivot Selection] Also influences the intermediate pdf size (b∧c)∨g Pivot = (ab) Max pdf: 3 Pivot = (cfg) Max pdf: 4 Optimal Pivot: Only n possible choices, estimate pdf size for each pivot location Outline • Motivation & Problem definition [done] • Background [done] – Junction trees & Query processing over junction trees – INDSEP • Lineage Processing over Junction trees [done] • Lineage Processing using INDSEP • Results Lineage Processing using INDSEP (b∧c) ∨((d∨e) ∧(n∨o) ) {b∧c, d∨e, c} {b, c, d, e} P1 Root I1 I2 P2 P3 {c, j} P4 I3 P5 {j, n∨o} {n, o} P6 Recursion bottomed out using EAGER+ORDER But what is the running time ? Lineage Planning Phase (b∧c) ∨((d∨e) ∧(n∨o) ) QUERY PLAN • If a node exceeds a threshold, do approximations to estimate probability • In addition, modify query plan for: – Multiple lineages that share variables – Exploiting disconnections Estimate maximum intermediate pdf size at each node 4 4 5 6 4 7 4 Results Datasets (1) D1: Fully independent (2) D2: Correlated (3) D3: Highly Correlated (long chains) NOTE: LOG scale Comparison Systems (1) NAIVE (2) EAGER (3) EAGER + ORDER NOTE: LOG scale Query Processing times for different heuristics EAGER+ORDER is much more efficient than others Results Highly dependent on size of lineage Multiquery processing exploits sharing NOTE: LOG scale Query Processing time vs Lineage size Ratio vs Sharing factor Conclusions • Proposed a scalable system for evaluating boolean formula queries over correlated probabilistic databases • Future – Plan to further the approximation approaches – Envelopes of boolean formulas for upper and lower bounds Thank you Lineage Processing (contd.) Construct complete graph on factors to be multiplied Amount of simplification possible when nodes are multiplied p(f, h) p(c, f, g) 4-2 p(g, c∧h) 1. 2. 3. Pick the biggest edge Merge / Simplify nodes together Recompute new edge weights Lineage Processing via INDSEP [Improvement 1] Multiple Lineage Processing: Exploit possibility of sharing (m∧c)∨g (n∧c)∨g Root {j, m} {c, g, j} I1 P1 I2 P2 P3 {c, g, j} P4 I3 P5 {j, n} P6 Sharing across multiple levels Need not even share variables, just paths Lineage Processing via INDSEP [Improvement 2] Extend to forest of junction trees: Real world data sets may have independences Index constructed to minimize disk wastage, combining forests together (a∧o) Root {a, c} I1 {a, c} P1 I2 P2 P3 {j, o} {c, j} P4 I3 {j} P5 P6 {o} j and o are disconnected !! a and o are disconnected !! Preprocess formula, keep variables in connected components together Lineage Processing via INDSEP [Improvement 3] What about complexity ? Complexity not evident from the algorithm Root {b, c, d,d∨e, e} c} {b∧c, I1 {b∧c, c} P1 Compute lwidth here P2 I2 {d∨e} P3 {c, j} P4 I3 {j, n} P5 {n,n∨o} o} {j, P6 {o} Compute lwidth here Intermediate junction tree “Predict” how large the intermediate cliques will be Approximate for all portions whose estimate is more than a threshold, e.g., 10