Lineage Processing over Correlated Probabilistic Databases Bhargav Kanagal Amol Deshpande

advertisement
Lineage Processing over Correlated
Probabilistic Databases
Bhargav Kanagal
Amol Deshpande
University of Maryland
Motivation: Information Extraction/Integration
Structured entities extracted from text in the internet
...located at 52 A
Goregaon West
Mumbai ...
ADDRESS
SEGMENTATION
Location
INFORMATION
EXTRACTION
SENTIMENT
ANALYSIS
CORRELATIONS
CarAds
Reputed
[Gupta&Sarawagi’2006, Jayram et al. 2006]
Why Lineage Processing ?
Location
CarAds
Reputed
List all “reputed” car sellers in “Mumbai” who offer Honda cars
SELECT SellerId
FROM Location, CarAds, Reputed
WHERE reputation = ‘good’ AND city = `Mumbai’
Location.SellerId = CarAds.SellerId AND
CarAds.SellerId = Reputed.SellerId
We need to compute the probability
of the above boolean formula
[Das Sarma et al. 2006]
Motivation: RFID based Event Monitoring
A building instrumented with RFID readers to track assets / personnel
• RFID readings are noisy
– Miss readings
– Add spurious readings
• Subjected to probabilistic modeling
• Probabilities associated with events
• Spatial and Temporal correlations
found(PC, X, 2pm), prob = 0.9
Was the PC correctly transferred from
room A to the conference room ?
found(x,PC) ∧found(z,PC)∧
[found(y1,PC)∨found(y2,PC)]
[RFID Ecosystem UW, Diao et al. 2009, Letchner et al. 2009, KD 2008]
PrDB System Overview
insert into reputation values (‘z1’,219,
uncertain(‘Good 0.5; Bad 0.5’);
User
insert factor ‘0 0 1; 1 1 1’ in
address on ‘y1.e’,‘y2.e’;
Query
Processor
PARSER
INDSEP
Manager
• Insert data + correlations
• Issue
Data tables
INDSEP Indexes
Uncertainty
Parameters
A Relational DBMS
– SPJ queries
– Inference queries
– Aggregation queries
[Kanagal & Deshpande SIGMOD 2009, SDG08, www.cs.umd.edu/~amol/PrDB/]
Outline
• Motivation & Problem definition [done]
• Background
– Probabilistic Databases as Junction trees
– Query processing over Junction trees
– INDSEP
• Lineage Processing over Junction trees
• Lineage Processing using INDSEP
• Results
Background: ProbDBs as Junction trees
id
Y
Exists ?
1
34
a?
2
33
b
?
3
25
?c
..
..
..
5
11
q
?
Random Variable
1 tuple exists
0 otherwise
Tuple Uncertainty
Attribute Uncertainty
Converted to Tuple Uncertainty
Correlations
Consise encoding of the
joint probability distribution
Query evaluation is performed
directly over Junction Trees
Forest of junction trees
Background: Junction trees
Clique
p(a,b,c)
p(b,c)
Separator
p(b,c,d)
Each clique and separator stores joint pdf (POTENTIAL)
Tree structure reflects Markov property
Joint distribution
Marginal: p(a,d)
Given b, c: a independent of
d
Marginal Computation
Steiner tree + Send messages toward a given pivot node
{b, c, n}
For ProbDBs ≈ 1 million tuples, not scalable
(1) Span of the query can be very large – almost the
PIVOT
complete database accessed even for a 3 variable query
(2) Searching for cliques is expensive: Linear scan over all
the nodes is inefficient
Keep query variables
Keep correlations
Remove others
Shortcut Potentials
How can we make marginal computation scalable ?
100 ops
Shortcut Potential
Junction tree on set variables
{c, f, g, j, k, l, m}
(1) Boundary separators
(2) Distribution required
to completely
shortcut the partition
(3) Which to build ?
50 ops
INDSEP - Overview
Obtained by hierarchical partitioning of the junction tree
1. Variables: {a,b,..} {c,f,..} {j,n..q}
2. Child Separators: p(c), p(j)
3. Tree induced on the children
Root
I1
4. Shortcut potentials of children:
{p(c), p(c,j), p(j)}
P1
I2
P2
P3
I3
P4
Actual Construction: [Kanagal & Deshpande SIGMOD 2009]
P5
P6
Computing Marginals using INDSEP
Recursion on INDSEP
{b, c, n}
Root
{b, c}
c}
{j, n}
{n}
{c, j}
I1
I2
I3
{b, c, n}
P1
P2
P3
P4
P5
P6
[Kanagal & Deshpande SIGMOD 2009]
Intermediate
Junction tree
Outline
• Motivation & Problem definition [done]
• Background [done]
– Junction trees & Query processing over junction
trees
– INDSEP
• Lineage Processing over Junction trees
• Lineage Processing using INDSEP
• Results
Lineage Processing
Typically classified into 2 types
Read-Once
Non-Read-Once
(a∧b)∨(c∧d)
(a∧b)∨(b∧c) ∨(c∧d)
The problem of lineage processing is #Pcomplete in general for correlated probabilistic
databases, even for read-once lineages
Reduction from #DNF
Lineage Processing on Junction trees
Naïve:
Evaluate marginal query over variables in formula
(a∧b)∨(c∧d)
p(a, b, c, d)
Multiply with p(a∧b|a,b)
COMPLEXITY
(1)
(name of the above process)
p(a, Simplifcation
b, a∧b, c, d)
(2) Dependent on the size of the intermediate pdf
(3) Here, itEliminate
is at least
a,b (n+1) (#terms in the formula)
(4) Not scalable to large formulae p((a∧b)∨(c∧d))
p(a∧b, c, d)
Multipl
y
Multiply / Eliminate
p(a∧b, c∧d)
p(a∧b, c, d, c∧d)
Eliminate
Lineage Processing [Optimization opportunities]
1. EAGER
Exploit conditional independence & simplify early
Query: (a∧b)∨(c∧d)
p(a, c, d)
p(a, d)
p(a, c∧d)
p(a, d)
PIVOT
[Kanagal & Deshpande SIGMOD 2010]
Lineage Processing [Optimization opportunities]
2. EAGER+ORDER
Distribute simplification into the product
(c∧h)∨(m∧n)
p(f, h)
Max pdf: 5
p(c, f, g)
p(g,m∧n
)
p(c, f, g, h, m∧n)
p(c, f, g, h)
p(c, h, m∧n)
p(g, c∧h)
p((c∧h)∨(m∧n))
p(g,m∧n
)
p(c∧h,
m∧n)
How to compute good ordering ?
[Kanagal & Deshpande SIGMOD 2010]
Max pdf: 4
Lineage Processing [Pivot Selection]
Also influences the intermediate pdf size
(b∧c)∨g
Pivot = (ab)
Max pdf: 3
Pivot = (cfg)
Max pdf: 4
Optimal Pivot: Only n possible choices, estimate pdf size for each pivot location
Outline
• Motivation & Problem definition [done]
• Background [done]
– Junction trees & Query processing over junction
trees
– INDSEP
• Lineage Processing over Junction trees [done]
• Lineage Processing using INDSEP
• Results
Lineage Processing using INDSEP
(b∧c) ∨((d∨e) ∧(n∨o) )
{b∧c, d∨e, c}
{b, c, d, e}
P1
Root
I1
I2
P2
P3
{c, j}
P4
I3
P5
{j, n∨o}
{n, o}
P6
Recursion bottomed out using EAGER+ORDER
But what is the running time ?
Lineage Planning Phase
(b∧c) ∨((d∨e) ∧(n∨o) )
QUERY PLAN
• If a node exceeds a threshold, do
approximations to estimate probability
• In addition, modify query plan for:
– Multiple lineages that share variables
– Exploiting disconnections
Estimate maximum
intermediate pdf size
at each node
4
4
5
6
4
7
4
Results
Datasets
(1) D1: Fully independent
(2) D2: Correlated
(3) D3: Highly Correlated (long chains)
NOTE: LOG scale
Comparison Systems
(1) NAIVE
(2) EAGER
(3) EAGER + ORDER
NOTE: LOG scale
Query Processing times for
different heuristics
EAGER+ORDER is much more efficient than others
Results
Highly dependent on size of lineage
Multiquery processing exploits sharing
NOTE: LOG scale
Query Processing time vs
Lineage size
Ratio vs
Sharing factor
Conclusions
• Proposed a scalable system for evaluating
boolean formula queries over correlated
probabilistic databases
• Future
– Plan to further the approximation approaches
– Envelopes of boolean formulas for upper and
lower bounds
Thank you 
Lineage Processing (contd.)
Construct complete graph
on factors to be multiplied
Amount of simplification possible
when nodes are multiplied
p(f, h)
p(c, f, g)
4-2
p(g, c∧h)
1.
2.
3.
Pick the biggest edge
Merge / Simplify nodes together
Recompute new edge weights
Lineage Processing via INDSEP [Improvement 1]
Multiple Lineage Processing: Exploit possibility of sharing
(m∧c)∨g
(n∧c)∨g
Root
{j, m}
{c, g, j}
I1
P1
I2
P2
P3
{c, g, j}
P4
I3
P5
{j, n}
P6
Sharing across multiple levels
Need not even share variables, just paths
Lineage Processing via INDSEP [Improvement 2]
Extend to forest of junction trees: Real world data sets may have independences
Index constructed to minimize disk wastage, combining forests together
(a∧o)
Root
{a, c}
I1
{a, c}
P1
I2
P2
P3
{j, o}
{c, j}
P4
I3
{j}
P5
P6
{o}
j and o are
disconnected !!
a and o are
disconnected !!
Preprocess formula, keep variables in connected components together
Lineage Processing via INDSEP [Improvement 3]
What about complexity ?
Complexity not evident from the algorithm
Root
{b, c, d,d∨e,
e} c}
{b∧c,
I1
{b∧c, c}
P1
Compute lwidth here
P2
I2
{d∨e}
P3
{c, j}
P4
I3
{j, n}
P5
{n,n∨o}
o}
{j,
P6
{o}
Compute lwidth here
Intermediate junction tree
“Predict” how large the intermediate cliques will be
Approximate for all portions whose estimate is more than a threshold, e.g., 10
Download