Representing and Querying Correlated Tuples in Probabilistic

advertisement
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
GENERAL INFO
 High demand for storing uncertain data
Issues with the use of probabilistic databases
1) existent probabilistic databases make simplistic
assumptions about the data that make it difficult to
A framework that can represent not only
use them in applications that naturally produce
probabilistic tuples but also correlations among
correlated data
them to tackle
thesecan
limitations
2) Most probabilistic
databases
only answer a
restricted subset of the queries that can be
expressed using traditional query languages
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
INTRODUCTION (1/2)
 Database research has primarily concentrated on how
to store and query exact data
 Many real-world applications produce large amounts of
uncertain data
Databases need to do more than simply
store and retrieve; they have to help the
user sift through the uncertainty and find
the results most likely to be the answer.
INTRODUCTION (2/2)
 Numerous approaches (models) proposed to handle
uncertainty.
 However, most models make assumptions about data
uncertainty that restricts applicability (they cannot
easily model or handle dependencies and correlations
among tuples)
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
INDEPENDENT TUPLES MODEL(1/2)
One of the most commonly used tuple-level uncertainty models,
associates existence probabilities with individual tuples and assumes
that the tuples are independent of each other
INDEPENDENT TUPLES MODEL (2/2)
Evaluating a query via the set of possible worlds is clearly
intractable as the number of possible worlds is very big
Intensional semantics guarantee results in accordance
with possible words semantics but are computationally
expensive.
Extensional semantics are computationally cheaper but do not
guarantee results in accordance with the possible worlds
semantics.
o
Base tuples are independent of each other, the intermediate tuples that are
generated during query evaluation are typically correlated
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
TUPLE CORRELATIONS (1/2)
TUPLE CORRELATIONS (2/2)
 Although the tuple probabilities associated with s1, s2
and t1 are identical, the query results are drastically
different across these four databases.
 Since both intensional and extensional semantics
assume base tuple independence neither can be
directly used to do query evaluation in such cases.
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Representing correlations
Query evaluation
Experiments
Conclusions & Work to be done
REPRESENTING CORRELATIONS(1/3)
1)
Associate every tuple t with a Boolean valued random variable Xt
2)
f (X) is a function of a (small) set of random variables X, where
<= f (X) <=1
3)
Associate with each tuple in the probabilistic database a random
variable
Define factors on (sub)sets of tuple-based random variables to
4)
0
encode correlations.
5)
The probability of an instantiation of the database is given by the
product of all the factors.
REPRESENTING CORRELATIONS(2/3)
Suppose we want to represent mutual exclusivity between tuples s1 and t1. In
particular, let us try to represent the possible worlds:
REPRESENTING CORRELATIONS(3/3)
Suppose we want to represent positive correlation between t1 and s1. In particular,
let us try to represent the possible worlds:
PROBABILISTIC GRAPHICAL MODEL
REPRESENTATION
A probabilistic graphical model is graph whose nodes represent random variables
and edges represent correlations
Complete Ind.
Xt1
Xs1
Xs2
Mutual Exclusivity
Xt1
Xs1
Xs2
Positive Correlation
Xt1
Xs1
Xs2
PROBABILISTIC GRAPHICAL MODEL
REPRESENTATION
X1
X2
X3
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
QUERY EVALUATION: BASIC IDEA
 Treat intermediate tuples as regular tuples.
 Carefully represent correlations between intermediate
tuples, base tuples and result tuples to construct a
probabilistic graphical model.
 Cast the probability computations resulting from query
evaluation to inference in probabilistic graphical
models.
QUERY EVALUATION: EXAMPLE
QUERY EVALUATION :EXAMPLE PROBABILISTIC
GRAPHICAL MODEL
 Query evaluation problem in Prob. Databases:
Compute the probability of the result tuple
summed over all possible worlds of the
database
 Equivalent problem in prob. graph. models:
marginal probability computation.
 use inference algorithms
Xs1
Xs2
Xt1
Xi1
Xi2
Xr1
Xs2
Xt1
Xi1
Xi2
Xr1
REPRESENTING PROBABILISTIC RELATIONS
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
EXPERIMENTS (1/3)
 Database contains 860 publications from CiteSeer [GBL98].
 Searched for publications for given (misspelt) author name.
 Naturally involves mutual exclusivity correlations
EXPERIMENTS (2/3)

Ran experiments on randomly generated TPC-H dataset of size 10MB.

The first bar on each query indicates the time it took to run the full query
including all the database operations and the probabilistic computations.

The second one indicates the time it took to run only the database operations
using our Java implementation.
EXPERIMENTS(3/3)

The result of running an average query over a synthetically generated
dataset containing tuples
OUTLINE
General Info
Introduction
Independent tuples model
Tuple correlations
Probabilistic graphical models & factored
representations
Representing Dependencies
Query evaluation
Experiments
Conclusions & Work to be done
CONCLUSIONS
 There is an increasing need for database solutions for
efficiently managing and querying uncertain data
exhibiting complex correlation patterns.
 A simple and intuitive framework is presented, based
on probabilistic graphical models, for explicitly
modeling correlations among tuples in a probabilistic
database
WORK TO BE DONE
Problem: Although conceptually the approach presented
allows for capturing arbitrary tuple correlations, exact query
evaluation over large datasets exhibiting complex correlations
may not always be feasible.
Future Considerations:
 Development of approximate query evaluation techniques
that can be used in such cases
 Develop disk-based query evaluation algorithms so that
their techniques can scale to very large datasets.
Download