slides

advertisement
Chapter 13: Incorporating
Uncertainty into Data
Integration
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
 Sources of uncertainty in data integration
 Representing uncertain data (brief overview)
 Probabilistic schema mappings
Managing Uncertain Data
 Databases typically model certain data:
 A tuple is either true (in the database) or false (not in the
database).
 Real life involves a lot of uncertainty:
 “The thief had either blond or brown hair”
 The sensor reading is often unreliable.
 Uncertain databases try to model such uncertain
data and to answer queries in a principled fashion.
 Data integration involves multiple facets of
uncertainty!
Uncertainty in Data Integration
 Data itself may be uncertain (perhaps it’s extracted
from an unreliable source)
 Schema mappings can be approximate (perhaps
created by an automatic tool)
 Reference reconciliation (and hence joins) are
approximate
 If the domain is broad enough, even the mediated
schema could involve uncertainty
 Queries, often posed as keywords, have uncertain
intent.
Outline
 Sources of uncertainty in data integration
 Representing uncertain data (brief overview)
 Probabilistic schema mappings
Principles of Uncertain Databases
 Instead of describing one possible state of the world,
an uncertain database describes a set of possible
worlds.
 The expressive power of the data model determines
which sets of possible world that database can
represent.
 Is uncertainty on values of an attribute?
 Or on the presence of a tuple?
 Can dependencies between tuples be represented?
C-Tables: Uncertainty without
Probabilities
 Alice and Bob want to go on a vacation together, but
will go to either Tahiti or Ulaanbaatar. Candace will
definitely go to Ulaanbaatar.
 Possible words result from different assignments to
the variables.
Representing Complex
Distributions
 The c-table represents mutual exclusion of tuples,
but doesn’t represent probability distributions.
 Representing complex probability distributions and
correlations between tuples requires using
probabilistic graphical models.
 A couple of simpler models:
 Independent tuple probabilities
 Block independent probabilities
Tuple Independent Model
 Assign each tuple a probability.
 The probability of every possible world is the
appropriate product of the probabilities for each of
the rows.
 pi if row i is in the database, and (1-pi) if it’s not.
 Cannot represent correlations between tuples.
Block Independent Model
 You choose one tuple from every block according to
the distribution of that block.
 Can represent mutual exclusion, but not co-dependence
(i.e., Alice and Bob going to the same location).
Outline
 Sources of uncertainty in data integration
 Representing uncertain data (brief overview)
 Probabilistic schema mappings
Probabilistic Schema Mappings
 Source schema:
 S=(pname, email-addr, home-addr, office-addr)
 Target schema:
 T=(name, mailing-addr)
 We may not be sure which attribute of S mailingaddr should map to?
 Probabilistic schema mappings let us handle such
uncertainty.
Probabilistic Schema Mappings
Intuitively, we want to give each mapping a
probability:
 S=(pname, email-addr, home-addr, office-addr)
 T=(name, mailing-addr)
Possible Mapping
Probability
{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
What are the Semantics?
 S=(pname, email-addr, home-addr, office-addr)
 T=(name, mailing-addr)
Possible Mapping
Probability
{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
Should a single mapping apply to the entire table? (bytable semantics), or can different mappings apply to
different tuples? (by-tuple semantics)
By-Table versus By-Tuple Semantics
Ds=
pname email-addr
Alice
Bob
alice@
bob@
home-addr
Mountain View
Sunnyvale
office-addr
Sunnyvale
Sunnyvale
There are 3 possible databases DT:
DT=
name
mailing-addr
name mailing-addr
name
mailing-addr
Alice
Mountain View
Alice
Sunnyvale
Alice
alice@
Bob
Sunnyvale
Bob
Sunnyvale
Bob
bob@
Pr(m1)=0.5
Pr(m2)=0.4
Pr(m3)=0.1
By-Table versus By-Tuple Semantics
Ds=
pname email-addr
Alice
Bob
alice@
bob@
home-addr
office-addr
Mountain View
Sunnyvale
Sunnyvale
Sunnyvale
There are 9 possible databases DT:
DT=
name
mailing-addr
name mailing-addr
name mailing-addr
Alice
Mountain View
Alice
Sunnyvale
Alice
alice@
Bob
bob@
Bob
bob@
Bob
bob@
Pr(<m1,m3>)=0.05
Pr(<m2,m3>)=0.04
Pr(<m3,m3>)=0.01
…
Complexity of Query Answering
Answering queries is more expensive under
by-tuple semantics:
By-table
By-tuple
Data Complexity
PTIME
#P-complete
Mapping Complexity
PTIME
PTIME
Summary of Chapter 13
 Uncertainty is everywhere in data integration
 Work on this area is really only beginning
 Great opportunity for further research.
 Probabilistic schema mappings:
 By-table versus by-tuple semantics
 By-tuple semantics is computationally expensive, but
restricted cases can found where query answering is still
polynomial.
 Where do the probabilities come from?
 Sometimes we interpret statistics as probabilities
 Sometimes the provenance of the data is more meaningful
than the probabilities
Download