Managing Inconsistency in Data Exchange and Integration

advertisement
Managing Inconsistent Data
in Data Integration and
Data Exchange
Renée J. Miller
University of Toronto
Periklis Andritsos, Ariel Fuxman
Tasos Kementsietsidis, Yannis Velegrakis
Outline
 Schema Mapping – reconciling differences in schemas
• Clio: Creating mappings
(VLDB00,VLDB02)
• Using semantics of schemas and data
• ToMAS: Managing schema mappings (VLDB03)
• Evolving schemas and semantics
 Using Mappings
•
•
Data Exchange
Querying Inconsistent Data
(ICDT03)
(IJCAI/IIWeb03)
 Data Mapping – reconciling differences in data
• Hyperion: managing data mappings
(SIGMOD03)
• Using networks of P2P data mappings
6/30/2016
R.J. Miller - U. Toronto
2
Mapping Independent Data Sources
Source
Schema S’’
Q
Source
Schema S’
data
data
Source
Schema S
“conforms to”
Mapping
Target
Schema T
“conforms to”
data
data
•Data Integration – answer target queries using data from source(s)
•Target data is virtual
6/30/2016
R.J. Miller - U. Toronto
3
Mapping Independent Data Sources
Q
Source
Schema S
“conforms to”
Mapping
Target
Schema T
“conforms to”
data
data
•Data Exchange – answer target queries answered locally
•Target data is materialized
6/30/2016
R.J. Miller - U. Toronto
4
Overview
 Goal: interoperability between independent data sources
• Creating Mappings
• Managing Mappings – as sources change
• Using Mappings – to query and exchange data
• Even when data is dirty or inconsistent
 Challenges
• Schemas can be arbitrarily different
• Still, data must not lose its meaning
•
6/30/2016
• Use semantics embedded in schemas & data
• Facilitate specification of any additional semantics
Performed manually: complex user queries, programs, etc.
• Hard to debug; understand; verify correctness
R.J. Miller - U. Toronto
5
Schema Mapping
•Wants data from S
•Understands T
•May not understand S
•XML Schema
•DTD
•Relational
Source
schema S
“conforms to”
Q
Mapping
Target
schema T
“conforms to”
data
• Automate (to the extent possible) the creation of mappings
• Mappings used for (virtual) data integration or (materialized) data exchange
6/30/2016
R.J. Miller - U. Toronto
6
Illustration: Mapping Creation


Support Nested Structures
Element correspondences



6/30/2016
Human friendly
Automatic discovery
Preserve data meaning

Discover data associations

Use constraints & schema

Create New Target Values

Produce Correct Grouping
R.J. Miller - U. Toronto
7
Creating Correspondences

Graphical User Interface


DBA interactively specifies
Automatic Discovery

Attribute (Element)
Classifier

Extensible to



Other Schema Matchers
VLDB J. 01 Survey
Correspondence based on
syntactic information

6/30/2016
Within schema or data
R.J. Miller - U. Toronto
8
Interpreting Correspondences
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amount
project
What semantics do we associate to an arrow?
Good enough for one
arrow !
Still works for these
two arrows!
How about now ?
6/30/2016
statDB: Set of Rcd
cityStat: Rcd
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
city
cid expenseDB.companies  cid statDB.cityStat.orgs
cid,name expenseDB.companies  cid,name statDB.cityStat.orgs
gid expenseDB.grants  gid statDB.cityStat.orgs.fundings
R.J. Miller - U. Toronto
9
Associations btw Elements
statDB: Set of Rcd
cityStat: Rcd
orgs: Set of Rcd
org: Rcd
cid
name
fundings: Set of Rcd
funding: Rcd
gid
proj
aid
financials: Set of Rcd
financial: Rcd
aid
date
amount
city
expenseDB: Rcd
companies: Set of Rcd
company: Rcd
cid
name
city
grants: Set of Rcd
grant: Rcd
cid
gid
amount
project
6/30/2016

We must recognize that grants are associated to companies

Association (in the source): grants ⋈ companies

Association (in the target): statDB ⋈ orgs ⋈ fundings ⋈ financials
R.J. Miller - U. Toronto
10
Schema Mapping
 Enumerate ALL logical associations
consistent with schema semantics
• Constraints
• Nesting (schema structure)
• Data
 Interpret correspondences (arrows) over
pair source & target association
6/30/2016
R.J. Miller - U. Toronto
11
Mappings as Views
q
Target
schema T
Source
schema S
(Local)
(Global)
st
t
virtual !
I
J
 Views: st have a special form:
6/30/2016
•
GAV: Qs(S)  Ti where Ti is a relation in T, Qs is a query on S
•
LAV: Si  Qt(T) where Si is a relation in S , Qt is a query on T
• Company(C,N,Ct),Grant(C,G,A,P)  Projects(P,Ct)
• Plain old view: create view projects (p, ct) as (select p,ct from …)
• Company(C,N,City)  Org(C,N) City(C,Ct)
R.J. Miller - U. Toronto
12
Clio Mappings
q
Target
schema T
Source
schema S
(Local)
(Global)
st
I
J
t
Virtual or
Materialized
 Clio Schema Mapping:
• Qs(S)  Qt(T) and constraints on S and T (s , t )
• More general than views
• Generality often required when S, T are fixed
• No design control
6/30/2016
R.J. Miller - U. Toronto
13
Using Mappings and Views
q
Source
schema S
Target
schema T
st
t
virtual !
I
J
 Data Integration
•
•
•
•
The target is not materialized; it is just a querying interface
Queries are posed on the target schema; data is in the source.
Problem: how to answer the query in the “best” possible way
AKA: Answering queries using views
 GAV/LAV (mostly) assumes conjunctive queries
•
•
6/30/2016
(mostly) assumes no target constraints – target is a view
Uses relational (not nested relational) model
R.J. Miller - U. Toronto
14
Using Mappings
q
Target
schema T
Source
schema S
(Local)
(Global)
st
t
Materialize!
I
J
 Data Exchange
•
•
•
The target is materialized
Queries are posed on the target schema; answered using target data
Problem: what is “best” instance to exchange
 Mapping: Qs(S)  Qt(T) and constraints on S and T (s , t )
• Given instance of S there may be many instances of T
• Which is best instance to exchange?
•
6/30/2016
Grant(C,G,A,P,S)  Funding(C,G,Aid),Financials(Aid,D,A)
R.J. Miller - U. Toronto
15
Semantics of Query Answering
 Answering queries using views
• Query is answered using source data
• Answer is set of tuples in query result on ALL possible
target instances: certain answers
 Data Exchange
• Query is answered using ONE materialized target
• Can single target give same information as source(s)?
 Is query result the same in both settings?
6/30/2016
R.J. Miller - U. Toronto
16
Mappings at Data Level
Financial
Employee
Target Database
Salary
Employee
Employee
Salary
Human Resources
Employee
Employee
Salary
Salary
Employee
6/30/2016
Salary
Salary
Mapping
Financial(e,s)  Global(e,s)
HumanRes(e,s)  Global(e,s)
R.J. Miller - U. Toronto
17
Data Inconsistency
Financial
Employee
Salary
John
1000
Employee
Target Database
Salary
Human Resources
Employee
Salary
John
2000
Mary
3000
Employee
6/30/2016
Salary
Employee
Salary
John
1000
John
2000
Mary
3000
Employee
Salary
Mapping
Financial(e,s)  Global(e,s)
HumanRes(e,s)  Global(e,s)
R.J. Miller - U. Toronto
18
Reconciling Inconsistencies (I)
1 – Delete all tuples for John
Financial
Target Database
Employee
Salary
John
1000
Employee
6/30/2016
Salary
Mary
3000
Salary
Human Resources
Employee
Salary
John
2000
Mary
3000
Employee
Employee
Salary
Employee
Salary
Mapping
Financial(e,s)  Global(e,s)
HumanRes(e,s)  Global(e,s)
R.J. Miller - U. Toronto
19
Reconciling Inconsistencies (II)
2 – Delete the salaries of John
Financial
Employee
Salary
John
1000
Employee
Target Database
Salary
Human Resources
Employee
Salary
John
2000
Mary
3000
Employee
6/30/2016
Salary
Employee
Salary
John
null
Mary
3000
Employee
Salary
Mapping
Financial(e,s)  Global(e,s)
HumanRes(e,s)  Global(e,s)
R.J. Miller - U. Toronto
20
Reconciling Inconsistencies (III)
3 – Delete only one tuple for John
Financial
Employee
Salary
John
1000
Employee
Target Database
Salary
Human Resources
Employee
Salary
John
2000
Mary
3000
Employee
6/30/2016
Salary
Employee
Salary
John
1000
Mary
3000
Employee
Salary
Mapping
Financial(e,s)  Global(e,s)
HumanRes(e,s)  Global(e,s)
R.J. Miller - U. Toronto
21
Repairing an integrated database
Repair 1
An integrated inconsistent database
Employee
Salary
John
1000
John
2000
Mary
3000
Employee
Salary
John
1000
Mary
3000
Employee
6/30/2016
R.J. Miller - U. Toronto
Salary
22
Repairing an integrated database
Repair 1
An integrated inconsistent database
6/30/2016
Employee
Salary
John
1000
John
2000
Mary
3000
Employee
Salary
John
1000
Mary
3000
Repair 2
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
23
Consistent Query Answers
Repair 1
Intuition:
• Input: query Q
• Get a query result Q( ) for each
repair .
• A tuple is in the consistent answer
if it appears in all query results.
6/30/2016
R.J. Miller - U. Toronto
Employee
Salary
John
1000
Mary
3000
Repair 2
Employee
Salary
John
2000
Mary
3000
24
Consistent Query Answers
Repair 1
Q(e,s)=Target(e,s)
“Get all employees and their salaries”
Employee
Salary
John
1000
Mary
3000
Repair 2
6/30/2016
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
25
Consistent Query Answers
Repair 1
Q(e,s)=Target(e,s)
“Get all employees and their salaries”
Employee
Salary
John
1000
Mary
3000
Repair 2
Consistent(Q,I)={(Mary,3000)}
6/30/2016
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
26
Consistent Query Answers
Repair 1
Q(e)=  s: Global(e,s)
“Get all employees”
Employee
Salary
John
1000
Mary
3000
Repair 2
6/30/2016
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
27
Consistent Query Answers
Result 1
Employee
Q(e)=  s: Global(e,s)
John
“Get all employees”
Mary
Consistent(Q,I)={(John),(Mary)}
Result 2
Employee
John
Mary
6/30/2016
R.J. Miller - U. Toronto
28
Consistent Query Answers
Repair 1
Q=  e Target(e,2000)
“Is there an employee who earns $2000?”
Employee
Salary
John
1000
Mary
3000
Repair 2
6/30/2016
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
29
Consistent Query Answers
Repair 1
Q= e Target(e,2000)
FALSE
“Is there an employee who earns $2000?”
Salary
John
1000
Mary
3000
Repair 2
Consistent(Q,I)=FALSE
TRUE
6/30/2016
Employee
R.J. Miller - U. Toronto
Employee
Salary
John
2000
Mary
3000
30
Our work (IJCAI/IIWeb03)
Problem: Retrieving consistent answers is co-NP complete
in general (i.e., we need to explore an exponential
number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003]
6/30/2016
R.J. Miller - U. Toronto
31
Our work
Problem: Retrieving consistent answers is co-NP complete
in general (i.e., we need to explore an exponential
number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003]
Goal: Find a class of tractable queries (i.e., the
consistent answers can be retrieved in polynomial time
without explicitly building all repairs).
6/30/2016
R.J. Miller - U. Toronto
32
Example: A tractable query
Are there two employees with the same salary?
Inconsistent instance
Employee
Salary
John
1000
John
2000
Mary
1000
Mary
2000
Anna
1000
Anna
3000
Employee
6/30/2016
Graph of the inconsistent instance
John
1000
Mary
2000
Anna
3000
Salary
R.J. Miller - U. Toronto
33
Example: A tractable query
6/30/2016
Employee
Salary
John
1000
Mary
2000
Anna
3000
John
1000
Mary
2000
Anna
3000
R.J. Miller - U. Toronto
34
Example: A tractable query
6/30/2016
Employee
Salary
John
1000
Mary
2000
Anna
3000
John
1000
Mary
2000
Anna
3000
R.J. Miller - U. Toronto
35
Inexpressibility result
 Query rewriting
• Input: query Q
• Output: query Q’ s.t. Q’(I)=consistent(Q,I) for every I.
 Appealing approach
• tractable
• reuses existing DBMSs
 BUT: so far known to be applicable only to a
restricted classes of queries ([ABC, PODS 1999])
6/30/2016
R.J. Miller - U. Toronto
36
Inexpressibility result

6/30/2016
Can we use query rewriting?
R.J. Miller - U. Toronto
37
Inexpressibility result

Can we use query rewriting?
NO
6/30/2016
R.J. Miller - U. Toronto
38
Practical Considerations (I)
Conflicts are usually confined to a small portion of the
database
6/30/2016
Robert
4000
Fred
5000
Paul
6000
Peter
7000
John
1000
Mary
2000
Anna
3000
R.J. Miller - U. Toronto
39
Practical Considerations (I)
Conflicts are usually confined to a small portion of the
database
6/30/2016
John
1000
Mary
2000
Anna
3000
R.J. Miller - U. Toronto
40
Practical Considerations (II)
Reasonable assumption in integration and exchange:
constant number of conflicts per key.
Financial
Employee
Salary
John
1000
Target Database
Employee ! Salary
Human Resources
Employee
Salary
John
2000
Mary
3000
Employee
Salary
John
1000
John
2000
Mary
3000
Employee ! Salary
6/30/2016
R.J. Miller - U. Toronto
41
Bibliography
J. Chomicki and J. Marcinkowski. On the Computational Complexity of Consistent
Query Answers. coRR cs.DB/0204010, 2002.
M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent
Databases, Proc. ACM PODS, 1999.
Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity
of query answering over inconsistent and incomplete databases,
Proc. ACM PODS, 2003.
6/30/2016
R.J. Miller - U. Toronto
42
Data Mapping (SIGMOD03)
 What if sources unwilling to share
schemas?
• Common in more autonomous P2P settings
• How can such sources share data?
• Shared schema mappings not appropriate
 Need to manage and share
• Data mappings
 Hyperion – P2P data sharing
6/30/2016
R.J. Miller - U. Toronto
43
P2P File-Sharing Systems
Currently, P2P querying relies on the use of value searches.
e.g., retrieve songs for music band “New Order”
However, P2P query mechanisms do not capture the intricacies
of values, i.e., that values are often associated to each other.
e.g. the value “New Order” is an alias for the value “Joy Division”
We propose the use of mapping tables to record such associations
e.g. a mapping table that records artist aliases
6/30/2016
old-name
new-name
Prince
Puff Daddy
Joy Division
The Artist
P. Diddy
New Order
R.J. Miller - U. Toronto
44
A P2P Genome Database System
Peers store information about genes, proteins, etc.
Gene (gid, name)
gid name
001
002
003
004
NF1
NID
NGFR
NEU1
“alias”
gid
pid
001
003
004
004
SwissProt(pid, name)
101
102
104
105
pid name
101
102
103
104
105
Neurofibromin
p75 ICD
Neuromedin
Sialidase 1
G9 Sialidase
Characteristics of mapping tables:
 The recorded associations can be 1:1, 1:n or m:n
 They are, in general, non-binary
 They associate values within or across domains
6/30/2016
R.J. Miller - U. Toronto
45
Contributions
State of the art:
Mapping tables represent expert knowledge.
Currently, they are created manually by domain specialists.
Our contributions:
We automate the creation and maintenance of these tables.
More specifically:
• We investigate alternative semantics for mapping tables.
• We motivate why reasoning capabilities are needed to
manage them.
• We propose efficient algorithms for both finding
inconsistencies in mapping tables and inferring new
mapping tables
6/30/2016
R.J. Miller - U. Toronto
46
Conclusions
 Managing Data Inconsistency
• Tolerate inconsistency
• Identify inconsistency at query time
• Recognizes that cleaning not always possible or
desirable
• Reconciling inconsistency
• Data mappings record reconciliation
• Manage use and combination of data mappings
www.cs.toronto.edu/db
www.cs.toronto.edu/db/tomas
www.cs.toronto.edu/db/hyperion
www.cs.toronto.edu/~miller
6/30/2016
R.J. Miller - U. Toronto
47
Download