Managing Inconsistent Data in Data Integration and Data Exchange Renée J. Miller University of Toronto Periklis Andritsos, Ariel Fuxman Tasos Kementsietsidis, Yannis Velegrakis Outline Schema Mapping – reconciling differences in schemas • Clio: Creating mappings (VLDB00,VLDB02) • Using semantics of schemas and data • ToMAS: Managing schema mappings (VLDB03) • Evolving schemas and semantics Using Mappings • • Data Exchange Querying Inconsistent Data (ICDT03) (IJCAI/IIWeb03) Data Mapping – reconciling differences in data • Hyperion: managing data mappings (SIGMOD03) • Using networks of P2P data mappings 6/30/2016 R.J. Miller - U. Toronto 2 Mapping Independent Data Sources Source Schema S’’ Q Source Schema S’ data data Source Schema S “conforms to” Mapping Target Schema T “conforms to” data data •Data Integration – answer target queries using data from source(s) •Target data is virtual 6/30/2016 R.J. Miller - U. Toronto 3 Mapping Independent Data Sources Q Source Schema S “conforms to” Mapping Target Schema T “conforms to” data data •Data Exchange – answer target queries answered locally •Target data is materialized 6/30/2016 R.J. Miller - U. Toronto 4 Overview Goal: interoperability between independent data sources • Creating Mappings • Managing Mappings – as sources change • Using Mappings – to query and exchange data • Even when data is dirty or inconsistent Challenges • Schemas can be arbitrarily different • Still, data must not lose its meaning • 6/30/2016 • Use semantics embedded in schemas & data • Facilitate specification of any additional semantics Performed manually: complex user queries, programs, etc. • Hard to debug; understand; verify correctness R.J. Miller - U. Toronto 5 Schema Mapping •Wants data from S •Understands T •May not understand S •XML Schema •DTD •Relational Source schema S “conforms to” Q Mapping Target schema T “conforms to” data • Automate (to the extent possible) the creation of mappings • Mappings used for (virtual) data integration or (materialized) data exchange 6/30/2016 R.J. Miller - U. Toronto 6 Illustration: Mapping Creation Support Nested Structures Element correspondences 6/30/2016 Human friendly Automatic discovery Preserve data meaning Discover data associations Use constraints & schema Create New Target Values Produce Correct Grouping R.J. Miller - U. Toronto 7 Creating Correspondences Graphical User Interface DBA interactively specifies Automatic Discovery Attribute (Element) Classifier Extensible to Other Schema Matchers VLDB J. 01 Survey Correspondence based on syntactic information 6/30/2016 Within schema or data R.J. Miller - U. Toronto 8 Interpreting Correspondences expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project What semantics do we associate to an arrow? Good enough for one arrow ! Still works for these two arrows! How about now ? 6/30/2016 statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city cid expenseDB.companies cid statDB.cityStat.orgs cid,name expenseDB.companies cid,name statDB.cityStat.orgs gid expenseDB.grants gid statDB.cityStat.orgs.fundings R.J. Miller - U. Toronto 9 Associations btw Elements statDB: Set of Rcd cityStat: Rcd orgs: Set of Rcd org: Rcd cid name fundings: Set of Rcd funding: Rcd gid proj aid financials: Set of Rcd financial: Rcd aid date amount city expenseDB: Rcd companies: Set of Rcd company: Rcd cid name city grants: Set of Rcd grant: Rcd cid gid amount project 6/30/2016 We must recognize that grants are associated to companies Association (in the source): grants ⋈ companies Association (in the target): statDB ⋈ orgs ⋈ fundings ⋈ financials R.J. Miller - U. Toronto 10 Schema Mapping Enumerate ALL logical associations consistent with schema semantics • Constraints • Nesting (schema structure) • Data Interpret correspondences (arrows) over pair source & target association 6/30/2016 R.J. Miller - U. Toronto 11 Mappings as Views q Target schema T Source schema S (Local) (Global) st t virtual ! I J Views: st have a special form: 6/30/2016 • GAV: Qs(S) Ti where Ti is a relation in T, Qs is a query on S • LAV: Si Qt(T) where Si is a relation in S , Qt is a query on T • Company(C,N,Ct),Grant(C,G,A,P) Projects(P,Ct) • Plain old view: create view projects (p, ct) as (select p,ct from …) • Company(C,N,City) Org(C,N) City(C,Ct) R.J. Miller - U. Toronto 12 Clio Mappings q Target schema T Source schema S (Local) (Global) st I J t Virtual or Materialized Clio Schema Mapping: • Qs(S) Qt(T) and constraints on S and T (s , t ) • More general than views • Generality often required when S, T are fixed • No design control 6/30/2016 R.J. Miller - U. Toronto 13 Using Mappings and Views q Source schema S Target schema T st t virtual ! I J Data Integration • • • • The target is not materialized; it is just a querying interface Queries are posed on the target schema; data is in the source. Problem: how to answer the query in the “best” possible way AKA: Answering queries using views GAV/LAV (mostly) assumes conjunctive queries • • 6/30/2016 (mostly) assumes no target constraints – target is a view Uses relational (not nested relational) model R.J. Miller - U. Toronto 14 Using Mappings q Target schema T Source schema S (Local) (Global) st t Materialize! I J Data Exchange • • • The target is materialized Queries are posed on the target schema; answered using target data Problem: what is “best” instance to exchange Mapping: Qs(S) Qt(T) and constraints on S and T (s , t ) • Given instance of S there may be many instances of T • Which is best instance to exchange? • 6/30/2016 Grant(C,G,A,P,S) Funding(C,G,Aid),Financials(Aid,D,A) R.J. Miller - U. Toronto 15 Semantics of Query Answering Answering queries using views • Query is answered using source data • Answer is set of tuples in query result on ALL possible target instances: certain answers Data Exchange • Query is answered using ONE materialized target • Can single target give same information as source(s)? Is query result the same in both settings? 6/30/2016 R.J. Miller - U. Toronto 16 Mappings at Data Level Financial Employee Target Database Salary Employee Employee Salary Human Resources Employee Employee Salary Salary Employee 6/30/2016 Salary Salary Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto 17 Data Inconsistency Financial Employee Salary John 1000 Employee Target Database Salary Human Resources Employee Salary John 2000 Mary 3000 Employee 6/30/2016 Salary Employee Salary John 1000 John 2000 Mary 3000 Employee Salary Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto 18 Reconciling Inconsistencies (I) 1 – Delete all tuples for John Financial Target Database Employee Salary John 1000 Employee 6/30/2016 Salary Mary 3000 Salary Human Resources Employee Salary John 2000 Mary 3000 Employee Employee Salary Employee Salary Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto 19 Reconciling Inconsistencies (II) 2 – Delete the salaries of John Financial Employee Salary John 1000 Employee Target Database Salary Human Resources Employee Salary John 2000 Mary 3000 Employee 6/30/2016 Salary Employee Salary John null Mary 3000 Employee Salary Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto 20 Reconciling Inconsistencies (III) 3 – Delete only one tuple for John Financial Employee Salary John 1000 Employee Target Database Salary Human Resources Employee Salary John 2000 Mary 3000 Employee 6/30/2016 Salary Employee Salary John 1000 Mary 3000 Employee Salary Mapping Financial(e,s) Global(e,s) HumanRes(e,s) Global(e,s) R.J. Miller - U. Toronto 21 Repairing an integrated database Repair 1 An integrated inconsistent database Employee Salary John 1000 John 2000 Mary 3000 Employee Salary John 1000 Mary 3000 Employee 6/30/2016 R.J. Miller - U. Toronto Salary 22 Repairing an integrated database Repair 1 An integrated inconsistent database 6/30/2016 Employee Salary John 1000 John 2000 Mary 3000 Employee Salary John 1000 Mary 3000 Repair 2 R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 23 Consistent Query Answers Repair 1 Intuition: • Input: query Q • Get a query result Q( ) for each repair . • A tuple is in the consistent answer if it appears in all query results. 6/30/2016 R.J. Miller - U. Toronto Employee Salary John 1000 Mary 3000 Repair 2 Employee Salary John 2000 Mary 3000 24 Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Employee Salary John 1000 Mary 3000 Repair 2 6/30/2016 R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 25 Consistent Query Answers Repair 1 Q(e,s)=Target(e,s) “Get all employees and their salaries” Employee Salary John 1000 Mary 3000 Repair 2 Consistent(Q,I)={(Mary,3000)} 6/30/2016 R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 26 Consistent Query Answers Repair 1 Q(e)= s: Global(e,s) “Get all employees” Employee Salary John 1000 Mary 3000 Repair 2 6/30/2016 R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 27 Consistent Query Answers Result 1 Employee Q(e)= s: Global(e,s) John “Get all employees” Mary Consistent(Q,I)={(John),(Mary)} Result 2 Employee John Mary 6/30/2016 R.J. Miller - U. Toronto 28 Consistent Query Answers Repair 1 Q= e Target(e,2000) “Is there an employee who earns $2000?” Employee Salary John 1000 Mary 3000 Repair 2 6/30/2016 R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 29 Consistent Query Answers Repair 1 Q= e Target(e,2000) FALSE “Is there an employee who earns $2000?” Salary John 1000 Mary 3000 Repair 2 Consistent(Q,I)=FALSE TRUE 6/30/2016 Employee R.J. Miller - U. Toronto Employee Salary John 2000 Mary 3000 30 Our work (IJCAI/IIWeb03) Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] 6/30/2016 R.J. Miller - U. Toronto 31 Our work Problem: Retrieving consistent answers is co-NP complete in general (i.e., we need to explore an exponential number of repairs) [Chomicki and Marcinkowski 2002, Cali et al. 2003] Goal: Find a class of tractable queries (i.e., the consistent answers can be retrieved in polynomial time without explicitly building all repairs). 6/30/2016 R.J. Miller - U. Toronto 32 Example: A tractable query Are there two employees with the same salary? Inconsistent instance Employee Salary John 1000 John 2000 Mary 1000 Mary 2000 Anna 1000 Anna 3000 Employee 6/30/2016 Graph of the inconsistent instance John 1000 Mary 2000 Anna 3000 Salary R.J. Miller - U. Toronto 33 Example: A tractable query 6/30/2016 Employee Salary John 1000 Mary 2000 Anna 3000 John 1000 Mary 2000 Anna 3000 R.J. Miller - U. Toronto 34 Example: A tractable query 6/30/2016 Employee Salary John 1000 Mary 2000 Anna 3000 John 1000 Mary 2000 Anna 3000 R.J. Miller - U. Toronto 35 Inexpressibility result Query rewriting • Input: query Q • Output: query Q’ s.t. Q’(I)=consistent(Q,I) for every I. Appealing approach • tractable • reuses existing DBMSs BUT: so far known to be applicable only to a restricted classes of queries ([ABC, PODS 1999]) 6/30/2016 R.J. Miller - U. Toronto 36 Inexpressibility result 6/30/2016 Can we use query rewriting? R.J. Miller - U. Toronto 37 Inexpressibility result Can we use query rewriting? NO 6/30/2016 R.J. Miller - U. Toronto 38 Practical Considerations (I) Conflicts are usually confined to a small portion of the database 6/30/2016 Robert 4000 Fred 5000 Paul 6000 Peter 7000 John 1000 Mary 2000 Anna 3000 R.J. Miller - U. Toronto 39 Practical Considerations (I) Conflicts are usually confined to a small portion of the database 6/30/2016 John 1000 Mary 2000 Anna 3000 R.J. Miller - U. Toronto 40 Practical Considerations (II) Reasonable assumption in integration and exchange: constant number of conflicts per key. Financial Employee Salary John 1000 Target Database Employee ! Salary Human Resources Employee Salary John 2000 Mary 3000 Employee Salary John 1000 John 2000 Mary 3000 Employee ! Salary 6/30/2016 R.J. Miller - U. Toronto 41 Bibliography J. Chomicki and J. Marcinkowski. On the Computational Complexity of Consistent Query Answers. coRR cs.DB/0204010, 2002. M. Arenas, L. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases, Proc. ACM PODS, 1999. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases, Proc. ACM PODS, 2003. 6/30/2016 R.J. Miller - U. Toronto 42 Data Mapping (SIGMOD03) What if sources unwilling to share schemas? • Common in more autonomous P2P settings • How can such sources share data? • Shared schema mappings not appropriate Need to manage and share • Data mappings Hyperion – P2P data sharing 6/30/2016 R.J. Miller - U. Toronto 43 P2P File-Sharing Systems Currently, P2P querying relies on the use of value searches. e.g., retrieve songs for music band “New Order” However, P2P query mechanisms do not capture the intricacies of values, i.e., that values are often associated to each other. e.g. the value “New Order” is an alias for the value “Joy Division” We propose the use of mapping tables to record such associations e.g. a mapping table that records artist aliases 6/30/2016 old-name new-name Prince Puff Daddy Joy Division The Artist P. Diddy New Order R.J. Miller - U. Toronto 44 A P2P Genome Database System Peers store information about genes, proteins, etc. Gene (gid, name) gid name 001 002 003 004 NF1 NID NGFR NEU1 “alias” gid pid 001 003 004 004 SwissProt(pid, name) 101 102 104 105 pid name 101 102 103 104 105 Neurofibromin p75 ICD Neuromedin Sialidase 1 G9 Sialidase Characteristics of mapping tables: The recorded associations can be 1:1, 1:n or m:n They are, in general, non-binary They associate values within or across domains 6/30/2016 R.J. Miller - U. Toronto 45 Contributions State of the art: Mapping tables represent expert knowledge. Currently, they are created manually by domain specialists. Our contributions: We automate the creation and maintenance of these tables. More specifically: • We investigate alternative semantics for mapping tables. • We motivate why reasoning capabilities are needed to manage them. • We propose efficient algorithms for both finding inconsistencies in mapping tables and inferring new mapping tables 6/30/2016 R.J. Miller - U. Toronto 46 Conclusions Managing Data Inconsistency • Tolerate inconsistency • Identify inconsistency at query time • Recognizes that cleaning not always possible or desirable • Reconciling inconsistency • Data mappings record reconciliation • Manage use and combination of data mappings www.cs.toronto.edu/db www.cs.toronto.edu/db/tomas www.cs.toronto.edu/db/hyperion www.cs.toronto.edu/~miller 6/30/2016 R.J. Miller - U. Toronto 47