Shuffling Data Around … Around… An introduction to the keywords in Data Integration, Exchange and Sharing Dr. Anastasios Kementsietsidis Special thanks to Prof. Ren ée J. Miller Renée The Cause and Effect Principle Cause: Data sources are autonomous, heterogeneous – Different data models, types and schemas – Different vocabularies (in data and schemas) – Different requirements for what/how data is shared Effect: – Integration: Provide uniform access to heterogeneous data – Exchange: Move data between heterogeneous sources – Sharing: Provide non-uniform access to data through each source’s schema and vocabulary © 2005 Anastasios Kementsietsidis and René Renée J. Miller 2 Data Warehousing Architecture User Query or or here? here? Relational Database (Warehouse) What What about about updates updates here? here? Data Extraction/Forwarding tools GDB GDB Database Database Swissprot Swissprot Database Database Outside Website Image server © 2005 Anastasios Kementsietsidis and René Renée J. Miller 3 Virtual Integration Architecture User Query Mediated Mediated Schema Schema Reformulation Engine Mediator Optimization Engine Execution Engine Wrapper Wrapper GDB GDB Database Database Swissprot Swissprot Database Database © 2005 Anastasios Kementsietsidis and René Renée J. Miller Metadata Metadata Wrapper Wrapper Image server Outside Website 4 Peer-to-Peer Architecture User Query Outside Website User Query GDB GDB Database Database Image server Swissprot Swissprot Database Database User Query © 2005 Anastasios Kementsietsidis and René Renée J. Miller 5 What are the Metadata? • Schemas (models of data) – Structured or Semi-structured • e.g., relational, object-oriented, XML data, … – Talk will not cover unstructured data • e.g., documents, images, audio files,… – Data(base) is an instance of a schema • Mappings – Model relationship between schemas or data • Schema mapping (e.g., views) • Data mapping (e.g., aliases) – Requirements for mapping specifications © 2005 Anastasios Kementsietsidis and René Renée J. Miller 6 Metadata Lifecycle • Creation – Automatic discovery or creation – Design tools facilitating creation • Maintenance – Maintain (integrated) schemas as sources change – Maintain mappings as schemas change • Use – Query answering – Data exchange (materialization), updates, etc… © 2005 Anastasios Kementsietsidis and René Renée J. Miller 7 Schema Integration Global Global Integrated Integrated Schema Schema G G What What about about schema schema changes changes ?? or or source source additions? additions? Local Local Source Source Schema Schema S S11 Local Local Source Source Schema Schema S S22 Local Local Source Source Schema Schema S Snn Data Data Data Data Data Data Schema Design Problem: Create global integrated schema G (and mappings) for a set of independently designed local schemas Si, 1 ≤ i ≤ n © 2005 Anastasios Kementsietsidis and René Renée J. Miller 8 Mappings (a.k.a. views) A view is just a query… works like a function…. • It accepts as input the local source instance(s) • It outputs an instance of the global (target) schema Global Global Integrated Integrated Schema Schema G G Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount) This This is is Global-as-View Global-as-View (GAV) (GAV) create view funding (aid, amount, project, date) as ( select grant.gid, grant.amount, grant.project, received.date from grant, received where grant.gid = received.gid) Local Local Source Source Schema Schema S S Relation: grant (gid, amount, project) Relation: received (gid, date) © 2005 Anastasios Kementsietsidis and René Renée J. Miller 9 This is Local-as-View (LAV) Global Global Integrated Integrated Schema Schema G G Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount) create view grant (gid, amount, project) as ( select funding.aid, funding.amount, funding.project, from funding) Local Local Source Source Schema Schema S S Relation: grant (gid, amount, project) Relation: received (gid, date) © 2005 Anastasios Kementsietsidis and René Renée J. Miller 10 GAV vs. LAV (in “plain’’ English) • GAV: – Gives direct information about which data satisfy the elements of the global schema – Not easily extendible on source schema changes or source additions – Query answering is “easy”… • LAV: – Does not give direct information about which data satisfy the global schema – Easily extendible on source schema changes or source additions – Query answering is “hard”… © 2005 Anastasios Kementsietsidis and René Renée J. Miller 11 But There Is More… Global-and-Local-as-View (GLAV) Global Global Integrated Integrated Schema Schema G G Relation: funding (aid, amount, project, date) Relation: finances (aid, date, amount) (select funding.aid, funding.amount, funding.project, funding.date from funding, finances where funding.aid = finances.aid) ⊇ (select grant.gid, grant.amount, grant.project, received.date from grant, received where grant.gid = received.gid) Local Local Source Source Schema Schema S S Relation: grant (gid, amount, project) Relation: received (gid, date) © 2005 Anastasios Kementsietsidis and René Renée J. Miller 12 Creating Mappings … with the help of Schema Matching grant gid amount project received gid date 0.95 0.85 0.56 0.40 0.65 funding aid amount proj date financial aid date amount It uses schema-level and/or instance-level information © 2005 Anastasios Kementsietsidis and René Renée J. Miller 13 … And Using Them In Data Integration In Data Exchange Virtual Integration Architecture Data Warehousing Architecture QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™anda decompressor areneededtoseethispicture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™anda decompressor areneededtoseethispicture. User Query QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Mediated Schema User Query QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Reformulation Engine QuickTime™ and a decompressor are needed to see this picture. Mediator Optimization Engine Wrapper Wrapper Metadata County Database Relational Database (Warehouse) County Database Outside Website Database Image server © 2005 Anastasios Kementsietsidis QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Data Extraction/Forwarding tools Wrapper City City Database QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Execution Engine Wrapper QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. Image server Outside Website and Renée J. Miller © 2005 Anastasios Kementsietsidis and René Renée J. Miller 4 © 2005 Anastasios Kementsietsidis and Renée J. Miller 3 14 Data Integration vs. Exchange • Data Integration – Global schema is a reconciled virtual view of heterogeneous sources – Uses GAV or LAV mappings – No constraints in the global schema are considered – Query is answered using source data; integration is virtual – Answer is set of tuples in query result on ALL possible target instances: certain answers • Data Exchange – Global schema is an independently created local source schema – Uses GLAV mappings – Considers the presence of constraints – Query is answered using ONE materialized target – Can single target give same information as source(s)? © 2005 Anastasios Kementsietsidis and René Renée J. Miller 15 Something Slightly Different… Data Mappings and Data Sharing Useful in environments where: • Sources are unwilling to share schemas • The schema of one source cannot be expressed as a view of another • There is a need to map (data) vocabularies gene gene((gid gid,, name name)) gid name 001 NF1 002 NGFR 003 NEU1 “encodes for” gid pid GDB GDB © 2005 Anastasios Kementsietsidis and René Renée J. Miller 001 002 003 003 101 102 103 104 protein protein((pid pid,, name name)) SwissProt SwissProt pid 101 102 103 104 name Neurofibromin p75 ICD Sialidase 1 G9 Sialidase This is a mapping table 16 Data Sharing Architecture article id … 96281126 99455262 … … II have have article article 96281126 96281126 UofE UofE Library Library II want want info info for for protein protein AARE AARE Mapping Mapping Tables Tables Mapping Mapping Tables Tables Mapping Mapping Tables Tables Mapping Mapping Tables Tables SwissProt SwissProt GDB GDB II have have info info for for gene gene APEH APEH Mapping Mapping Tables Tables gene … NF1 APEH … … protein … AARE G9 Sialidase … … z Establish mapping tables between the vocabularies of different sources. z Use these tables to translate query requests between the sources © 2005 Anastasios Kementsietsidis and René Renée J. Miller 17 Management of Tables Network Network Mapping Mapping Tables Tables Mapping Mapping Tables Tables Source Source SS11 Mapping Mapping Tables Tables Mapping Mapping Tables Tables Inferred Inferred Tables Tables Consistent? Source Source SSnn Inferred Inferred Tables Tables The Consistency and Inference problems, the main vehicles for managing mapping tables. Solving these problems allows us to: – Infer new mapping tables from existing ones – Augment existing mapping tables with new associations – Validate mapping tables © 2005 Anastasios Kementsietsidis and René Renée J. Miller 18 Closing Remarks… … and things to remember (other than the keywords): • Integration, one of the oldest problems in database research. Research is still going strong in this area • Exchange, an interesting and practical problem (e.g. B2B apps) • Sharing, the latest twist in the integration problem, also of practical importance (e.g. in P2P apps) Disclaimer: This talk provides only a glimpse of the research issues in these areas. Note: If you find any of these interesting, TALK to us… We are in Appleton Tower, 2nd Floor © 2005 Anastasios Kementsietsidis and René Renée J. Miller 19 Questions!? GAV vs. LAV Local Local Source Source Schema Schema S S Data Data QS/G Global Global Integrated Integrated Schema Schema G G User Query Data Data (?) (?) • GAV: QS(S) ⊆ RG, where RG is a relation in G, QS is a query on S • LAV: RS ⊆ QG(G), where Rs is a relation in S, QG is a query on G © 2005 Anastasios Kementsietsidis and René Renée J. Miller 21 Query Answering GAV uses Query/View Unfolding select project from funding where amount > $45.000 create view funding (aid, amount, project, date) as ( select grant.gid, grant.amount, grant.project, received.date from grant, received where grant.gid = received.gid) select project from grant, received where grant.gid = received.gid AND grant.amount > $45.000 © 2005 Anastasios Kementsietsidis and René Renée J. Miller What What about about LAV? LAV? ItIt uses uses aa method method called called Query Query Rewriting Rewriting (not (not presented presented here) here) 22