CSE 636 Data Integration Introduction

advertisement
CSE 636
Data Integration
Introduction
Staff
• Instructor: Dr. Michalis Petropoulos
Email:
Location:
Office Hours:
mpetropo@cse.buffalo.edu
210 Bell Hall
Wednesday & Friday 1:00-2:00pm &
By Appointment
• Web Page
http://www.cse.buffalo.edu/~mpetropo/CSE636-FA08/
• Newsgroup
sunyab.cse.636
2
Course Goals
• Data integration applications and architectures
• Issues in building such applications
– Really big and currently active research area
• Solutions to several of them
• Provide foundation for
– understanding current research problems
– criticizing proposed solutions
– proposing your own solution!
• Acquire valuable experience by implementing
the project
3
Prerequisites
• An introductory database course
– CSE 520, CSE 562 or equivalent
•
•
•
•
•
Data structures and algorithms
Knowledge Representation
Distributed systems
Complexity theory
Mathematical Logic
• Curiosity!
– You should ask a lot of questions
• Have a lot of fun!
4
Relevant Material
Textbooks
• Database Systems: The Complete Book
– by Garcia-Molina, Ullman and Widom
• Database Management Systems
– by Ramakrishnan
• Fundamentals of Database Systems
– by Elmasri and Navathe
• Foundations of Databases
– by Abiteboul, Hull and Vianu
• Data on the Web
– by Abiteboul, Buneman and Suciu
5
Course Format
• Assignments: 15%
– Three assignments will be given, 5% each
• Final: 20% (take home)
• Projects: 60%
– Detailed specs will be given
– Can be used to satisfy the M.S. project requirement
• Participation: 5%
6
What is Data Integration?
The problem of providing
• uniform (sources transparent to users)
• access to (query)
• multiple (even 2 is a problem)
• autonomous (not affect the behavior of sources)
• heterogeneous (different data models, schemas)
• structured (at least semistructured)
• data sources (not only databases)
7
The Data Integration Problem
MyBookstore.com Mediated Schema
Books
Internet
Inventory
Intranet
Morgan
Kaufman
DB
East
Site
Addison
Wesley
DB
West
Prentice
Hall
Shipping
Internet
Site
WS
Orders
DB
Orders
Site
WS
Site
Reviews
Internet
FedEx
UPS
DB
Site
Customer
Reviews
NY Times
…
Uniform query capability across autonomous,
heterogeneous data sources on the Internet
8
Motivation
• Enterprise data integration
– Web site construction
• WWW
– Comparison shopping
– Portals integrating data from multiple sources
– B2B, electronic marketplaces
• Sciences
– Geology: integrate geological data across the US
continent (text as well as spatial data)
– Biology: integrating genomic data
9
Current Solutions
• Mostly ad-hoc programming
– Create a special solution for every case
– Pay consultants a lot of money
• Data Warehousing (Data Exchange)
– Load all the data periodically into a warehouse
– Separates operational DBMS from decision support
DBMS (not only a solution to data integration)
– Performance is good
– Data may not be fresh
– Need to clean data
10
Course Outline (Tentative)
• Data Integration Scenarios & Architectures
– Find out what the problems are
• Data Models & Type Systems
– XML/Semistructured Data, DTDs, XML Schema
• Query & Transformation Languages
– Datalog, XPath, XQuery, XSLT
• Data Integration Approaches
– Different approaches depending on application
characteristics
• Schema Integration
– Schema Mapping/Matching
– Semi-automate the discovery of schema mappings
11
Course Outline (cont)
• Distributed Query Processing Algorithms
• Query Rewriting Algorithms
• Limited Query Capabilities
– We don’t have full access to any database
• Consistent Query Answers
• Web Services
– What can they do for data integration?
• Semantic Web
– RDF & SPARQL
• Workflow Languages
– How is this related to data integration?
12
References
• Data Integration: a Status Report
– Alon Halevy
– German Database Conference (BTW), 2003
– Invited Talk
• Lecture Slides
– Alon Halevy
– http://www.cs.washington.edu/education/courses/cse544/00sp/l
ectures/ps/l12.ps
13
Download