Shuffling Data Around… Shuffling Data Around … Dr. Anastasios Kementsietsidis

advertisement
Shuffling Data Around
…
Around…
An introduction to the keywords in
Data Integration, Exchange and Sharing
Dr. Anastasios Kementsietsidis
Special thanks to Prof. Ren
ée J. Miller
Renée
The Cause and Effect Principle
Cause:
Data sources are autonomous, heterogeneous
– Different data models, types and schemas
– Different vocabularies (in data and schemas)
– Different requirements for what/how data is shared
Effect:
– Integration: Provide uniform access to
heterogeneous data
– Exchange: Move data between heterogeneous
sources
– Sharing: Provide non-uniform access to data
through each source’s schema and vocabulary
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
2
Data Warehousing Architecture
User Query
or
or here?
here?
Relational Database
(Warehouse)
What
What about
about
updates
updates here?
here?
Data Extraction/Forwarding tools
GDB
GDB
Database
Database
Swissprot
Swissprot
Database
Database
Outside Website
Image server
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
3
Virtual Integration Architecture
User Query
Mediated
Mediated Schema
Schema
Reformulation Engine
Mediator
Optimization Engine
Execution Engine
Wrapper
Wrapper
GDB
GDB
Database
Database
Swissprot
Swissprot
Database
Database
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
Metadata
Metadata
Wrapper
Wrapper
Image server
Outside Website
4
Peer-to-Peer Architecture
User Query
Outside Website
User Query
GDB
GDB
Database
Database
Image server
Swissprot
Swissprot
Database
Database
User Query
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
5
What are the Metadata?
• Schemas (models of data)
– Structured or Semi-structured
• e.g., relational, object-oriented, XML data, …
– Talk will not cover unstructured data
• e.g., documents, images, audio files,…
– Data(base) is an instance of a schema
• Mappings
– Model relationship between schemas or data
• Schema mapping (e.g., views)
• Data mapping (e.g., aliases)
– Requirements for mapping specifications
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
6
Metadata Lifecycle
• Creation
– Automatic discovery or creation
– Design tools facilitating creation
• Maintenance
– Maintain (integrated) schemas as sources change
– Maintain mappings as schemas change
• Use
– Query answering
– Data exchange (materialization), updates, etc…
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
7
Schema Integration
Global
Global Integrated
Integrated
Schema
Schema G
G
What
What about
about
schema
schema changes
changes ??
or
or source
source additions?
additions?
Local
Local Source
Source
Schema
Schema S
S11
Local
Local Source
Source
Schema
Schema S
S22
Local
Local Source
Source
Schema
Schema S
Snn
Data
Data
Data
Data
Data
Data
Schema Design Problem:
Create global integrated schema G (and mappings) for a set
of independently designed local schemas Si, 1 ≤ i ≤ n
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
8
Mappings (a.k.a. views)
A view is just a query… works like a function….
• It accepts as input the local source instance(s)
• It outputs an instance of the global (target) schema
Global
Global Integrated
Integrated
Schema
Schema G
G
Relation: funding (aid, amount, project, date)
Relation: finances (aid, date, amount)
This
This is
is
Global-as-View
Global-as-View (GAV)
(GAV)
create view funding (aid, amount, project, date) as
( select grant.gid, grant.amount, grant.project, received.date
from grant, received
where grant.gid = received.gid)
Local
Local Source
Source
Schema
Schema S
S
Relation: grant (gid, amount, project)
Relation: received (gid, date)
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
9
This is Local-as-View (LAV)
Global
Global Integrated
Integrated
Schema
Schema G
G
Relation: funding (aid, amount, project, date)
Relation: finances (aid, date, amount)
create view grant (gid, amount, project) as
( select funding.aid, funding.amount, funding.project,
from funding)
Local
Local Source
Source
Schema
Schema S
S
Relation: grant (gid, amount, project)
Relation: received (gid, date)
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
10
GAV vs. LAV (in “plain’’ English)
• GAV:
– Gives direct information about which data satisfy the elements
of the global schema
– Not easily extendible on source schema changes or source
additions
– Query answering is “easy”…
• LAV:
– Does not give direct information about which data satisfy the
global schema
– Easily extendible on source schema changes or source
additions
– Query answering is “hard”…
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
11
But There Is More…
Global-and-Local-as-View (GLAV)
Global
Global Integrated
Integrated
Schema
Schema G
G
Relation: funding (aid, amount, project, date)
Relation: finances (aid, date, amount)
(select funding.aid, funding.amount, funding.project, funding.date
from funding, finances
where funding.aid = finances.aid) ⊇
(select grant.gid, grant.amount, grant.project, received.date
from grant, received
where grant.gid = received.gid)
Local
Local Source
Source
Schema
Schema S
S
Relation: grant (gid, amount, project)
Relation: received (gid, date)
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
12
Creating Mappings
… with the help of Schema Matching
grant
gid
amount
project
received
gid
date
0.95
0.85
0.56
0.40
0.65
funding
aid
amount
proj
date
financial
aid
date
amount
It uses schema-level and/or instance-level information
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
13
… And Using Them
In Data Integration
In Data Exchange
Virtual Integration Architecture
Data Warehousing Architecture
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™anda
decompressor
areneededtoseethispicture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™anda
decompressor
areneededtoseethispicture.
User Query
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Mediated Schema
User Query
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Reformulation Engine
QuickTime™ and a
decompressor
are needed to see this picture.
Mediator
Optimization Engine
Wrapper
Wrapper
Metadata
County
Database
Relational Database
(Warehouse)
County
Database
Outside Website
Database
Image server
© 2005 Anastasios Kementsietsidis
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Data Extraction/Forwarding tools
Wrapper
City
City
Database
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Execution Engine
Wrapper
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Image server
Outside Website
and Renée J. Miller
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
4
© 2005 Anastasios Kementsietsidis and Renée J. Miller
3
14
Data Integration vs. Exchange
•
Data Integration
– Global schema is a reconciled virtual view of heterogeneous sources
– Uses GAV or LAV mappings
– No constraints in the global schema are considered
– Query is answered using source data; integration is virtual
– Answer is set of tuples in query result on ALL possible target
instances: certain answers
•
Data Exchange
– Global schema is an independently created local source schema
– Uses GLAV mappings
– Considers the presence of constraints
– Query is answered using ONE materialized target
– Can single target give same information as source(s)?
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
15
Something Slightly Different…
Data Mappings and Data Sharing
Useful in environments where:
•
Sources are unwilling to share schemas
•
The schema of one source cannot be expressed as a view of another
•
There is a need to map (data) vocabularies
gene
gene((gid
gid,, name
name))
gid name
001 NF1
002 NGFR
003 NEU1
“encodes for”
gid pid
GDB
GDB
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
001
002
003
003
101
102
103
104
protein
protein((pid
pid,, name
name))
SwissProt
SwissProt
pid
101
102
103
104
name
Neurofibromin
p75 ICD
Sialidase 1
G9 Sialidase
This is a
mapping table
16
Data Sharing Architecture
article id
…
96281126
99455262
…
…
II have
have article
article
96281126
96281126
UofE
UofE
Library
Library
II want
want info
info for
for
protein
protein AARE
AARE
Mapping
Mapping
Tables
Tables
Mapping
Mapping
Tables
Tables
Mapping
Mapping
Tables
Tables
Mapping
Mapping
Tables
Tables
SwissProt
SwissProt
GDB
GDB
II have
have info
info for
for
gene
gene APEH
APEH
Mapping
Mapping
Tables
Tables
gene
…
NF1
APEH
…
…
protein
…
AARE
G9 Sialidase
…
…
z Establish mapping tables between the vocabularies of different
sources.
z Use these tables to translate query requests between the sources
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
17
Management of Tables
Network
Network
Mapping
Mapping
Tables
Tables
Mapping
Mapping
Tables
Tables
Source
Source SS11
Mapping
Mapping
Tables
Tables
Mapping
Mapping
Tables
Tables
Inferred
Inferred
Tables
Tables
Consistent?
Source
Source SSnn
Inferred
Inferred
Tables
Tables
The Consistency and Inference problems, the main vehicles for
managing mapping tables. Solving these problems allows us to:
– Infer new mapping tables from existing ones
– Augment existing mapping tables with new associations
– Validate mapping tables
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
18
Closing Remarks…
… and things to remember (other than the keywords):
• Integration, one of the oldest problems in database research.
Research is still going strong in this area
• Exchange, an interesting and practical problem (e.g. B2B apps)
• Sharing, the latest twist in the integration problem, also of practical
importance (e.g. in P2P apps)
Disclaimer:
This talk provides only a glimpse of the research issues in these areas.
Note: If you find any of these interesting, TALK to us…
We are in Appleton Tower, 2nd Floor
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
19
Questions!?
GAV vs. LAV
Local
Local Source
Source
Schema
Schema S
S
Data
Data
QS/G
Global
Global Integrated
Integrated
Schema
Schema G
G
User Query
Data
Data (?)
(?)
• GAV: QS(S) ⊆ RG,
where RG is a relation in G, QS is a query on S
• LAV: RS ⊆ QG(G),
where Rs is a relation in S, QG is a query on G
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
21
Query Answering
GAV uses Query/View Unfolding
select project
from funding
where amount > $45.000
create view funding (aid, amount, project, date) as
( select grant.gid, grant.amount, grant.project, received.date
from grant, received
where grant.gid = received.gid)
select project
from grant, received
where grant.gid = received.gid AND
grant.amount > $45.000
© 2005 Anastasios Kementsietsidis and René
Renée J. Miller
What
What about
about LAV?
LAV?
ItIt uses
uses aa method
method called
called
Query
Query Rewriting
Rewriting
(not
(not presented
presented here)
here)
22
Download