Data sources

advertisement
Describing data sources
Outline


Overview
Schema mapping languages
Source descriptions
• Which sources are
available
• What data exists in
each source
• How each source
can be accessed
Components of a data integration
system
User query is
reformulated into a
query over the data
sources
Working example
Mediated schema:
Movie(title, director, year, genre), Actors(title, name)
Plays(movie, location, startTime)
Reviews(title, rating, description)
Data sources:
S1:
Movie(MID, title)
Actor(AID, firstName, lastName, nationality, yearOfBirth)
ActorPlays(AID, MID)
MovieDetail(MID, director, genre, year)
S2:
S3:
Cinemas(place,movie,start)
NYCCinemas(name, title,
startTime)
S4:
S5:
Reviews(title,date,grade,review) MovieGenres(title,
genre)
S6:
S7:
MovieDirectors(title, dir)
MovieYears(title, year)
Components of source descriptions

Schema mappings



What data exists in sources
How to map terms used in source schemata with
terms used in the mediated schema
Information used to optimize queries

Access pattern limitations


Because data sources may differ on the access patterns
supported
Source completeness
Components of source descriptions
 Schema mappings
 What data exists in sources
 How to map terms used in source schemata
with terms used in the mediated schema

Information used to optimize queries to the
sources and to avoid illegal access patterns

Access pattern limitations


Because data sources may differ on the access patterns
supported
Source completeness
Schema mapping


Main component of a source description
Specification of:



What data exists in the source
How the terms used in the source schema relate to the
terms used in the mediated schema
Needs to handle semantic heterogeneity:
discrepancies between the source schemata and
the mediated schema




Relation and attribute names
Tabular organization
Domain coverage
Data-level variations
Query reformulation

Besides schema mappings, source
descriptions specify information:

To enable the data integration system to optimize
queries posed to the sources


Knowing that a data source is known to be complete
saves work by not accessing other data sources that
have overlapping data
To avoid illegal access patterns

Data sources may differ on which access patterns they
support
Schema mapping languages

Schema mapping: set of expressions that describe a
relationship between a set of schemata (typically two). In
our case, mediator schema and the schema of the sources


Used to reformulate a query formulated in terms of the mediated
schema into appropriate queries on the sources.
Result is called logical query plan (query expression that refers
only to the relations in the data sources)


It will not be always possible to generate a query plan that produces
all the certain answers
Two types of algorithms involved:



To find the best possible logical plan
To find all the certain answers
Schema mapping languages based on: query
expressions
Semantics of schema mappings
A semantic mapping M defines a relation MR over:
I(G) X I(S1) X .... X I(Sn)
Where:



I(G) denotes the possible instances of the mediated
schema
I(S1), ..., I(Sn) denote the possible instances of the source
relations S1, ..., Sn, respectively
If (g, s1, ..., sn)  MR, then g is a possible instance of
the mediated schema when the source relation
instances are s1, ..., sn
Certain answers

Let M be a schema mapping between a mediated
schema G and source schemata S1, ..., Sn that
defines the relation MR over I(G) X I(S1) X... X I(Sn).
Let Q be a query over G, and let s1,..., sn be
instances of the source relations.
We say that t is a certain answer of Q wrt M and s1,
..., sn
if t  Q(g) for every instance g of G s.t.
(g, s1, ..., sn)  MR
Properties of schema mapping languages


Flexibility: the formalism should be able to express
a wide variety of relationships between schemata.
Efficient reformulation: reformulation algorithms
should have well understood properties and be
efficient

Trade-off: flexibility/expressivness vs efficiency

Easy update: Must be easy to add and remove
sources

Schema mapping languages:



Global-As-View (GAV)
Local-As-View (LAV)
Global-Local-As-View.(GLAV)
Two systems

(GAV) TSIMMIS [Garcia-Molina+97] – Stanford





(LAV) Information Manifold [Levy+96] – AT&T
Research




Focus: semistructured data (OEM), OQL-based language
(Lorel)
Creates a mediated schema as a view over the sources
Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems
Other important systems of this vein: Kleisli/K2 @ Penn
Focus: local-as-view mappings, relational model
Sources defined as views over mediated schema
Led to peer-to-peer integration approaches (Piazza, etc.)
Focus: Web-based queriable sources
Global-As-View (GAV)

Defines the mediated schema as a set of
views over the data sources


Mediated schema also referred as global schema
Let G be a mediated schema, and let S =
{S1, ..., Sn} be schemata of n data sources,
A Global-As-View schema mapping M is a set
of expressions of the form:
Gi(X)  Q(S) or Gi(X) = Q(S), where


Gi is a relation in G, and appears in at most one
expression in M, and
Q(S) is a query over the relations in S
Working example
Mediated schema:
Movie(title, director, year, genre), Actors(title, name)
Plays(movie, location, startTime)
Reviews(title, rating, description)
Data sources:
S1:
Movie(MID, title)
Actor(AID, firstName, lastName, nationality, yearOfBirth)
ActorPlays(AID, MID)
MovieDetail(MID, director, genre, year)
S2:
S3:
Cinemas(place,movie,start)
NYCCinemas(name, title,
startTime)
S4:
S5:
Reviews(title,date,grade,review) MovieGenres(title,
genre)
S6:
S7:
MovieDirectors(title, dir)
MovieYears(title, year)
Example of a GAV schema mapping
Movie(title, director, year, genre) 
S1.Movie(MID, title),
S1.MovieDetail(MID, director, genre, year)
Movie(title, director, year, genre) 
S5.MovieGenres(title, genre),
S6.MovieDirectors(title, director),
S7.MovieYears(title, year)
Plays(movie, location, startTime) 
S2.Cinemas(location, movie, startTime)
Plays(movie, location, startTime) 
S3.NYCCinemas(location, movie, startTime)
GAV semantics

Let M = M1, ..., Ml be a GAV schema
mapping between G and S = {S1, ..., Sn},
where Mi is of the form Gi(X)  Qi(S), or
Gi(X) = Qi(S).
Let g be an instance of the mediated schema
G, and let s = s1, ..., sn be instances of S1,
...Sn, respectively. The tuple of instances (g,
s1, ..., sn) is in MR if for every 1<=i<=l, the
following holds:


If Mi is a = expression, then the extension of Gi in
g is equal to the result of evaluating Qi on s,
If Mi is a  expression, then the extension of Gi in
g is a superset of the result of evaluating Qi on s
Reformulation in GAV


To reformulate a query posed over the
mediated schema, simply unfold the query
with the view definitions
The reformulation resulting from the unfolding
is guaranteed to find all the certain answers

Example
The query Q, over the mediated schema, asks for comedies starting after 8pm:
Q(title,location,startTime) :- Movie(title,director,year,“comedy”),
Plays(title, location, st), st >= 8pm
Reformulating Q with the source descriptions would yield the
following four logical query plans:

Q’(title, location, startTime) :- S1.Movie(MID, title),
S1.MovieDetail(MID, director, “comedy”, year),
S2.Cinemas(location, movie, st), st >= 8pm
Q’(title, location, startTime) :- S1.Movie(MID, title),
MovieDetail(MID, director, “comedy”, year),
S3.NYCCinemas(location, title, st), st >= 8pm
Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”),
S6.MovieDirectors(title, director),
S7.MovieYears(title, year),
S2.Cinemas(location, title, st), st >= 8pm
Q’(title, location, startTime) :- S5.MovieGenres(title, “comedy”),
S6.MovieDirectors(title, director),
S7.MovieYears(title, year),
S3.NYCCinemas(location, title, st),
st >= 8pm
Limitations


The reformulation may not be the most efficient
method to answer the query
Some subgoals may be redundant



In the last two reformulations, the subgoals:
S6.MovieDirectors and S7.MovieYears are not needed,
since what is really needed for the Movies relations is the
genre of the movie.
But there is no way of concluding this in GAV descriptions
Adding and removing sources involves considerable
work and knowledge of the sources -> potentially
not scalable


Ex: if we discover another source that includes only movie
directors
To update the source descriptions we need to specify
exactly which sources it needs to be joined with in order to
produce tuples of Movie
TSIMMIS [Garcia-Molina+97]


One of the first systems to support semi-structured
data according to the OEM data model, which
predated XML by several years.
Mediator Specification Language (MSL): logic-based
OO language used as a view definition language
targeted to the OEM data model and to the
integration of heterogeneous data sources



Based on Datalog, among others
Wrappers accept queries expressed in MSL and
compare them with the patterns (MSL templates)
given in the wrapper specification file
An instance of a GAV mediation system

We define our global schema as views over the sources
XML vs. Object Exchange Model
<book>
<author>Bernstein</author>
<author>Newcomer</author>
<title>Principles of TP</title>
</book>
<book>
<author>Chamberlin</author>
<title>DB2 UDB</title>
O1: book {
</book>
O2: author { Bernstein }
O3: author { Newcomer }
O4: title { Principles of TP }
}
O5: book {
O6: author { Chamberlin }
O7: title { DB2 UDB }
}
User queries in TSIMMIS
Specified in OQL-style language called Lorel


OQL was an object-oriented query language that looks like
SQL
Lorel is, in many ways, a predecessor to XQuery
Based on path expressions over OEM structures:
select book
where book.title = “DB2 UDB” and book.author =
“Chamberlin”
This is basically like XQuery, which we’ll use in place of
Lorel and the MSL template language. Previous query
restated:
for $b in AllData()/book
where $b/title/text() = “DB2 UDB” and
$b/author/text() = “Chamberlin”
return $b
Query Answering in TSIMMIS
Basically, it’s view unfolding, i.e., composing a
query with a view



The query is the one being asked
The views are the MSL templates for the
wrappers
Some of the views may actually require
parameters, e.g., an author name, before they’ll
return answers


Common for web forms (see Amazon, Google, …)
XQuery functions (XQuery’s version of views) support
parameters as well, so we’ll see these in action
Recall SQL View Unfolding/Expansion

A view consisting of branches and their customers
create view all_customer as
(select branch_name, customer_name
from depositor, account
where depositor.account_number =
account.account_number )
union
(select branch_name, customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number )
 Find all customers of the Perryridge branch
select customer_name
from all_customer
where branch_name = 'Perryridge'
A Wrapper Definition in MSL
Wrappers have templates and binding patterns ($X) in MSL:
B :- B: <book {<author $X>}>
// $$ = “select * from book where author=“ $X //

If the template is matched by the query issued to the mediator, an
SQL query is issued over Book(author, year, title), which is the
relation stored in the data source
In XQuery, this might look like:
…
define function GetBook($x AS xsd:string) as book {
for $b in
sql(“Amazon.DB”,
“select * from book where author=‘” + $x +”’”)
return <book>{$b/title}<author>$x</author></book>
}
The GetBook’s results is unioned with others to form the view
Mediator()
How to Answer the Query
Given our query:
for $b in Mediator()/book
where $b/title/text() = “DB2 UDB” and
$b/author/text() = “Chamberlin”
return $b
Find all wrapper definitions that:


Contain enough “structure” to match the
conditions of the query
Or have already tested the conditions for us!
Query Composition with Views
We find all views that define book with author and titleas
output, and we compose the query with each:
define function GetBook($x AS xsd:string) as book {
for $b in
sql(“Amazon.DB”,
“select * from book where author=‘” + $x + “’”)
return <book> {$b/title} <author>{$x}</author></book>
}
for $b in Mediator()/book
where $b/title/text() = “DB2 UDB” and
$b/author/text() = “Chamberlin”
return $b
book
title
…
…
author
Matching View Output to
Our Query’s Conditions

Determine that $b/author/text()  $x by matching the pattern on the function’
output:
define function GetBook($x AS xsd:string) as book {
for $b in
sql(“Amazon.DB”,
“select * from book where author=‘” + $x + “’”)
return <book>{ $b/title }
<author>{$x}</author> </book>
book
}
let $x := “Chamberlin”
for $b in GetBook($x)/book
where $b/title/text() = “DB2 UDB”
return $b
title
…
author
…
The Final Step: Unfolding
let $x := “Chamberlin”
for $b in (
for $b’ in
sql(“Amazon.com”,
“select * from book where author=‘” + $x + “’”)
return <book>{ $b/title }<author>{$x}</author></book>
)/book
where $b/title/text() = “DB2 UDB”
return $b
This can be simplified into:
for $b in sql(“Amazon.com”,
“select * from book where author=‘Chamberlin’”)
where $b/title/text() = “DB2 UDB”
return $b
Virtues of TSIMMIS

Early adopter of semistructured data, greatly
predating XML



Presents a mediated schema that is the
union of multiple views


Can support data from many different kinds of
sources
Obviously, doesn’t fully solve heterogeneity
problem
Query answering based on view unfolding
Easily composed in a hierarchy of mediators
Big limitation of TSIMMIS
Mediated schema is basically the union of the
various MSL templates – as they change, so
may the mediated schema
Local-As-View


Opposite approach to GAV
Focus on describing each data source as
precisely as possible and independently of
any other sources


Instead of specifying how to compute tuples of the
mediated system
LAV expressions describe data sources as
queries over the mediated schema
The Local-as-View Model
The basic model is the following:



Local sources are views over the mediated schema
Sources have the data – mediated schema is virtual
Sources may not have all the data from the domain –
“open-world assumption”
The system must use the sources (views) to
answer queries over the mediated schema
LAV schema mappings

Let G be a mediated schema and
let S = {S1, ..., Sn} be schemata of n data
sources.
A Local-As-View schema mapping M is a
set of expressions of the form Si(X)  Qi(G)
or Si(X) = Qi(G), where:


Qi is a query over the mediated schema G, and
Si is a source relation and it appears in at most
one expression in M
Recap. example
Mediated schema:
Movie(title, director, year, genre), Actors(title, name)
Plays(movie, location, startTime)
Reviews(title, rating, description)
Data sources:
S1:
Movie(MID, title)
Actor(AID, firstName, lastName, nationality, yearOfBirth)
ActorPlays(AID, MID)
MovieDetail(MID, director, genre, year)
S2:
S3:
Cinemas(place,movie,start)
NYCCinemas(name, title,
startTime)
S4:
S5:
Reviews(title,date,grade,review) MovieGenres(title,
genre)
S6:
S7:
MovieDirectors(title, dir)
MovieYears(title, year)
LAV example
In LAV, sources S5-S7 would be described as projection
queries over the Movie relation in the mediated schema
S5.MovieGenres(title, genre)  Movie(title,
director, year, genre)
S6.MovieDirectors(title, dir)  Movie(title,
director, year, genre)
S7.MovieYears(title, year)  Movie(title,
director, year, genre)
 In LAV, we can express constraints on the contents of data
sources
S9(title, year, “comedy”)  Movie(title,
director, year, “comedy”), year >= 1970

LAV semantics

Let M= M1, ..., Ml be a LAV schema mapping
between G and S ={S1, ..., Sn}, where Mi is of the
form Si(X)  Qi(G) or Si(X) = Qi(G).
Let g be an instance of the mediated schema G, and
let s = s1, ..., sn be instances of S1, ..., Sn,
respectively. The tuple of instances (g, s1, ..., sn) is
in MR if for every 1<=i<=l, the following holds:


If Mi is an expression, then the result of evaluating Qi over
g is equal to si
If Mi is a  expression, then the result of evaluating Qi over
g is a subset of si
Reformulation in LAV
Main advantages: flexibility + enables expressing
incomplete information


data sources are described in isolation => the system, and not the
designer, will find ways of combining data from multiple sources
Easier for a designer to add/remove sources
Example:
Q(title) :- Movie(title, director, year,
“comedy”), year >= 1960
Using sources S5-S7, we obtain the reformulation:
Q’(title) :- S5.MovieGenres(title, “comedy”),
S7.MovieYears(title, year), year >= 1960
Using source S9, we obtain the reformulation:
Q’(title) :- S9(title, year, “comedy”)
The Information Manifold [Levy+96]
When you integrate something, you have some
conceptual model of the integrated domain


Define that as a basic frame of reference, everything
else as a view over it
Local as View
May have overlapping/incomplete sources
Define each source as the subset of a query over the
mediated schema
 We can use selection or join predicates to specify that a
source contains a range of values:
ComputerBooks(…)  Books(Title, …, Subj),
Subj = “Computers”

Advantages and Shortcomings of LAV

Enables expressing incomplete information
More robust way of defining mediated schemas
and sources
Mediated schema is clearly defined, less likely to
change
Sources can be more accurately described

Computationally more expensive!



References




Chapter 4, Draft of the book on “Principles of Data
Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in
preparation).
Sudarshan Chawathe, Hector Garcia-Molina, Joachim
Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey
Ullman, and Jennifer Widom.The TSIMMIS project:
Integration of heterogeneous information sources. In
proceedings of IPSJ, Tokyo, Japan, October 1994.
Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille.
Querying Heterogeneous Information Sources Using
Source Descriptions. In Proceedings of the International
Conference on Very Large Databases (VLDB), 1996.
Zach Ives, slides of the course: “Database and
Information Systems”, Fall 2007, available at:
http://www.seas.upenn.edu/~zives/07f/cis550/
Download