Job Shop Scheduling - Computer Science & Engineering

advertisement
Planning for the Web I
Data Integration
Dan Weld
University of Washington
June, 2003
Acknowledgements
•
•
•
•
Oren Etzioni
Alon Halevy
Zachary Ives
Rao Kambhampati
Caveat
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
2
My Two Talks for Today
• Data Integration
Providing uniform access to disparate data srcs
AI meets DB
Answering queries using views
Execution in the face of uncertainty, latency
• Service Integration
Invoking and composing web services
Query and update
Planning with incomplete information
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
3
Overview: Data Integration
•
•
•
•
Motivation / intro
Wrappers / information extraction
Database review
Integrating data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
4
What is Data Integration?
A system providing:
Uniform (same query interface to all sources)
Access to (queries; eventually updates too)
Multiple (we want many, but 2 is hard too)
Autonomous (DBA doesn’t report to you)
Heterogeneous (data models are different)
Structured (or at least semi-structured)
Data Sources (not only databases).
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
5
User enters query
Formulate queries
Remove duplicates
...
Post-process + rank
Lycos
Excite
Collate results
Download?
Present to user
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
6
•
•
•
•
•
•
•
•
•
•
•
•
Meta-?
Web Search
Shopping
Product Reviews
Chat Finder
Columnists (e.g. jokes, sports, ….)
Email Lookup
Event Finder
People Finder
Restaurant Reviews
Job Listings
Classifieds
Apartment + Real Estate
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
7
Intuition: Info Integration
• Info aggregation … on Steroids!
• Want agent such that
• User says what she wants
• Agent decides how & when to achieve it
• Example:
Show me all reviews of movies starring Matt
Damon that are currently playing in Seattle
Sidewalk
IMDB
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Ebert
Info. Aggregation vs. Integration
prices of
laptop
with …
movies
in Seattle
starring …
IMDB
store1
store2
… storeN
sidewalk
join
rev1
rev2
…
revN
sort
•
•
•
•
Join, sort
aggregate
More complex queries
Dynamic generation/optimization of execution plan
Applicable to wider range of problems
Much harder to implement efficiently
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
9
Challenges
User must know which sites have relevant
info
User must go to each one in turn
Slow: Sequential access takes time
Confusing: Each site has a different
interface
User must manually integrate information
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Practical Motivation
• Enterprise
Business “dashboard’’; web-site construction.
• WWW
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
• Science and culture:
Medical genetics: integrating genomic data
Astrophysics: monitoring events in the sky.
Environment: Puget Sound Regional Synthesis Model
Culture: uniform access to all cultural databases
produced by countries in Europe.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
11
The Problem: Data Integration
mybooks.com Mediated Schema
Books
Internet
Inventory
Orders
WAN
MorganKaufman
PrenticeHall
...
Shipping
Internet
East
Orders
West
Reviews
Internet
FedEx
Customer
Reviews
UPS
NYTimes
...
alt.books.
reviews
Uniform query capability across autonomous,
heterogeneous data sources on LAN, WAN,
or Internet
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
12
Current Solutions
• Mostly ad-hoc programming: create a
special solution for every case;
pay consultants a lot of money.
• Data warehousing: load all the data
periodically into a warehouse.
6-18 months lead time
Separates operational DBMS from decision
support DBMS. (not only a solution to data
integration).
Performance is good; data may not be fresh.
Need to clean, scrub you data.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
13
Data Warehouse Architecture
User queries
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction,
cleaning/
scrubbing
Data extraction
programs
Data
source
Data
source
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Data
source
14
• Pro
Warehouse Summary
Relatively simple
Good performance (OLAP support)
Mature technology (DB, ETL industries)
• Con
Expensive
Stale data
Risky – most warehouse projects fail
• Rigid architecture
• Fixed schema
• Must know all queries ahead of time
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
15
Architecture for Virtual
Integration
Leave the data in the sources.
When a query comes in:
1)
2)
3)
4)
Determine which sources are relevant to query.
Break query into sub-queries for each source.
Get answers from sources
Combine.
Data is fresh.
Challenge: performance.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
16
Virtual Integration Architecture
User queries
Mediated schema
Reformulator
Mediator:
Optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
Sources can be: relational, hierarchical (IMS), structured files, web sites.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
17
Research Projects
•
•
•
•
•
•
•
•
•
•
Garlic (IBM),
Information Manifold (AT&T)
Tsimmis, InfoMaster (Stanford)
Internet Softbot/Razor/Tukwila (U Wash.)
Hermes (Maryland)
Telegraph / Eddies (UC Berkeley)
Niagara (Univ Wisconsin)
DISCO, Agora (INRIA, France)
SIMS/Ariadne (USC/ISI)
Emerac/Havasu (ASU)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
18
Industry
•
•
•
•
Nimble Technology
Enosys Markets
IBM
BEA
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
19
Dimensions to Consider
•
•
•
•
•
•
How many sources are we accessing?
How autonomous are they?
Meta-data about sources?
Is the data structured?
Queries or also updates?
Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
20
Outline
Motivation / introduction
• Wrappers / information extraction
• Database Review
• Integrating data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
21
• Task
Wrapper Programs
to communicate with the data sources and do
format translations.
• Built w.r.t. a specific source.
• Can sit either at the source or mediator.
• Often hard to build
(very little science).
• Can be “intelligent”
perform source-specific optimizations.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
22
Example
<b> Introduction to DB </b>
Transform:
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
into:
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
23
Wrapper Construction
• Use PERL, or
• Generate wrappers automatically
Get training examples
• Human marks up selected pages with GUI tool
Use shallow NLP to create features
Favorite learning method
• HMMs, VS on prefix, postfix strings, ??
Boosting
Co-training
• See research on information extraction
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
24
SemTag & Seeker
• WWW-03 Best Paper Prize
• Seeded with TAP ontology (72k concepts)
And ~700 human judgments
• Crawled 264 million web pages
• Extracted 434 million semantic tags
Automatically disambiguated
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
25
Outline
Motivation / Introduction
Wrappers / Information extraction
• Database Review
• Relational algebra, SQL, datalog
• Views
• Optimization (query planning)
• Integrating data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
26
Traditional Database
Architecture
Query
(SQL)
Answer
(relation)
Database Manager
(DBMS)
-Storage mgmt
-Query processing
-View management
-(Transaction processing)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Database
(relational)
27
Relational Data: Terminology
Product
attribute
relation (Arity=4)
Name
Price
Category
Manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Power gizmo
$29.99
gadgets
GizmoWorks
SingleTouch
$149.99
photography
Canon
MultiTouch
$203.99
household
Hitachi
tuple
schema
Product(Name: string, Price: real, category: enum, Man: string)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
28
Relational Algebra
• Operators
tuple sets as input,
new set as output
• Operations
Name
Price
Category
Manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Power gizmo $29.99
gadgets
GizmoWorks
SingleTouch
$149.99
photography
Canon
MultiTouch
$203.99
household
Hitachi
Union, Intersection, difference, ..
Selection (s)
Projection ()
Cartesian product (X)
• Join ( )
City
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Manufacturer
Tempe
GizmoWorks
Kyoto
Canon
Dayton
Hitachi
29
SQL: A query language for
Relational Algebra
Many standards out there: SQL92, SQL2,
SQL3, SQL99
Select attributes
From relations (possibly multiple, joined)
Where conditions (selections)
Other features:
aggregation, group-by
etc.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
“Find companies that manufacture products
bought by Joe Blow”
SELECT Company.name
FROM
Company, Product
WHERE Company.name=Product.maker
AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer = “Joe Blow”);
30
Deductive Databases
• Tables viewed as predicates.
• Ops on tables expressed as “datalog” rules
(Horn clauses, without function symbols)
Enames(Name) :- Employe(Name, SSN) [Projection]
Wealthy-Employee(Name)
:- Employee(Name,SSN), Salary(SSN,Money),Money> 10
[Selection]
Ed(Name, Dname)
:- Employee(Name, SSN), E_Dependents(SSN, Dname)
[Join]
ERelated(Name,Dname) :- Ed(Name,Dname)
ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2)
[Recursion]
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
31
Views
Views are relations,
except that they are not physically stored.
Uses:
• simplify complex queries, &
• define conceptually different views of DB for diff. users.
Example:
purchases of telephony products:
CREATE VIEW telephony-purchases AS
SELECT product, buyer, seller, store
FROM Purchase, Product
WHERE Purchase.product = Product.name
AND Product.category = “telephony”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
33
A Different View
CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer
We can later use the view:
SELECT name, store
FROM
Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes”
What’s really happening when we query a view??
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
34
Materialized Views
• Views whose corresponding queries have
been executed and the data is stored in a
separate database
Uses: Caching
• Issues
Using views in answering queries
• Normally, views are available in addition to DB
– (so, views are local caches)
• In information integration, views may be the
only things we have access to.
– An internet source specializing in tom hanks movies
can be seen as a view on a database of all movies.
– Except, there is no DB out there which contains all
movies..
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
35
DB Inference
• Type 1
Evaluate query on instance of data
Exponential in size of query
Key is performance in size of data
• Type 2
Query containment
Q1  Q2
No matter what data instances
Logical entailment
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
36
Query Optimization
Declarative query
SELECT S.buyer
FROM Purchase P, Person Q
WHERE P.buyer=Q.name AND
Q.city=‘seattle’ AND
Q.phone > ‘5430000’
Imperative execution plan:
buyer
s
City=‘seattle’
phone>’5430000’
Inputs:
Buyer=name
(Simple Nested Loops)
• the query
• available memory
• statistics about the data
Person
Purchase
• indexes,
(Table scan)
(Index scan)
• cardinalities,
• selectivity factors
Ideally: Want to find best plan. Practically: Avoid worst plans!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
37
(On-the-fly)
sname
bid=100
(On-the-fly)
sname
(On-the-fly)
rating > 5
Pipelined
(Use hash
index; do
sid=sid
not write
result to
temp)
rating > 5
bid=100
(Simple Nested Loops)
hash join
sid=sid
Reserves
Reserves
Sailors
Goal of
optimization:
To find more
efficient plans
that compute the
same answer.
sname
Sailors
(On-the-fly)
rating > 5 (On-the-fly)
SELECT S.sname
FROM Reserves
R, Sailors S
(Sort-Merge Join)
sid=sid
(Scan;
write to
temp T1)
bid=100
Sailors
WHERE
R.sid=S.sid AND
R.bid=100
AND S.rating>5
Reserves
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
38
Relational Algebra Equivalences
• Allow us to choose different join orders
and to ‘push’ selections and projections
ahead of joins.
Selections: s c1... cn  R)  s c1  . . . s cn  R))
s c1 s c 2  R) )  s c 2 s c1  R) )
(Commute)
Projections:
Joins:
 a1 R ) a1 ... an R )))
R  (S T)
 (R S)  T
(R  S)  (S  R)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
(Cascade)
(Associative)
(Commute)
Optimizing Joins
• Q(u,x) :- R(u,v), S(v,w), T(w,x)
R
S
T
• Many ways of doing a single join R S
Symmetric vs. asymmetric join operations
• Nested join, hash join, double pipe-lined hash join etc.
Processing costs alone vs. proc. + transfer costs
• Get R and S together vs, get R, get just the tuples of S that will
join with R (“semi-join”)
• Many orders in which to do the join
(R join S) join T
(S join R) join T
(T join S) join R etc.
• All with different costs
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
40
Determining Join Order
In principle, must consider all possible join orderings:
D
D
C
A
B
C
D
A
B
C
A
As # of joins increases, # plans grows rapidly
Must restrict search space.
System-R: consider only left-deep join trees.
This lets us generate all fully pipelined plans:
B
Intermediate results not written to temporary files.
(Not all left-deep trees are fully pipelined)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Cost Estimation
• For each plan considered, estimate cost:
Estimate cost of each operation in plan tree.
• Depends on input cardinalities.
Estimate size of result for each op in tree!
• Use information about the input relations.
• Selectivity (Histograms)
• For selections and joins, assume independence of
predicates.
• System R cost estimation approach.
Very inexact, but works ok in practice.
More sophisticated techniques known now.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Key Lessons in Optimization
• Classic planning / execution scenario
Uncertainty / replanning key for data integration
• Main points
Disk IO as cost metric
Algebraic rules / use in query transformation..
Join ordering via dynamic programming
Estimating cost of plans
• Size of intermediate results.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
44
Integrator vs. DBMS
No common schema
Sources with heterogeneous schemas
Semi-structured sources
Legacy Sources
Not relational-complete
Access/process limitations
Autonomous sources
Uncontrolled source content overlap
Lack of source statistics
Tradeoffs: plan cost, coverage, quality, …
Multi-objective cost models
Unpredictable run-time behavior
Makes query execution hard
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
45
Outline
Motivation
Wrappers / information extraction
Database Review
• Integrating data sources
Content, completeness, capabilities
Reformulation algorithms: Bucket
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
46
Remnder: Info Integration
• Want agent such that
• User says what she wants
• Agent decides how & when to achieve it
• Example:
Show me all reviews of movies starring Matt
Damon that are currently playing in Seattle
Sidewalk
IMDB
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Ebert
Data Source Catalog
• Contains meta-information about sources:
Logical source contents (books, new cars).
Source capabilities (can answer SQL queries?)
Source completeness (has all books).
Physical properties of source and network.
Statistics about the data (like in an RDBMS)
Source reliability
Mirror sources?
Update frequency.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
48
Content Descriptions
• User queries refer to the mediated schema.
• Source data is stored in a local schema.
• Content descriptions provide
semantic mappings between different schemas.
• Data integration system
uses the descriptions to translate user queries
into queries on the sources.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
49
Desiderata for Source
Descriptions
• Expressive power: distinguish between
sources with closely related data. Enable
pruning of access to irrelevant sources.
• Easy addition: make it easy to add new
data sources.
• Reformulation: be able to reformulate a
user query into a query on the sources
efficiently and effectively.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
50
Reformulation Problem
• Given:
A query Q posed over the mediated schema
Descriptions of the data sources
• Find:
A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers from to Q
given the sources.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
51
Approaches to Specifying Source
Descriptions
• Global-as-view: express the mediated
schema relations as a set of views over
the data source relations
• Local-as-view: express the source
relations as views over the mediated
schema.
• Can be combined with no additional cost.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
52
Global-as-View
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select * from S1
[S1(title,dir,year,genre)]
union
select * from S2
[S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
53
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies?
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
55
Global-as-View Summary
 Very easy conceptually.
Query reformulation  view unfolding.
 Can build hierarchies of mediated schemas.
 Sometimes loose information.
Not always natural.
 Adding sources is hard.
Need to consider all other sources that are
available.
May need to modify every global view defn
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
56
Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 AS
[S1(title, dir, year, genre)]
select * from Movie
Create Source S3 AS
[S3(title, dir)]
select title, dir from Movie
Create Source S5 AS
[S5(title, dir, year)]
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
57
Local-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
.
Now if we want to find which cinemas are
playing comedies, there is hope!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
58
Local-as-View Summary
• Very flexible.
You have the power of the entire query language
to define the contents of the source.
• Hence, can easily distinguish between
contents of closely related sources.
• Adding sources is easy:
They’re independent of each other.
• Query reformulation:
Answering queries using views!
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
59
The General Problem
• Given a set of views V1,…,Vn, and a query Q,
can we answer Q using only the answers to
V1,…,Vn?
Many, many papers on this problem.
Great survey on the topic: (Halevy, 2001).
• The best performing algorithm:
MiniCon (Pottinger & Levy, 2000).
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
60
Example
Query:
Q(x):- r1(x,y) & r2(y,x)
Movie,
Schedule
Views:
Create Source V1 as
V1(a):-r1(a,b)
select a
V2(d):-r2(c,d)
from r1
V3(f):- r1(f,g) & r2(g,f)
Create Source V3 as
select r1.arg1
from r1, r2
where r1.arg1 = r2.arg2
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
61
Bucket Algorithm (1)
1)
For each subgoal in the query, place
relevant views in the subgoal’s bucket
Query:
Q(x):- r1(x,y) & r2(y,x)
Views:
V1(a):-r1(a,b)
V2(d):-r2(c,d)
V3(f):- r1(f,g) & r2(g,f)
r1(x,y)
V1(x),V3(x)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
Buckets:
r2(y,x)
V2(x), V3(x)
62
Bucket Algorithm (2)
2) For every combo in the Cartesian product of the
buckets, “check containment of the query”
Q(x):-r1(x,y) & r2(y,x)  Q’4(x):-r1(x,a) & r2(a,x) &
r1(x,b) & r2(b,x)
Bucket Algorithm will
check all possible
Candidate rewritings:
combinations
Q’1(x) :- V1(x) & V2(x)
r2(y,x)
Q’2(x) :- V1(x) & V3(x)
r1(x,y)
Q’3(x) :- V3(x) & V2(x)
V
(x),
V
(x)
V1(x),V3(x)
2
3
Q’4(x) :- V3(x) & V3(x)
r1(x,y) r2(y,x)
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
63
Modeling Source Capabilities
• Negative capabilities:
A web site may require certain inputs (in an
HTML form).
• Positive capabilities:
A source may be an ODBC compliant system.
Need to decide placement of operations
according to capabilities.
• Problem: how to describe and exploit source
capabilities.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
64
Example #1: Access Patterns
Mediated schema relation: cites(paper1, paper2)
Create Source S1 as
select *
from cites
given paper1
S1($x,y) :- cites(x, y)
Create Source S2 as
select paper1
from cites
S2(x) :- cites(x, y)
Q(x) :- cites(x, “W03”) &
cites(x, y)
Select paper1 from cites where paper2=“W03”
Query:
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
65
Example #2: Access Patterns
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paperID
from UW-Papers
Create Source S3 as
select paperID
from AwardPapers
given paperID
Query: select * from AwardPapers
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
66
Example #2: Solutions
• Can’t go directly to S3 (it requires a binding).
• Can go to S1, get UW papers,
and check if they’re in S3.
• Can
go to S1, get UW papers, feed them to S2,
and then check if they’re in S3.
• Can
go to S1, feed results into S2, feed results
into S2 again, then check if they’re in S3.
• Note: we can’t a priori decide when to stop.
Need recursive query processing.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
67
Algorithms
• Inverse Rules [Duschka & Genesereth]
• Minicon [Pottinger & Levy]
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
68
Complexity of finding maximallycontained plans
• Complexity depends on sources, not query
Sources as unions of conjunctive queries
(NP-hard)
• Disjunctive descriptions
Sources as recursive queries
(Undecidable)
True source contents
Advertised description
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
69
Matching Objects Across Sources
• How do I know that D. Weld in source 1 is
the same as Daniel S. Weld in source 2?
• If  uniform keys across sources, easy.
• If not:
Domain specific solutions
• (e.g., maybe look at the address, …).
Use IR techniques (Cohen, 98).
• Judge similarity as you would between documents.
Use concordance tables.
• These are time-consuming to build, but you can then
sell them for lots of money.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
70
Schema Mapping Problem
• Types of structures:
Database schemas, XML DTDs, ontologies, …,
• Input:
Two (or more) structures, S1 and S2
(perhaps) Data instances for S1 and S2
Background knowledge
• Output:
A mapping between S1 and S2
Should enable translating between data
instances.
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
71
Semantic Mappings between
Schemas
• Source schemas = XML DTDs
house
address
contact-info
agent-name
num-baths
agent-phone
1-1 mapping
non 1-1 mapping
house
location
contact
name
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
full-baths
half-baths
phone
72
Summary
• Data Integration
Providing uniform access to disparate data srcs
AI meets DB
• Modeling Data Sources
GAV, LAV
• Query Reformulation
Answering queries using views
Bucket algorithm
[Inverse rules, Minicon]
• Schema Matching
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
73
Next Talk
• Executing data integration plans
Variable latency,
Poor info on size & speed of remote sources
 Adaptive execution
• Service integration
Invoking and composing web services
Query and update
Planning with incomplete information
© Daniel S. Weld, PLANET 2003 Tutorial on Data Integration
74
Download