Slides

advertisement
Data Integration:
The Teenage Years
Alon Halevy (Google)
Anand Rajaraman (Kosmix)
Joann Ordille (Avaya)
VLDB 2006
Agenda
• A few perspectives on the last 10 years
– Technical, commercial
• Perspectives from our personal paths
• Wild speculations about the future
• This is not a survey on data integration
(See the paper in the proceedings for another
non-survey)
Acknowledgements
Other members of the Information Manifold
Project:
– Jaewoo Kang (NCSU, Korea Univ.)
– Divesh Srivastava (AT&T Labs)
– Shuky Sagiv (Hebrew U.)
– Tom Kirk
Acknowledgements
To the SIGMOD 1996 Program committee
For rejecting the earlier version of the paper.
Timeline
95
96
97
98
99
00
01
02
03
04
05
06
Data Integration
Enterprise Databases
Phenotype
Gene
Sequenceable
Entity
Protein
Structured
Vocabulary
Nucleotide
Sequence
Experiment
Microarray
Experiment
Legacy Databases
Services and Applications
The Information Manifold
• Goal: integrate data from multiple sources
on the web:
Find the Woody Allen movies playing in
my area, and their reviews
• Need to describe the data sources:
– Contents, constraints, access patterns
Design time
Run time
Mediated Schema
query
reformulation
Semantic mappings
optimization &
execution
wrapper
wrapper
wrapper
wrapper
wrapper
Semantic Mappings
[a.k.a. Source Descriptions]
Mediated Schema
CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
logic
CDs
Album
ASIN
Price
DiscountPrice
Studio
Books
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
Artists
CDCategories
ASIN
Category
BookCategories
ISBN
Category
ASIN
ArtistName
GroupName
Global-as-View (GAV)
Mapping:
CD(A,T,G) :- R1(A,T,G)
CD(A,T,G) :- R2(A,T), R3(T,G)
Mediated Schema
CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
Source
R1
Source
R2
Source
R3
Source
R4
Source
R5
Local-as-View (LAV)
Mapping:
R1(A,T,G) :- CD(A,T,G,Y), Artist(A,N), Y< 1970
R2(A,T) :- CD(A,T,”French”,Y)
Mediated Schema
CD: ASIN, Title, Genre, Year
Artist: ASIN, Name, …
Source
R1
Source
R2
Source
R3
Source
R4
Source
R5
Query Answering in LAV =
Answering queries using views
Given a set of views V1,…,Vn,
And a query Q,
Can we answer Q using only the answers to
V1,…,Vn?
AQUV (I)
• [Larson et al., 85 & 87], [Tsatalos et al., 94],
[Chaudhuri et al., 95],
• Focus on AQUV for:
– Query optimization
– Supporting physical data independence
• Every commercial DBMS supports AQUV.
AQUV (II)
• AQUV for data integration:
– Find maximally contained rewriting
– Not necessarily equivalent rewriting
• Algorithms:
– Bucket algorithm [LRO, 96]
– Inverse rules [Duschka, 97]
– Minicon [Pottinger and Halevy, 2000]
• Views and security: [Miklau and Suciu, 04]
Survey: Halevy, VLDB Journal, 2001
Some Subsequent Results
• Semantics of data integration:
– Abiteboul & Duschka, 1998: certain answers
– Open vs. closed world assumption
• CWA is bad complexity news!
Survey: Lenzerini, PODS 2002
Certain Answers
Mediated schema: Route (Origin, Destination)
Source 1: Origins
SF
NY
Source 2: Destinations
Seattle
Seoul
Query: Route (SF, Seattle)?
Possible databases:
Origin
Destination
Origin
Destination
SF
NY
Seattle
Seoul
SF
NY
Seoul
Seattle
Some Subsequent Results
• Limitations due to binding patterns
– Input title, get book info [Rajaraman et al., 95]
• Additional query processing capabilities
– Form applies multiple predicates
• Disjunction, negation in sources.
• Ordering sources, probabilistic mappings
– [Florescu et al., 97, Doan et al., Dong et al.]
• GLAV [Millstein et al., 99]
Survey: Lenzerini, PODS 2002
A word on Description Logics
• Selecting relevant sources = reasoning.
• Description logics to the rescue:
– [Catarci and Lenzerini, 93]
• Information Manifold
– Combined the Classic DL with Datalog
(CARIN)
– See AAAI-96 (not sigmod)
• Brought DL and DB closer together.
– A very active area of research today.
95
96
97
98
99
00
01
02
03
04
05
06
XML and Semi-structured Data
• Tsimmis: semi-structured data for
integration.
• XML: whetted the integration appetites
– We have the syntax
– Now just solve the silly semantics problems
– Don’t bother: we’ll all standardize on DTDs.
• XML will have a significant role on the data
integration industry and research.
95
96
97
98
99
00
01
02
03
04
05
06
Back in the Lab…
• Two observations:
– Who’s going to write all these LAV/GAV
formulas?
– This was the bottleneck.
• Once we have mappings, how can we
execute queries?
– Traditional plan-then-execute doesn’t work.
Semantic Mappings
Books
BooksAndMusic
Title
Author
Publisher
ItemID
ItemType
SuggestedPrice
Categories
Keywords
Inventory
Database A
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
BookCategories
ISBN
Category
CDCategories
CDs
Album
ASIN
Price
DiscountPrice
Studio
ASIN
Category
Artists
ASIN
ArtistName
GroupName
Inventory Database B
“Standards are great, but there are too many of them.”
Techniques for Schema Mapping
[Survey by Rahm and Bernstein, VLDBJ 2001]
• Compare schema elements based on:
– Names (or n-grams)
– Data types and instances
– Text descriptions, integrity constraints
• Combine multiple techniques:
– [Momis, Cupid, LSD, Coma]
• Create mappings from matches
– [Clio @ IBM + Miller]
A Machine Learning Approach
[Doan et al., 2001, ACM Distinguished Dissertation 2003]
Mediated schema
• Many mapping tasks are repetitive
• Learn from previous experience:
– Build a classifier for every element of the
mediated schema.
– Many kinds of cues  meta-strategy learning
Matching Real-Estate Sources
Mediated schema
address
location
price
agent-phone
listed-price
phone
description
comments
Schema of realestate.com
location
listed-price
phone
comments
realestate.com Miami, FL $250,000 (305) 729 0831 Fantastic house
Boston, MA $110,000 (617) 253 1429 Great location
...
...
...
...
homes.com
price
contact-phone
extra-info
$550,000 (278) 345 7215 Beautiful yard
$320,000 (617) 335 2315 Great beach
...
...
...
Learned hypotheses
If “phone” occurs
in the name =>
agent-phone
If “fantastic” &
“great”
occur frequently
in data values =>
description
Reference Reconciliation
To Join or not to Join?
• Many ways to refer to the same object in
the world:
– “IBM”, “International Business Machines”
– Alon Levy, Alon Halevy
• Automated methods are necessity
– Can’t go through all the data manually
• Very active area in ML, KDD, DB, UAI, …
Query Processing
To Plan or to Execute?
• In addition to distributed query processing issues:
– Few statistics, if any.
– Network behavior issues: latency, burstiness,…
– Garlic @IBM
• “Adaptive query processing”:
–
–
–
–
–
Stonebraker saw it coming in Ingres.
Revivals by Graefe (1993) and DeWitt (1998).
Query scrambling [Urhan & Franklin]
Eddies [Avnur & Hellerstein]
Convergent query processing [Ives et al.]
95
96
97
98
99
00
01
02
03
04
05
06
Commercialization
• Late 90’s – anything goes.
• Want money from VC’s?
– Say “XML” 3 times loud and clear.
• Academia at the forefront:
– Nimble (UW), Cohera (Berkeley), Enosys
(UCSD),…
• Big companies took notice
– Some faster than others
Commercialization Retrospective
[See Panel-of-Experts, SIGMOD 05]
• Uphill battle vs. the warehousing folks
– Virtual integration was more “pay-as-you-go”
• Another battle with the EAI folks
– Should really be a symbiosis there.
• Go vertical or horizontal?
– Obvious: go vertical if you can find the right
one.
• The technology worked
– But it’s all in the timing…
After $30M…
Front-End
User
Applications
Lens™ File
Software
Developers Kit
InfoBrowser™
Lens Builder™
NIMBLE™ APIs
XML Query
Nimble Integration Engine™
Cache
Compiler
Executor
Metadata
Server
Common
XML View
Management
Tools
Integration
Builder
Concordance
Developer
Relational Data Warehouse/ Legacy
Mart
Flat File
Web Pages
Data
Administrator
Security Tools
Integration
Layer
XML
NASDAQ
95
96
97
98
99
00
01
02
03
04
05
06
So… Back in the Lab
• Model management
• Peer data management systems
• Data exchange
Model Management
[Bernstein et al.]
• Generic infrastructure for managing
schemas and mappings:
– Manipulate models and mappings as bulk
objects
– Operators to create & compose mappings,
merge & diff models
– Short operator scripts can solve schema
integration, schema evolution, reverse
engineering, etc.
• First challenge: semantics of operators.
Peer Data Management Systems
Q3
UW (Wisconsin)
Stanford
Q1
Q4
Berkeley
Q5
LAV, GLAV
Q
UW (Washington)
DBLP
Q2
UW (Waterloo)
Q6
CiteSeer
PDMS-Related Projects
•
•
•
•
•
•
•
•
•
Piazza (Washington)
Hyperion (Toronto)
PeerDB (Singapore)
Local relational models (Trento, Toronto)
Active XML (INRIA)
Edutella (Hannover, Germany)
Semantic Gossiping (EPFL Lausanne)
Raccoon (UC Irvine)
Orchestra (U. Penn)
PDMS Challenges
• Semantics:
• careful about cycles
• Optimization:
• Compose mappings
• Prune paths
UW (Wisconsin)
Stanford
Berkeley
• Manage networks:
• Consistency
• Quality
• Caching
UW (Washington)
DBLP
UW (Waterloo)
CiteSeer
Data Exchange
S
M
T
• Key question: given an instance of S and a
mapping, create an instance for T.
• [Fagin, Kolaitis, Popa & Tan]
95
96
97
98
99
00
01
02
03
04
05
06
?
95
96
97
98
99
00
01
02
03
04
05
06
2006 Status Report
[The People Angle]
• Joann @ Avaya
– Integrating communications into business
processes
• Anand @ Kosmix
– Creating a new kind of search company
• Alon @ Google
– Working for Joann’s old boss
– Deep web evangelist
– Pondering data management for the masses
2006 Status Report
[Enterprise Angle]
• Enterprise Information Integration is
established:
– IBM, BEA, Oracle, MetaMatrix, Composite,
Actuate, …
• Impact on design tools:
– IBM Rational Data Architect
– ADO .NET v. 3
Forrester Says…
"Enterprises are facing the growing
challenges of
using disparate sources of data managed
by different applications, including problems with data
integration, security, performance, availability and
quality.... New technology is emerging that Forrester has
coined "information fabric," a term defined as a
virtualized data layer that integrates
heterogeneous data and content repositories in real
time.... The potential benefits of this technology are so
great that enterprises should develop a strategy to
leverage information fabric technology as it becomes
more widely available."
2006 Status Report
[Web Angle]
• Vertical search engines: one domain
• At scale: need even better source
descriptions
– deep web can be surfaced
• Terminology: Data integration = mashups!
Wikipedia:
A mashup is a website or Web 2.0
application that uses content from more
than one source to create a completely
new service. This is akin to transclusion.
Looking Ahead
• Data management: from the enterprise to the
masses
• Challenges:
– Databases of everything
– Need support for collaboration
– Help people structure their data
– Pay-as-you go data management
Pay-as-you-go Data Management
Dataspaces: Franklin, Halevy, Maier [see PODS 2006]
Benefit
Dataspaces
Data integration solutions
Artist: Mike Franklin
Investment (time, cost)
Big Carrots
Reusing Human Attention
• Principle:
 User action = statement of semantic relationship
 Leverage actions to infer other semantic relationships
• Examples
– Providing a semantic mapping
• Infer other mappings
– Writing a query
• Infer content of sources, relationships between sources
– Creating a “digital workspace”
• Infer “relatedness” of documents/sources
• Infer co-reference between objects in the dataspace
– Annotating, cutting & pasting, browsing among docs
Conclusion
• We’ve done extremely well as a community!
• Next challenge: data management and
integration tools for the masses
Download