New Principles for Information Integration Laura Haas, IBM Research

advertisement
New Principles for
Information Integration
Laura Haas, IBM Research
Renee Miller, Univ. of Toronto
Donald Kossmann, ETH
Martin Hentschel, ETH
Agenda
•Information integration challenges
•Early progress in execution engines
•Mappings: a big step forward in design
•Where are we now?
•Two principles for information integration in
the future
Information Integration
"What protocols were used for tumors
in similar locations, for patients in the
same age group, with the same
genetic background"
<Organization>
<Located-in>
The Mayo Clinic reported today that the
conclusion of their final study on
tumors in the abdominal cavity
shows increased risk of ascites
through pressure on portal vessels.
<Medical object>
?
Sample
Seq ID
Features
Feature
Pname
ID
Test Notes
Jane A
13
CT
Tumor
Bob D
57
Ekg
Arhyth
Location
Feature
Notes
…with
tumor
Challenges
• Independent schemas
–No global schema
• Diverse data models
–XML, other semi-structured, relational, text, multimedia, …
• Overlapping, incomplete and often inconsistent data
–Must identify co-referent instances
–Tolerate or repair incompleteness and inconsistencies
• Different tasks
–Create a big picture (e.g., master data management)
•Patients, prescriptions, assays, etc.
–Gather information for a specific need (e.g., analytics)
• Needs and knowledge that change over time
–Integrate with only partial knowledge of schema and data
–Extend as needed, or more known
• Multi-step process requiring multiple tools (today)
–Choice of integration engine is binding
–Limited iteration or re-use
Prehistory 1: Data Warehousing
Materialization, Data Exchange, ETL (Extract, Transform, Load), Eager Integration
Data Sources
(DB’s)
Integrated or
Global Schema
User Query
Warehouse Schema
ETL Job
DSlink4
DSlink3
Aggregator_5
Transformer_2
BANK_CLIENTS
DSlink6
DSlink34
DSLink8
BANK_CUSTOMERS
Transformer_7
DSlink25
DSlink10
DSlink5
Aggregator_13
DSlink15
DSlink26
some data
is relevant
All relevant
data collected
Prehistory 2: Data Federation
Mediation, Virtual Integration, Data Integration, Lazy Integration
Garlic
User Query
Tsimmis
Mediated Schema
Information
Manifold
Disco
Pegasus
etc.
Federation
Engine or
Mediator
Integrated or
Global Schema
Reformulation Engine
Optimization Engine
Execution Engine
Wrapper
Wrapper
City
Database
County
Database
Wrapper
Metadata
Wrapper
Public Data Server Outside Website
Good Survey: Lenzerini PODS 2002
Creating an Integrated Schema
Q
Source
Schema S
“conforms to”
data
Mapping
Target
Schema T
“conforms to”
data
Mapping Creation Problem:create mapping between independently
designed (legacy) schemas
“Code” Generation Problem: use the mapping to produce an
executable for integrating data
Schema Mapping Creation
•
Leverage attribute
correspondences
– Manually entered or
– Automatically discovered
•
Preserve data semantics
•
Model incompleteness
•
Produce correct grouping
–
–
–
–
–
Discover data associations
Leverage nesting
Leverage constraints
Mine for approximate constraints
Use workload
– generate new values for data
exchange
Clio
Mapping Specification
• What do we need?
– Source query Qs
• Join of company and grant
– Can be used to populate
– Target query Qt
• Join of organization (with
nested fundings) and
financial
• Mapping is then a
relationship
– Qs  Qt
– Qs isa Qt
• This form of mapping has
proven to be central to data
integration and exchange
Clio’s Key Contributions
• The definition of non-procedural schema mappings to describe the
relationship between data in heterogeneous schemas
– Different levels of granularity: from single concepts (for web publishing) to full
schemas (for enterprise integration)
• A new paradigm in which mapping creation is viewed as query discovery
– Queries (Qs and Qt) represent “concepts” that can be related in two schemas
• Algorithms for automatically generating code for data transformation from
the mappings
– SQL, XQuery, XSLT transforms, ETL scripts, etc.
• Clio began in 1999
– First publication in VLDB 2000 [Miller, Haas, Hernández, VLDB00]
– Ten year retrospective on Clio appears in:
Conceptual Modeling: Foundations and Applications, Essays in Honor of John
Mylopoulos, Springer Festschrift, LNCS 5600, 2009.
Where We Are Today
• Enterprise integration is a hard problem
–Bridges data islands
–Deals with mission-critical data
•Integrations need to be “right”
•Willing to invest in code, expertise to get there
• Great progress has been made
–Two execution methods developed
•Data warehousing, ie, eager integration
•Data federation, ie, lazy integration
–Big steps forward in integration design
•Nonprocedural schema mappings leverage data semantics, aid
eager and lazy integration
•Semi-automated tools for other aspects, e.g., matching, cleaning
• Total Cost of Ownership is still high
–Months to years to build a warehouse
–Weeks to months to build and tune a federation
–Deal with relatively small number of sources
The Changing Landscape
New Challenges for Integration
• Dynamic
–New sources of information every day
–Data flowing constantly
–New uses of information arise constantly
–So many schemas, it’s practically schema-less
• De-centralized
–Few sources of “complete truth”
–Many sources have info on overlapping entities
–Sources may be totally independent
• Every application, every person needs integration
–Not only big applications central to an enterprise
–Total cost of ownership needs to shrink
The Downside of Progress?
• Distinct integration engines create complexity
–Engines do one or the other, lazy or eager
–Must choose the style of integration early on
• Multiple design tools needed to resolve different types of
heterogeneity
–Some tools reconcile instances from the different sources
–Others reconcile different schemas
• Too much choice, too early
–The “best” way to integrate may change over time
–Must be able to build integrations incrementally
• Can we take another giant step?
–Create an engine that delivers the best of both worlds
–Resolve schema & instance-level differences in one tool
–Allow more flexible integrations that evolve over time
Distinct Engines Create Issues
•
•
•
•
Major implications for performance, cost, infrastructure
Good decision requires understanding both data and application(s)
Would be useful to be able to switch styles of integration
Change is expensive today: lost work, duplicate labor costs
Principle 1: Integration Independence
• Applications should be independent of how, when,
and where information integration takes place
–Changing the integration method should not change the
application
–Results should always be delivered in the desired form
–Response times, availability, cost may vary
–Should quality (completeness, precision, …) also be
allowed to vary?
• Materialization and the timing of transformations
should be an optimization decision that is
transparent to applications
–Nonprocedural mappings and query language are key
–Generate code for the best engine for the requirements
–Or, a new engine that spans federated to materialized
–Need an “optimizer” that considers key desiderata
Example: Optimizing an Integration?
50 sec
10 sec
+ $1
App
App
Integration
Integration
App
2 sec +
$1.50
Integration
Fully
distributed
App
1 sec
+ $3
Integration
Fully
materialized
•Possible in limited cases with today’s technology
•Other degrees of freedom exist
•Other objective functions possible
Separate Resolution of Schema and
Data Heterogeneity Creates Issues
• Entity resolution resolves data (instance) heterogeneity
–Detection: Founded on similarity computation using data values
–Reconcile: creates a single, consistent object (may re-define schema)
• Schema mapping resolves schema heterogeneity
–May leverage matching algorithms to suggest alignments
–Tells how to generate data in the target schema
–Few tools leverage data (some matching algorithms sample data)
• Schema mapping implicitly does entity resolution
–Generates joins and unions that combine instances and eliminate
duplicates
–Remaining unresolved instances show up as “bugs” in mapping or
require materializing result for (further) entity resolution
• Tools are separate, but functionality overlaps
–Confusing: which tool is needed? When to use each?
–Inefficient: neither tool uses full knowledge of both data and schema
Example: Instances Aid Mapping?
Schema 1
Condition
Intervention
Reference
…
Thalassemia
Hydroxyurea
14988152
Schema 2
…
Ziekte
Verwijzing
Behandeling
…
Thalassaemia
Pubmed 14988152
Voorschrift: Hydroxyurea
Target Schema
Diagnosis
Therapy
Literature
Principle 2: Holistic Information Integration
• Integration by its nature is an iterative, incremental process
–Rarely get it right the first time
–Needs evolve over time
• Tools should support iteration across integration tasks
–From mapping to entity resolution and back
–From exploring data to mapping and back
–And so on
• Tools should be able to leverage information about both
schema and data to improve the integrated result
–Must be able to leverage partial results of an integration
–Data-driven integration debugging
New Principles Suggest a New Engine
• Two new principles
–Integration Independence
–Holistic Information Integration
• Current engines violate both principles
–Separate virtual from materialized (mostly)
–Separate data handling from schema handling (mostly)
• May need a new engine to embody these
principles
–Allow easy movement from virtual to materialized
–Allow easy movement from data to schema and back
Mapping Data to Queries: a next step?
• New technique for answering queries with mapping rules
–Flexibly integrates and explores heterogeneous data
•Availability of original data at all times
•New schemas and sources can be added incrementally
•Users add only the mapping rules they need
–“Schema optional” solution
–Scalability in the number of mapping rules and schemas
• Allows both schema-level and instance mapping rules
–Permits both traditional mapping and data-level tasks
–Schema-level rules support aliasing, (un)nesting, element
construction, transformations
–Instance rules merge one object into another, keeping all fields of
both objects
• Provides an infrastructure for experimentation
–Algorithms, process, etc still needed
Hentschel et al, 2009
Leveraging MDQ: Schema Mapping
Clinic 1: TestCharge
Name
Policy
Covered
BillOut
@PJK1
James Kelly
SF 45301
$300
$280
Clinic 2: Visit
First
Last
Charge
InsPay
Ins #
@PDF8
Don Fergus
SF 3352
$250
$50
@PJP3
Jose Pozole
ALS 65349
$240
$60
Sample Query:
//PatientVisit [Owed >= 50]
Rules:
TestCharge -> PatientVisit
Visit -> PatientVisit
BillOut -> Owed
(Charge – InsPay) -> Owed
@BDF
Don
Fergus
$580
$500
SF 3352
@BJP
Jose
Poloze
$420
$350
FR 13299
Leveraging MDQ: Instance Mapping
Clinic 1: TestCharge
Name
Policy
Covered
BillOut
Clinic 2: Visit
First
Last
Charge
InsPay
Ins #
James Kelly
SF 45301
$300
$280
Don
@PDF8
Fergus
Don
Fergus
$580
SF
3352
$500
$250
SF 3352
$50
Don Fergus
SF 3352
$250
$50
Rules:
TestCharge -> PatientVisit
Visit -> PatientVisit
BillOut -> Owed
(Charge – InsPay) -> Owed
@BDF <- @PDF8
@BDF @PDF8
@PJK1
Sample Query:
//PatientVisit [Owed >= 50]
@PJP3
Jose Pozole
ALS 65349
$240
$60
@BDF
Don
Fergus
$580
$500
SF 3352
@BJP
Jose
Poloze
$420
$350
FR 13299
Using MDQ for Data Conversion
Clinic 1: TestCharge
Name
Policy
Covered
BillOut
New Schema Mapping Rules:
TestCharge as $t ->
<PatientVisit>
<Name> $t.Name </Name>
<Owed> $t.BillOut </Owed>
<Policy> $t.Policy </Policy>
</PatientVisit>
Visit as $v ->
<PatientVisit>
<Name> $v.First || $v.Last <Name>
<Owed> $v.BillOut + ($v.Charge –
$v.InsPay) </Owed>
<Policy> $v.Ins# </Policy>
</PatientVisit>
Clinic 2: Visit
First
Last
Charge
InsPay
Ins#
Sample Query:
//PatientVisit [Owed >= 50]
Instance Mapping Rule: @BDF <- @PDF8
@PJK1
James Kelly
$280
SF 45301
@BDF @PDF8
Don Fergus
$130
SF 3352
@PJP3
Jose Pozole
$60
ALS 65349
@BJP
Jose Poloze
$70
FR 13299
Summary
• Significant progress in information integration research
–Theory, algorithms, systems
–Federation/virtual/lazy and warehousing/materialized/eager
–Schema mappings feed both
• Significant challenges remain
–Too many incompatible tools and engines – too much choice
–The dynamic, decentralized new world demands new solutions
• Integration independence
–Protect applications from how the data is put together
–Make federation vs materialization an optimization decision
• Holistic Information Integration
–Schema and data-level tasks in the same framework
–Support iteration through tasks and improve integration
• Much fascinating work ahead
–Engines that can provide true integration independence
–Tools that can take advantage of holistic information integration
–Can “Mapping Data to Queries” help us reach our goals?
Download