New Principles for Information Integration Laura Haas, IBM Research Renee Miller, Univ. of Toronto Donald Kossmann, ETH Martin Hentschel, ETH Agenda •Information integration challenges •Early progress in execution engines •Mappings: a big step forward in design •Where are we now? •Two principles for information integration in the future Information Integration "What protocols were used for tumors in similar locations, for patients in the same age group, with the same genetic background" <Organization> <Located-in> The Mayo Clinic reported today that the conclusion of their final study on tumors in the abdominal cavity shows increased risk of ascites through pressure on portal vessels. <Medical object> ? Sample Seq ID Features Feature Pname ID Test Notes Jane A 13 CT Tumor Bob D 57 Ekg Arhyth Location Feature Notes …with tumor Challenges • Independent schemas –No global schema • Diverse data models –XML, other semi-structured, relational, text, multimedia, … • Overlapping, incomplete and often inconsistent data –Must identify co-referent instances –Tolerate or repair incompleteness and inconsistencies • Different tasks –Create a big picture (e.g., master data management) •Patients, prescriptions, assays, etc. –Gather information for a specific need (e.g., analytics) • Needs and knowledge that change over time –Integrate with only partial knowledge of schema and data –Extend as needed, or more known • Multi-step process requiring multiple tools (today) –Choice of integration engine is binding –Limited iteration or re-use Prehistory 1: Data Warehousing Materialization, Data Exchange, ETL (Extract, Transform, Load), Eager Integration Data Sources (DB’s) Integrated or Global Schema User Query Warehouse Schema ETL Job DSlink4 DSlink3 Aggregator_5 Transformer_2 BANK_CLIENTS DSlink6 DSlink34 DSLink8 BANK_CUSTOMERS Transformer_7 DSlink25 DSlink10 DSlink5 Aggregator_13 DSlink15 DSlink26 some data is relevant All relevant data collected Prehistory 2: Data Federation Mediation, Virtual Integration, Data Integration, Lazy Integration Garlic User Query Tsimmis Mediated Schema Information Manifold Disco Pegasus etc. Federation Engine or Mediator Integrated or Global Schema Reformulation Engine Optimization Engine Execution Engine Wrapper Wrapper City Database County Database Wrapper Metadata Wrapper Public Data Server Outside Website Good Survey: Lenzerini PODS 2002 Creating an Integrated Schema Q Source Schema S “conforms to” data Mapping Target Schema T “conforms to” data Mapping Creation Problem:create mapping between independently designed (legacy) schemas “Code” Generation Problem: use the mapping to produce an executable for integrating data Schema Mapping Creation • Leverage attribute correspondences – Manually entered or – Automatically discovered • Preserve data semantics • Model incompleteness • Produce correct grouping – – – – – Discover data associations Leverage nesting Leverage constraints Mine for approximate constraints Use workload – generate new values for data exchange Clio Mapping Specification • What do we need? – Source query Qs • Join of company and grant – Can be used to populate – Target query Qt • Join of organization (with nested fundings) and financial • Mapping is then a relationship – Qs Qt – Qs isa Qt • This form of mapping has proven to be central to data integration and exchange Clio’s Key Contributions • The definition of non-procedural schema mappings to describe the relationship between data in heterogeneous schemas – Different levels of granularity: from single concepts (for web publishing) to full schemas (for enterprise integration) • A new paradigm in which mapping creation is viewed as query discovery – Queries (Qs and Qt) represent “concepts” that can be related in two schemas • Algorithms for automatically generating code for data transformation from the mappings – SQL, XQuery, XSLT transforms, ETL scripts, etc. • Clio began in 1999 – First publication in VLDB 2000 [Miller, Haas, Hernández, VLDB00] – Ten year retrospective on Clio appears in: Conceptual Modeling: Foundations and Applications, Essays in Honor of John Mylopoulos, Springer Festschrift, LNCS 5600, 2009. Where We Are Today • Enterprise integration is a hard problem –Bridges data islands –Deals with mission-critical data •Integrations need to be “right” •Willing to invest in code, expertise to get there • Great progress has been made –Two execution methods developed •Data warehousing, ie, eager integration •Data federation, ie, lazy integration –Big steps forward in integration design •Nonprocedural schema mappings leverage data semantics, aid eager and lazy integration •Semi-automated tools for other aspects, e.g., matching, cleaning • Total Cost of Ownership is still high –Months to years to build a warehouse –Weeks to months to build and tune a federation –Deal with relatively small number of sources The Changing Landscape New Challenges for Integration • Dynamic –New sources of information every day –Data flowing constantly –New uses of information arise constantly –So many schemas, it’s practically schema-less • De-centralized –Few sources of “complete truth” –Many sources have info on overlapping entities –Sources may be totally independent • Every application, every person needs integration –Not only big applications central to an enterprise –Total cost of ownership needs to shrink The Downside of Progress? • Distinct integration engines create complexity –Engines do one or the other, lazy or eager –Must choose the style of integration early on • Multiple design tools needed to resolve different types of heterogeneity –Some tools reconcile instances from the different sources –Others reconcile different schemas • Too much choice, too early –The “best” way to integrate may change over time –Must be able to build integrations incrementally • Can we take another giant step? –Create an engine that delivers the best of both worlds –Resolve schema & instance-level differences in one tool –Allow more flexible integrations that evolve over time Distinct Engines Create Issues • • • • Major implications for performance, cost, infrastructure Good decision requires understanding both data and application(s) Would be useful to be able to switch styles of integration Change is expensive today: lost work, duplicate labor costs Principle 1: Integration Independence • Applications should be independent of how, when, and where information integration takes place –Changing the integration method should not change the application –Results should always be delivered in the desired form –Response times, availability, cost may vary –Should quality (completeness, precision, …) also be allowed to vary? • Materialization and the timing of transformations should be an optimization decision that is transparent to applications –Nonprocedural mappings and query language are key –Generate code for the best engine for the requirements –Or, a new engine that spans federated to materialized –Need an “optimizer” that considers key desiderata Example: Optimizing an Integration? 50 sec 10 sec + $1 App App Integration Integration App 2 sec + $1.50 Integration Fully distributed App 1 sec + $3 Integration Fully materialized •Possible in limited cases with today’s technology •Other degrees of freedom exist •Other objective functions possible Separate Resolution of Schema and Data Heterogeneity Creates Issues • Entity resolution resolves data (instance) heterogeneity –Detection: Founded on similarity computation using data values –Reconcile: creates a single, consistent object (may re-define schema) • Schema mapping resolves schema heterogeneity –May leverage matching algorithms to suggest alignments –Tells how to generate data in the target schema –Few tools leverage data (some matching algorithms sample data) • Schema mapping implicitly does entity resolution –Generates joins and unions that combine instances and eliminate duplicates –Remaining unresolved instances show up as “bugs” in mapping or require materializing result for (further) entity resolution • Tools are separate, but functionality overlaps –Confusing: which tool is needed? When to use each? –Inefficient: neither tool uses full knowledge of both data and schema Example: Instances Aid Mapping? Schema 1 Condition Intervention Reference … Thalassemia Hydroxyurea 14988152 Schema 2 … Ziekte Verwijzing Behandeling … Thalassaemia Pubmed 14988152 Voorschrift: Hydroxyurea Target Schema Diagnosis Therapy Literature Principle 2: Holistic Information Integration • Integration by its nature is an iterative, incremental process –Rarely get it right the first time –Needs evolve over time • Tools should support iteration across integration tasks –From mapping to entity resolution and back –From exploring data to mapping and back –And so on • Tools should be able to leverage information about both schema and data to improve the integrated result –Must be able to leverage partial results of an integration –Data-driven integration debugging New Principles Suggest a New Engine • Two new principles –Integration Independence –Holistic Information Integration • Current engines violate both principles –Separate virtual from materialized (mostly) –Separate data handling from schema handling (mostly) • May need a new engine to embody these principles –Allow easy movement from virtual to materialized –Allow easy movement from data to schema and back Mapping Data to Queries: a next step? • New technique for answering queries with mapping rules –Flexibly integrates and explores heterogeneous data •Availability of original data at all times •New schemas and sources can be added incrementally •Users add only the mapping rules they need –“Schema optional” solution –Scalability in the number of mapping rules and schemas • Allows both schema-level and instance mapping rules –Permits both traditional mapping and data-level tasks –Schema-level rules support aliasing, (un)nesting, element construction, transformations –Instance rules merge one object into another, keeping all fields of both objects • Provides an infrastructure for experimentation –Algorithms, process, etc still needed Hentschel et al, 2009 Leveraging MDQ: Schema Mapping Clinic 1: TestCharge Name Policy Covered BillOut @PJK1 James Kelly SF 45301 $300 $280 Clinic 2: Visit First Last Charge InsPay Ins # @PDF8 Don Fergus SF 3352 $250 $50 @PJP3 Jose Pozole ALS 65349 $240 $60 Sample Query: //PatientVisit [Owed >= 50] Rules: TestCharge -> PatientVisit Visit -> PatientVisit BillOut -> Owed (Charge – InsPay) -> Owed @BDF Don Fergus $580 $500 SF 3352 @BJP Jose Poloze $420 $350 FR 13299 Leveraging MDQ: Instance Mapping Clinic 1: TestCharge Name Policy Covered BillOut Clinic 2: Visit First Last Charge InsPay Ins # James Kelly SF 45301 $300 $280 Don @PDF8 Fergus Don Fergus $580 SF 3352 $500 $250 SF 3352 $50 Don Fergus SF 3352 $250 $50 Rules: TestCharge -> PatientVisit Visit -> PatientVisit BillOut -> Owed (Charge – InsPay) -> Owed @BDF <- @PDF8 @BDF @PDF8 @PJK1 Sample Query: //PatientVisit [Owed >= 50] @PJP3 Jose Pozole ALS 65349 $240 $60 @BDF Don Fergus $580 $500 SF 3352 @BJP Jose Poloze $420 $350 FR 13299 Using MDQ for Data Conversion Clinic 1: TestCharge Name Policy Covered BillOut New Schema Mapping Rules: TestCharge as $t -> <PatientVisit> <Name> $t.Name </Name> <Owed> $t.BillOut </Owed> <Policy> $t.Policy </Policy> </PatientVisit> Visit as $v -> <PatientVisit> <Name> $v.First || $v.Last <Name> <Owed> $v.BillOut + ($v.Charge – $v.InsPay) </Owed> <Policy> $v.Ins# </Policy> </PatientVisit> Clinic 2: Visit First Last Charge InsPay Ins# Sample Query: //PatientVisit [Owed >= 50] Instance Mapping Rule: @BDF <- @PDF8 @PJK1 James Kelly $280 SF 45301 @BDF @PDF8 Don Fergus $130 SF 3352 @PJP3 Jose Pozole $60 ALS 65349 @BJP Jose Poloze $70 FR 13299 Summary • Significant progress in information integration research –Theory, algorithms, systems –Federation/virtual/lazy and warehousing/materialized/eager –Schema mappings feed both • Significant challenges remain –Too many incompatible tools and engines – too much choice –The dynamic, decentralized new world demands new solutions • Integration independence –Protect applications from how the data is put together –Make federation vs materialization an optimization decision • Holistic Information Integration –Schema and data-level tasks in the same framework –Support iteration through tasks and improve integration • Much fascinating work ahead –Engines that can provide true integration independence –Tools that can take advantage of holistic information integration –Can “Mapping Data to Queries” help us reach our goals?