Data Fusion Jens Bleiholder and Felix Naumann Presented by Aaron Stewart Data Integration • Schema mapping • Duplicate detection • Data fusion Complete / Concise • Like recall/precision • Complete: coverage of real-world objects • Concise: avoid duplicates Conflicts • Schematic conflicts • Identity conflicts • Data conflicts • Uncertainty • Contradiction Data Fusion Strategies Uniqueness • Uniqueness-preserving • Uniqueness-enforcing Value preservation • Value-preserving • Non-value-preserving • Object-preserving Motivating Example Joins • Equi-join • Natural join • Full outer join – Key join – Left join – Right join Equi-join SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name Equi-join Result SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name Natural Join SELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age AND U1.Status=U2.Status AND U1.Address=U2.Address AND U1.Field=U2.Field Natural Join Result SELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age AND U1.Status=U2.Status AND U1.Address=U2.Address AND U1.Field=U2.Field Full Outer Join SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name Full Outer Join Result SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name Full Disjunction • Generalizes outer join to more than two tables … Information Systems for Data Fusion 1. 2. 3. 4. Conflict resolution Conflict avoidance Conflict ignorance No conflict handling Architecture • Database management system (DBMS) • Multidatabase management system (MDBMS) • Mediator-wrapper (MW) • Multi-agent system (MAS) • Stand-alone application (APP) Integration Model • Global-as-view (GaV) • Local-as-view (LaV) • Global-Local-as-view (GLaV) 1. Conflict-Resolving Systems • • • • • Multibase Hermes Fusionplex HumMer Ajax Multibase • C. 1983 • Solution: – Outer join – Aggregation (min, max, sum, choose, etc.) Hermes • HEterogeneous Reasoning and MEdiator System • C. 1996 • Mediator-specified conflict resolution – Created by an expert Fusionplex • • • • Multiplex, Fusionplex, Autoplex Classifies quality of data User-prioritized feature “importance” Able to incorporate new/unknown databases HumMer • • • • • Humboldt-Merger C. 2006 Handles conflicts in schema, identity, data Clusters duplicates User-defined aggregation functions Ajax • Format and unit conversion • User-defined cleansing process – Compiled to Java 2. Conflict-Avoiding Systems • • • • • • TSIMMIS SIMS and Ariadne Infomix HIPPO ConQuer Rainbow Conflict-Ignoring Systems • • • • • Pegasus Nimble Carnot InfoSleuth Potter’s Wheel Other Systems • Research Systems – – – – – – – – – – Trio Information Manifold Garlic Disco (Distributed Information Search Component) Papyrus, Nomenclature DIOM, KOMET, Infomaster, Occam, SIMS, Internet Softbot Singapore, Magic, Observer Lore, Tukwila SIRIUS-DELTA, DDTS, Mermaid, UNIBASE MRDSM, OMNIBASE, CALIDA, DQS Other Systems • Commercial – IBM, Oracle, Microsoft, others – IBM Information Server (IIS) – Microsoft SQL Server Integration Services (SSIS) Other Systems • Peer Data Management Systems – Orchestra – Hyper Analysis • Weaknesses – Difficult to show utility of a tool on paper • Strengths – Covered a lot of theory – Covered a lot of systems