ppt

advertisement
Data Fusion
Jens Bleiholder and Felix
Naumann
Presented by Aaron Stewart
Data Integration
• Schema mapping
• Duplicate detection
• Data fusion
Complete / Concise
• Like recall/precision
• Complete: coverage of real-world objects
• Concise: avoid duplicates
Conflicts
• Schematic conflicts
• Identity conflicts
• Data conflicts
• Uncertainty
• Contradiction
Data Fusion Strategies
Uniqueness
• Uniqueness-preserving
• Uniqueness-enforcing
Value preservation
• Value-preserving
• Non-value-preserving
• Object-preserving
Motivating Example
Joins
• Equi-join
• Natural join
• Full outer join
– Key join
– Left join
– Right join
Equi-join
SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status,
U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone
FROM U1 JOIN U2 ON U1.Name=U2.Name
Equi-join Result
SELECT U1.Name, U2.Name,
U1.Age, U2.Age, U1.Status,
U2.Status,
U1.Address, U2.Address,
U1.Field, U2.Field,
U1.Library, U2.Phone
FROM U1 JOIN U2 ON
U1.Name=U2.Name
Natural Join
SELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field,
U1.Library, U2.Phone
FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age
AND U1.Status=U2.Status AND U1.Address=U2.Address
AND U1.Field=U2.Field
Natural Join Result
SELECT U1.Name, U1.Age,
U1.Status, U1.Address,
U1.Field, U1.Library,
U2.Phone
FROM U1 JOIN U2 ON
U1.Name=U2.Name AND
U1.Age=U2.Age
AND U1.Status=U2.Status AND
U1.Address=U2.Address
AND U1.Field=U2.Field
Full Outer Join
SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status,
U1.Address, U2.Address, U1.Field, U2.Field, U1.Library,
U2.Phone
FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name
Full Outer Join Result
SELECT U1.Name, U2.Name,
U1.Age, U2.Age, U1.Status,
U2.Status,
U1.Address, U2.Address,
U1.Field, U2.Field,
U1.Library, U2.Phone
FROM U1 FULL OUTER JOIN U2 ON
U1.Name=U2.Name
Full Disjunction
• Generalizes outer join to more than two
tables
…
Information Systems
for Data Fusion
1.
2.
3.
4.
Conflict resolution
Conflict avoidance
Conflict ignorance
No conflict handling
Architecture
• Database management system (DBMS)
• Multidatabase management system
(MDBMS)
• Mediator-wrapper (MW)
• Multi-agent system (MAS)
• Stand-alone application (APP)
Integration Model
• Global-as-view (GaV)
• Local-as-view (LaV)
• Global-Local-as-view (GLaV)
1. Conflict-Resolving Systems
•
•
•
•
•
Multibase
Hermes
Fusionplex
HumMer
Ajax
Multibase
• C. 1983
• Solution:
– Outer join
– Aggregation (min, max, sum, choose, etc.)
Hermes
• HEterogeneous Reasoning and MEdiator
System
• C. 1996
• Mediator-specified conflict resolution
– Created by an expert
Fusionplex
•
•
•
•
Multiplex, Fusionplex, Autoplex
Classifies quality of data
User-prioritized feature “importance”
Able to incorporate new/unknown
databases
HumMer
•
•
•
•
•
Humboldt-Merger
C. 2006
Handles conflicts in schema, identity, data
Clusters duplicates
User-defined aggregation functions
Ajax
• Format and unit conversion
• User-defined cleansing process
– Compiled to Java
2. Conflict-Avoiding Systems
•
•
•
•
•
•
TSIMMIS
SIMS and Ariadne
Infomix
HIPPO
ConQuer
Rainbow
Conflict-Ignoring Systems
•
•
•
•
•
Pegasus
Nimble
Carnot
InfoSleuth
Potter’s Wheel
Other Systems
• Research Systems
–
–
–
–
–
–
–
–
–
–
Trio
Information Manifold
Garlic
Disco (Distributed Information Search Component)
Papyrus, Nomenclature
DIOM, KOMET, Infomaster, Occam, SIMS, Internet
Softbot
Singapore, Magic, Observer
Lore, Tukwila
SIRIUS-DELTA, DDTS, Mermaid, UNIBASE
MRDSM, OMNIBASE, CALIDA, DQS
Other Systems
• Commercial
– IBM, Oracle, Microsoft, others
– IBM Information Server (IIS)
– Microsoft SQL Server Integration Services
(SSIS)
Other Systems
• Peer Data Management Systems
– Orchestra
– Hyper
Analysis
• Weaknesses
– Difficult to show utility of a tool on paper
• Strengths
– Covered a lot of theory
– Covered a lot of systems
Download