PPT slides presented in the class

advertisement
Schema & Ontology Matching:
Current Research Directions
AnHai Doan
Database and Information System Group
University of Illinois, Urbana Champaign
Spring 2004
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
2
Motivation: Data Integration
Find houses with
2 bedrooms
priced under
200K
New faculty
member
realestate.com
homeseekers.com
homes.com
3
Architecture of Data Integration System
Find houses with 2 bedrooms
priced under 200K
mediated schema
source schema 1
realestate.com
source schema 2
homeseekers.com
source schema 3
homes.com
4
Semantic Matches between Schemas
Mediated-schema
price agent-name
1-1 match
homes.com
listed-price
320K
240K
contact-name
Jane Brown
Mike Smith
address
complex match
city
state
Seattle WA
Miami FL
5
Schema Matching is Ubiquitous!


Fundamental problem in numerous applications
Databases
–
–
–
–
–
–
–

data integration
data translation
schema/view integration
data warehousing
semantic query processing
model management
peer data management
AI
– knowledge bases, ontology merging, information gathering agents, ...

Web
– e-commerce
– marking up data using ontologies (e.g., on Semantic Web)
6
Why Schema Matching is Difficult

Schema & data never fully capture semantics!
– not adequately documented
– schema creator has retired to Florida!

Must rely on clues in schema & data
– using names, structures, types, data values, etc.

Such clues can be unreliable
– same names => different entities: area => location or square-feet
– different names => same entity:
area & address => location

Intended semantics can be subjective
– house-style = house-description?
– military applications require committees to decide!

Cannot be fully automated, needs user feedback!
7
Current State of Affairs

Finding semantic mappings is now a key bottleneck!
– largely done by hand
– labor intensive & error prone
– data integration at GTE [Li&Clifton, 2000]
– 40 databases, 27000 elements, estimated time: 12 years

Will only be exacerbated
– data sharing becomes pervasive
– translation of legacy data


Need semi-automatic approaches to scale up!
Many research projects in the past few years
– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,
U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ...
– AI: Stanford, Karlsruhe University, NEC Japan, ...
8
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
9
LSD


Learning Source Description
Developed at Univ of Washington 2000-2001
– with Pedro Domingos and Alon Halevy

Designed for data integration settings
– has been adapted to several other contexts

Desirable characteristics
–
–
–
–
–
learn from previous matching activities
exploit multiple types of information in schema and data
incorporate domain integrity constraints
handle user feedback
achieves high matching accuracy (66 -- 97%) on real-world data
10
Schema Matching for Data Integration:
the LSD Approach
Suppose user wants to integrate 100 data sources
1. User
– manually creates matches for a few sources, say 3
– shows LSD these matches
2. LSD learns from the matches
3. LSD predicts matches for remaining 97 sources
11
Learning from the Manual Matches
price
Mediated schema
agent-name agent-phone office-phone
description
listed-price contact-name contact-phone office comments
If “office”
occurs in name
=> office-phone
Schema of realestate.com
realestate.com
listed-price
contact-name contact-phone
$250K
$320K
James Smith
Mike Doan
$350K
$230K
contact-agent
comments
(305) 729 0831 (305) 616 1822 Fantastic house
(617) 253 1429 (617) 112 2315 Great location
homes.com
sold-at
office
extra-info
(206) 634 9435 Beautiful yard
(617) 335 4243 Close to Seattle
If “fantastic” & “great”
occur frequently in
data instances
=> description
12
Must Exploit Multiple Types of Information!
Mediated schema
price
agent-name agent-phone office-phone
description
listed-price contact-name contact-phone office comments
If “office”
occurs in name
=> office-phone
Schema of realestate.com
realestate.com
listed-price
contact-name contact-phone
$250K
$320K
James Smith
Mike Doan
$350K
$230K
contact-agent
comments
(305) 729 0831 (305) 616 1822 Fantastic house
(617) 253 1429 (617) 112 2315 Great location
homes.com
sold-at
office
extra-info
(206) 634 9435 Beautiful yard
(617) 335 4243 Close to Seattle
If “fantastic” & “great”
occur frequently in
data instances
=> description
13
Multi-Strategy Learning

Use a set of base learners
– each exploits well certain types of information

To match a schema element of a new source
– apply base learners
– combine their predictions using a meta-learner

Meta-learner
– uses training sources to measure base learner accuracy
– weighs each learner based on its accuracy
14
Base Learners

Training
Object
Training
examples



Matching
Name Learner
(X1,C1)
(X2,C2)
...
(Xm,Cm)
Observed label
X
labels weighted by confidence score
Classification model
(hypothesis)
– training:
(“location”, address)
(“contact name”, name)
– matching:
agent-name => (name,0.7),(phone,0.3)
Naive Bayes Learner
– training:
(“Seattle, WA”,address)
(“250K”,price)
– matching:
“Kent, WA”
=> (address,0.8),(name,0.2)
15
The LSD Architecture
Training Phase
Matching Phase
Mediated schema
Source schemas
Base-Learner1
Hypothesis1
Training data
for base learners
Base-Learnerk
Hypothesisk
Base-Learner1 .... Base-Learnerk
Meta-Learner
Predictions for instances
Prediction Combiner
Domain
constraints Predictions for elements
Constraint Handler
Meta-Learner
Weights for
Base Learners
Mappings
16
Training the Base Learners
Mediated schema
address
price agent-name agent-phone office-phone
description
realestate.com
location
price contact-name contact-phone
office
comments
Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house
Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location
Name Learner
Naive Bayes Learner
(“location”, address)
(“price”, price)
(“contact name”, agent-name)
(“contact phone”, agent-phone)
(“office”, office-phone)
(“comments”, description)
(“Miami, FL”, address)
(“$250K”, price)
(“James Smith”, agent-name)
(“(305) 729 0831”, agent-phone)
(“(305) 616 1822”, office-phone)
(“Fantastic house”, description)
(“Boston,MA”, address)
17
Meta-Learner: Stacking
[Wolpert 92,Ting&Witten99]

Training
–
–
–
–

uses training data to learn weights
one for each (base-learner,mediated-schema element) pair
weight (Name-Learner,address) = 0.2
weight (Naive-Bayes,address) = 0.8
Matching: combine predictions of base learners
– computes weighted average of base-learner confidence scores
area
Seattle, WA
Kent, WA
Bend, OR
Name Learner
Naive Bayes
(address,0.4)
(address,0.9)
Meta-Learner
(address, 0.4*0.2 + 0.9*0.8 = 0.8)
18
The LSD Architecture
Training Phase
Matching Phase
Mediated schema
Source schemas
Base-Learner1
Hypothesis1
Training data
for base learners
Base-Learnerk
Hypothesisk
Base-Learner1 .... Base-Learnerk
Meta-Learner
Predictions for instances
Prediction Combiner
Domain
constraints Predictions for elements
Constraint Handler
Meta-Learner
Weights for
Base Learners
Mappings
19
Applying the Learners
homes.com schema
area
sold-at contact-agent
area
Seattle, WA
Kent, WA
Bend, OR
extra-info
Name Learner
Naive Bayes
Name Learner
Naive Bayes
Meta-Learner
Meta-Learner
(address,0.8), (description,0.2)
(address,0.6), (description,0.4)
(address,0.7), (description,0.3)
Prediction-Combiner
(address,0.7), (description,0.3)
homes.com
sold-at
contact-agent
extra-info
(price,0.9), (agent-phone,0.1)
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
20
Domain Constraints



Encode user knowledge about domain
Specified only once, by examining mediated schema
Examples
– at most one source-schema element can match address
– if a source-schema element matches house-id then it is a key
– avg-value(price) > avg-value(num-baths)

Given a mapping combination
– can verify if it satisfies a given constraint
area:
sold-at:
contact-agent:
extra-info:
address
price
agent-phone
address
21
The Constraint Handler
Predictions from Prediction Combiner
Domain Constraints
area:
(address,0.7), (description,0.3)
sold-at:
(price,0.9), (agent-phone,0.1)
contact-agent: (agent-phone,0.9), (description,0.1)
extra-info:
(address,0.6), (description,0.4)
At most one element
matches address
area:
sold-at:
contact-agent:
extra-info:



address
price
agent-phone
address
0.7
0.9
0.9
0.6
0.3402
area:
sold-at:
contact-agent:
extra-info:
address
price
agent-phone
description
0.3
0.1
0.1
0.4
0.0012
0.7
0.9
0.9
0.4
0.2268
Searches space of mapping combinations efficiently
Can handle arbitrary constraints
Also used to incorporate user feedback
– sold-at does not match price
22
The Current LSD System

Can also handle data in XML format
– matches XML DTDs

Base learners
– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97]
– exploits frequencies of words & symbols
– WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]
– employs information-retrieval similarity metric
– Name Learner [SIGMOD-01]
– matches elements based on their names
– County-Name Recognizer [SIGMOD-01]
– stores all U.S. county names
– XML Learner [SIGMOD-01]
– exploits hierarchical structure of XML data
23
Empirical Evaluation

Four domains
– Real Estate I & II, Course Offerings, Faculty Listings

For each domain
–
–
–
–

created mediated schema & domain constraints
chose five sources
extracted & converted data into XML
mediated schemas: 14 - 66 elements, source schemas: 13 - 48
Ten runs for each domain, in each run:
– manually provided 1-1 matches for 3 sources
– asked LSD to propose matches for remaining 2 sources
– accuracy = % of 1-1 matches correctly identified
24
Average Matching Acccuracy (%)
High Matching Accuracy
100
90
80
70
60
50
40
30
20
10
0
Real Estate I
Real Estate II
Course
Offerings
Faculty
Listings
LSD’s accuracy:
71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner:
+ 5 - 22%
+ Constraint handler:
+ 7 - 13%
+ XML learner:
+ 0.8 - 6%
25
Average matching accuracy (%)
Contribution of Schema vs. Data
100
90
80
70
60
50
40
30
20
10
0
Real Estate I
Real Estate II
Course Offerings Faculty Listings
LSD with only schema info.
LSD with only data info.
Complete LSD
More experiments in [Doan et al. SIGMOD-01]
26
LSD Summary

LSD
– learns from previous matching activities
– exploits multiple types of information
– by employing multi-strategy learning
– incorporates domain constraints & user feedback
– achieves high matching accuracy

LSD focuses on 1-1 matches
 Next challenge: discover more complex matches!
– iMAP (illinois Mapping) system [SIGMOD-04]
– developed at Washington and Illinois, 2002-2004
– with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos
27
The iMAP Approach
Mediated-schema
price num-baths
address
homes.com
listed-price
320K
240K

agent-id
full-baths half-baths city
53211
11578
2
1
1
1
Seattle
Miami
zipcode
98105
23591
For each mediated-schema element
– searches space of all matches
– finds a small set of likely match candidates
– uses LSD to evaluate them

To search efficiently
– employs a specialized searcher for each element type
– Text Searcher, Numeric Searcher, Category Searcher, ...
28
The iMAP Architecture [SIGMOD-04]
Mediated schema
Searcher1
Source schema + data
Searcher2
Searcherk
Match candidates
Domain
knowledge
and data
Base-Learner1 .... Base-Learnerk
Explanation
module
Meta-Learner
Similarity Matrix
User
Match selector
1-1 and complex matches
29
An Example: Text Searcher


Beam search in space of all concatenation matches
Example: find match candidates for address
Mediated-schema
homes.com
listed-price
320K
240K
price num-baths
agent-id
532a
115c
concat(agent-id,city)
532a Seattle
115c Miami

address
full-baths half-baths city
2
1
1
1
Seattle
Miami
concat(agent-id,zipcode)
532a 98105
115c 23591
zipcode
98105
23591
concat(city,zipcode)
Seattle 98105
Miami 23591
Best match candidates for address
– (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)
30
Empirical Evaluation

Current iMAP system
– 12 searchers

Four real-world domains
– real estate, product inventory, cricket, financial wizard
– target schema: 19 -- 42 elements, source schema: 32 -- 44


Accuracy: 43 -- 92%
Sample discovered matches
– agent-name = concat(first-name,last-name)
– area
= building-area / 43560
– discount-cost = (unit-price * quantity) * (1 - discount)

More detail in [Dhamanka et. al. SIGMOD-04]
31
Observations

Finding complex matches much harder than 1-1 matches!
– require gluing together many components
– e.g., num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms
– if missing one component => incorrect match

However, even partial matches are already very useful!
– so are top-k matches => need methods to handle partial/top-k matches

Huge/infinite search spaces
– domain knowledge plays a crucial role!

Matches are fairly complex, hard to know if they are correct
– must be able to explain matches

Human must be fairly active in the loop
– need strong user interaction facilities

Break matching architecture into multiple "atomic" boxes!
32
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
33
Finding Matches is only Half of the Job!

To translate data/queries, need mappings, not matches
Schema S
HOUSES
location
Atlanta, GA
Raleigh, NC
Schema T
price ($) agent-id
360,000 32
430,000 15
LISTINGS
area
list-price agent-address agent-name
Denver, CO 550,000
Boulder, CO Laura Smith
Atlanta, GA 370,800 Athens, GA
Mike Brown
AGENTS
id name
32 Mike Brown
15 Jean Laup

city
Athens
Raleigh
state fee-rate
GA 0.03
NC 0.04
Mappings
– area
= SELECT location FROM HOUSES
– agent-address = SELECT concat(city,state) FROM AGENTS
– list-price
= price * (1 + fee-rate)
FROM
HOUSES, AGENTS
WHERE agent-id = id
34
Clio: Elaborating Matches into Mappings

Developed at Univ of Toronto & IBM Almaden, 2000-2003
– by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling
Yan, Ron Fagin

Given a match
– list-price = price * (1 + fee-rate)

Refine it into a mapping
– list-price = SELECT price * (1 + fee-rate)
FROM HOUSES (FULL OUTER JOIN) AGENTS
WHERE agent-id = id

Need to discover
– the correct join path among tables, e.g., agent-id = id
– the correct join, e.g., full outer join? inner join?

Use heuristics to decide
– when in doubt, ask users
– employ sophisticated user interaction methods [VLDB-00, SIGMOD-01]
35
Clio: Illustrating Examples
Schema S
HOUSES
location
Atlanta, GA
Raleigh, NC
Schema T
price ($) agent-id
360,000 32
430,000 15
LISTINGS
area
list-price agent-address agent-name
Denver, CO 550,000
Boulder, CO Laura Smith
Atlanta, GA 370,800 Athens, GA
Mike Brown
AGENTS
id name
32 Mike Brown
15 Jean Laup

city
Athens
Raleigh
state fee-rate
GA 0.03
NC 0.04
Mappings
– area
= SELECT location FROM HOUSES
– agent-address = SELECT concat(city,state) FROM AGENTS
– list-price
= price * (1 + fee-rate)
FROM
HOUSES, AGENTS
WHERE agent-id = id
36
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
37
Broader Picture: Find Matches
Hand-crafted rules
Exploit schema
1-1 matches
Single learner
Exploit data
1-1 matches
TRANSCM [Milo&Zohar98]
ARTEMIS [Castano&Antonellis99]
[Palopoli et al. 98]
CUPID
[Madhavan et al. 01]
Learners + rules, use multi-strategy learning
Exploit schema + data
1-1 + complex matches
Exploit domain constraints
LSD [Doan et al., SIGMOD-01]
iMAP [Dhamanka et. al., SIGMOD-04]
SEMINT [Li&Clifton94]
ILA [Perkowitz&Etzioni95]
DELTA [Clifton et al. 97]
AutoMatch, Autoplex
[Berlin & Motro, 01-03]
Other Important Works
COMA by Erhard Rahm group
David Embley group at BYU
Jaewoo Kang group at NCSU
Kevin Chang group at UIUC
Clement Yu group at UIC
More about some of these works soon ....
38
Broader Picture: From Matches
to Mappings
Learners + rules
Exploit schema + data
1-1 + complex matches
Automate as much as possible
Rules
Exploit data
Powerful user interaction
CLIO [Miller et. al., 00]
[Yan et al. 01]
iMAP [Dhamanka et al., SIGMOD-04]
?
39
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
40
Ontology Matching

Increasingly critical for
– knowledge bases, Semantic Web

An ontology
– concepts organized into a taxonomy tree
– each concept has
– a set of attributes
– a set of instances
Undergrad
Courses
– relations among concepts

CS Dept. US
Entity
Grad
Courses
Faculty
Matching
– concepts
– attributes
– relations
People
Assistant
Professor
Associate
Professor
Staff
Professor
name: Mike Burns
degree: Ph.D.
41
Matching Taxonomies of Concepts
CS Dept. US
CS Dept. Australia
Entity
Undergrad
Courses
Grad
Courses
Entity
People
Faculty
Assistant
Professor
Associate
Professor
Courses
Staff
Professor
Lecturer
Staff
Academic Staff
Technical Staff
Senior
Lecturer
Professor
42
Glue

Solution
– Use data instances extensively
– Learn classifiers using information within taxonomies
– Use a rich constraint satisfaction scheme

[Doan, Madhavan, Domingos, Halevy; WWW’2002]
43
Concept Similarity
Concept A
A,S
Concept S
A, S
A,S
Hypothetical
universe of
all examples
A,S
Sim(Concept A, Concept S) =
[Jaccard, 1908]
P(A  S)
P(A  S)
=
P(A,S)
P(A,S) + P(A,S) + P(A,S)
Joint Probability Distribution: P(A,S),P(A,S),P(A,S),P(A,S)
Multiple Similarity measures in terms of the JPD
44
Machine Learning for
Computing Similarities
A,S
A
Taxonomy 1
A,S
A,S
Taxonomy 2
A,SS
S
A
A,S
CLA
A,S
A,S
A
A,S
CLS
A
JPD estimated by counting the sizes of the partitions
S
S
45
The Glue System
Matches for O1 , Matches for O2
Relaxation Labeling
Common Knowledge &
Domain Constraints
Similarity Function
Similarity Matrix
Similarity Estimator
Joint Probability Distribution P(A,B), P(A’, B)…
Meta Learner
Base Learner
Taxonomy O1
(tree structure + data instances)
Distribution
Estimator
Base Learner
Taxonomy O2
(tree structure + data instances)
46
Constraints in Taxonomy Matching

Domain-dependent
– at most one node matches department-chair
– a node that matches professor can not be a child of a node
that matches assistant-professor

Domain-independent
– two nodes match if parents & children match
– if all children of X matches Y, then X also matches Y
– Variations have been exploited in many restricted settings
[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],
[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]

Challenge: find a general & efficient approach
47
Solution: Relaxation Labeling

Relaxation labeling [Hummel&Zucker, 83]
–
–
–
–

applied to graph labeling in vision, NLP, hypertext classification
finds best label assignment, given a set of constraints
starts with initial label assignment
iteratively improves labels, using constraints
Standard relax. labeling not applicable
– extended it in many ways [Doan et al., W W W-02]
48
Real World Experiments

Taxonomies on the web
– University organization (UW and Cornell)
– Colleges, departments and sub-fields
– Companies (Yahoo and The Standard)
– Industries and Sectors

For each taxonomy
–
–
–
–
–

Extract data instances – course descriptions, company profiles
Trivial data cleaning
100 – 300 concepts per taxonomy
3-4 depth of taxonomies
10-90 data instances per concept
Evaluation against manual mappings as the gold standard
49
Glue’s Performance
Name Learner
100
Content Learner
Meta Learner
Relaxation Labeler
Matching accuracy (%)
90
80
70
60
50
40
30
20
10
0
Cornell to Wash.
Wash. to Cornell
University Depts 1
Cornell to Wash.
Wash. to Cornell
University Depts 2
Standard to Yahoo
Yahoo to Standard
Company Profiles
50
Broader Picture

Ontology matching parallels the development of
schema matching
– rule-based & learning-based approaches
– PROMPT family, OntoMorph, OntoMerge, Chimaera, Onion,
OBSERVER, FCAMerge, ...
– extensive work by Ed Hovy's group
– ontology versioning (e.g., by Noy et. al.)

More powerful user interaction methods
– e.g., iPROMPT, Chimaera

Much more theoretical works in this area
51
Road Map

Schema Matching
– motivation & problem definition
– representative current solutions: LSD, iMAP, Clio
– broader picture

Ontology Matching
– motivation & problem definition
– representative current solution: GLUE
– broader picture

Conclusions & Emerging Directions
52
Develop the Theoretical Foundation

Not much is going on, however ...
– see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model
management contexts)
– some preliminary work in AnHai Doan's Ph.D. dissertation
– work by Stuart Russell and other AI people on identity uncertainty is
potentially relevant

Most likely foundation
– probability framework
53
Need Much More Domain Knowledge

Where to get it?
– past matches (e.g., LSD, iMAP)
– other schemas in the domain
– holistic matching approach by Kevin Chang group [SIGMOD-02]
– corpus-based matching by Alon Halevy group [IJCAI-03]
– clustering to achieve bridging effects by Clement Yu group
[SIGMOD-04]
– external data (e.g., iMAP at SIGMOD-04)
– mass of users (e.g., MOBS at WebDB-03)

How to get it and how to use it?
– no clear answer yet
54
Employ Multi-Module Architecture

Many "black boxes", each is good at doing a single thing
 Combine them and tailor them to each application
 Examples
– LSD, iMAP, COMA, David Embley's systems

Open issues
– what are these back boxes?
– how to build them?
– how to combine them?
55
Powerful User Interaction


Minimize user effort, maximize its impact
Make it very easy for users to
– supply domain knowledge
– provide feedback on matches/mappings

Develop powerful explanation facilities
56
Other Issues

What to do with partial/top-k matches?
 Meaning negotiation
 Fortifying schemas for interoperability
 Very-large-scale matching scenarios (e.g., the Web)
 What can we do without the mappings?
 Interaction between schema matching and tuple matching?
 Benchmarks, tools?
57
Summary

Schema/ontology matching:
key to numerous data management problems
– much attention in the database, AI, Semantic Web communities

Simple problem definition, yet very difficult to do
– no satisfactory solution yet
– AI complete?

We now understand the problems much better
– still at the beginning of the journey
– will need techniques from multiple fields
58
Backup Slides
59
Backup Slides
60
Training the Meta-Learner

For address
Extracted XML Instances
<location> Miami, FL</>
<listed-price> $250,000</>
<area> Seattle, WA </>
<house-addr>Kent, WA</>
<num-baths>3</>
...
Name Learner Naive Bayes
0.5
0.4
0.3
0.6
0.3
...
0.8
0.3
0.9
0.8
0.3
...
True Predictions
1
0
1
1
0
...
Least-Squares
Linear Regression
Weight(Name-Learner,address) = 0.1
Weight(Naive-Bayes,address) = 0.9
61
Average matching accuracy (%)
Sensitivity to Amount of Available Data
100
90
80
70
60
50
40
0
100
200
300
400
500
Number of data listings per source (Real Estate I)
62
Average Matching Acccuracy (%)
Contribution of Each Component
100
80
60
40
20
0
Real Estate I
Course Offerings
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
Faculty Listings
Real Estate II
63
Exploiting Hierarchical Structure

Existing learners flatten out all structures
<contact>
<name> Gail Murphy </name>
<firm> MAX Realtors </firm>
</contact>

<description>
Victorian house with a view. Name your price!
To see it, contact Gail Murphy at MAX Realtors.
</description>
Developed XML learner
– similar to the Naive Bayes learner
– input instance = bag of tokens
– differs in one crucial aspect
– consider not only text tokens, but also structure tokens
64
Reasons for Incorrect Matchings

Unfamiliarity
– suburb
– solution: add a suburb-name recognizer

Insufficient information
– correctly identified general type, failed to pinpoint exact type
– agent-name
phone
Richard Smith
(206) 234 5412
– solution: add a proximity learner

Subjectivity
– house-style = description?
Victorian
Beautiful neo-gothic house
Mexican
Great location
65
Evaluate Mapping Candidates

For address, Text Searcher returns
– (agent-id,0.7)
– (concat(agent-id,city),0.8)
– (concat(city,zipcode),0.75)


Employ multi-strategy learning to evaluate mappings
Example: (concat(agent-id,city),0.8)
– Naive Bayes Learner: 0.8
– Name Learner: “address” vs. “agent id city” 0.3
– Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65

Meta-Learner returns
– (agent-id,0.59)
– (concat(agent-id,city),0.65)
– (concat(city,zipcode),0.70)
66
Relaxation Labeling

Applied to similar problems in
– vision, NLP, hypertext classification
People
Dept Australia
Courses
Courses
Acad. Staff
Faculty
Dept U.S.
Courses
Staff
Tech. Staff
Staff
Courses
People
Faculty
Staff
67
Relaxation Labeling for Taxonomy Matching

Must define
– neighborhood of a node
– k features of neighborhood
– how to combine influence of features
– P( N  L | f1 , f 2 ,..., f k )

Algorithm
– init:
– loop:
for each pair <N,L>, compute P( N  L | )
for each pair <N,L>, re-compute
P( N  L | )   P( N  L | M , ).P( M | )
M
Staff = People
Acad. Staff: Faculty
Tech. Staff: Staff
Neighborhood configuration
68
Relaxation Labeling for Taxonomy Matching

Huge number of neighborhood configurations!
– typically neighborhood = immediate nodes
– here neighborhood can be entire graph
100 nodes, 10 labels => 10100 configurations

Solution
– label abstraction + dynamic programming
– guarantee quadratic time for a broad range of domain constraints

Empirical evaluation
–
–
–
–
–
GLUE system [Doan et. al., WWW-02]
three real-world domains
30 -- 300 nodes / taxonomy
high accuracy 66 -- 97% vs. 52 -- 83% of best base learner
relaxation labeling very fast, finished in several seconds
69
Download