Presentation

advertisement
Constraint-based
Information Integration
Steven Minton
Fetch Technologies
Joint work with Craig Knoblock and
Jose Luis Ambite (USC/ISI)
Example Application
Geocoder
Tiger
Map
Server
Integration
System
LA County
Restaurant
Health
Ratings
Zagat Restaurants
Guide
Outline

Agents that access information sources
on the web



AgentBuilder – learning from examples
ActiveAtlas -- standardizing data from
multiple sources
Constraint-based Integration

Heracles – putting it all together
Information Agents
Decision Support
Application Programs
Information Agent
Databases
Knowledge Bases
The Web
Computer Programs
Web Agents

Web agents provide uniform query language for
data access: “Wrapping a web site”
Name
Restaurants in
Santa Monica?
Address
Chinois on Main 2709 Main St.
Chao Dara
13 Union Sq.
…
...
AgentBuilder



Supervised learning: Extraction rules
created from examples
High precision
High reliability
Extraction technology

Expressive extraction rule language:


Extraction rule = sequence of landmarks
Describes how to find the beginning and
end of each field
PAGE:
<html> Name:<b> KFC </b> Cuisine :<p> <b>
Start: SkipTo(Cuisine :) SkipTo(<b>)
Fast Food
</b> <br>...
End: SkipTo(</b>)
A Sequential Covering Algorithm
for “Wrapper Induction”
Training Examples:
Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
A Sequential Covering Wrapper
Induction Algorithm
Training Examples:
Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate:
SkipTo( ( )
A Sequential Covering Wrapper
Induction Algorithm
Training Examples:
Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate:
SkipTo( <b> ( )
SkipTo( ( )
... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
A Sequential Covering Wrapper
Induction Algorithm
Training Examples:
Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ...
Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
Initial candidate:
SkipTo( <b> ( )
…
SkipTo( ( )
... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
SkipTo(Phone) SkipTo(:) SkipTo( ( )
...
Outline

Agents that access information sources
on the web



AgentBuilder – learning from examples
Atlas -- standardizing data from multiple
sources
Constraint-based Integration

Heracles – putting it all together
The Problem:
Multi-Source Inconsistency
Zagat’s Restaurant Guide
Art’s Deli
California Pizza Kitchen
Campanile
Citrus
Grill, The
Philippe The Original
Spago
Health Dept Restaurant Listings
Art’s Delicatessen
Ca’ Brea
CPK
The Grill
Patina
Philippe’s The Original
The Tillerman
How can the same objects be identified
when they are stored in inconsistent text formats?
The Solution: Record Linkage
Zagat’s Restaurants
Name
Art’s Deli
Teresa's
Street
Phone
12224 Ventura Boulevard
80 Montague St.
Steakhouse The
128 Fremont St.
Les Celebrites 155 W. 58th St.
818-756-4124
718-520-2910
702-382-1600
212-484-5113
Dept. of Health
Name
Street
Phone
Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100
Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604
Binion's Coffee Shop 128 Fremont St.
Les Celebrites
160 Central Park S
702/382-1600
212/484-5113
Query
Record Linkage
Zagat’s Agent
Dept. of Health Agent
Zagat’s
Name
Dept of Health
Street
Phone
Name
Street
Phone
Art’s Deli
12224 Ventura Boulevard
818-756-4124
Art’s
Delicatessen
12224 Ventura Blvd.
818/755-4100
Teresa’s
80 Montague St.
718-520-2910
Teresa’s
103 1st Ave. between 6th and
7th Sts.
212/228-0604
Steakhouse The
128 Fremont St.
702-382-1600
Binion’s
Shop
128 Fremont St.
702/382-1600
Les Celebrites
155 W. 58th St.
212-484-5113
Les Celebrites
5432 Sunset Blvd
212/484-5113
Coffee
Approach to Record Linkage

Learning attribute weighting rules
Name
Zagat’s
Street
Art’s Deli
Phone
12224 Ventura Boulevard
818-756-4124
Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/756-4124

Learning general transformation rules
Zagat’s
TransformationsRules
Art’s Deli
California Pizza Kitchen
Philippe The Original
Abbreviation
Acronym
Stemming
Dept of Health
Art’s Delicatessen
CPK
Philippe’s The Original
Active Learning to
Determine Matched Records
[Tejada, Knoblock, Minton ’01,’02]

Learn importance of attributes for matching records
Name
Zagat’s
Art’s Deli
Street
12224 Ventura Boulevard
Phone
818-756-4124
Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100
Mapping rules:
Name > .9 & Street > .87 => mapped
Name > .95 & Phone > .96 => mapped
Active Atlas
Mapping Rule Learner
Label
Choose initial examples
Generate committee of learners
Learn
Rules
Learn
Rules
Learn
Rules
Classify
Examples
Classify
Examples
Classify
Examples
Votes
Votes
Votes
USER
Choose Example
Label
Set of Mapped
Objects
Committee Disagreement

Chooses an example based on the
disagreement of the query committee
Examples
Art’s Deli, Art’s Delicatessen
CPK, California Pizza Kitchen
Ca’Brea, La Brea Bakery

Committee
M1
M2
M3
Yes
Yes
No
Yes
No
No
Yes
Yes
No
CPK, California Pizza Kitchen is the most
informative example
Outline

Agents that access information sources
on the web



AgentBuilder – learning from examples
ActiveAtlas -- standardizing data from
multiple sources
Constraint-based Integration

Heracles – putting it all together
Constraint-based Integration


Integrating data from multiple sources
often involves reasoning about the
information
Constraints provide a approach to
expressing relationships and filtering
data
Heracles



Framework for building integrated
applications
Interleaves planning and information
gathering
Uses a constraint reasoner to decide what
sources to query and to integrate the
results
The Travel Assistant
Dynamically Updates Slots as
Information Becomes Available
BLACK
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
GREEN
BLACK
GREEN
BLUE
BLUE
RED
GREEN
GREEN
RED
RED
RED
RED
RED
RED
RED
RED
RED
Supports Informed Choices
Changes Propagate Throughout
User Can Specify
High-Level Preferences
Constraint Networks for
Managing Information

Constraint reasoning system






Propagates information
Decides when to launch information requests
Evaluate constraints
Computes preferences
All run as asynchronous processes to support the
user
Components:




Representation of the variables
Representation of constraints
Hierarchical templates
Constraint propagation
Constraint Networks for
Integrating Information

Components:




Representation of the variables
Representation of constraints
Hierarchical template representation
Constraint propagation and cycle detection
Constraint Variables

Constraint network consists of a set of
variables such as:



MeetingStartTime
MeetingLocation
Variables are related by constraints that
determine the possible values of a
solution
Constraint Networks for
Integrating Information

Components:




Representation of the variables
Representation of constraints
Hierarchical template representation
Constraint propagation and cycle detection
Constraint Representation

Constraints are computable components:

Local calculations (e.g., Xquery)


Web and Database Wrappers



ITN: DepartureAirport, ArrivalAirport, Date --> Flights
Yahoo Weather: City, Date --> Weather predication
External Programs (Outlook, Planners, etc)


MeetingStartTime + MeetingDuration -->
MeetingEndTime
Outlook Calendar: Date --> Meetings
Results cached in tables
Drive or Take a Taxi?
Sep 30, 2000
OriginAddress
GetDistance
DepartureDate
DestinationAddress
15.1 miles
Oct 2, 2000
ReturnDate
Distance
FindClosestAirport
LAX
DepartureAirport
getParkingRate
GetTaxiFare
computeDuration
3 days
Duration
$21.00
ParkingTotal
$23.00
TaxiFare
ParkingRate
$7.00/day
multiply
SelectModeToAirport
ModeToAirport Drive
Constraint Networks for
Integrating Information

Components:




Representation of the variables
Representation of constraints
Hierarchical template representation
Constraint propagation and cycle detection
Hierarchically-Partitioned
Constraint Networks

Template:



Groups related variables and constraints
Organizes information for computation and
presentation to user
Templates organized hierarchically


Template decomposed into subtemplates
Choose among alternative subtemplates
Template Structure
Template
 Arguments: input and output variables
 Variables: name, type, default values
 Constraints
 Expansions: alternative subtemplate
calls
 GUI specification
Partitioned Constraint Network
Who
Company
Subject
Dest Weather
Dest. Addr.
OriginWeather
Origin Addr.
Starting Time
Distance
Ending Time
Travel Mode
Depart Time
Depart Airport
Dist. toAirport
Arrival Time
Parking Lot
Taxi Fare
Parking Rate
Mode toAirport
Flight Num
Arrival Airport
Template Hierarchy for the
Travel Assistant
Trip
AND
1
2
ModeToDestination
ModeHotel
OR
Drive
OR
Fly
Taxi
3
ModeNext
OR
Hotel NoOvernight
AND
Trip
Trip
Trip End
ModeToAirport FlightDetail ModeFromAirport (Return (Return (New Trip
OR
OR
Home) Office) Leg)
1
Drive
Taxi
2
3
Drive
Taxi
Dynamic Networks
Generalization of Constraint Networks
 Variables can be active or inactive
 Normal Constraints
x1 = k1 ^ … ^ xm = km  xn = kn
 Activity constraints:
x1 = k1 ^ … ^ xm = km  active(xn)
 Inactive variables do not participate in the
network, i.e., do not propagate constraints
Heracles: Template Selection

Core network



Computes values of template selection vars
Always active
Template selection variables

Inputs to activity constraints: determine
the choice of subtemplates, i.e., which
additional variables are active
Constraint Networks for
Integrating Information

Components:




Representation of the variables
Representation of constraints
Hierarchical template representation
Constraint propagation
Constraint Propagation

Approach



Core network



When a variable is assigned a value, re-compute the value
sets and assigned values of all dependent variables
Proceeds recursively until no values are changed or a cycle
is detected
Propagates all variables through the core network
Remaining variables are computing when a template is
opened
Does not perform full CSP



Less costly
Does not require all information in advance
Makes choices locally, so may fail to find optimal assignment
Discussion

General framework for interleaving planning
and information gathering





Retrieves information as needed
Gathers and integrates data in a uniform
framework
Evaluates tradeoffs and selects among alternatives
Allows the users to explore alternatives
Supports a wide variety of information types:
databases, web pages, images, video, etc.
SmartClients [Torrens et al, 2002]



Cast an integration problem as a Constraint
Satisfaction Problem (CSP)
Given a request, the server retrieves the
required data and sends the data and the
CSP to the client
Client solves the CSP locally


Large complex problem transmitted in small
amount of space
Provides fine-grained user interaction with the
data
Architecture for SmartClients
SmartClients: Pros and Cons

Pros



Elegant approach that exploits past work on CSPs
Minimizes the data retrieval and supports complex
reasoning and integration of the data
Cons


Assumes that all data can be retrieved before any
reasoning about the data
In the travel planning, assumes that prices are the
same on any date and there are no issues with
flight availability
Summary

Our approach for creating “web
assistants”:



Agents for accessing web data
Record linkage for mapping
between sources
Constraint-based integration
provides the glue
Download