Constraint-based Information Integration Steven Minton Fetch Technologies Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI) Example Application Geocoder Tiger Map Server Integration System LA County Restaurant Health Ratings Zagat Restaurants Guide Outline Agents that access information sources on the web AgentBuilder – learning from examples ActiveAtlas -- standardizing data from multiple sources Constraint-based Integration Heracles – putting it all together Information Agents Decision Support Application Programs Information Agent Databases Knowledge Bases The Web Computer Programs Web Agents Web agents provide uniform query language for data access: “Wrapping a web site” Name Restaurants in Santa Monica? Address Chinois on Main 2709 Main St. Chao Dara 13 Union Sq. … ... AgentBuilder Supervised learning: Extraction rules created from examples High precision High reliability Extraction technology Expressive extraction rule language: Extraction rule = sequence of landmarks Describes how to find the beginning and end of each field PAGE: <html> Name:<b> KFC </b> Cuisine :<p> <b> Start: SkipTo(Cuisine :) SkipTo(<b>) Fast Food </b> <br>... End: SkipTo(</b>) A Sequential Covering Algorithm for “Wrapper Induction” Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … Initial candidate: SkipTo( ( ) A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … Initial candidate: SkipTo( <b> ( ) SkipTo( ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(() A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … Initial candidate: SkipTo( <b> ( ) … SkipTo( ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(() SkipTo(Phone) SkipTo(:) SkipTo( ( ) ... Outline Agents that access information sources on the web AgentBuilder – learning from examples Atlas -- standardizing data from multiple sources Constraint-based Integration Heracles – putting it all together The Problem: Multi-Source Inconsistency Zagat’s Restaurant Guide Art’s Deli California Pizza Kitchen Campanile Citrus Grill, The Philippe The Original Spago Health Dept Restaurant Listings Art’s Delicatessen Ca’ Brea CPK The Grill Patina Philippe’s The Original The Tillerman How can the same objects be identified when they are stored in inconsistent text formats? The Solution: Record Linkage Zagat’s Restaurants Name Art’s Deli Teresa's Street Phone 12224 Ventura Boulevard 80 Montague St. Steakhouse The 128 Fremont St. Les Celebrites 155 W. 58th St. 818-756-4124 718-520-2910 702-382-1600 212-484-5113 Dept. of Health Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion's Coffee Shop 128 Fremont St. Les Celebrites 160 Central Park S 702/382-1600 212/484-5113 Query Record Linkage Zagat’s Agent Dept. of Health Agent Zagat’s Name Dept of Health Street Phone Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa’s 80 Montague St. 718-520-2910 Teresa’s 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Steakhouse The 128 Fremont St. 702-382-1600 Binion’s Shop 128 Fremont St. 702/382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Les Celebrites 5432 Sunset Blvd 212/484-5113 Coffee Approach to Record Linkage Learning attribute weighting rules Name Zagat’s Street Art’s Deli Phone 12224 Ventura Boulevard 818-756-4124 Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/756-4124 Learning general transformation rules Zagat’s TransformationsRules Art’s Deli California Pizza Kitchen Philippe The Original Abbreviation Acronym Stemming Dept of Health Art’s Delicatessen CPK Philippe’s The Original Active Learning to Determine Matched Records [Tejada, Knoblock, Minton ’01,’02] Learn importance of attributes for matching records Name Zagat’s Art’s Deli Street 12224 Ventura Boulevard Phone 818-756-4124 Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Mapping rules: Name > .9 & Street > .87 => mapped Name > .95 & Phone > .96 => mapped Active Atlas Mapping Rule Learner Label Choose initial examples Generate committee of learners Learn Rules Learn Rules Learn Rules Classify Examples Classify Examples Classify Examples Votes Votes Votes USER Choose Example Label Set of Mapped Objects Committee Disagreement Chooses an example based on the disagreement of the query committee Examples Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery Committee M1 M2 M3 Yes Yes No Yes No No Yes Yes No CPK, California Pizza Kitchen is the most informative example Outline Agents that access information sources on the web AgentBuilder – learning from examples ActiveAtlas -- standardizing data from multiple sources Constraint-based Integration Heracles – putting it all together Constraint-based Integration Integrating data from multiple sources often involves reasoning about the information Constraints provide a approach to expressing relationships and filtering data Heracles Framework for building integrated applications Interleaves planning and information gathering Uses a constraint reasoner to decide what sources to query and to integrate the results The Travel Assistant Dynamically Updates Slots as Information Becomes Available BLACK GREEN GREEN GREEN GREEN GREEN GREEN GREEN GREEN GREEN GREEN GREEN BLACK GREEN BLUE BLUE RED GREEN GREEN RED RED RED RED RED RED RED RED RED Supports Informed Choices Changes Propagate Throughout User Can Specify High-Level Preferences Constraint Networks for Managing Information Constraint reasoning system Propagates information Decides when to launch information requests Evaluate constraints Computes preferences All run as asynchronous processes to support the user Components: Representation of the variables Representation of constraints Hierarchical templates Constraint propagation Constraint Networks for Integrating Information Components: Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle detection Constraint Variables Constraint network consists of a set of variables such as: MeetingStartTime MeetingLocation Variables are related by constraints that determine the possible values of a solution Constraint Networks for Integrating Information Components: Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle detection Constraint Representation Constraints are computable components: Local calculations (e.g., Xquery) Web and Database Wrappers ITN: DepartureAirport, ArrivalAirport, Date --> Flights Yahoo Weather: City, Date --> Weather predication External Programs (Outlook, Planners, etc) MeetingStartTime + MeetingDuration --> MeetingEndTime Outlook Calendar: Date --> Meetings Results cached in tables Drive or Take a Taxi? Sep 30, 2000 OriginAddress GetDistance DepartureDate DestinationAddress 15.1 miles Oct 2, 2000 ReturnDate Distance FindClosestAirport LAX DepartureAirport getParkingRate GetTaxiFare computeDuration 3 days Duration $21.00 ParkingTotal $23.00 TaxiFare ParkingRate $7.00/day multiply SelectModeToAirport ModeToAirport Drive Constraint Networks for Integrating Information Components: Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation and cycle detection Hierarchically-Partitioned Constraint Networks Template: Groups related variables and constraints Organizes information for computation and presentation to user Templates organized hierarchically Template decomposed into subtemplates Choose among alternative subtemplates Template Structure Template Arguments: input and output variables Variables: name, type, default values Constraints Expansions: alternative subtemplate calls GUI specification Partitioned Constraint Network Who Company Subject Dest Weather Dest. Addr. OriginWeather Origin Addr. Starting Time Distance Ending Time Travel Mode Depart Time Depart Airport Dist. toAirport Arrival Time Parking Lot Taxi Fare Parking Rate Mode toAirport Flight Num Arrival Airport Template Hierarchy for the Travel Assistant Trip AND 1 2 ModeToDestination ModeHotel OR Drive OR Fly Taxi 3 ModeNext OR Hotel NoOvernight AND Trip Trip Trip End ModeToAirport FlightDetail ModeFromAirport (Return (Return (New Trip OR OR Home) Office) Leg) 1 Drive Taxi 2 3 Drive Taxi Dynamic Networks Generalization of Constraint Networks Variables can be active or inactive Normal Constraints x1 = k1 ^ … ^ xm = km xn = kn Activity constraints: x1 = k1 ^ … ^ xm = km active(xn) Inactive variables do not participate in the network, i.e., do not propagate constraints Heracles: Template Selection Core network Computes values of template selection vars Always active Template selection variables Inputs to activity constraints: determine the choice of subtemplates, i.e., which additional variables are active Constraint Networks for Integrating Information Components: Representation of the variables Representation of constraints Hierarchical template representation Constraint propagation Constraint Propagation Approach Core network When a variable is assigned a value, re-compute the value sets and assigned values of all dependent variables Proceeds recursively until no values are changed or a cycle is detected Propagates all variables through the core network Remaining variables are computing when a template is opened Does not perform full CSP Less costly Does not require all information in advance Makes choices locally, so may fail to find optimal assignment Discussion General framework for interleaving planning and information gathering Retrieves information as needed Gathers and integrates data in a uniform framework Evaluates tradeoffs and selects among alternatives Allows the users to explore alternatives Supports a wide variety of information types: databases, web pages, images, video, etc. SmartClients [Torrens et al, 2002] Cast an integration problem as a Constraint Satisfaction Problem (CSP) Given a request, the server retrieves the required data and sends the data and the CSP to the client Client solves the CSP locally Large complex problem transmitted in small amount of space Provides fine-grained user interaction with the data Architecture for SmartClients SmartClients: Pros and Cons Pros Elegant approach that exploits past work on CSPs Minimizes the data retrieval and supports complex reasoning and integration of the data Cons Assumes that all data can be retrieved before any reasoning about the data In the travel planning, assumes that prices are the same on any date and there are no issues with flight availability Summary Our approach for creating “web assistants”: Agents for accessing web data Record linkage for mapping between sources Constraint-based integration provides the glue