Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI Outline • Information Integration – – – – Definition Architectures Domain Models Source wrapping • Application to EIA – Example of Multi-State Data re-organization • Wrapping • Modelling Information Integration Single Interface to Multiple Sources Decision Support Application Programs Information Agent Databases Knowledge Bases The Web Computer Programs Information Integration The problem of providing uniform (sources transparent to user) access to (query, and eventually updates too) multiple (even 2 is hard!) autonomous (not affect the behavior of sources) heterogeneous (different data models, schemas) structured (and semistructured) data sources (not only databases, web sources, …) Information Integration in SIMS To enable query access SIMS needs to: • address semantic heterogeneity: => describe sources in common domain model • address syntactic (format) heterogeneity: => standardize access to sources: – Structured (DBMS): Oracle, MS Access … – Semistructured: wrappers for html, text, pdf Domain Model (for Time Series) Point of Sale Unit Time Series Period Gasoline G. Premium Month Area USA CA G. Leaded G. Regular Week Year Product Measurement Footnote Date Tag Value Text G. Unleaded G. Premium Unleaded Quality NY Subclass Part-of General Relation Source Mapping Price CPI Volume PPI Integration Architectures Materialized: Data Warehouse Virtual: Mediator from [Levy2000] Wrappers • provide uniform mechanism for extracting data from semi-structured sources (HTML, text, …) • transform semi-structured sources into structured Name Restaurants in Santa Monica? Wrapper Address Chinois on Main 2709 Main St. Chao Dara 13 Union Sq. … ... Wrapper Building Tools • Creating Wrappers (semi-)automatically: – Demonstration-oriented user interface enables users to show system what to extract by example [Muslea99] – System automatically induces extraction rules – Common extraction engine • Benefits: – Rapid wrapper creation – Simplified wrapper maintenance • Fetch.com – Start-up that comercializes the technology Example: EIA Multi-State Data EIA Multi-State Data: Multiple formats EIA Multi-State Data: Table 31 Text source: • Formatted text • Tables contains national, regional (PADDs), state data extract state data • Tables contains different measurements EIA Multi-State: Wrapper Creation Wrapper Creation 1. Mark-up a few examples and assign meaning (map to attributes from domain model) date value 2. System induces extraction rules EIA Multi-State: Metadata Extraction • Extract - data - metadata • Associate with domain model Extracted Data + Metadata EIA-T31-1: Regular Gasoline Prices in Maine, Sales to end users through retail outlets Measured in cents per gallon Domain Model Point of Sale Unit Cents/gallon Period Week Year Month CA G. Leaded G. Regular EIA-T31-1 Measurement Maine Gasoline G. Premium Area USA Time Serie s Retail outlets Product Date Subclass Part-of General Relation Source Mapping Value G. Unleaded G. Premium Unleaded Footnote Tag EIA Multi-state Wrapper Text Quality Price Volume Additional slides Example of Extraction Rule [Muslea et al 1999] RULE = sequence of landmarks (e.g., Cuisine : ) Page: <b>Name:</b>Chinois on Main<b> Cuisine :<p> </b> Start: SkipTo(Cuisine :) SkipTo(</b>) Pacific New Wave <br> End: SkipTo(<br>) Example of Rule Induction Training Examples: [Muslea et al 1999] <p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good <p>Review:<p><br><p> <b> Excellent SkipTo( <b> ) Example of Rule Induction Training Examples: [Muslea et al 1999] <p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good <p>Review:<p><br><p> <b> Excellent SkipTo( <b> ) SkipTo(<p> <b>) ... SkipTo( : ) SkipTo(<b>) ... SkipTo(<p>)SkipTo(<b>) Example of Rule Induction Training Examples: [Muslea et al 1999] <p>Cuisine:<p><b>Thai</b><p>Review:<p> <b> Good <p>Review:<p><br><p> <b> Excellent SkipTo( <b> ) SkipTo(<p> <b>) … ... SkipTo( : ) SkipTo(<b>) ... SkipTo(<p>)SkipTo(<b>) SkipTo( Review :) SkipTo( <b> ) ... Mediator Architecture • User queries in global (mediator) schema • Mediator translates and decomposes user query into multiple source queries from [Levy2000] Construction phase: • Deploy DBs • Extend ontol. System Architecture Integrated ontology - global terminology - source descriptions - integration axioms Domain modeling - DB analysis - text analysis Query processor - reformulation - cost optimization Sources User Interface - ontology browser - query constructor User phase: R S • Compose query T Access phase: • Create DB query • Retrieve data