Information Integration 4/5 Kambhampati & Knoblock Information Integration on the Web (SA-2) 1 Information Integration on the Web AAAI Tutorial (SA2) Monday July 22nd 2007. 9am-1pm Rao Kambhampati & Craig Knoblock Kambhampati & Knoblock Information Integration on the Web (SA-2) 2 Information Integration • Combining information from multiple autonomous information sources – And answering queries using the combined information • Many Applications – WWW: • • • • Comparison shopping Portals integrating data from multiple sources B2B, electronic marketplaces Mashups, service composion – Science informatics • Integrating genomic data, geographic data, archaeological data, astro-physical data etc. – Enterprise data integration • An average company has 49 different databases and spends 35% of its IT dollars on integration efforts Kambhampati & Knoblock Information Integration on the Web (SA-2) 4 Deployed information integration systems… • • • • Travel sites: Kayak, Expedia etc.. Google Base, DBPedia Map Mashups Libra; Citeseer; Google-squared etc Kambhampati & Knoblock Information Integration on the Web (SA-2) 5 Deep Web as a Motivation for II • The surface web of crawlable pages is only a part of the overall web. • There are many databases on the web – How many? – 25 million (or more) – Containing more than 80% of the accessible content • Mediator-based information integration is needed. Kambhampati & Knoblock Information Integration on the Web (MA-1) 6 Services Webpages ~25M Structured data Sensors (streaming Data) Kambhampati & Knoblock Information Integration on the Web (SA-2) 8 Blind Men & the Elephant: Differing views on Information Integration Database View • Integration of autonomous structured data sources • Challenges: Schema mapping, query reformulation, query processing Kambhampati & Knoblock Web service view • Combining/compo sing information provided by multiple websources • Challenges: learning source descriptions; source mapping, record linkage etc. Information Integration on the Web (SA-2) IR/NLP view • Computing textual entailment from the information in disparate web/text sources • Challenges: Information Extraction 9 Services User queries OLAP / Decision support/ Data cubes/ data mining Relational database (warehouse) Data extraction programs Data source Data cleaning/ scrubbing Data source Data source --Warehouse Model---Get all the data and put into a local DB --Support structured queries on the warehouse Webpages ~25M Structured data Sensors (streaming Data) User queries Mediated schema Mediator: Which data model? Reformulation engine optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source --Mediator Model---Design a mediator --Reformulate queries --Access sources --Collate results --Search Model---Materialize the pages --crawl &index them --include them in search results Kambhampati & Knoblock Information Integration on the Web (SA-2) 10 Kambhampati & Knoblock Information Integration on the Web (SA-2) 15 Dimensions of Variation • Conceptualization of (and approaches to) information integration vary widely based on – Type of data sources: being integrated (text; structured; images etc.) – Type of integration: vertical vs. horizontal vs. both – Level of up-front work: Ad hoc vs. pre-orchestrated – Control over sources: Cooperative sources vs. Autonomous sources – Type of output: Giving answers vs. Giving pointers – Generality of Solution: Task-specific (Mashups) vs. Task-independent (Mediator architectures) Kambhampati & Knoblock Information Integration on the Web (SA-2) 17 Dimensions: Type of Data Sources • Data sources can be – Structured (e.g. relational data) • Schema mapping (GAV/LAV) • Precise integration (Could I have done better by going to the sources myself?) – Text oriented • Information Extraction – Mixed Kambhampati & Knoblock Information Integration on the Web (SA-2) 18 Dimensions: Type of Output (Pointers vs. Answers) • The cost-effective approach may depend on the quality guarantees we would want to give. • At one extreme, it is possible to take a “web search” perspective—provide potential answer pointers to keyword queries – Materialize the data records in the sources as HTML pages and add them to the index • Give it a sexy name: Surfacing the deep web • At the other, it is possible to take a “database/knowledge base” perspective – View the individual records in the data sources as assertions in a knowledge base and support inference over the entire knowledge. • Extraction, Alignment etc. needed Kambhampati & Knoblock Information Integration on the Web (SA-2) 19 Dimensions: Vertical vs. Horizontal • Vertical: Sources being integrated are all exporting same type of information. The objective is to collate their results – Eg. Meta-search engines, comparison shopping, bibliographic search etc. – Challenges: Handling overlap, duplicate detection, source selection • Horizontal: Sources being integrated are exporting different types of information – E.g. Composed services, Mashups, – Challenges: Handling “joins” • Both.. Kambhampati & Knoblock Information Integration on the Web (SA-2) 20 Dimensions: Level of Up-front work Ad hoc vs. Pre-orchestrated • Fully Query-time II (blue sky for now) – – – – – – – • Fully pre-fixed II – Decide on the only query Get a query from the user you want to support (most interesting on the mediator schema – Write a (java)script that action is Go “discover” relevant data supports the query by “in between”) sources accessing specific (predetermined) sources, piping Figure out their “schemas” Map the schemas on to the E.g. We may start with results (through known APIs) to specific other known sources and mediator schema sources Reformulate the user query their known schemas, • Examples include Google do hand-mapping into data source queries Map Mashups and support automated Optimize and execute the reformulation and queries optimization Return the answers Kambhampati & Knoblock Information Integration on the Web (SA-2) 21 Dimensions: Control over Sources (Cooperative vs. Autonomous) • Cooperative sources can (depending on their level of kindness) – Export meta-data (e.g. schema) information – Provide mappings between their meta-data and other ontologies – Could be done with Semantic Web standards… – Provide unrestricted access – Examples: Distributed databases; Sources following semantic web standards • …for uncooperative sources all this information has to be gathered by the mediator – Examples: Most current integration scenarios on the web Kambhampati & Knoblock Information Integration on the Web (SA-2) 23 [Figure courtesy Halevy et. Al.] Kambhampati & Knoblock Information Integration on the Web (SA-2) 24 Collated Challenges • Source Selection – Which sources are likely to be most relevant to answer the query? • Source quality • Source overlap • Schema Mapping – How to map schemas of the sources to the mediator? • Record Linkage – How to couple data items that refer to the same entity • Information quality – Data with null values – Inconsistent data • Query reformulation – To convert query from the mediator schema to the source schema – To bring out relevant data despite incompleteness and inconsistency 25 Services Rao’s Solution… Webpages ~25M Structured data Sensors (streaming Data) Kambhampati & Knoblock Information Integration on the Web (SA-2) 26 Idea: Consider “endorsement” (a la A/H or PageRank) But there are no explicit links…. Kambhampati & Knoblock Information Integration on the Web (SA-2) 28 Kambhampati & Knoblock Information Integration on the Web (SA-2) 29 Kambhampati & Knoblock Information Integration on the Web (SA-2) 30 Services Rao’s Solution… Webpages ~25M Structured data Sensors (streaming Data) Kambhampati & Knoblock Information Integration on the Web (SA-2) 31 Case Study: Query Rewriting in QPIAD Given a query Q:(Body style=Convt) retrieve all relevant tuples Id 1 2 3 Make Audi BMW Porsche Model A4 Z4 Boxster Year 2001 2002 2005 Convt Convt Convt 4 BMW Z4 2003 Null 5 Honda Civic 2004 Null 6 Toyota Camry 7Ranked Audi A4 Relevant Base Result Set Body Id Make Model Year Body 1 Audi A4 2001 Convt 2 BMW Z4 2002 Convt 3 Porsche Boxster 2005 Convt AFD: Model~> Body style Re-order queries based 2002 on Estimated Sedan Precision 2006 Null Rewritten queries Q1’: Model=A4 Q2’: Model=Z4 Q3’: Model=Boxster Uncertain Answers Id Make Model Year Body Confidence 4 BMW Z4 2003 Null 0.7 7 Audi A4 2006 Null 0.3 We can select top K rewritten queries using Fmeasure F-Measure = (1+α)*P*R/(α*P+R) P – Estimated Precision R – Estimated Recall based on P and Estimated Selectivity Services Rao’s Solution… Webpages ~25M Structured data Sensors (streaming Data) Kambhampati & Knoblock Information Integration on the Web (SA-2) 33 Imprecise Queries ? Toyota A Feasible Query Want a ‘sedan’ priced around $7000 Make =“Toyota”, Model=“Camry”, Price ≤ $7000 What about the price of a Honda Accord? Is there a Camry for $7100? Solution: Support Imprecise Queries Camry $7000 1999 Toyota Camry $7000 2001 Toyota Camry $6700 2000 Toyota Camry $6500 1998 ……… Imprecise Query ap:Convert Q M “like”to“=” Qpr=Map(Q) Concept (Value) Similarity Concept: Any distinct attribute value pair. E.g. Make=Toyota • • Visualized as a selection query binding a single attribute Represented as a supertuple Concept Similarity: Estimated as the percentage DeriveBase Set Abs Abs=Qpr(R) UseBaseSet assetof relaxableselection queries UseConceptsimilarity tomeasuretuple similarities UsingAFDsfind relaxationorder Prunetuplesbelow threshold DeriveExtendedSetby executingrelaxedqueries ReturnRankedSet ST(QMake=Toyota) Model Camry: 3, Corolla: 4,…. Year 2000:6,1999:5 2001:2,…… Price 5995:4, 6500:3, 4000:6 Supertuple for Concept Make=Toyota of correlated values common to two given concepts Similarity ( v1, v 2 ) Commonalit y (Correlated ( v1, values ( Ai )), Correlated ( v 2 , values ( Ai ))) where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R • Measured as the Jaccard Similarity among supertuples representing the concepts JaccardSim(A,B) = A B A B Concept (Value) Similarity Graph Dodge Nissan 0.15 0.11 Honda BMW 0.12 0.22 0.25 0.16 Ford Chevrolet Toyota Things beyond this were not covered in Spring 2010 Kambhampati & Knoblock Information Integration on the Web (SA-2) 47 Review of Topic 3: Finding, Representing & Exploiting Structure Getting Structure: Allow structure specification languages XML? [More structured than text and less structured than databases] If structure is not explicitly specified (or is obfuscated), can we extract it? Wrapper generation/Information Extraction Using Structure: For retrieval: Extend IR techniques to use the additional structure For query processing: (Joins/Aggregations etc) Extend database techniques to use the partial structure For reasoning with structured knowledge Semantic web ideas.. Structure in the context of multiple sources: How to align structure How to support integrated querying on pages/sources (after alignment) Kambhampati & Knoblock 48