Information Integration 4/5 Kambhampati & Knoblock Information Integration on the Web (SA-2) 1 Information Integration on the Web AAAI Tutorial (SA2) Monday July 22nd 2007. 9am-1pm Rao Kambhampati & Craig Knoblock Kambhampati & Knoblock Information Integration on the Web (SA-2) 2 Information Integration • Combining information from multiple autonomous information sources – And answering queries using the combined information • Many Applications – WWW: • • • • Comparison shopping Portals integrating data from multiple sources B2B, electronic marketplaces Mashups, service composion – Science informatics • Integrating genomic data, geographic data, archaeological data, astro-physical data etc. – Enterprise data integration • An average company has 49 different databases and spends 35% of its IT dollars on integration efforts Kambhampati & Knoblock Information Integration on the Web (SA-2) 4 Deployed information integration systems… • • • • Travel sites: Kayak, Expedia etc.. Google Base, DBPedia Map Mashups Libra; Citeseer; Google-squared etc Kambhampati & Knoblock Information Integration on the Web (SA-2) 5 Deep Web as a Motivation for II • The surface web of crawlable pages is only a part of the overall web. • There are many databases on the web – How many? – 25 million (or more) – Containing more than 80% of the accessible content • Mediator-based information integration is Kambhampati & Knoblock Information Integration on the Web (MA-1) 6 Blind Men & the Elephant: Differing views on Information Integration Database View • Integration of autonomous structured data sources • Challenges: Schema mapping, query reformulation, query processing Kambhampati & Knoblock Web service view • Combining/compo sing information provided by multiple websources • Challenges: learning source descriptions; source mapping, record linkage etc. Information Integration on the Web (SA-2) IR/NLP view • Computing textual entailment from the information in disparate web/text sources • Challenges: Convert to structured format 8 Services User queries OLAP / Decision support/ Data cubes/ data mining Relational database (warehouse) Data extraction programs Data source Data cleaning/ scrubbing Data source Data source --Warehouse Model---Get all the data and put into a local DB --Support structured queries on the warehouse Webpages ~25M Structured data Sensors (streaming Data) User queries Mediated schema Mediator: Which data model? Reformulation engine optimizer Execution engine Data source catalog wrapper wrapper wrapper Data source Data source Data source --Mediator Model---Design a mediator --Reformulate queries --Access sources --Collate results --Search Model---Materialize the pages --crawl &index them --include them in search results Kambhampati & Knoblock Information Integration on the Web (SA-2) Challenges Extracting Information Aligning Sources Aligning Data Query Processing 9 Kambhampati & Knoblock Information Integration on the Web (SA-2) 13 Dimensions of Variation • Conceptualization of (and approaches to) information integration vary widely based on – Type of data sources: being integrated (text; structured; images etc.) – Type of integration: vertical vs. horizontal vs. both – Level of up-front work: Ad hoc vs. pre-orchestrated – Control over sources: Cooperative sources vs. Autonomous sources – Type of output: Giving answers vs. Giving pointers – Generality of Solution: Task-specific (Mashups) vs. Task-independent (Mediator architectures) Kambhampati & Knoblock Information Integration on the Web (SA-2) 15 Dimensions: Type of Data Sources • Data sources can be – Structured (e.g. relational data) No need for information extraction – Text oriented – Mixed Kambhampati & Knoblock Information Integration on the Web (SA-2) 16 Dimensions: Vertical vs. Horizontal • Vertical: Sources being integrated are all exporting same type of information. The objective is to collate their results – Eg. Meta-search engines, comparison shopping, bibliographic search etc. – Challenges: Handling overlap, duplicate detection, source selection • Horizontal: Sources being integrated are exporting different types of information – E.g. Composed services, Mashups, – Challenges: Handling “joins” • Both.. Kambhampati & Knoblock Information Integration on the Web (SA-2) 17 Dimensions: Level of Up-front work Ad hoc vs. Pre-orchestrated • Fully Query-time II (blue sky for now) – – – – – – – • Fully pre-fixed II – Decide on the only query Get a query from the user you want to support (most interesting on the mediator schema – Write a (java)script that action is Go “discover” relevant data supports the query by “in between”) sources accessing specific (predetermined) sources, piping Figure out their “schemas” Map the schemas on to the E.g. We may start with results (through known APIs) to specific other known sources and mediator schema sources Reformulate the user query their known schemas, • Examples include Google do hand-mapping into data source queries Map Mashups and support automated Optimize and execute the reformulation and queries optimization Return the answers Kambhampati & Knoblock Information Integration on the Web (SA-2) 18 Dimensions: Control over Sources (Cooperative vs. Autonomous) • Cooperative sources can (depending on their level of kindness) – Export meta-data (e.g. schema) information – Provide mappings between their meta-data and other ontologies – Could be done with Semantic Web standards… – Provide unrestricted access – Examples: Distributed databases; Sources following semantic web standards • …for uncooperative sources all this information has to be gathered by the mediator – Examples: Most current integration scenarios on the web Kambhampati & Knoblock Information Integration on the Web (SA-2) 20 Dimensions: Type of Output (Pointers vs. Answers) • The cost-effective approach may depend on the quality guarantees we would want to give. • At one extreme, it is possible to take a “web search” perspective—provide potential answer pointers to keyword queries – Materialize the data records in the sources as HTML pages and add them to the index • Give it a sexy name: Surfacing the deep web • At the other, it is possible to take a “database/knowledge base” perspective – View the individual records in the data sources as assertions in a knowledge base and support inference over the entire knowledge. • Extraction, Alignment etc. needed Kambhampati & Knoblock Information Integration on the Web (SA-2) 21 Idea: Consider “endorsement” (a la A/H or PageRank) But there are no explicit links…. Kambhampati & Knoblock Information Integration on the Web (SA-2) 23 Kambhampati & Knoblock Information Integration on the Web (SA-2) 24 Kambhampati & Knoblock Information Integration on the Web (SA-2) 25 Kambhampati & Knoblock Information Integration on the Web (SA-2) 26 Kambhampati & Knoblock Information Integration on the Web (SA-2) 27 Kambhampati & Knoblock Information Integration on the Web (SA-2) 28 Things beyond this were not covered in Spring 2010 Kambhampati & Knoblock Information Integration on the Web (SA-2) 29 Review of Topic 3: Finding, Representing & Exploiting Structure Getting Structure: Allow structure specification languages XML? [More structured than text and less structured than databases] If structure is not explicitly specified (or is obfuscated), can we extract it? Wrapper generation/Information Extraction Using Structure: For retrieval: Extend IR techniques to use the additional structure For query processing: (Joins/Aggregations etc) Extend database techniques to use the partial structure For reasoning with structured knowledge Semantic web ideas.. Structure in the context of multiple sources: How to align structure How to support integrated querying on pages/sources (after alignment) Kambhampati & Knoblock 30 Schema Mapping • • • Schema mappings can be simple or complex Simple mapping #employees * avg salary Sum(employee salary) • Bag similarity • Jaccard similarity between {Bag of values for Attribute-1} and {Bag of values for Attribute-2} • Can show that Make and Manufacture are the same attribute (for example) • Can also show that “Alive” and “dead” are same (since both have y/n values) • Kambhampati & Knoblock Word-net similarity • Writer ~ Author Complex mapping – • Writer ~ Writr • Writer Author • • Heuristic techniques for schema mapping String Similarity Consider ensemble learning methods based on these simple learners Information Integration on the Web (SA-2) 31