Information Integration

advertisement
Information Integration
4/5
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
1
Information Integration on the Web
AAAI Tutorial (SA2)
Monday July 22nd 2007. 9am-1pm
Rao Kambhampati & Craig Knoblock
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
2
Information Integration
• Combining information from multiple autonomous
information sources
– And answering queries using the combined information
• Many Applications
– WWW:
•
•
•
•
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
Mashups, service composion
– Science informatics
• Integrating genomic data, geographic data, archaeological data,
astro-physical data etc.
– Enterprise data integration
• An average company has 49 different databases and spends
35% of its IT dollars on integration efforts
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
4
Deployed information integration
systems…
•
•
•
•
Travel sites: Kayak, Expedia etc..
Google Base, DBPedia
Map Mashups
Libra; Citeseer; Google-squared etc
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
5
Deep Web as a Motivation for II
• The surface web of crawlable pages is only
a part of the overall web.
• There are many databases on the web
– How many?
– 25 million (or more)
– Containing more than 80% of the accessible
content
• Mediator-based information integration is
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
6
Blind Men & the Elephant:
Differing views on Information Integration
Database View
• Integration of
autonomous
structured data
sources
• Challenges:
Schema
mapping, query
reformulation,
query processing
Kambhampati & Knoblock
Web service view
• Combining/compo
sing information
provided by
multiple websources
• Challenges:
learning source
descriptions;
source mapping,
record linkage etc.
Information Integration on the Web (SA-2)
IR/NLP view
• Computing
textual
entailment from
the information
in disparate
web/text sources
• Challenges:
Convert to
structured format
8
Services
User queries
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction
programs
Data
source
Data cleaning/
scrubbing
Data
source
Data
source
--Warehouse Model---Get all the data and
put into a local DB
--Support structured
queries on the
warehouse
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
--Mediator Model---Design a mediator
--Reformulate queries
--Access sources
--Collate results
--Search Model---Materialize
the pages
--crawl &index them
--include them in
search results
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
Challenges
Extracting Information
Aligning Sources
Aligning Data
Query Processing
9
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
13
Dimensions of Variation
• Conceptualization of (and approaches to)
information integration vary widely based on
– Type of data sources: being integrated (text; structured;
images etc.)
– Type of integration: vertical vs. horizontal vs. both
– Level of up-front work: Ad hoc vs. pre-orchestrated
– Control over sources: Cooperative sources vs.
Autonomous sources
– Type of output: Giving answers vs. Giving pointers
– Generality of Solution: Task-specific (Mashups) vs.
Task-independent (Mediator architectures)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
15
Dimensions: Type of Data Sources
• Data sources can be
– Structured (e.g. relational data)
No need for
information
extraction
– Text oriented
– Mixed
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
16
Dimensions: Vertical vs. Horizontal
• Vertical: Sources being integrated are all exporting same
type of information. The objective is to collate their results
– Eg. Meta-search engines, comparison shopping, bibliographic
search etc.
– Challenges: Handling overlap, duplicate detection, source selection
• Horizontal: Sources being integrated are exporting
different types of information
– E.g. Composed services, Mashups,
– Challenges: Handling “joins”
• Both..
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
17
Dimensions: Level of Up-front work
Ad hoc vs. Pre-orchestrated
• Fully Query-time II (blue
sky for now)
–
–
–
–
–
–
–
• Fully pre-fixed II
– Decide on the only query
Get a query from the user
you want to support
(most interesting
on the mediator schema
– Write a (java)script that
action is
Go “discover” relevant data
supports the query by
“in between”)
sources
accessing specific (predetermined) sources, piping
Figure out their “schemas”
Map the schemas on to the E.g. We may start with results (through known
APIs) to specific other
known sources and
mediator schema
sources
Reformulate the user query their known schemas,
• Examples include Google
do hand-mapping
into data source queries
Map Mashups
and
support
automated
Optimize and execute the
reformulation and
queries
optimization
Return the answers
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
18
Dimensions: Control over Sources
(Cooperative vs. Autonomous)
• Cooperative sources can (depending on their level
of kindness)
– Export meta-data (e.g. schema) information
– Provide mappings between their meta-data and other
ontologies
– Could be done with Semantic Web standards…
– Provide unrestricted access
– Examples: Distributed databases; Sources following semantic
web standards
• …for uncooperative sources all this information
has to be gathered by the mediator
– Examples: Most current integration scenarios on the web
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
20
Dimensions: Type of Output
(Pointers vs. Answers)
• The cost-effective approach may depend on the quality
guarantees we would want to give.
• At one extreme, it is possible to take a “web search”
perspective—provide potential answer pointers to keyword
queries
– Materialize the data records in the sources as HTML pages and add
them to the index
• Give it a sexy name: Surfacing the deep web
• At the other, it is possible to take a “database/knowledge
base” perspective
– View the individual records in the data sources as assertions in a
knowledge base and support inference over the entire knowledge.
• Extraction, Alignment etc. needed
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
21
Idea: Consider “endorsement” (a la A/H or PageRank)
But there are no explicit links….
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
23
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
24
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
25
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
26
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
27
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
28
Things beyond this were not
covered in Spring 2010
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
29
Review of Topic 3:
Finding, Representing & Exploiting
Structure
Getting Structure:
Allow structure specification languages
 XML? [More structured than text and less structured than databases]
If structure is not explicitly specified (or is obfuscated), can we extract it?
Wrapper generation/Information Extraction
Using Structure:
For retrieval:
Extend IR techniques to use the additional structure
For query processing: (Joins/Aggregations etc)
Extend database techniques to use the partial structure
For reasoning with structured knowledge
Semantic web ideas..
Structure in the context of multiple sources:
How to align structure
How to support integrated querying on pages/sources (after alignment)
Kambhampati & Knoblock
30
Schema Mapping
•
•
•
Schema mappings can be simple or
complex
Simple mapping
#employees * avg salary 
Sum(employee salary)
•
Bag similarity
• Jaccard similarity between {Bag
of values for Attribute-1} and
{Bag of values for Attribute-2}
• Can show that Make and
Manufacture are the same attribute
(for example)
• Can also show that “Alive” and
“dead” are same (since both have
y/n values)
•
Kambhampati & Knoblock
Word-net similarity
• Writer ~ Author
Complex mapping
–
• Writer ~ Writr
•
Writer  Author
•
•
Heuristic techniques for schema
mapping
String Similarity
Consider ensemble learning
methods based on these simple
learners
Information Integration on the Web (SA-2)
31
Download