Information Integration

advertisement
Information Integration
4/5
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
1
Information Integration on the Web
AAAI Tutorial (SA2)
Monday July 22nd 2007. 9am-1pm
Rao Kambhampati & Craig Knoblock
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
2
Information Integration
• Combining information from multiple autonomous
information sources
– And answering queries using the combined information
• Many Applications
– WWW:
•
•
•
•
Comparison shopping
Portals integrating data from multiple sources
B2B, electronic marketplaces
Mashups, service composion
– Science informatics
• Integrating genomic data, geographic data, archaeological data,
astro-physical data etc.
– Enterprise data integration
• An average company has 49 different databases and spends
35% of its IT dollars on integration efforts
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
4
Deployed information integration
systems…
•
•
•
•
Travel sites: Kayak, Expedia etc..
Google Base, DBPedia
Map Mashups
Libra; Citeseer; Google-squared etc
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
5
Deep Web as a Motivation for II
• The surface web of crawlable pages is only
a part of the overall web.
• There are many databases on the web
– How many?
– 25 million (or more)
– Containing more than 80% of the accessible
content
• Mediator-based information integration is
needed.
Kambhampati & Knoblock
Information Integration on the Web (MA-1)
6
Services
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
8
Blind Men & the Elephant:
Differing views on Information Integration
Database View
• Integration of
autonomous
structured data
sources
• Challenges:
Schema
mapping, query
reformulation,
query processing
Kambhampati & Knoblock
Web service view
• Combining/compo
sing information
provided by
multiple websources
• Challenges:
learning source
descriptions;
source mapping,
record linkage etc.
Information Integration on the Web (SA-2)
IR/NLP view
• Computing
textual
entailment from
the information
in disparate
web/text sources
• Challenges:
Information
Extraction
9
Services
User queries
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction
programs
Data
source
Data cleaning/
scrubbing
Data
source
Data
source
--Warehouse Model---Get all the data and
put into a local DB
--Support structured
queries on the
warehouse
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
--Mediator Model---Design a mediator
--Reformulate queries
--Access sources
--Collate results
--Search Model---Materialize
the pages
--crawl &index them
--include them in
search results
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
10
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
15
Dimensions of Variation
• Conceptualization of (and approaches to)
information integration vary widely based on
– Type of data sources: being integrated (text; structured;
images etc.)
– Type of integration: vertical vs. horizontal vs. both
– Level of up-front work: Ad hoc vs. pre-orchestrated
– Control over sources: Cooperative sources vs.
Autonomous sources
– Type of output: Giving answers vs. Giving pointers
– Generality of Solution: Task-specific (Mashups) vs.
Task-independent (Mediator architectures)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
17
Dimensions: Type of Data Sources
• Data sources can be
– Structured (e.g. relational data)
• Schema mapping (GAV/LAV)
• Precise integration (Could I have done better by
going to the sources myself?)
– Text oriented
• Information Extraction
– Mixed
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
18
Dimensions: Type of Output
(Pointers vs. Answers)
• The cost-effective approach may depend on the quality
guarantees we would want to give.
• At one extreme, it is possible to take a “web search”
perspective—provide potential answer pointers to keyword
queries
– Materialize the data records in the sources as HTML pages and add
them to the index
• Give it a sexy name: Surfacing the deep web
• At the other, it is possible to take a “database/knowledge
base” perspective
– View the individual records in the data sources as assertions in a
knowledge base and support inference over the entire knowledge.
• Extraction, Alignment etc. needed
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
19
Dimensions: Vertical vs. Horizontal
• Vertical: Sources being integrated are all exporting same
type of information. The objective is to collate their results
– Eg. Meta-search engines, comparison shopping, bibliographic
search etc.
– Challenges: Handling overlap, duplicate detection, source selection
• Horizontal: Sources being integrated are exporting
different types of information
– E.g. Composed services, Mashups,
– Challenges: Handling “joins”
• Both..
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
20
Dimensions: Level of Up-front work
Ad hoc vs. Pre-orchestrated
• Fully Query-time II (blue
sky for now)
–
–
–
–
–
–
–
• Fully pre-fixed II
– Decide on the only query
Get a query from the user
you want to support
(most interesting
on the mediator schema
– Write a (java)script that
action is
Go “discover” relevant data
supports the query by
“in between”)
sources
accessing specific (predetermined) sources, piping
Figure out their “schemas”
Map the schemas on to the E.g. We may start with results (through known
APIs) to specific other
known sources and
mediator schema
sources
Reformulate the user query their known schemas,
• Examples include Google
do hand-mapping
into data source queries
Map Mashups
and
support
automated
Optimize and execute the
reformulation and
queries
optimization
Return the answers
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
21
Dimensions: Control over Sources
(Cooperative vs. Autonomous)
• Cooperative sources can (depending on their level
of kindness)
– Export meta-data (e.g. schema) information
– Provide mappings between their meta-data and other
ontologies
– Could be done with Semantic Web standards…
– Provide unrestricted access
– Examples: Distributed databases; Sources following semantic
web standards
• …for uncooperative sources all this information
has to be gathered by the mediator
– Examples: Most current integration scenarios on the web
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
23
[Figure courtesy Halevy et. Al.]
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
24
Collated Challenges
• Source Selection
– Which sources are likely to
be most relevant to answer
the query?
• Source quality
• Source overlap
• Schema Mapping
– How to map schemas of the
sources to the mediator?
• Record Linkage
– How to couple data items
that refer to the same entity
• Information quality
– Data with null values
– Inconsistent data
• Query reformulation
– To convert query from the
mediator schema to the
source schema
– To bring out relevant data
despite incompleteness and
inconsistency
25
Services
Rao’s Solution…
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
26
Idea: Consider “endorsement” (a la A/H or PageRank)
But there are no explicit links….
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
28
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
29
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
30
Services
Rao’s Solution…
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
31
Case Study: Query Rewriting in QPIAD
Given a query Q:(Body style=Convt) retrieve all relevant tuples
Id
1
2
3
Make
Audi
BMW
Porsche
Model
A4
Z4
Boxster
Year
2001
2002
2005
Convt
Convt
Convt
4
BMW
Z4
2003
Null
5
Honda
Civic
2004
Null
6
Toyota
Camry
7Ranked
Audi
A4
Relevant
Base Result Set
Body
Id
Make
Model
Year
Body
1
Audi
A4
2001
Convt
2
BMW
Z4
2002
Convt
3
Porsche
Boxster
2005
Convt
AFD: Model~> Body style
Re-order queries based
2002 on Estimated
Sedan
Precision
2006
Null
Rewritten queries
Q1’: Model=A4
Q2’: Model=Z4
Q3’: Model=Boxster
Uncertain Answers
Id
Make
Model
Year
Body
Confidence
4
BMW
Z4
2003
Null
0.7
7
Audi
A4
2006
Null
0.3
We can select top K rewritten queries using Fmeasure
F-Measure = (1+α)*P*R/(α*P+R)
P – Estimated Precision
R – Estimated Recall based on P and Estimated
Selectivity
Services
Rao’s Solution…
Webpages
~25M
Structured
data
Sensors
(streaming
Data)
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
33
Imprecise Queries ?
Toyota
A Feasible Query
Want a ‘sedan’ priced
around $7000
Make =“Toyota”,
Model=“Camry”,
Price ≤ $7000
What about the price of a
Honda Accord?
Is there a Camry for
$7100?
Solution: Support Imprecise Queries
Camry
$7000
1999
Toyota
Camry
$7000
2001
Toyota
Camry
$6700
2000
Toyota
Camry
$6500
1998
………
Imprecise
Query
ap:Convert
Q M
“like”to“=”
Qpr=Map(Q)
Concept (Value) Similarity
Concept: Any distinct attribute value pair.
E.g. Make=Toyota
•
•
Visualized as a selection query binding a
single attribute
Represented as a supertuple
Concept Similarity: Estimated as the percentage
DeriveBase
Set Abs
Abs=Qpr(R)
UseBaseSet assetof
relaxableselection
queries
UseConceptsimilarity
tomeasuretuple
similarities
UsingAFDsfind
relaxationorder
Prunetuplesbelow
threshold
DeriveExtendedSetby
executingrelaxedqueries
ReturnRankedSet
ST(QMake=Toyota)
Model
Camry: 3, Corolla: 4,….
Year
2000:6,1999:5 2001:2,……
Price
5995:4, 6500:3, 4000:6
Supertuple for Concept Make=Toyota
of correlated values common to two given
concepts
Similarity ( v1, v 2 )   Commonalit y (Correlated ( v1, values ( Ai )), Correlated ( v 2 , values ( Ai )))
where v1,v2 Є Aj, i ≠ j and Ai, Aj Є R
•
Measured as the Jaccard Similarity among
supertuples representing the concepts
JaccardSim(A,B) =
A B
A B
Concept (Value) Similarity Graph
Dodge
Nissan
0.15
0.11
Honda
BMW
0.12
0.22
0.25
0.16
Ford
Chevrolet
Toyota
Things beyond this were not
covered in Spring 2010
Kambhampati & Knoblock
Information Integration on the Web (SA-2)
47
Review of Topic 3:
Finding, Representing & Exploiting
Structure
Getting Structure:
Allow structure specification languages
 XML? [More structured than text and less structured than databases]
If structure is not explicitly specified (or is obfuscated), can we extract it?
Wrapper generation/Information Extraction
Using Structure:
For retrieval:
Extend IR techniques to use the additional structure
For query processing: (Joins/Aggregations etc)
Extend database techniques to use the partial structure
For reasoning with structured knowledge
Semantic web ideas..
Structure in the context of multiple sources:
How to align structure
How to support integrated querying on pages/sources (after alignment)
Kambhampati & Knoblock
48
Download