Learning to Create Data-Integration Queries

advertisement
Learning to Create Data-Integration
Queries
Partha Pratim Talukdar, Marie Jacob, Muhammad Salman Mehmood, Koby Crammer, Zachary G. Ives,
Fernando Pereira, Sudipto Guha
VLDB2008
Seminar Presented by
Noel Gunasekar
CSE Department – SUNY Buffalo
Learning to Create Data-Integration Queries
Introduction
Motivation
Example
Existing solutions
Q-System Solution
Q-System Architecture
Query and Query Answers
Executing Query
Learning From feedback
Conclusion
Experimental results
Future Work
Learning to Create Data Integration Queries
2
Introduction
Learning to Create Data Integration Queries
3
Motivation
Need for non-expert user to pose queries across multiple
data resources.
• Non-expert user - Not familiar with querying languages
• Multiple resource - Databases, Data warehouses, Virtual
integrated schemas
Learning to Create Data Integration Queries
4
Bio-Science Field
• Many "standardized" databases with overlapping and cross-referenced
information
• Each site is being independently extended, corrected, and analyzed
• Differing levels of data quality/confidence
Protein Databases
Protein DataBase: - PDB information and service listings at
Brookhaven National Laboratory [ BNL ]
PIR: - Protein Identification Resource database at [ JHU ]
PRF: - Protein Research Foundation database at GenomeNet
SwissProt - Protein database at ExPASy [ Switzerland ]
Learning to Create Data Integration Queries
5
Example
“What are the proteins and genes associated with the disease Narcolepsy?”
Life Sciences researcher querying
on data-sources like genomics,
disease studies and pharmacology.
genomics
Life Sciences
Researcher
Disease Studies
Pharmacology
http://www.expasy.org/uniprot/P04049
Learning to Create Data Integration Queries
6
Existing Solution
Using keyword based queries on Web-Forms
Match the keywords with terms in the tuples and form the query by
joining different databases using foreign-keys
• Cost for the query is fixed and doesn’t accommodate the context
of the query
http://www.expasy.org/uniprot/P0C852
Learning to Create Data Integration Queries
7
Proposed Solution - Q System
Automatically generate Web-Forms for given set of
keywords
Pose queries across multiple data resources using the
generated web-form
Learning to Create Data Integration Queries
8
Proposed Solution - Q System
Create re-usable web-form
User
(Author)
Q System
Keywords
Protein, gene, disease
Reusable Web-Form
For querying
Use web-form for Querying
Users
(Author + others)
Query Results
Parameters
Reusable Web-Form
For querying
Learning to Create Data Integration Queries
9
Q System Architecture
Learning to Create Data Integration Queries
10
Architecture of Q System
Four Components
•Initial Schema Loader
•Query Template Creation
•Query Execution
•Learning Through Feedback
Learning to Create Data Integration Queries
11
Architecture of Q System
Learning to Create Data Integration Queries
12
Initial Setup
Schema Loader
Input
• Given a set of data sources with its own schema
• Foreign Keys and Links
• Schema Mappings
• Record Link
Output
• Schema Graph
Learning to Create Data Integration Queries
13
Initial Setup
Example Schema Graph
b
0.07
0.1
c
0.04
d
Node: Databases and their attributes
(UniProt database, Entrez GeneInfo db, term)
Edge: Relation based on foreign keys/cross-references
(UniProt to PIR)
Cost: Reliability, completeness
Learning to Create Data Integration Queries
14
Query Template Creation
Learning to Create Data Integration Queries
15
Query Template Creation - Example
Input: “protein”, “plasma membrane”, “gene” and “disease”
Output:
Learning to Create Data Integration Queries
16
Query Template Creation - Example
a
Schema Graph
b
0.1
0.07
0.1
Query Keywords a, e, f
c
0.04
d
0.1
f
0.1
e
Find trees connecting red nodes
a
0.1
b
a
0.1
Rank = 1
Cost = 0.4
0.07
c
0.04
0.1
Q1
b
d
0.1
e
0.1
f
Q2
Rank = 2
Cost = 0.41
d
0.1
e
0.1
f
Query Formulation
•Trees can be easily written as executable queries:
Steiner Tree
a
0.1
b
0.1
d
0.1
f
0.1
e
Conjunctive query: a(x,y),b(y,z),d(z,w),e(w,u),f(w,v)
View Refinement
Web-Form
Query Execution
Learning to Create Data Integration Queries
21
Input: Web-Form
Output: Result Answers
Q1
Q1
Q1,2
Q2
Q2
Q2
System determines “producer” queries using provenance
Query Execution
Query Processing Engine with
• Support for querying remote data sources
• Record data provenance
Solution:
ORCHESTRA
http://www.cis.upenn.edu/~zives/orchestra/
Learning to Create Data Integration Queries
24
Orchestra Project
• The ORCHESTRA project focuses on the challenges of data sharing
scenarios in the sciences
• Bioinformatics Scenario - many "standardized" databases with overlapping
information, similar but not identical data and differing levels of data
quality/confidence
• Each site is being independently extended, corrected, and analyzed
•ORCHESTRA collaborative data sharing system (CDSS) is on how to support
reconciliation across different schemas, with disagreeing users
Learning to Create Data Integration Queries
25
Orchestra Project – Data Provenance
http://www.cis.upenn.edu/~zives/research/exchange.pdf
Learning to Create Data Integration Queries
26
Learning through Feedback
Learning to Create Data Integration Queries
27
Learning through Feedback
• Input: Ranked Results + provenance
Q1
Q1
Q1,2
Q2
Q2
Q2
Learning to Create Data Integration Queries
28
Learning through Feedback
• User provides feedback
Q1
Q1
Q1,2
Q2
Q2
Q2
Learning to Create Data Integration Queries
29
Query Formulation - Recap
a
Schema Graph
b
0.1
0.07
0.1
Query Keywords a,
c
e, f
0.04
d
0.1
f
0.1
e
Find trees connecting red nodes
a
0.1
b
a
0.1
Rank = 1
Cost = 0.4
0.07
c
0.04
0.1
Q1
b
d
0.1
e
0.1
f
Q2
Rank = 2
Cost = 0.41
d
0.1
e
0.1
f
Learning through Feedback
a
0.1
Q1
b
a
0.1
d
0.1
0.1
0.05
0.07
Q2
f
c
0.04
d
0.1
Rank = 1
2
b
0.1
Rank
1
Rank =
=2
Cost == 0.
0.39
Cost
41
e
Cost = 0.4
 Change weights so Q2 is “cheaper” than Q1
a
0.1
b
0.0
0.07
5
0.1
c
0.04
d
0.1
e
0.1
f
e
0.1
f
Iteration!
Learning to Create Data Integration Queries
32
Q-System: Challenges
 Computation of ranked queries which in turn produce
ranked tuples: K-Best Steiner Tree Generation
 Predicting new query rankings based on user feedback
over tuples, and also generalizing feedback: Learning
 Maintaining associations between tuples and queries:
Query answers with provenance
 Everything at interactive speed!
Cost of a Query
• Query Cost = Sum of edge costs in the tree.
• Edge Cost = Sum of weights of features defined over it.
• Features are properties of the edges, e.g., nodes connected
• Each feature has a corresponding weight.
• Feature example:
f = 1 if the edge connects Term and Synonym tables, else 0
Term
f
1
w8
Synonym
Steiner Trees: Finding Lowest-Cost Queries
 A tree of minimal cost in a graph (G)
which includes all the required nodes (S).
a
 Cost of a Steiner Tree is the sum of costs
of edges present in the tree.
0.1
b
0.07
0.1
c
0.04
d
0.1
e
 Steiner Tree is generalization of
Minimum Spanning Tree (MST)
[equivalent when S = all vertices in G].
0.1
f
K-Best Steiner Tree Algorithms
• Exact (practical for ~100 nodes and edges).
• Integer Linear Program (ILP) based formulation for finding K-best
Steiner Trees in a graph.
• The ILP uses ideas from multi-commodity network flows
• Approximate (for 100s+ nodes and edges).
• Novel Shortest Paths Complete Subgraph Heuristic.
• Significantly faster; in practice, often gives optimal solution.
Multi-Commodity Flow Problem
MIP for min-cost Steiner Tree
MIP for K min-cost Steiner Tree
Constraints
C1 : Flow of commodity k starts at root r
C2 : Flow of commodity k terminates at node k
C3 : Conservation of flow at Steiner nodes
C4 : Flow of an edge allowed only if that edge is included ( Yij = 1 )
C5 : Non-negativity constraint
C6 : Defines value for Y
C7 : Ensures no incoming active edge into the root
C8 : Ensures that all nodes have at most one incoming active edge
C9 : Flow of at least one commodity on all edges in I
C10 : Ensures no flow on edges in X
Finding K-Best Steiner Trees
a
b
0.1
0.07
0.1
c
0.04
d
0.1
f
0.1
e
2-best Steiner trees connecting terminal nodes.
a
0.1
b
a
0.1
d
Cost = 0.4
0.07
c
0.04
0.1
Rank = 1
b
0.1
e
0.1
f
d
Rank = 2
Cost = 0.41
0.1
e
0.1
f
K-Best Steiner Tree Algorithms
• Approximate (for 100s+ nodes and edges).
• Novel Shortest Paths Complete Subgraph Heuristic.
• Send “m” shortest path graph as input.
• Shortest path between each pair of nodes in S
• Significantly faster; in practice, often gives optimal solution.
Q Challenge : Getting User Feedback
Query
Tuples
Top
a
b
d
T
f
e
.
.
.
.
.
.
a
b
.
.
.
c
d
T*
Bottom
Q
f
Q*
e
T and T* differ in 3 edges. This difference is termed loss: L(T, T*)
Learning: Update Weights
Edge Cost: 0.07
Term(T1)
Term2Term
w8 = 0.06
w25 = 0.01
Edge Cost Update
Edge Cost: 0.05
Term(T1)
w8 = 0.04
w25 = 0.01
Term2Term
Re-ranked Steiner Trees
a
0.1
b
a
0.1
b
c
0.07
0.04
0.1
d
f
0.1
d
0.1
f
0.1
0.1
e
e
Weight Update
a
0.1
b
0.05
c
a
0.1
0.04
d
0.1
0.1
f
d
0.1
Rank 1
e
b
0.1
Rank 2
e
0.1
f
Experimental Results
The Key Questions
• Can the algorithm start with uninitialized weights and learn expert (“gold
standard”) ranking of queries?
• Can the results be generated at interactive speeds?
• Does the approach scale to larger graphs?
Results: Learning Expert Weights
• Graph: Start with the BioGuide bio
sources, with 28 vertices and 96
edges.
• Goal: Learn the queries
corresponding to the expert-set
weights in BioGuide
• Methodology:
• All weights are set to default.
• Sequence of 25 searches
•For each, user feedback identifies
& promotes a tuple from the gold
standard answer.
• After 40-60% searches with feedback, system finds the top query immediately.
• For each individual search, a single feedback is enough to learn the top query.
Results: Time to generate K-best Queries
• Schema graph of size (28, 96) from BioGuide (Boulakia et al., 2007).
K
Time (s)
1
0.11
5
2.00
10
4.02
20
8.42
It is possible to generate the top query in < 1 sec and top 5 queries
in about 2 sec, all within interactive range. Query execution is pipelined.
Results: Scalability to Larger Graphs
• Larger schema graph of size (408, 1366) from real sources: GUS, GO, BioSQL.
Running Time (sec)
10000
1000
Exact (ILP)
K
Approximate (SPCSH)
100
10
1
1
2
3
Queries (K)
Speedup Error
1
12
0
2
14.6
0
3
20.3
0
5
72.4
0
5
It is possible to do K-best inference in larger graphs quickly
and with little or no loss (none in this case).
Experimental Results
“Gold Standard”
• Using BioGuide – a biomedical information integration system
• BioGuide generates the schema graph based on keywords
• The edge cost in the schema graph are manually assigned by experts
• This expert given schema graph is called the “gold standard”
Experiment involves in comparing the result produced by the q system
with the results produced by the gold standard.
Learning to Create Data Integration Queries
50
Learning against expert cost
• Started with an expert query template
“What are the related proteins in [DB1] and genes in [DB2]
associated with disease Narcolepsy in [DB3] ?
• By instantiating the template 25 queries were formed
• Each time for a query the lowest-cost Steiner tree is computed
• The “gold standard” is used as the feedback and learning is done
Learning to Create Data Integration Queries
51
Future Work
• Work on other approximation algorithms for computing the Steiner trees
• Evaluation against real biological applications
• Incorporating data-level keyword matches.
Learning to Create Data Integration Queries
52
Download