Integration of databanks Accessing multiple data sources

advertisement
Accessing multiple data sources
Genomics
Which?
Where?
How?
Disease
information
Integration of databanks
Clinical
trials
Metabolism
DISCOVERY toxicology
Chemical
structure
Target
structure
1
Access to multiple databanksProblems
Disease
models
2
Queries over multiple databanks
Find PubMed publications on
diseases of certain insulin sequences
• Users need good knowledge on where the
required information is stored and how it
can be accessed
• Representation of an entity in different
databanks can be different.
Same name in different databanks can refer
to different entities.
get relevant
diseases
SWISS-PROT
get IDs of
publications
•
•
•
•
Find
Divide & order
Execute
Combine
get
publications
OMIM
PubMed
3
4
Access to multiple databanks steps
• Decide which databanks should be used
• Divide query into sub-queries to the databanks
• Decide in which order to send sub-queries to the
databanks
• Send sub-queries to the databanks - use the
terminology of the databanks
• Merge results from the databanks to an answer for
the original query
Æ mistake in any step can lead to inefficient
processing of the query or failure to get a result 5
query
Answer1
Answer2
Answer3
Sub-query1
Sub-query1
Sub-query1
Answer1
Answer2
Answer3
Answer1
Answer2
Answer3
6
query
Answer1
Answer2
Answer3
query
Answer1.1
Answer1.2
Answer1
Answer2
Answer3
Sub-query2(answer1)
Sub-query2(answer2)
Answer1.1
Sub-query2(answer1)
Answer1.2
Answer2.1
Sub-query2(answer2)
Answer2.2
Sub-query2(answer1)
Answer1.1
Answer1.2
Sub-query2(answer2)
Answer2.1
Answer2.2
Answer1.1
Answer1.2
Answer2.1
Answer2.2
7
8
result
query
Answer1
Answer2
Answer3
query
Answer1.1
Answer1.2
Answer1
Answer2
Answer3
Answer2.1
Answer2.2
Answer1.1
Answer1.2
Answer2.1
Answer2.2
Answer3.1
Answer3.1
Answer.a
Answer.b
Answer.c
Answer.d
Answer.e
Answer.f
Subquery3(Answer1.1,Answer1.2,
Answer.a
Answer2.1,Answer2.2,Answer3.1)
Answer.b
Sub-query2(answer3)
Answer.c
Answer.d
Subquery3(Answer1.1,Answer1.2,
Answer.e
Answer2.1,Answer2.2,Answer3.1)
Answer.f
Answer3.1
Sub-query2(answer3)
Sub-query2(answer3)
Answer3.1
9
Problem formulation
Methods for integration
• Databank properties
–
–
–
–
Protein(name, date, organism)
ProteinStructure(name, structure)
Data source 1
Protein(name, authors, date, organism)
Article(authors, title, year)
date>1995
Autonomous databanks
Different data models
Differences in terminology
Overlapping, redundant data
• Integration aims to provide
transparent access to
multiple heterogenous
databanks
Data source 2
Structure(name, structure, organism)
date>2000
10
– uniform query language
– uniform representation of
results
11
– Link driven federations
• Explicit links between databanks.
– Warehousing
• Data is downloaded, filtered, integrated and stored
in a warehouse. Answers to queries are taken from
the warehouse.
– Mediation or View integration
• A global schema is defined over all databanks.
12
Link driven federations
Link driven federations
• Creates explicit links between databanks
• query: get interesting results and use web
links to reach related data in other
databanks
User Interface
Answer Assembler
Query interpreter
Retrieval Engine
wrapper
wrapper
• systems: SRS, Entrez
databank A
Index
A->B
Index
B->A
databank B
13
Link driven federations
14
View integration
• Advantages
- complex queries
- fast
• Disadvantages
- require good knowledge
- syntax based
- terminology problem not solved
• Define a global schema over the databanks
• high level query language
• systems: BioKleisli, K2, TAMBIS,
BioTRIFU
15
View integration
View integration
User Interface
Query Interpreter and Expander
Answer Assembler
Retrieval Engine
wrapper
Ontology
Base
Databank
Knowledge
Base
databank
16
wrapper
• Advantages
- complex queries
- requires less knowledge
- solution for terminology problem
- semantics based
databank
17
18
View integration
Mediation
Query
Application
• Disadvantages
- more computation
- view maintenance
• Query problem
How to answer queries expressed using
the global schema.
Mediator
Global schema
• Modeling problem
Wrapper
Local schema
Source
Wrapper
Local schema
How to model the global schema,
databanks och mappings.
Source
19
20
Queries
Mediator
• Queries use the global schema
• Conjunctive queries
Query
– select-project-join queries
if
head
Mediator
body/subgoals
Query Reformulation
p(X,Z) :- a(X,Y), b(Y,Z)
Query Execution Engine
• Mediator reformulates queries in terms of a set of
queries that use the local schema. Equivalence and
containment of queries needs to be preserved.
– global schema (domain model/ontology)
– local schema (databank model)
• Information for integration
• Capabilities
Mappings
Wrapper
Local schema
Wrapper
Local schema
Source
Source
– Access only relevant databanks
22
• Relation between domain and databank
content
Global schema:
Protein(name, date, organism)
ProteinStructure(name, structure)
– mapping
Mediator
Global schema
Wrapper
Mapping
• Description of databank content
Application
Wrapper
Issues:
– Semantically correct reformulation
21
Knowledge
Query
– reformulation of queries decide query plan
– query optimization
– execution of query plan, assemble results into
final answer
Query Optimization
q(name, structure) :- Protein(name, 2001, ‘human’),
ProteinStructure(name, structure)
- Q1 is contained in Q2
if the result of Q1 is a subset of the result of Q2.
• Mediator is responsible for query processing
– attributes and constraints
– processing capabilities
– completeness
– cost of query answering
– reliability
Data source local schema:
DS1(name, authors, date, organism)
DS2(name, structure, organism)
– Global as view
The global schema is defined in terms of source
terminology
• Used for
Source
Source
– selection of relevant databanks
– query plan formulation
– query plan optimization
Protein(name, date, organism) :- DS1(name, authors, date, organism)
ProteinStructure(name, structure) :- DS2(name, structure, organism)
23
24
Mapping
Query processing in GAV
• Relation between domain and databank
content
Global schema:
Protein(name, date, organism)
ProteinStructure(name, structure)
Query: give name and structure for human proteins with date ‘2001’.
q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)
Data source local schema:
DS1(name, authors, date, organism)
DS2(name, structure, organism)
– Local as view
The sources are defined in terms of the global
schema.
DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995
DS2(name, structure, organism) -: Protein(name, date, organism),
ProteinStructure(name, structure), date >2000
25
GAV: Protein(name, date, organism) :- DS1(name, authors, date, organism)
ProteinStructure(name, structure) :- DS2(name, structure, organism)
• No explicit representation of databank content
• Mapping gives direct information about which
data satisfies the global schema.
• Query is processed by expanding the query atoms
according to their definitions.
New query: q(name, structure) :DS1(name, authors, 2001, ’human’), DS2(name, structure, organism)
26
Query processing in LAV
Query processing in LAV
Query: give name and structure for human proteins with date ‘2001’.
q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)
Query: give name and structure for human proteins with date ‘2001’.
q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)
LAV:
LAV:
DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995
DS2(name, structure, organism) -: Protein(name, date, organism),
ProteinStructure(name, structure), date >2000
• Mapping does not give direct information about
which data satisfies the global schema.
• To answer the query it needs to be inferred how the
mappings should be used.
27
Comparison GAV - LAV
• Bucket algoritm (Information Manifold)
– For each sub-goal in query create bucket of relevant views.
– Define rewritings of query. Each rewriting consists of one
conjunct from every bucket. Check whether the resulting
conjunction is contained in the query.
– The result is the union of the rewritings.
New query: q(name, structure) :DS1(name, authors, 2001, ’human’), DS2(name, structure, organism)
28
Capabilities
• Most common capabilities describe attributes
• Global as view
– Clear how databanks interact
– When a databank is added, the global schema can
change
– Query processing is easy
• Local as view
–
–
–
–
DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995
DS2(name, structure, organism) -: Protein(name, date, organism),
ProteinStructure(name, structure), date >2000
Each databank is specified in isolation
Easy to add databanks
Easier to specify constraints on the contents of sources
Query processing requires reasoning
29
– f - free, attribute can be specified or not
– b - bound, a value must be specified for the attribute, all
values are permitted
– u - unspecified, not permitted to specify a value for the
attribute
– c[S] - value should be one of the values in finite set S
– o[S] - value is not specified or one of the values in
finite set S
DS1: (name, authors, date, organism) f f b c[human mouse]
30
View integration
View integration
User Interface
Research issues:
- knowledge
representation Answer Assembler
Query Interpreter
and Expander
- creation and maintenance of global schema
Retrieval Engine
User Interface
Research issues:
- knowledge representation
Query Interpreter and Expander
Answer Assembler
- creation of ontologies
- merging of ontologies
Retrieval Engine
wrapper
Ontology
Base
Databank
Knowledge
Base
databank
wrapper
Databank
Knowledge
Base
databank
31
View integration
databank
32
User Interface
Answer Assembler
Research issue:
Query Interpreter
and Expander generationAnswer
Assembler
- semi-automatic
of wrappers
Retrieval Engine
Ontology
Base
databank
View integration
User Interface
Query Interpreter and Expander
wrapper
wrapper
Ontology
Base
Retrieval Engine
wrapper
Research issues:
wrapper
- unified query language
- knowledge representation
Databank
Biological
- strategies for query
expansion
databank
Knowledge
- query optimizationdatabank
Base
33
wrapper
wrapper
Ontology
Base
Databank
Knowledge
Base
databank
databank
34
Integration - exercises
35
36
Integration systems
Discuss MedMaker, Information Manifold,
SIMS and InfoSleuth
•
•
•
•
•
•
•
What does the system focus on?
Describe the query language.
Describe the knowledge representation.
Describe the query processing approach.
Describe the system architecture.
What are the advantages of the system?
Interesting features?
37
Download