Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University

advertisement
Exploring and Exploiting
the Biological Maze
Zoé Lacroix
Arizona State University
Data collection queries

Scientific protocol
– Must be able to reproduce the process

Involve multiple resources
– Data sources
– Applications
Expressing scientific protocols
Scientific protocols mix design and
implementation
 Design

– What the protocols does (tasks)
– Scientific objects involved

Implementation
– How the protocol is executed
– Data sources and applications
Expressing scientific protocols

Scientific protocols are driven by their
implementation
– Scientists use the resources they know
• data (quality)
• access to data
• format, limits, etc.
– Scientists may not exploit better resources
because they do not know them

Queries should be driven by the design, the
implementation should meet the design
needs
Example* - Pipeline for Analysis of
Protein Variation Due to Alternative Splicing
and SNPs
The alternative splicing pipeline will provide a
complete characterization of variations in
proteins due to splice variation or SNPs evident
in repositiories of contiguous genome sequence
data and expressed sequence tags (ESTs). The
pipeline applies secondary structure, tertiary
structure, domain motif detection and sequence
comparison tools to proteins encoded by genes
with alternatively splice forms or SNPs.
*Courtesy of Dr. Marta Janer, Institute for Systems Biology
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and
SNPs
From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length
cDNA sequences from the target organisms
of interest (in this case, human and mouse)
that match the query proteins (mouse DNA
binding proteins) using tblastn. Map the query
protein to the target DNA sequences, keeping
track of which query amino acids correspond
to which nucleotides.
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and
SNPs
From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length
cDNA sequences from the target organisms
of interest (in this case, human and mouse)
that match the query proteins (mouse DNA
binding proteins) using tblastn. Map the query
protein to the target DNA sequences, keeping
track of which query amino acids correspond
to which nucleotides.
Data sources
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and
SNPs
From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length
cDNA sequences from the target organisms
of interest (in this case, human and mouse)
that match the query proteins (mouse DNA
binding proteins) using tblastn. Map the query
protein to the target DNA sequences, keeping
track of which query amino acids correspond
to which nucleotides.
tools
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and
SNPs
From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length
cDNA sequences from the target organisms
of interest (in this case, human and mouse)
that match the query proteins (mouse DNA
binding proteins) using tblastn. Map the query
protein to the target DNA sequences, keeping
track of which query amino acids correspond
to which nucleotides.
tasks
Step 2 - Pipeline for Analysis of Protein
Variation Due to Alternative Splicing and
SNPs
From GenBank, Dbest and the Riken Clone
Collection, collect all EST and full-length
cDNA sequences from the target organisms
of interest (in this case, human and mouse)
that match the query proteins (mouse DNA
binding proteins) using tblastn. Map the query
protein to the target DNA sequences, keeping
track of which query amino acids correspond
to which nucleotides.
Scientific objects
Pipeline Selecting Target
Proteins*
SwissProt
SMART
BIND
DIP
sigpep
blast x D.mel
CEY2H
Step 1 = retrieve all proteins from SMART and
Swiss-Prot with textual search with the keyword
“apoptosis”
Step 2 = retrieve all proteins from Swiss-Prot with a
signal peptide feature and the keyword “apoptosis”
Step 3 = retrieve their binding partners from DIP,
BIND and the C.elegans dataset
Step 4 = run through a signal peptide prediction
program such as SigPep to check for the presence
of signal peptides in each of the sequences
Step 5 = homology search using BLAST of the
retrieved sequences with proteins predicted from
the Drosophila melanogaster genome might yield
additional candidates
Output = final set of signal peptide proteins involved
in apoptosis
*Courtesy of Dr. Terry Gaasterland, The Rockefeller University
Design and implementation
Step Task
Implementation
Input Relevant keyword for which the proteins are required
Step 1All proteins with keyword and with signal feature peptide
must be retrieved
SMART
Swissprot
Step 2Binding partners of all of these proteins are retrieved
DIP
BIND
Step 3Integration into final set is run through a signal peptide
prediction program
SigPep
Step 4Homology search of the retrieved sequences with proteins
predicted from the specific genome yield additional candidates
BLAST
Expressing scientific pipelines
with BioNavigation

Queries are expressed at a conceptual
level (design)
Protein
Seq.
Citation
Scientific
classes
Disease
DNA
Seq.
Gene
Conceptual level
Conceptual graph

Labeled edges
– Scientific meaningful edges
isA
isA
Nucleotide
Sequence
RNA
mRNA
isTranscribedFrom
translatesTo
transcribesTo
isTranslatedFrom
isA
DNA
isA
Gene
Protein
Conceptual graph
IsRelatedTo
IsRelatedTo
isA
Nucleotide
Sequence
isA
RNA
IsRelatedTo
IsRelatedTo
IsRelatedTo
mRNA
isTranscribedFrom
translatesTo
transcribesTo
IsRelatedTo
isTranslatedFrom
isA
IsRelatedTo
DNA
isA
Gene
Protein
Mapping to physical resources
Protein
Seq.
DNA
Seq.
Citation
Scientific
classes
Disease
Gene
Conceptual level
Physical level
PubMed
GenBank
HUGO
Data
Sources
NCBI
Protein
OMIM
Mapping to physical resources
Protein
Seq.
DNA
Seq.
Citation
Scientific
classes
Disease
Gene
Conceptual level
Physical level
PubMed
GenBank
HUGO
Data
Sources
NCBI
Protein
OMIM
Exploring biological metadata

“Return all citations that are related to some
disease or condition”

Diabetes : 11 Aging : 71 Cancer : 391
OMIM
(P2)
NUCLEOTIDE
(P1)
PUBMED
•Link: Entrez provides an index with
the Links in the display option from
each entry
• Parse: Parsing each entry to
retrieve its related entries
PROTEIN
(P3)
•All: Entrez provides an index with
the Links in the display option which
allows to look at a set of entries at a
time
Selecting biological resources

3 resources that look the same
– Are they the same?

3 paths that will retrieve PubMed entries
related to citations
– Do they have the same semantics?
Results for the disease conditions
diabetes, aging and cancer
Diabetes
Link
Parse
Aging
All
Link
Parse
Cancer
All
Link
Parse
All
P1
43,890
43,747
44,037
48,393
48,398
48,393
56,315
56,315
56,532
P2
42,969
43,090
43,581
51,712
51,855
51,474
54,487
54,607
52,488
P3
59,959
51,906
49,719
60,129
61,260
60,938
62,686
63,367
60,033
Overlap results for the disease
conditions diabetes
P1
P2
Link
P3
P1
P2
Parse
P3
P1
P2
All
P3
P1
100%
25.28%
29.98%
100%
29.18%
33.60%
100%
24.64%
27.42%
P2
25.82%
100%
97.68%
23.93%
100%
97.81%
24.75%
100%
90.68%
P3
21.95%
70.00%
100%
22.87%
81.20%
100%
24.29%
79.49%
100%
Evaluating resources

Similar applications
– Different outputs

Similar data sources
– Different output

Number of resources
– Different output

Order of resources
– Different output
Exploiting semantics of resources
Number of entries
 Characterization of entries (number of
attributes)
 Time

Exploiting the semantics of links
BioNavigation (joint work with Louiqa
Raschid and Maria-Esther Vidal)
 Conceptual graph
– No labeled links

Queries
– Regular expressions of concepts

ESearch
– Path cardinality - number of instances of paths of the
result. For a path of length 1 between two sources S1 and
S2, it is the number of pairs (e1, e2) of entries e1 of S1
linked to an entry e2 of S2.
– Target Object Cardinality – number of distinct objects
retrieved from the final data source.
– Evaluation Cost – cost of the evaluation plan, which
involves both the local processing cost and remote network
access delays.
Work in progress

Conceptual graph
– Labeled links

Queries
– Complex dataflows

Physical graph
– Access to a BioMetaDatabase
– Data sources
– Applications
Representing the conceptual
graph in Protégé
Visualization Limitations in Protégé

Using the GraphViz plugin
– Shows only IsA hierarchy

TgiViz plugin
Conclusion
Scientists need support to select
resources to express their protocols
 Semantics of resources may be
exploited to enhance the data collection
process
 Need for a repository of biological
metadata (BioMetaDatabase)

Download