Provenance issues in 3R systems Lawrence Hunter, Ph.D. Director, Computational Bioscience Program

advertisement
Provenance issues in 3R
systems
Lawrence Hunter, Ph.D.
Director, Computational Bioscience Program
University of Colorado School of Medicine
Larry.Hunter@uchsc.edu
http://compbio.uchsc.edu/Hunter
What is a 3R system?
• Reading, Reasoning & Reporting.
• Used to facilitate knowledge-based
•
•
analysis of genome-scale experimental
data in biology
SM Leach, et al. “Biomedical Discovery
Acceleration, with Applications to
Craniofacial Development.” PLoS Comput
Bio 2009, 5(3): e1000215.
doi:10.1371/journal.pcbi.1000215
http://hanalyzer.sourceforge.net
Understanding Gene Lists
• There is no “gene” for any complex phenotype;
gene products function together in dynamic
groups
• A key task is to understand why a set of gene
products are grouped together in a condition,
exploiting all existing knowledge about:
– The genes (all of them)
– Their relationships (|genes|2)
– The condition(s) under study.
Knowledge-based data
analysis
• Goal: Bring all existing knowledge
•
(and more!) to bear on explaining
experimental results.
How?
– Integrate multiple databases
(using the semantic web)
– Extract knowledge from the literature
– Infer implicit interactions
– Represent with knowledge networks
• Nodes are social fiducials, like gene IDs or ontology terms
• Arcs (relations) are qualified (typed) and quantified (with
reliability)
– Deliver a tool for biomedical analysts to use knowledge
networks to explain results and generate hypotheses
External
sources
Reading
methods
Ontology
enrichment
Ontology
annotations
Medline
abstracts
Reasoning
methods
Experimental
data
Co-annotation
inference
Biomedical
language
processing
Gene
database1
Literature
co-occurrence
Co-database
inference
Gene
database2
…
Gene
databasen
Reporting
methods
Parsers &
Provenance
tracker
Data
Network
Knowledge
Network
Network
integration
methods
Semantic
integration
Reliability
estimation
Visualization &
Drill-down tool
Semantic database
integration
• More than 1,000 peer-reviewed gene databases:
– Annotations to function, location, process, disease, etc. ontologies
– Linkages to many sorts of experimental and derived data
(GWAS, expression, structure, pathways, population frequencies)
– Linkages to publications that report evidence relevant to them
– A “Gene” is not one thing:
•
•
•
•
DNA locus
Specific allele (Coding sequence and/or regulatory sequences)
Varieties of gene products: intermediate and functional RNAs, proteins, alternative
splices
Post-translational variations (cleavage, phosphorylation) to activate, transport
• Many can be integrated into a single, unified
network using gene or publication identifiers.
– Identifier cross-reference lists increasingly reliable
– Increasing coordination and standardization among providers
• Many provenance challenges remain
OpenDMAP extracts typed
relations from the literature
• Concept recognition tool
– Connect ontological terms to literature instances
– Built on Protégé knowledge representation system
• Language patterns associated with concepts and
slots
– Patterns can contain text literals, other concepts,
constraints (conceptual or syntactic), ordering
information, or outputs of other processing.
– Linked to many text analysis engines via UIMA
• Best performance in BioCreative II IPS task
• >500,000 instances of three predicates (with
•
arguments) extracted from Medline Abstracts
[Hunter, et al., 2008] http://bionlp.sourceforge.net
Network inference issues
Ddc; MGI:94876
B
P
carboxylic acid metabolic process
B
P
catecholamine biosynthesis process
B
P
response to toxin (GO:00009636)
(GO:0019752)
(GO:0042423)
catechols (CHEBI:33566)
catecholamines (CHEBI:33567)
adrenaline (CHEBI:33568)
noradrenaline (CHEBI:335
Cadps;
MGI:1350922
…
B
P
catecholamine secretion (GO:0050432)
B
P
protein transport (GO:0015031)
B
P
vesicle organization (GO:0016050)
…
GO:0042423
?
MGI:94876
GO:0050432
?
?
CHEBI:33567
Reliability = 0.009740
?
MGI:1350922
Inferred interactions
• Dramatically increase coverage…
• But at the cost of
Top 1,000 Craniofacial genes
•
•
lower reliability
We apply new
method to
assess reliability
without an
explicit gold
standard
[Leach, et al., 2007;
Gabow, et al., 2008]
(1,000,000 possible edges)
Link calculations for MyoD1 ↔ MyoG
Shared GO
DMAP transport
molecular functions:
relation
PMID:16407395…
GO:3705…
Co-occurrence
in abstracts:
R = 0.0105
Shared knockout
phenotypes:
MP:5374 …
R = 0.018
R = 0.1034
Shared GO
cell component:
GO:5667…
R = 0.0284
Shared interpro
domains:
IPR:11598…
R = 0.0438
Shared GO
biological processes:
GO:6139…
R = 0.0190
Premod_M interaction:
Mod074699
R = 0.1005
R = 0.0172
Inferred link through
shared GO/ChEBI:
ChEBI:16991
R = 0.01
Correlation in
expression data:
Pdata = 0.4808
A real use
WT Mouse
craniofacial
development:
3 tissues
5 time points
7 replicates
each
>1000 genes
differentially
expressed @
FDR<0.01
Graph of top
1000 edges
each from
average or
logistic or
both (1734
edges total).
Strong data and background
knowledge facilitate
explanations
Skeletal muscle structural components
Skeletal muscle contractile components
Proteins of no common family
•
AVE edges
Both edges
Goal is abductive inference (explanation): why are these
genes doing this?
– Specifically, why the increase in mandible before the increase in
maxilla, and not at all in the frontonasal prominence?
Exploring the knowledge
network
Scientist + aide + literature → explanation:
tongue development
Skeletal muscle structural components
Skeletal muscle contractile components
Proteins of no common family
AVE edges
Both edges
The delayed onset, at E12.5, of the same group of proteins
during mastication muscle development.
Myoblast differentiation and proliferation continues until E15 at which
point the tongue muscle is completely formed.
Myogenic cells invade the tongue primodia ~E11
Hypothesis generation
HANISCH edges
AVE edges
Both edges
inferred synapse signaling proteins
Inferred myogenic proteins
•
•
Proteins of no common family
Proteins in the previous AVE based sub-network
Add the strong data, weak background knowledge
(Hanisch) edges to the previous network, bringing in new
genes.
Four of these genes not previously implicated in facial
muscle development (1 almost completely unannotated)
Hypotheses confirmed
Zim1,E12.5
E43rik,E12.5
ApoBEC2,E11.5
HoxA2,E12.5
Genes ~ Bad Guys?
• Pirolli & Card, Int’l Conf. on Intelligence Analysis,
2005
3R Provenance issues
• Reasoning about the evidence for a link
(e.g. types of evidence, epistemic
relations)
• Identifying and avoiding redundant
evidence, while recognizing corroborating
evidence
• Capturing provenance aspects of
inference
– Automatic inferences have many
Reliability estimation
• Biological knowledge is provisional
• Reliability estimates are very important
• ECO used by GO for typing evidence
O. Biomedical Investigations
• OBI models: investigation design;
materials, protocols and instrumentation;
data generated; and analysis performed.
Estimating and Combining
Reliabilities
• Assessing and Combining Reliability of
•
Protein Interaction Sources [Leach, et al.,
2007]
Many alternative approaches to
combination:
Redundant vs. Corroborating
• Multiple sources of evidence for many
•
•
assertions: Which are genuinely
independent?
Many databases curate from the same
articles, but shared article ID alone doesn’t
prove redundancy (multiple assertions per
article)
Some data comes directly from an
experiment, but then is later peer reviewed
and published: redundant or
b ti ?
Capturing complex
inferences
• Goal of 3R systems is to support
•
explanation and hypothesis generation;
There are many computational and cognitive
tasks involved.
Want to capture (some aspects of) use
– Maintain an “analyst’s notebook” to facilitate
gathering materials and sense-making; what
needs to be tracked?
– Report to a server on what was done with the
network (to calculate “value” or
“interestingness”)
3R Provenance issues
• Reasoning about the evidence for a link
(e.g. types of evidence, estimating
reliability)
• Identifying and avoiding redundant
evidence, while recognizing corroborating
evidence
• Capturing provenance aspects of
inference
– Automatic inferences have many
Download