Lecture Notes

advertisement
How a cell is wired
Environment
DNA
Small
molecules
mRNA
Protein
Regulatory
RNA
The dynamics of such interactions emerge as cellular
processes and functions
Molecular interaction networks
How do the genes and their products interact
to collectively perform a function?
35
U2AF
Gene G
A
Gene G
RPM
B
Inhibitor
Functional
Molecular^ interaction networks
A network containing genes connected to each other
whenever they physically or functionally interact
 Proteins that interact/co-complex (ribosomal, polymerase,
etc.)
 Transcription factors and their target
 Enzymes catalyzing different steps in the same metabolic
pathway
 Genes with correlation in expression
 Genes with similar phylogenetic profiles
Arabidopsis is the primary model
organism for plants
 Complex organization from molecular to whole organism
level.
 A key challenge …
 Understanding the cellular machinery that sustains this
complexity.
 In the current post-genomic times, a main aspect of this
challenge is ‘gene function prediction’:
 Identification of functions of all the (~30, 000) genes in the
genome.
Extent of gene annotations in
Arabidopsis
Total of ~30,000 genes in the genome
~15% with some
experimental
annotation
Leaving ~50% of
the genome without
any annotation
~8% with ‘expert’
annotation
~13% with
annotations based on
manually curated
computational
analysis
~14% with electronic
annotations
Ashburner et al, (2000) Nat. Gen.
Swarbreck et al (2008) Nuc. Acids. Res.
Exploit high-throughput data
 Integrating functional genomic data could lead to
 Network models of gene interactions that resemble the
underlying cellular map.
 Typically these networks contain gene functional interactions
 Connecting pairs of genes that participate in the same
biological processes.
 In such a network, the very place of a gene establishes the
functional context that gene.
 ‘Guilt-by-association’ – genes of unknown functions can
also be imputed with the function of their annotated
neighbors.
Functional interaction networks
 Functional interaction network models have been
developed for Arabidopsis.
 Lee et al. (2010) Rational association of genes with traits
using a genome-scale gene network for Arabidopsis thaliana.
 Very comprehensive in terms of using and integrating
datasets in other organisms for application in plants.
 Integrated 24 datasets: 5 datasets from Arabidopsis and the
rest from other models.
 AraNet: 19,647 genes, 1,062,222 interactions.
Goal of this study …
 We examine the state of network-based gene function
prediction in Arabidopsis.
 Evaluate the performance of multiple prediction algorithms
on AraNet.
 Assesses the influence of the number of genes annotated to
a function and the source of annotation evidence.
 Compute the correlation of prediction performance with
network properties.
 Evaluate prediction performance for plant-specific functions.
Network-based gene function
prediction algorithms
Propagation of functional
annotations across the
network
Guilt-by-association using
direct interactions
Use positive
and negative
examples
SinkSource
Hopfield
Local
Use only
positive
examples
FunctionalFlow –
multiple phases
Each gene in the network
FunctionalFlow – 1
phase
Local+
Network-based gene function prediction
Network-based gene function prediction
 Function A
 Function B
In this study …
Sink Source
Precision: fraction
of predictions that
are correct
TP
(TP + FP)
Recall: fraction of
known examples
predicted correctly
TP
(TP +
FN)
Performance of different algorithms
 Computational gene function prediction precedes
and guides experimental validation
 What we get is a ranked list of novel predictions
 An experimenter would choose a manageable
number of top-scoring predictions to pursue
 Precision at the top of the prediction list
 We choose precision at 20% recall (P20R) as the
measure of performance
Performance of different algorithms
SS seems to be better than the
other algorithms
Using only annotations
based on
experimental/expert
3rd quartile
evidence
Median
1st quartile
What about the influence of the number of genes in a function?
Performance of different algorithms
Each group containing
~125 functions
Number of functions
First group
Second group
Third group
Number of genes annotated with a function
Performance of different algorithms
For ‘small’ functions, the
algorithm
does
matter!
And, using
justnot
experimental
annotations is better when you
know little about a function.
For ‘medium’ functions, SS is
a little better and use of
‘electronic’ evidences is mixed.
For ‘large’ functions
- SS is clearly the best
- Using all annotation is better
Performance of different algorithms
Wilcoxon test: SS vs. other algorithms
All ECs
Sans IEA/ISS
Overall, SinkSource appears to be best algorithm.
Correlation of performance with network
properties
 Performance on a particular function might depend on
how its genes are organized / connected among themselves
in the network.
 Number of nodes
 Number of components
 Fraction of nodes in the largest connected component
 Total edge weight
 Weighted density
 Average weighted degree
 Average segregation
Correlation of performance with network
properties
Correlation of performance with network
properties
Correlation of performance with network
properties
 Number of nodes = 9
 Number of components = 3
 Fraction of nodes in the
largest connected
component = 4/9
 Total edge weight = 8
 Weighted density = 8/36
 Average weighted degree =
16/9
Correlation of performance with network
properties
Functional
modularity:
Average Segregation
Correlation of performance with network
properties
Functional
modularity:
Average Segregation
 Avg. seg = 8/22
 Avg. seg = 12/15
Correlation of performance with network
properties
 We have …
 Vector of SS P20R values for each function
 Vector of values of a particular topological property for each
function
P20R
 Spearman rank correlation
Weighted density
Correlation of performance with network
properties
Spearman rank
correlation
Performance on plant-specific
functions
 The underlying network is built based on data from
multiple non-plant species
Using only annotations
For
functions based on
For‘plant-specific’
‘conserved’ functions
-Performance
is much
-Performance
is better
thanworse
that for
experimental/expert
compared to
functions
all‘conserved’
functions
rd
evidence
3experimental
quartile
-Using
-Using
all only
annotations
is better
annotations is better
Median
1st quartile
Most predictable ‘conserved’
functions
 protein folding
 nucleotide transport
 innate immunity
 cytoskeleton
organization, and
 cell cycle
Least predictable ‘conserved’
functions
Specialized functions
 regulation of …
Most predictable ‘plant-specific’
functions
Contribution from
Arabidopsis datasets
 cell wall
modification
 auxin/cytokinin
signaling, and
 photosynthesis
Least predictable ‘plant-specific’
functions
 development,
morphogenesis
 pattern
formation
 phase transitions
of various
tissues, organs /
growth stages
Conclusions
 Evaluated the performance of various prediction algorithms on
AraNet.
 SinkSource is the overall best prediction algorithm.
 Measured the influence of the number of genes annotated to a
function and the source of annotation evidence.
 All algorithms perform poorly when only a small number of genes
are ‘known’ or when annotating very specific functions.
 When only a small number of genes are ‘known’, use only
experimentally verified annotations to make new predictions.
 When a considerable number of genes are ‘known’, use all
annotations to make new predictions.
Conclusions
 Measured the correlation of performance with
network properties
 Several topological properties correlate well with
performance.
 ‘Average segregation’ has the strongest correlation.
Conclusions
 Assessed performance on conserved/plant-specific
functions
 Performance on basic ‘conserved’ functions is better
than that for all the functions.
 Specialized ‘conserved’ functions are hard to predict.
 Performance on ‘plant-specific’ functions is very poor.
 Also a consequence of the fact that ‘plant-specific’
functions generally have small number of annotations.
Conclusions
 Avenues for improvement in functional interaction
networks
 Build functional interaction networks that are based on
a larger collection of plant datasets.
 If possible, rely as little as possible on data from other
species.
 Avenues for future experimental work
 ‘Plant-specific’ functions and
 Specialized ‘conserved’ functions.
Acknowledgements
 Arjun Krishnan
 Brett Tyler
 Andy Pereira
Download