Year 2

advertisement
Year 2
A.
Overview of Work Completed
As I mapped out in my proposal, most of my work in the second year was applying my
library of structural and function templates to a moderate number of genomes, developing
new methods of predicting protein attributes (e.g. function) based on expression data, and
constructing an appropriate database infrastructure to handle all this. We have also
participated in a number of collaborations with experimental scientists in proteomics, A
Edwards and C Arrowsmith in Toronto and L Regan and M Snyder at Yale. A related
highlight of the year was surveying pseudogenes in the worm. I have graphically
summarized some aspects of my current work in Exhibit 1.
We have currently focused work on a standard dataset of 20 genomes, 18
prokaryotes plus yeast and worm. We are working now on fitting in the fly, cress, and,
most importantly, human into this framework. It is essential to integrate the human
genome into comparisons with microbial ones. Identifying families and folds in microbial
pathogens that are not in humans holds great promise for suggesting drug targets.
B.
Specific Highlights from the Past Year
I summarize below in some detail the specific highlights from the past year, connecting
each to the relevant publication, the full citation of which is listed in section 5 below.
i.
Analysis of Expression Data in Relation to Function
One of the aims of the grant was to develop new approaches to predicting protein
properties, such as structure, function, interactions, and localization, on a genome-wide
scale. We have achieved this through using gene expression data. We observed that levels
of gene expression were closely correlated with a protein's eventual subcellular
localization, with high levels of gene expression characteristic of nuclear proteins and
low levels characteristic of nuclear and membrane proteins (Drawid et al., 2000, TIG).
From this we were able to develop a system to integrate expression information
and sequence pattern information for yeast in a Bayesian network (Drawid & Gerstein,
2000, JMB). This allows the prediction of the subcellular localization of the ~4000 yeast
proteins with unknown localization. We also studied our ability to predict protein
function based on expression (Gerstein & Jansen, 2000, COSB), finding a relationship
that applied for certain classes of experiments and functions but did not apply globally.
ii.
Genome-wide Characterization of Protein Function in Microbes
Following on our analysis of the relation between expression and function, we explored a
number of approaches to large-scale characterizations of protein function, clearly one of
the major goals of genome analysis. We developed methods of describing the functional
shifts based on changing patterns of residue conservation (Naylor & Gerstein, 2000,
JME). We have begun preliminary work on the analysis of metabolic pathways in
pathogens (Das et al., 2000, J Mol. Microl Biotech). Finally, we are collaborating with
Prof Michael Snyder on a large-scale functional analysis of the yeast genome. Prof
Snyder has developed a system of experimentally assessing the functions of thousands of
yeast genes with a protein chips. We have helped interpret these experiments
computationally, developing a database for the results and clustering them (Zhu et al.,
2000, Nat. Genetics).
iii.
Pseudogenes
We have published a survey of pseudogenes in worm genome (Harrison et al., 2001,
NAR). This represents the analysis of a large metazoan genome in terms of protein
structure. It describes the occurrence of common folds and families in pseudogenes.
Some pseudogenes are highlighted as possibly being transferred from microorganisms.
iv.
Collaboration on Experimental Structural Genomics of Microbes
We have done a variety of computational analyses designed to interface with
experimental structural genomics. This work is the direct experimental complement of
the computational work proposed in the grant. Our efforts are part of the North East
Structural Genomics Consortium (NESG). We have designed approaches to pick targets
prospectively for subsequent structural analysis and then to do retrospective data mining
of the results. In particular, we have collaborated with the Ontario Proteomics group lead
by C Arrowsmith and A Edwards on helping them to analyze their large-scale structural
genomics analysis of the archeon M. thermoautotropicum (Christendat, 2000, Nature
Struc. Biol.). Our analysis consisted of building decision trees to help predict which
proteins would perform well in high-throughput protein purification.
In a separate analysis, we collaborated with L Regan at Yale in identifying unusual
proteins in a small model genome, that of M. genitalium. We identified 11 proteins that
had no known structure or function but had homologs in other genomes. These were
subject to subsequent CD analysis (Balasubramanian, 2000, NAR).
v.
Large scale integrative database systems
I have developed a number of systems for integrating much heterogeneous information
related to microbial genomes. In particular, we have built three main database systems for
our analyses.
a) The first system is called PartsList (Qian et al., 2001, NAR). It is principally
orientated towards annotating one of the existing structural classifications of proteins,
the scop scheme. The central metaphor for the annotation is that of “ranking” folds,
finding the most common folds based on a variety of different metrics.
b) The second system is a SPINE (Bertone et al., in press, NAR). It is built as a part of a
large collaboration with the Northeast structural genomics consortium (described
above). It enables researchers who are part of this consortium to collect and rank
targets for high-throughput structural genomics.
c) The third system, which we call GeneCensus, tabulates results related to genes and
genomes. (This is currently unpublished). Its central metaphor is a “tree” arranging
genomes. The tree can built based on various measures of relatedness – e.g. number
of shared orthologs, number of shared folds, amino acid identity of individual
orthologous proteins, overall genome composition, etc. These measures of relatedness
occur at different levels, whole-genome, partial-proteome, and individual gene.
All parts of the systems interact. So, for instance, it is possible to see how many genome
matches there are for a particular structure and then to click on these matches and see the
GeneCensus genome annotation for each of them. Also, each target in the construct
database links into GeneCensus, which validates that its structure is not currently known,
and the all the solved structures from the consortium have annotation in the PartList.
Integrative database analysis is essential for all this analysis. I have been called on
to write a number of prominent surveys and opinions in this area, particularly concerning
the question of how integrated databases will interface with the biological literature
(Gerstein, 2000, Nat. Struc. Biol).
Download