A. Analyzing Structures: Quantifying the Diversity in a

advertisement
Summary of Mark Gerstein's Research Program
The goal of Prof Gerstein's laboratory is to make sense of the data deluge brought about by
genome sequencing and other high-throughput technologies, through performing integrative
surveys and systematic data mining. Specifically, he is focused on computational proteomics:
understanding the structure, function, and evolution of proteins through analyzing populations of
them in the databases and in whole-genome experiments. The research program in his lab
broadly falls into three areas, which are described below.
A. Analyzing Structures: Quantifying the Diversity in a Limited Number of Folds
It is believed that there is a large but limited number of folds (estimated to be ~5000), and a
library of them represents a most important resource for biology. To build a library of folds, one
needs some statistical or heuristic definition of what a fold is, a way of clustering together all the
structures with a given fold, and intelligent techniques for matching up sequences with unknown
structure to those with known structure. Prof Gerstein is working on a number of these topics. In
particular, he has developed a way to use existing structural classifications as scaffolds for
integrating diverse genomic information [72*, PartsList.org]. An important aspect of a fold
library is its use in comprehensively surveying protein flexibility and conformational variability.
Prof Gerstein is classifying all instances of conformational variability into a web-accessible
database [38*, MolMovDB.org]. Part of this project involves devising a system for
characterizing protein motions in a highly standardized fashion. He has developed a web server
that, given two coordinate sets, automatically does this (producing “morph movies” as a byproduct). The classification of motions is based on the packing at internal interfaces. Motions are
identified as shear or hinge, based on whether or not a well-packed interface is maintained
between the mobile elements throughout the motion. The motions classification scheme is
motivated by the fact that protein interiors are packed exceedingly tightly, and the tight packing
at internal interfaces greatly constrains the way proteins can move.
B. Analyzing Sequences: Surveying the Occurrence of Proteins in Genomes
As more genomes are sequenced, and structures, determined, it has become increasingly possible
to characterize a substantial fraction of the folds used in a given organism -- statistically, in the
sense of a population census. This allows one to see whether particular folds are more common in
certain organisms than in others. Prof Gerstein was one of the first to address questions of this
sort, performing comparisons of genomes in terms of folds [34*]. In these and other surveys he
have found that a number of folds, such as TIM-barrels, recur in every (analyzed) genome, while
other folds are missing from certain genomes. Prof Gerstein also used fold occurrence to build
whole-genome trees, with the distances between organisms defined in terms of the presence or
absence of specific folds in the whole genome [GeneCensus.org], in contrasts to traditional
phylogenies, which group organisms based on sequence similarity of individual genes. While he
found that the specific most common folds often differed between genomes, in all cases the
occurrence of folds (and many other aspects of genomic biology) tends to follow power-law
statistics, with a few common ones and many rare ones. Prof Gerstein's surveys on folds in
genomes are coupled to collaborations with crystallographers, trying to determine structure in
high-throughput fashion. In particular, he has done target selection, database design, and
datamining for one of the structural genomics centers, and this has enabled us to develop
systematic rules to predict protein solubility [76*, NESG.org].
In addition to analyzing the occurrence of folds and families within the "living" proteome,
Prof Gerstein has also used them to survey the "dead" pseudogenes and pseudogeneic fragments
in intergenic regions. He was one of the first to perform comprehensive surveys of pseudogenes
on a genome-wide scale in terms of protein families, which was done for the worm [73*]. He has
done subsequent surveys on other organisms. Collectively, these allow one to determine the
common "pseudofolds" and "pseudofamilies" in various genomes and to address important
evolutionary questions about the type of proteins that were present in the past history of an
organism. In particular, he found that duplicated pseudogenes tend to have a very different
distribution than one would expect if they were randomly derived from the population of genes
in the genome. They tend to lie on the end of chromosomes and have an intermediate
composition between that of genes and intergenic DNA. Most importantly, pseudogenes tend to
have environmental-response functions. This suggests that they may be resurrectable protein
parts, and there is a potential mechanism for this in yeast [99*].
C. Predicting Protein Function on a Genome Scale, through Data Integration
Because of its size and complexity, individual experimentation for functional annotation of every
gene in the human genome is not possible. Thus, a central problem in proteomics is how to
determine protein function on a large-scale. There are a wide range of approaches to this
problem, which Prof Gerstein is pursuing. One of the most used techniques in genome analysis is
"annotation transfer", carrying over information related to a variety of properties (e.g. structure
and function) from a known sequence in the databases to an unknown one in the genome that is
similar to it. Prof Gerstein is using classifications of protein folds and functions to provide
benchmarks to measure to what degree annotation can be reliably transferred between similar
sequences, particularly when similarity is expressed in modern probabilistic language. The key
issue here is defining appropriate sequence similarity thresholds for the transfer of functional
annotation, and based on his analysis, Prof Gerstein has been able to find clear thresholds (e.g.
40% identity) [55*].
Another method to get at the function of an uncharacterized protein is through determining
its 3D structure and then looking for structural similarities to proteins of known function. This is
a central idea in both structural genomics and structure prediction. To address this issue, Prof
Gerstein has measured, globally, the degree to which fold is associated with function [45*].
A new approach for getting at protein function is clustering gene-expression timecourses
from microarrays -- genes that cluster together may be functionally related. Prof Gerstein has
performed many expression analyses focusing on cross-referencing expression clusters to broad
"proteomic categories," such as functions and families. He has found this approach averages
away much of the noise in expression data. In particular, he has developed a new method of
clustering expression data that finds many time-shifted and inverted relationships in addition to
the simultaneous relationships found in other studies, and he has developed a way of quantifying
how much expression clustering predicts protein functional role or interactions [91*].
In addition to microarrays, further functional genomics experiments have recently appeared.
No individual experiment provides a full description of function. Integrating many experiments
together with "traditional" sequence information should give better predictions. While easy to
advocate, integration is tricky in practice, as it involves weighting highly heterogeneous features
-- such as expression timecourses and sequence motifs -- within a single formalism. In one
particular context, Prof Gerstein has been able to successfully integrate many features for
function prediction: predicting subcellular localization [62*]. He found that the localization of a
protein is related to the expression level of its associated gene -- e.g. lowly expressed proteins
were more likely to be destined for the nucleus than cytoplasm. He then used a Bayesian system
to seamlessly integrate this expression observation with traditional sequence motifs and
essentiality information and predict localization for many uncharacterized yeast proteins.
Listing of Ten Top Papers
*99. P Harrison, A Kumar, N Lan, N Echols, M Snyder, M Gerstein
"A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome
evolution."
J Mol Biol 316: 409-419 (2002)
*91. J Qian, M Dolled-Filhart, J Lin, H Yu, M Gerstein.
"Beyond synexpression relationships: Local Clustering of time-shifted and inverted gene expression profiles
identifies new, biologically relevant interactions."
J Mol Biol 314:1053-66 (2001).
*76. P Bertone, Y Kluger, N Lan, D Zheng, D Christendat, A Yee, A Edwards, C Arrowsmith, G Montelione,
M Gerstein.
"SPINE: An integrated tracking database and data mining approach for identifying feasible targets in highthroughput structural proteomics."
Nucleic Acids Res 29: 2884-98 (2001).
*73. P Harrison, N Echols, M Gerstein.
"Digging for Dead Genes: An Analysis of the Characteristics of the Pseudogene Population in the C. elegans
Genome."
Nuc. Acids. Res. 29 : 818-30 (2001).
*72. J Qian, B Stenger, C Wilson, J Lin, R Jansen, W Krebs, V Alexandrov, N Echols, S Teichmann, J Park, M
Gerstein.
"PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including wholegenome expression and interaction information."
Nucleic Acids Res 29: 1750-64 (2001).
*62. A Drawid, M Gerstein.
"A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive
application to the yeast genome."
J Mol Biol 301 : 1059-75 (2000).
*55. C Wilson, J Kreychman, M Gerstein.
"Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and
function through traditional and probabilistic scores."
J Mol Biol 297 : 233-49 (2000).
*45. H Hegyi, M Gerstein.
"The relationship between protein structure and function: a comprehensive survey with application to the yeast
genome."
J Mol Biol 288 : 147-64 (1999).
*38. M Gerstein, W Krebs.
"A database of macromolecular motions."
Nucleic Acids Res 26 : 4280-90 (1998).
*34. M Gerstein.
"A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure."
J Mol Biol 274 : 562-76 (1997).
Download