Abstract
Comparing mRNA Expression and Protein Abundance via Genomic and Proteomic
Characteristics
Dov Greenbaum
2004
With the advent of high throughput proteomic and genomic technologies we now
have appreciable quantitative mRNA expression and protein abundance levels for
much of the yeast genome. While the cellular mRNA and protein concentrations are
clearly mechanistically related, a quantitative relationship, however, may not be
clear-cut. This thesis is an attempt to quantify the relationship via interrelating
diverse expression data sets and additional external information. There are three
important aspects to this analysis: (i) the data, particularly the protein data, is the
result of various experimental methodologies, as such, the information was integrated
judiciously through iteratively fitting the available datasets into reference sets; (ii) the
data is inherently noisy. To minimize noise broad categories (e.g. functional,
structural, and interaction categories) were used to average the data-points into more
robust numbers; (iii) protein complexes, where the subunits occur in
stoichiometrically equal amounts, can be used in simple but valuable illustrations of
the relationship between protein products and their mRNA precursors. Overall,
considerable agreement between mRNA expression and protein abundance, in terms
of the enrichment of structural and functional categories was found. This agreement,
which was considerably greater than the simple correlation between these quantities
for individual genes, reflects the way broad categories collect many individual
measurements into simple, robust averages. In particular it is shown that in respect to
the genome, the proteome is enriched in (i) small amino acids Val, Gly, and Ala
(high levels of these amino acids in proteins lead to more compact and more stable
proteins); (ii)low molecular weight (i.e. more cost efficient) proteins; (iii) proteins
involved in protein cell structure and energy production; and is depleted in proteins
that act as molecular switches (i.e., transcription and cell growth). mRNA expression
levels are also shown to correlate well among the members of permanent protein
complexes. Generally, permanent complexes, such as the ribosome and proteasome,
are shown to have a particularly strong relationship with mRNA expression, while
transient ones do not. However, several transient complexes, such as the RNA
polymerase II holoenzyme and the replication complex, can be subdivided into
smaller permanent ones, which do have a strong relationship to gene expression.
Comparing mRNA Expression and Protein Abundance via Genomic and Proteomic
Characteristics
A Dissertation
Presented to the Faculty of the Graduate School
of
Yale University
in Candidacy for the Degree of
Doctor of Philosophy
by
Dov Greenbaum
Dissertation Director: Mark Gerstein
May 2004
© 2004 by Dov Greenbaum
All rights reserved.
Table of Contents
TABLE OF CONTENTS ................................................................................................. 3
LIST OF FIGURES AND TABLES................................................................................ 5
ACKNOWLEDGMENTS ................................................................................................ 7
INTRODUCTION........................................................................................................... 10
CHAPTER 1: INTEGRATING GENOMIC DATA SETS......................................... 15
1.1 INTERRELATING DIFFERENT TYPES OF GENOMIC DATA, FROM PROTEOME TO
SECRETOME: 'OMING IN ON FUNCTION........................................................................... 15
Abstract ..................................................................................................................... 15
Introduction............................................................................................................... 16
The Path to Function is Filled with 'omes ................................................................ 18
Computational Methods for Defining 'omes ............................................................. 20
Experimental Methods for Defining 'omes ............................................................... 21
Interrelating Different 'omes..................................................................................... 22
The Use of Broad Categories to Interpret Noisy Data ............................................. 23
A Case Study: Interrelating the Transcriptome and the Translatome ...................... 26
Conclusion ................................................................................................................ 28
Figures and Tables ................................................................................................... 29
References ................................................................................................................. 36
CHAPTER 2: MRNA EXPRESSION AND PROTEIN ABUNDANCE ................... 43
2.1 ANALYSIS OF MRNA EXPRESSION AND PROTEIN ABUNDANCE DATA: AN APPROACH
FOR THE COMPARISON OF THE ENRICHMENT OF FEATURES IN THE CELLULAR POPULATION
OF PROTEINS AND TRANSCRIPTS ..................................................................................... 43
Abstract ..................................................................................................................... 43
Introduction............................................................................................................... 44
Methods ..................................................................................................................... 49
Data set scaling......................................................................................................... 51
Enrichment of features .............................................................................................. 56
Results ....................................................................................................................... 59
Application to semi-quantitative protein abundance data sets ................................. 63
Discussion and conclusion ........................................................................................ 64
Acknowledgments ...................................................................................................... 70
Figures and Tables ................................................................................................... 71
References ................................................................................................................. 87
2.2 COMPARING PROTEIN ABUNDANCE AND MRNA EXPRESSION LEVELS ON A GENOMIC
SCALE ............................................................................................................................ 98
Abstract ..................................................................................................................... 98
Introduction............................................................................................................... 98
Two-dimensional electrophoresis ............................................................................. 99
Mass spectrometric approaches ............................................................................. 101
Comparison of mRNA and protein levels................................................................ 104
Introduction
3
Acknowledgements .................................................................................................. 113
Figures and Tables ................................................................................................. 114
References ............................................................................................................... 119
CHAPTER 3: MRNA EXPRESSION AND PROTEIN-PROTEIN
INTERACTIONS.......................................................................................................... 125
3.1 RELATING WHOLE-GENOME EXPRESSION DATA WITH PROTEIN-PROTEIN
INTERACTIONS ............................................................................................................. 125
Abstract ................................................................................................................... 125
Introduction............................................................................................................. 126
Results ..................................................................................................................... 127
Discussion and conclusion ...................................................................................... 136
Methods ................................................................................................................... 141
Efficient calculation of the average correlations.................................................... 142
Kinetic model of the relationship between protein and mRNA concentration........ 143
Acknowledgments .................................................................................................... 145
Figures .................................................................................................................... 146
References ............................................................................................................... 157
APPENDIX: CHANGE IN MRNA EXPRESSION VS. CHANGE IN PROTEIN
ABUNDANCE LEVELS .............................................................................................. 164
GENOMIC AND PROTEOMIC ANALYSIS OF THE MYELOID DIFFERENTIATION PROGRAM:
GLOBAL ANALYSIS OF GENE EXPRESSION DURING INDUCED DIFFERENTIATION IN THE
MPRO CELL LINE ........................................................................................................ 164
Abstract ................................................................................................................... 164
Introduction............................................................................................................. 165
Materials and methods ............................................................................................ 167
Results ..................................................................................................................... 173
Discussion ............................................................................................................... 180
Acknowledgments .................................................................................................... 188
Figures and Tables ................................................................................................. 189
References ............................................................................................................... 209
Appendix ................................................................................................................. 219
Introduction
4
List of figures and tables
CHAPTER 1: INTEGRATING GENOMIC DATA SETS .......................................................... 29
1.1 Interrelating Different Types of Genomic Data, from Proteome to Secretome:
'Oming in on Function ............................................................................................ 29
Figure 1 An overview of the current `omic terminology ............................................... 29
Figure 2 Interrelating the transcriptome and the translatome ......................................... 31
Table 1 A Table of 'omes, ............................................................................................... 34
CHAPTER 2: MRNA EXPRESSION AND PROTEIN ABUNDANCE ......................................... 71
2.1 Analysis of mRNA expression and protein abundance data: ............................... 71
Figure 1 Schematic overview of the analysis ................................................................. 74
Figure 2 mRNA expression levels vs. protein abundance levels .................................... 76
Figure 3a-c Amino acid and biomass enrichment........................................................... 78
Figure 3d Statistical significance .................................................................................... 81
Figure 4 Breakdown of the transcriptome and translatome in terms of broad categories
relating to structure, localization, and function ........................................................ 83
2.2 Comparing protein abundance and mRNA expression levels on a genomic scale
................................................................................................................................. 114
Table 1 Proteomic Technologies .................................................................................. 114
Figure 1 Comparison of mRNA expression and protein abundance. ........................... 115
Figure 2 The differences in correlation between mRNA and protein expression values
using novel categories. ............................................................................................ 117
CHAPTER 3: MRNA EXPRESSION AND PROTEIN-PROTEIN INTERACTIONS .................. 146
3.1 Relating whole-genome expression data with protein-protein interactions ...... 146
Figure 1 Distributions of normalized differences for various groups of proteins in
boxplot representation. ............................................................................................ 146
Figure 2 Distributions of correlation coefficients between expression profiles .......... 148
Figure 3a Various key statistics .................................................................................... 151
Figure 3b Graphical representation of part of the protein complex statistics ............... 154
Figure 4 Representation of the replication complex and its components ..................... 155
APPENDIX: CHANGE IN MRNA EXPRESSION VS. CHANGE IN PROTEIN ABUNDANCE
LEVELS .................................................................................................................... 189
Figure 1 Two-dimensional electrophoretograms of wide pH range of MPRO cells. ... 189
Figure 2 Two-dimensional electrophoretograms of MPRO cells in pH range 4 to 7. .. 191
Table 1 Distribution of protein spots identified during myeloid differentiation .......... 193
Table 2 Protein species represented by multiple spots ................................................. 194
Table 3 Classification of known proteins ..................................................................... 196
Figure 4 Protein clusters according to their expression patterns. ................................. 197
Figure 5 The correlation between the mRNA difference at 0 and 72 hours and the
corresponding protein difference. ........................................................................... 199
Introduction
5
Figure 6 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO
cells. ........................................................................................................................ 201
Figure 7 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO
cells. ........................................................................................................................ 203
Figure 8 Distribution of protein spots from cycloheximide experiment....................... 205
Table 4 Transcription factors analyzed by Northern blot assay ................................... 206
Introduction
6
Acknowledgments
I have thoroughly enjoyed working in Mark Gerstein’s lab over the past number of years.
Through interactions with Mark and other lab members I have grown in my
understantding and appreciation of science in general and, in particular, gained a
substantial understanding of bioinformatics and genetics.
Many colleagues have contributed, either directly, or indirectly, to this dissertation. These
include various coauthors, confidants and mentors. In particular, I would like to thank:
Ronald Jansen, Yuval Kluger, Haiyuan Yu, Nick Luscombe, Hedi Hegyi, Jiang Qian,
Jimmy Lin, Paul Bertone, Lian Zheng, David Tuck, Jochen Junker, Rajdeep Das,
Sambath Chung, Mike Snyder, Nevan Krogan, Al Edwards, Andrew Emili, Bart Kus,
Jack Greenblatt, Ken Williams, Christopher Colangelo, John Karro, Xiaowei Zhu, and
the entire Gerstein lab.
I would like to thank Drs. Sherman Weissman and Kevin White for serving on my
research committee. Both Sherman and Kevin have been a stimulating force in my
research; their comments and suggestions have proven invaluable to my research. I
would also like to thank my department for all their help and support over the past six
years, in particular Betsy Jasiorkowski and Michael Stern have shown excessive patience
in helping me.
Introduction
7
It is impossible to overstate my gratitude to my advisor Mark Gerstein. He has guided
me through the process of exploring a new scientific field. He not only helped me to
develop thorough scientific judgment, but also taught me about many of the practical
aspects of doing science.
I want to thank my parents, Drs. Cheryl and Joseph Greenbaum, my brothers: Eli, Yale,
Moshe, Rafi, and Ari, and my in-laws: The Honarable and Mrs. Simon Gluck, for their
help and continuing support through many years.
My daughter, Liana Tova, eclipses all as the source of my greatest pride and joy. Her
smile lights up the room and prevents me from doing my work.
Finally, I want to express my deepest gratitude to my eishes chayil, Sabrina. She has
always been there for me, and has graciously allowed, and continues to allow me to
prolong my education and the pursuit of knowledge. She has been an awe-inspiring
source of love, and intellectual and moral support.
Introduction
8
Introduction
9
Introduction
A central and integral biological process in every cell is the faithful transition from DNA,
through an mRNA intermediary, to the final protein product.
The cell exquisitely
controls every step of the process, the result being the desired concentration of functional
proteins. While we understand this process on a biological level, it is obvious that the
population of mRNA leads to the total protein complement of the cell, it is now possible,
with the advent of high throughput genomic methodologies for measuring mRNA
expression and protein abundance, to analyze and accurately measure this relationship
between mRNA and protein qualitatively. Simplistically, we can view this relationship
as the consequence of translation of mRNA’s and degradation of protein; i.e. Dp i /Dt =
ks;i * mRNAi - kd;i Pi where ks is the rate of translation and kd represents the rate of
degradation. Thus, at steady state: P = ks;i * mRNAi/Kdi. When I first began my
research ks, for the most part, was unknown. Presently, kd is still unknown on a genomic
scale.
Through investigating and analyzing this relationship we gain a broader understanding of
the cellular mechanisms and controls used in synthesizing the protein population.
Additionally, given the large discrepancy in data quality and availability between mRNA
expression and protein abundance, it is helpful to understand the relationship between the
two populations: difficult to measure protein abundance levels may possibly be predicted
from mRNA expression data and other associated information sources.
Introduction
10
One of the goals of bioinformatics is to provide robust methodologies for analyzing the
data derived from high throughput experimentation, and to then extract biological
insights from the data. Data from high throughput experimentations is often noisy. One
can minimize the effect of the noise on an analysis in a number of ways: First, by
integrating multiple data sources and observations; and, secondly, by integrating
additional tangential resources.
This dissertation encompasses previously published research focusing on the correlation
between mRNA and protein levels in Saccharomyces cerevisiae. Each chapter represents
an important part of the analysis of the relationship.
Chapter 1 introduces the concept of cellular populations as defined both by their physical
constitution, but also, in a more novel sense, by their function. This differentiation of the
cellular protein complement into distinct categories or ‘omes’ is instrumental in my
analysis of correlations between mRNA and protein abundance. I also present an initial
analysis of the relationship between mRNA and protein levels.
Chapter 2 represents a formalization of the problem presented in the first chapter. Given
some of the limitations inherent in the date (e.g. size and quality of the datasets), I have
devised a methodology for merging of the current mRNA and protein data sets in larger
and more reliable reference data sets. I then analyze the relationship between mRNA and
protein population levels in the cell, specifically as it related to a number of broad
Introduction
11
categories including secondary structure, function, and subcellular localization, and
particularly with regard to well defined gene populations. I show that biologically
relevant insights can be discerned through my methods.
Chapter 2.2 presents a second look at the relationship between mRNA and protein levels
using a newer, larger and more reliable data set. I also looked at additional novel
categories with which to compare protein and mRNA. These included ribosomal
occupancy levels for each mRNA species, the Codon Adaptation Index and the
variability of mRNA expression as measured by the coefficient of variation.
Chapter 3 looks to expand the original analysis of mRNA and protein levels by
investigating correlations among the proteins of binary and complex interactions.
Assuming that there is a relationship between protein and mRNA levels in the cell, one
would hope to find that pairs and groups of proteins which are thought to exist in the cell
in similar protein concentrations also have similar concentrations of the mRNA. My
analysis has shown that while proteins in binary interactions do not have, on average,
similar levels of mRNA as their interaction partners (initial protein abundance data shows
similar results), proteins that interact together in complexes do tend to have overall
similar levels of mRNA concentration. These results provide further evidence of the
possibility of quantifying a relationship between mRNA and protein expression levels in
yeast cells.
Introduction
12
In addition to setting up a preliminary cDNA microarray facility with Professor Arch
Perkins, I further attempted to enhance my understanding of the experimental techniques
behind mRNA expression and protein abundance determination through extensive hands–
on work in deciphering two-dimensional gels. This work also provided me with an
appreciation of the efforts necessary to consistently and accurately determine protein
abundance levels. This analysis, as described in the appendix, involves a proteomic
analysis of myeloid differentiation in a murine promyelocytic (MPRO) cell line. In
particular, I investigated the relationship between mRNA and protein in terms of
simultaneous changes in their levels over multiple time points. This is the first time such
a relationship has been studied.
These datasets gave a much stronger correlation, than
previous analyses involving only a solitary time point. This result is consistent with the
hypothesis that a substantial proportion of protein change is a consequence of changed
mRNA levels, rather than posttranscriptional effects.
Introduction
13
References
Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and
protein abundance data: an approach for the comparison of the enrichment of
features in the cellular population of proteins and transcripts. Bioinformatics 18,
585-96 (2002).
Greenbaum, D., Colangelo, C., Williams, K. & Gerstein, M. Comparing protein
abundance and mRNA expression levels on a genomic scale. Genome Biol 4, 117
(2003).
Greenbaum, D., Luscombe, N. M., Jansen, R., Qian, J. & Gerstein, M.
Interrelating different types of genomic data, from proteome to secretome: 'oming
in on function. Genome Res 11, 1463-8 (2001).
Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression
data with protein-protein interactions. Genome Res 12, 37-46. (2002).
Lian Z, Kluger Y, Greenbaum DS, Tuck D, Gerstein M, Berliner N, Weissman
SM, Newburger PE. Genomic and proteomic analysis of the myeloid
differentiation program: global analysis of gene expression during induced
differentiation in the MPRO cell line. Blood. 100(9):3209-20. (2002).
Introduction
14
Chapter 1: Integrating Genomic Data Sets
1.1 Interrelating Different Types of Genomic Data, from
Proteome to Secretome: 'Oming in on Function
Abstract
With the completion of genome sequences, the current challenge for biology is to
determine the functions of all gene products and to understand how they contribute in
making an organism viable. For the first time, biological systems can be viewed as being
finite, with a limited set of molecular parts. However, the full range of biological
processes controlled by these parts is extremely complex. Thus, a key approach in
genomic research is to divide the cellular contents into distinct sub-populations, which
are often given an "-omic" term. For example, the proteome is the full complement of
proteins encoded by the genome, and the secretome is the part of it secreted from the cell.
Carrying this further, I suggest the term "translatome" to describe the members of the
proteome weighted by their abundance, and the "functome" to describe all the functions
carried out by these. Once the individual sub-populations are defined and analyzed, I can
then try to reconstruct the full organism by interrelating them, eventually allowing for a
full and dynamic view of the cell. All this is, of course, made possible because of the
increasing amount of large-scale data resulting from functional genomics experiments.
However, there are still many difficulties resulting from the noisiness and complexity of
the information. To some degree, these can be overcome through averaging with broad
proteomic categories such as those implicit in functional and structural classifications.
For illustration, I discuss one example in detail, interrelating transcript and cellular
Chapter 1: Integrating Genomic Data Sets
15
protein populations (transcriptome and translatome). Further information is available at
http://bioinfo.mbb.yale.edu/what-is-it.
Introduction
"[It] does not consist of individuals, but expresses the sum of interrelations, the
relations within which these individuals stand." adapted from Karl Marx, Grundrisse
(1857) (Marx, 1857).
The raw data produced by genome sequencing projects currently provides little insight
into the precise workings of an organism at the molecular level (Luscombe et al., 2001).
Therefore, the goal of functional genomics is to complement the genomic sequence by
assigning useful biological information to every gene. Through this, I aim to improve my
understanding of how the different biological molecules contained within the cell (i.e.,
DNA, RNA, proteins, and metabolites) combine to make the organism viable. Clearly,
the main challenge is the elucidation of all molecular, cellular, and physiological
functions of each gene product. However, there are many subsidiary goals as part of this
challenge, such as defining the three-dimensional structures of these macromolecules,
their subcellular localizations, intermolecular interactions, and expression levels.
Although gathering and classifying the necessary information is central to this process, it
is impractical to rely on individual experiments for the potentially thousands of genes in
each organism. Furthermore, with large-scale proteomic experiments still yet to be used
widely, computational techniques while sometimes based on less than ideal information
provide a crucial resource for assigning biological data.
Chapter 1: Integrating Genomic Data Sets
16
The paper by Antelmann et al. in this issue of Genome Research (Antelmann et al., 2001)
evaluates their earlier attempts to assign protein functions through computational means.
Previously, the group used computational methods to predict all exported proteins(or
members of the secretome) in Bacillus subtilis by searching for signal peptides and cell
retention signals in the protein sequences. A better understanding of how and why a
protein is secreted is valuable as the bacterium's ability to export numerous enzymes
enables it to degrade extracellular substrates and survive in a continuously changing
environment. Moreover, it will eventually allow these bacteria to be employed as
"cellular factories" for secreting commercially valuable proteins in large quantities
(Tjalsma et al., 2000).
Antelmann et al.'s present paper aims to verify their previous predictions by
experimentally characterising the entire population of secreted proteins using 2D gel
electrophoresis and mass spectrometry. They showed that the original predictions
correctly identified about 50% of all secreted proteins. Most of the disagreements were
due to the inability to predict the secretion of proteins lacking the appropriate signal, or
those containing seemingly inappropriate signals (cell retention signals). In summary,
Antelmann et al.'s work highlights both the encouraging aspects of computational
assignments of biological data, and reveals some of the shortcomings in the current
methods.
Chapter 1: Integrating Genomic Data Sets
17
The Path to Function is Filled with 'omes
To describe their studies, Antelmann et al. coined the term "secretome". This 'omic term
is an example of the new lexicon that has appeared recently to define the varied
populations and sub-populations in the cell (Fig. 1). These terms are generally suffixed
with "-ome", with an associated research topic of "-omics".
Broadly, the existing 'omes can be divided into those that represent a population of
molecules, and those that define their actions (Fig. 1). For the first category, populations
provide an inventory or "parts list" of molecules contained within an organism (Gerstein
and Hegyi, 1998,Qian et al., 2001,Skolnick and Fetrow, 2000,Vukmirovic and Tilghman,
2000). The genome, the entire DNA sequence of an organism, presents a basis for
defining the proteome, a list of coding DNA regions that result in protein products.
Transcription of these coding sequences produces the transcriptome (Velculescu et al.,
1997), which is the cellular complement of all mRNA under a variety of cellular
conditions. Note, this population is weighted by the expression level of each molecule
and, ideally, should incorporate the results of alternative splicing. Following translation
of the transcriptome, I suggest the term "translatome" to describe the cellular population
of proteins expressed in the organism at a given time, explicitly weighted by their
abundance. It is important to note that, whereas the membership of the genome and
proteome are virtually static, the transcriptome and translatome are dynamic and
continually change in response to internal and external events. Additional 'omes describe
the presence of molecules that are not encoded by the genome, but are nonetheless
essential, for instance, the metabolome (Tweeddale et al., 1998). Because of the newness
Chapter 1: Integrating Genomic Data Sets
18
of most 'omic terms, a few still have competing definitions. This is most evident for the
proteome (see Table 1).
The second group of 'omes are fewer in number and describe the actions of the protein
products. For example, the secretome is a subset of the proteome that is defined by its
action, that is, it is actively exported from the cell. The interactome (Sanchez et al.,
1999)lists all of the specific interactions that are made between macromolecules in the
cell. More abstractly, the regulome (Web references only; see Table 1) defines the
genome-wide regulatory network of the cell and most notably includes transcription
regulation pathways.
The elucidation of each of these 'omes contributes to the ultimate goal of functional
genomics, defining the functome,which describes all of the functions that are assigned to
each gene in the genome ( (Rison, 2000), http://www.biochem.ucl.ac.uk/~rison). The
functions of a gene can be described at many levels, including their biochemical, cellular
and physiological roles (Ashburner et al., 2000), and also depend on additional factors
that are not immediately associated with their basic functions, such as subcellular
localization and intermolecular interactions. Therefore, aspects of the functome may be
expressed in terms of other 'omes, for example those that group similar biochemical
functions, for example the immunome (Pederson, 1999); similar localizations, for
example the secretome; and similar interactions, for example, the interactome. For the
record, I coin my own term here; at present, a large proportion of genes can only be
described as members of the "unknome": those with currently no functional information!
Chapter 1: Integrating Genomic Data Sets
19
Computational Methods for Defining 'omes
There are a variety of computational approaches for defining 'omes (Gerstein and Honig,
2001):
(1)
Algorithmic methods for predicting genes, protein structure, interactions, or
localization based on patterns in individual sequences or structures; for example, defining
the proteome or orfeome using a gene-finding algorithm on the genome (Claverie,
1997,Guigo et al., 2000,Harrison et al., 2001,Yeh et al., 2001)determining the foldome
from structure prediction of the proteome (Simons et al., 2001), determining the
interactome from the foldome, using known binding sites (Teichmann et al., 2001), and
determining the secretome through identifying signal sequences in the proteome (Tjalsma
et al, 2000).
(2)
Annotation transfer through homology, that is, inferring structure or function
based on sequence and structural information of homologous proteins (Gerstein,
1997,Gerstein, 1998,Hegyi and Gerstein, 1999,Hegyi et al., 2002,Thornton, 2001,Wilson
et al., 2000).
(3)
Using a "guilt-by-association" method based on clustering where functions or
interactions are inferred from clusters of functional genomic data, such as expression
information. For example, similar functions can sometimes be inferred through
interactions with other proteins or similar expression profiles (Eisen et al., 1998,Gerstein
and Jansen, 2000,Ito et al., 2001,Marcotte et al., 1999).
Chapter 1: Integrating Genomic Data Sets
20
Experimental Methods for Defining 'omes
Although still in their infancy, several large-scale experimental techniques are designed
to assess the nature of different 'omes. Gene expression studies are now well established
and microarray or GeneChip technologies can be used to measure mRNA abundance in
the cell and hence define the transcriptome (Epstein and Butow, 2000). Detection of
protein concentration and definition of the translatome is more difficult, however, as
evidenced by the dearth of such data. At present, the most prominent method employs
two-dimensional electrophoresis to isolate proteins followed by mass spectrometry for
their identification (Futcher et al., 1999,Gygi et al., 1999,Naaby-Hansen et al.,
2001)followed by quantification (Aebersold et al., 2000,Appel et al., 1997,Gygi et al.,
2000). The two-hybrid system enables detection of specific protein-protein associations
to build the interactome (Ito et al., 2001,Uetz et al., 2000,Walhout and Vidal, 2001).
Antelmann et al. (Antelmann et al, 2001) used two-dimensional electrophoresis to
determine the membership of the secretome.
Given the goal of determining the functome, perhaps the most exciting technology is the
protein chip system, which is capable of high-throughput screening of protein
biochemical activity. (Zhu et al., 2001,Zhu et al., 2000). Other methods for obtaining
large-scale protein functional characterization include a transposon insertion
methodology (Ross-Macdonald et al., 1999,Zhu et al, 2001,Zhu et al, 2000).
Although I discuss the computational and experimental methods separately, there is, in
fact, an inseparable relationship between the two. On the one hand, data resulting from
Chapter 1: Integrating Genomic Data Sets
21
high-throughput experimentation require intensive computational interpretation and
evaluation (Carson et al., 2001). On the other hand, computational methods use empirical
data to build a knowledge base for predictions. Furthermore, they sometimes produce
questionable predictions that should be reviewed and confirmed through experiments, as
Antelmann et al. point out. In addition to these high-throughput techniques, another
interesting tactic is to aggregate the results of individual experiments through
comprehensive literature searches. Although there clearly are difficulties with differing
experimental conditions and varying interpretations, preliminary results have shown this
to be an effective method (Jensen, 2001,Marcotte et al., 2001,Ono et al., 2001).
Interrelating Different 'omes
Having categorized the organism into different sub-populations, a fundamental approach
in genomics is to establish relationships between the different 'omes. In other words, by
piecing the individual 'omes together, I hope to build a full and dynamic view of the
complex processes that support the organism. For example, how do the proteome and
regulome combine to produce the translatome?
As with defining the 'omes, these relationships can be explored in different ways:
(1)
Defining or assigning one 'ome based on another, as described above.
(2)
Comparing one 'ome with another to better understand the processes that shift one
population into its successor. For instance, this could be done by correlating expression
measurements for the transcriptome and translatome (see below).
Chapter 1: Integrating Genomic Data Sets
22
(3)
Calculating "missing" (experimentally unattainable) information in one 'ome
based on information in another one - for example, using the known relationships
between gene expression level and subcellular location to help predict the destination of
proteins of unknown localization (Drawid and Gerstein, 2000,Drawid et al., 2000).
(4)
Describing the intersection between multiple populations. For example,
combining data from the transcriptome and the functome could describe the array of
biochemical, and potentially, physiological functions that are available to the cell at any
given time (Hegyi and Gerstein, 1999).
The Use of Broad Categories to Interpret Noisy Data
Functional genomics experiments generally give rise to very complicated data that are
inherently hard to interpret. Furthermore, these data are often plagued with noise (Kerr et
al., 2000). Both factors can lead to inaccuracies and conflicting interpretations.
A good example is gene expression measurements, which are known to fluctuate between
experiments even if the conditions are apparently identical (Baldi and Long, 2001). These
fluctuations are often due to measurement errors, but there are also inherent biological
variations of expression levels, relating to the stochastic nature of gene expression
(Szallasi, 1999). One cause is the very low cellular concentrations of many transcription
factors, meaning, that they bind promoters very rarely. Such events approximate to a
Poisson process, and in fact, macroscopic chemical kinetics would fail to describe the
resulting expression level of the gene (McAdams and Arkin, 1999,Thattai and van
Oudenaarden, 2001)In another example, the interactome, when determined using the
Chapter 1: Integrating Genomic Data Sets
23
yeast two-hybrid technique, is notorious for false positives and negatives (Ito et al,
2001,Ito et al., 2000,Legrain et al., 2001,Serebriiskii et al., 2000).
A useful way to tackle noise and complexity of functional genomics information is to
average the data from many different genes into broad 'omic categories (Jansen and
Gerstein, 2000)For instance, instead of looking at how the level of expression of an
individual gene changes over a timecourse, I can average all the genes in a functional
category (e.g., glycolysis) together. This gives a more robust answer about the degree to
which a functional system changes over the timecourse. Likewise, if one wants to
investigate the relationship between a gene's essentiality whether or not it is essential
(Winzeler et al., 1999) and its subcellular localization, it might be useful to combine the
results for all proteins in the same compartment. This would give the average degree of
essentiality of all nuclear proteins, cytoplasmic proteins, and so forth. In an actual study
for predicting protein subcellular localization, I obtained more accurate predictions for
the overall populations (96% accuracy) of a given subcellular compartment than for
individual genes (75% accuracy) (Drawid et al, 2000).
Thus, the strength of genomic studies lies in the global comparisons between biological
systems rather than detailed examination of single genes or proteins. Genomic
information is often misused when applied exclusively to individual genes. If one is
interested only in one particular gene, there are many more conclusive experiments that
should be consulted before using the results from genomics datasets. Therefore, genomic
data should not be used in lieu of traditional biochemistry, but as an initial guideline to
Chapter 1: Integrating Genomic Data Sets
24
identify areas for deeper investigation and to see how those results fit in with the rest of
the genome.
Moreover, most genomics datasets give relative rather than absolute information, which
means that information about a single gene has little meaning in isolation. For example,
they are best used to identify "outlier" genes that are particularly highly-expressed, or
have especially many interactions, rather than to focus on the individual measurements
for a particular gene. A gene that makes a particularly large number of interactions may
indicate that it is a key component of the cell. One numerical technique that is
particularly useful with regard to dealing with this information is expressing results
through ranks (i.e., not giving the number of interactions of a particular gene product, but
how it ranks when compared with others). Furthermore, it provides a powerful way to
combine many different heterogeneous sources of information into a common and
statistically robust numerical framework (Gerstein and Hegyi, 1998,Gerstein and Levitt,
1997,Qian et al, 2001).
These observations should be kept in mind when interacting with genomics tools and
databases. Many websites focus on providing a lot of information for a single gene
sequence or protein, in a "non-genomic" fashion. Rather, such sites should be designed to
simultaneously display and manipulate large populations of genes. In the absence of such
an 'omic interface, it is important that information resources at least accommodate bulk
downloading of standardized data.
Chapter 1: Integrating Genomic Data Sets
25
A Case Study: Interrelating the Transcriptome and the Translatome
A specific example of comparing the transcriptome and translatome will illustrate the
points I made about interrelating 'omes and using categories to interpret noisy data. Here
the question is to what degree do highly expressed genes (transcriptome) correspond to
highly expressed proteins (translatome)? I can get very different answers depending on
the perspective I take:
Theoretical View
Turning to the entire mRNA and protein populations, the change in protein concentration
over time is equal to the rate of translation minus the rate of degradation. Borrowing from
chemical kinetics, this is approximately expressed by the equation dP(i,t)/dt = SE(i,t) DP(i,t), where P is the abundance of protein i at time t, E is the corresponding expression
level of this protein, S is a general rate of protein synthesis per mRNA, and D is a general
rate of protein degradation per protein. Obviously, this is highly simplified and in a more
general context one would expect that the rates of synthesis and degradation to be
different for each gene and dependent on the regulatory effects of other genes over time.
In addition, the equation does not take into account the stochastic nature of gene
expression (see above) (Chen et al., 1999).
Direct Comparison of Individual mRNA and Protein Data
Chapter 1: Integrating Genomic Data Sets
26
At the moment, I do not have good enough data to apply models such as the equation
above. However, there is an intuitive sense that highly expressed genes correspond to
highly abundant proteins. (One can see this by imagining the situation at steady-state,
when the lefthand side of the equation is zero and a positive correlation between E and P
results.) Figure 2A shows the direct comparison between raw measurements of mRNA
expression and protein abundance data for 181 genes in yeast drawn from two recent
studies (Futcher et al, 1999,Gygi et al, 1999). The two variables show a high degree of
variation for individual data pairs and investigators have come to different conclusions
about the general correlation between them. This is, to some degree, dependent on the
subjective way of analyzing the data.
Analysis of the Data in Terms of Categories
Although the relationship between mRNA and protein levels is vague for individual
genes, some of the statistics for broad categories of protein properties are much more
robust. Figure 2B shows the protein secondary structure and functional composition in
the genome, the transcriptome (i.e., weighted by mRNA abundance), and in the
translatome (i.e., weighted by protein abundance). In contrast to the differences between
mRNA and protein data for individual genes, the broad categories show that the
transcriptome and translatome populations are remarkably similar; both contain roughly
the same proportions of secondary structure and functional categories. Moreover, this
contrasts with the genome, which appears to have a distinctly different composition of
functional categories. This illustrates that I get a more consistent picture when I average
Chapter 1: Integrating Genomic Data Sets
27
across the population; that is, there is broad similarity between the characteristics of
highly expressed mRNA and highly abundant proteins.
Conclusion
The ultimate goal of genomics is the elucidation of the functome, but there are many
intermediate steps. By viewing the cell in terms of a list of distinct parts, I can define,
part by part, each 'ome in an effort to determine and categorize functional information for
each gene. High-throughput experimentation and computational techniques are valuable
and complementary; that is, conclusive results often cannot be made based on a single
methodology. It must be noted that this data is only valuable with regard to large
populations, and as such, should only be used as a secondary source for single gene
queries. Moreover, genomic approaches result in inaccurate and noisy data. This noise,
while deafening on the single gene level, can be tolerated through the use of broad
categories to analyze the data.
ACKNOWLEDGMENTS
R.J. acknowledges IBM Graduate Research Fellowship.
Chapter 1: Integrating Genomic Data Sets
28
Chapter 1: Integrating Genomic Data Sets
Figures and Tables
Chapter 1: Integrating Genomic Data Sets
1.1 Interrelating Different Types of Genomic Data, from Proteome to Secretome:
'Oming in on Function
Figure 1a
An overview of the current `omic terminology
Chapter 1: Integrating Genomic Data Sets
29
Figure 1b
Figure 1 An overview of the current `omic terminology. (A) A schematic of the main
'omes in the process of gene expression. (B) The literature citations of four of the most
widely used 'omes over time.
Chapter 1: Integrating Genomic Data Sets
30
Figure 2 Interrelating the transcriptome and the translatome
Figure 2 Interrelating the transcriptome and the translatome.(A) A direct comparison
of protein abundance and mRNA expression. The abundance data is from two recent
studies (datasets 1 and 2) of a global comparison of protein and mRNA expression levels
in yeast (Futcher et al, 1999,Gygi et al, 1999). The combined protein abundance dataset
Chapter 1: Integrating Genomic Data Sets
31
is an average of the data points from the two studies if the given gene product appears in
both studies. The mRNA expression data is mainly derived from Holstege (Holstege et
al., 1998). Although there is a general trend for protein concentration to rise with mRNA
levels, the actual correlation is weak and protein concentrations can sometimes vary by
more than two orders of magnitude for a given mRNA level. Similar observations were
reported by a study in human liver cells (Anderson and Seilhamer, 1997). The mRNA
expression data was scaled and the process is described on this paper’s eb site
(http://bioinfo.mbb.yale.edu/expression). (B) The composition of the genome (proteome),
transcriptome and translatome in terms of broad categories: protein secondary structures
and functions. This is based on the analysis in Jansen and Gerstein (Jansen and Gerstein,
2000) with updates to include protein abundance data. The bottom piecharts give the
composition in the genome, the middle charts in the transcriptome and the top charts in
the translatome. The compositions for the transcriptome and the translatome are
calculated by weighting each mRNA/protein with its respective expression level. The
secondary structure composition does not vary significantly between the different 'omes,
mainly because transcription and translation are independent of secondary structure. The
right five piecharts analyse the functional composition. I highlight the Energy and
Cellular Organization categories determined from MIPS (Mewes et al., 2000). A problem
in comparing the different 'omes is that each represents a different set of genes. For
instance, protein levels have been measured only for a fraction of genes whereas mRNA
levels are known for almost all genes. The piecharts show the compositions for the whole
genome in the right column and a representative subset of genes with known protein
levels in the left column. Comparing the left to the right immediately shows the
Chapter 1: Integrating Genomic Data Sets
32
experimental bias of two-dimensional electrophoresis (the method for measuring protein
abundance) with respect to certain functional categories. There is good agreement
between the composition in the translatome and the transcriptome, despite the low
correlation of protein and mRNA levels for individual genes. In comparison, the
compositions in the genome are much lower.
Chapter 1: Integrating Genomic Data Sets
33
Table 1 A Table of 'omes,
Together with their Occurrence in the Literature and on the World Wide Web
Term
Description
Genome
Google
Year of
first
PubMed
PubMed citation
The full complement of genetic
~1880000 66171
information both coding and non
coding in the organism
Proteome
The protein-coding regions of the
~63,000
703
genome
Transcriptome The population of mRNA transcripts in 3520
72
the cell, weighted by their expression
levels
Physiome
Quantitative description of the
2980
15
physiological dynamics or functions of
the whole organism
Metabolome The quantitative complement of all the
349
12
small molecules present in a cell in a
specific physiological state
Phenome
Qualitative identification of the form
4980
6
and function derived from genes, but
lacking a quantitative, integrative
definition
Morphome
The quantitative description of
238
2
anatomical structure, biochemical and
chemical composition of an intact
organism, including its genome,
proteome, cell, tissue and organ
structures
Interactome
List of interactions between all
56
2
macromolecules in a cell
Glycome
The population of carbohydrate
46
1
molecules in the cell
Secretome
The population of gene products that
21
1
are secreted from the cell
Ribonome The population of RNA-coding regions
1
1
of the genome
Orfeome
The sum total of open reading frames
42
in the genome, without regard to
Chapter 1: Integrating Genomic Data Sets
1932**
1995
1997
1997
1998
1995
1996
1999
2000
2000
2000
-
34
whether or not they code; a subset of
this is the proteome
Regulome
Genome-wide regulatory network of
the cell
Cellome
The entire complement of molecules
and their interactions within a cell
Operome
The characterization of proteins with
unknown biological function
Transportome The population of the gene products
that are transported; this includes the
secretome
Pseudome The complement of pseudogenes in the
proteome
Functome
The population of gene products
classified by their functions
Translatome The population of proteins in the cell,
weighted by their expression levels
Foldome
The population of gene products
classified through their tertiary
structure
*
Unknome
Genes of unknown function
18
-
-
17
-
-
8
-
-
1
-
-
-
-
-
1
-
-
-
-
-
-
-
-
-
-
-
Updated versions of this table will be available through my Web site at
http://bioinfo.mbb.yale.edu/what-is-it. Note that I define five new 'omes: the translatome,
the foldome, the pseudome, the functome, and the unknome. My definition of the
translatome is motivated partially by the ambiguities in term proteome, which has two
competing definitions. First, broadly favored by computational biologists, it is a list of all
the proteins encoded in the genome (Gaasterland 1999; Doolittle 2000). In this context, it
is equivalent to what some refer to as the orfeome, (i.e., the set of genes excluding
noncoding regions). Experimentalists, especially those involved in large-scale
experiments such as expression analysis and 2D electrophoresis, favor a second
definition. Here, it is used to describe the actual cellular contents of proteins, taking into
account the different levels of protein concentrations (Yates 2000). I prefer the former
definition for proteome, and use the term translatome for the latter. See
http://www.genomic_glossaries.com/content/omes.asp for a listing of other 'omes and
their definitions.
*
This term is also used in other fields with different meanings. **First citation according
to the Oxford English Dictionary.
Chapter 1: Integrating Genomic Data Sets
35
References
1.
Aebersold, R., Rist, B. & Gygi, S. P. Quantitative proteome analysis: methods and
applications. Ann N Y Acad Sci 919, 33-47 (2000).
2.
Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein
abundances in human liver. Electrophoresis 18, 533-7 (1997).
3.
Antelmann, H. et al. A proteomic view on genome-based signal peptide
predictions. Genome Res 11, 1484-502 (2001).
4.
Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D. & Hochstrasser, D. F.
Melanie II--a third-generation software package for analysis of two- dimensional
electrophoresis images: II. Algorithms. Electrophoresis 18, 2735-48. (1997).
5.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet 25, 25-9. (2000).
6.
Baldi, P. & Long, A. D. A Bayesian framework for the analysis of microarray
expression data: regularized t -test and statistical inferences of gene changes.
Bioinformatics 17, 509-19. (2001).
7.
Carson, J. H., Cowan, A. & Loew, L. M. Computational cell biologists snowed in
at Cranwell. Trends Cell Biol 11, 236-8. (2001).
8.
Chen, T., He, H. L. & Church, G. M. Modeling gene expression with differential
equations. Pac Symp Biocomput, 29-40. (1999).
9.
Claverie, J. M. Computational methods for the identification of genes in
vertebrate genomic sequences. Hum Mol Genet 6, 1735-44 (1997).
Chapter 1: Integrating Genomic Data Sets
36
10.
Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with
sequence patterns for localizing proteins: comprehensive application to the yeast
genome. J Mol Biol 301, 1059-75. (2000).
11.
Drawid, A., Jansen, R. & Gerstein, M. Genome-wide analysis relating expression
level with protein subcellular localization. Trends Genet 16, 426-30 (2000).
12.
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and
display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 148638 (1998).
13.
Epstein, C. & Butow, R. Microarray technology - enhanced versatility, persistent
challenge. Current Opinions Biotechnology 11, 36-41 (2000).
14.
Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A
sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999).
15.
Gerstein, M. A structural census of genomes: comparing bacterial, eukaryotic, and
archaeal genomes in terms of protein structure. J Mol Biol 274, 562-76 (1997).
16.
Gerstein, M. Patterns of Protein-Fold Usage in Eight Microbial Genomes: A
Comprehensive Structural Census. Proteins 33, 518-534 (1998).
17.
Gerstein, M. & Hegyi, H. Comparing genomes in terms of protein structure:
surveys of a finite parts list. FEMS Microbiol Rev 22, 277-304 (1998).
18.
Gerstein, M. & Honig, B. Sequences and topology. Curr Opin Struct Biol 11,
327-9. (2001).
19.
Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of
whole-genome expression data: How does it relate to protein structure and
function (In press). Current Opinions in Structural Biology (2000).
Chapter 1: Integrating Genomic Data Sets
37
20.
Gerstein, M. & Levitt, M. A structural census of the current population of protein
sequences. Proc Natl Acad Sci U S A 94, 11911-6. (1997).
21.
Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of
gene prediction accuracy in large DNA sequences. Genome Res 10, 1631-42.
(2000).
22.
Gygi, S. P., Rist, B. & Aebersold, R. Measuring gene expression by quantitative
proteome analysis [In Process Citation]. Curr Opin Biotechnol 11, 396-401
(2000).
23.
Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between
protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999).
24.
Harrison, P. M., Echols, N. & Gerstein, M. B. Digging for dead genes: an analysis
of the characteristics of the pseudogene population in the Caenorhabditis elegans
genome. Nucleic Acids Res 29, 818-30. (2001).
25.
Hegyi, H. & Gerstein, M. The relationship between protein structure and function:
a comprehensive survey with application to the yeast genome. J Mol Biol 288,
147-64 (1999).
26.
Hegyi, H., Lin, J., Greenbaum, D. & Gerstein, M. Structural genomics analysis:
characteristics of atypical, common, and horizontally transferred folds. Proteins
47, 126-41 (2002).
27.
Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome.
Cell 95, 717-728 (1998).
28.
Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein
interactome. Proc Natl Acad Sci U S A 98, 4569-74. (2001).
Chapter 1: Integrating Genomic Data Sets
38
29.
Ito, T., Chiba, T. & Yoshida, M. Exploring the protein interactome using
comprehensive two-hybrid projects. Trends Biotechnol 19, S23-7. (2001).
30.
Ito, T. et al. Toward a protein-protein interaction map of the budding yeast: A
comprehensive system to examine two-hybrid interactions in all possible
combinations between the yeast proteins. Proc Natl Acad Sci 97, 1143-1147
(2000).
31.
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and
functional categories: characterizing highly expressed proteins. Nucleic Acids Res
28, 1481-8 (2000).
32.
Jensen, F. V. Bayesian Networks and Decision Graphs (Springer, New York,
2001).
33.
Kerr, M. K., Martin, M. & Churchill, G. A. Analysis of variance for gene
expression microarray data. J Comput Biol 7, 819-37 (2000).
34.
Legrain, P., Wojcik, J. & Gauthier, J. M. Protein--protein interaction maps: a lead
towards cellular functions. Trends Genet 17, 346-52. (2001).
35.
Luscombe, N. M., Greenbaum, D. & Gerstein, M. What is bioinformatics? A
proposed definition and overview of the field. Methods Inf Med 40, 346-58 (2001).
36.
Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. & Eisenberg, D.
A combined algorithm for genome-wide prediction of protein function. Nature
402, 83-6. (1999).
37.
Marcotte, E. M., Xenarios, I. & Eisenberg, D. Mining literature for proteinprotein interactions. Bioinformatics 17, 359-63. (2001).
38.
Marx, K. Grundrisse (1857).
Chapter 1: Integrating Genomic Data Sets
39
39.
McAdams, H. H. & Arkin, A. It's a noisy business! Genetic regulation at the
nanomolar scale. Trends Genet 15, 65-9. (1999).
40.
Mewes, H. W. et al. MIPS: a database for genomes and protein sequences.
Nucleic Acids Res 28, 37-40 (2000).
41.
Naaby-Hansen, S., Waterfield, M. D. & Cramer, R. Proteomics - post-genomic
cartography to understand gene function. Trends Pharmacol Sci 22, 376-84.
(2001).
42.
Ono, T., Hishigaki, H., Tanigami, A. & Takagi, T. Automated extraction of
information on protein-protein interactions from the biological literature.
Bioinformatics 17, 155-61. (2001).
43.
Pederson, T. The immunome. Mol Immunol 36, 1127-8. (1999).
44.
Qian, J. et al. PartsList: a web-based system for dynamically ranking protein folds
based on disparate attributes, including whole-genome expression and interaction
information. Nucleic Acids Res 29, 1750-64 (2001).
45.
Rison, S. C. G. H., T. C. Thornton, J.M. Comparison of Functional Annotation
Schemes for Genomes. Funct Integr Genomics 1, 56-59 (2000).
46.
Ross-Macdonald, P. et al. Large-scale analysis of the yeast genome by transposon
tagging and gene disruption. Nature 402, 413-8 (1999).
47.
Sanchez, C. et al. Grasping at molecular interactions and genetic networks in
Drosophila melanogaster using FlyNets, an Internet database. Nucleic Acids Res
27, 89-94. (1999).
Chapter 1: Integrating Genomic Data Sets
40
48.
Serebriiskii, I., Estojak, J., Berman, M. & Golemis, E. A. Approaches to detecting
false positives in yeast two-hybrid systems. Biotechniques 28, 328-30, 332-6.
(2000).
49.
Simons, K. T., Strauss, C. & Baker, D. Prospects for ab initio protein structural
genomics. J Mol Biol 306, 1191-9. (2001).
50.
Skolnick, J. & Fetrow, J. S. From genes to protein structure and function: novel
applications of computational approaches in the genomic era. Trends Biotechnol
18, 34-9. (2000).
51.
Szallasi, Z. Genetic network analysis in light of massively parallel biological data
acquisition. Pac Symp Biocomput, 5-16. (1999).
52.
Teichmann, S. A., Murzin, A. G. & Chothia, C. Determination of protein function,
evolution and interactions by structural genomics. Curr Opin Struct Biol 11, 35463. (2001).
53.
Thattai, M. & van Oudenaarden, A. Intrinsic noise in gene regulatory networks.
Proc Natl Acad Sci U S A 98, 8614-9. (2001).
54.
Thornton, J. M. From genome to function. Science 292, 2095-7. (2001).
55.
Tjalsma, H., Bolhuis, A., Jongbloed, J. D., Bron, S. & van Dijl, J. M. Signal
peptide-dependent protein transport in Bacillus subtilis: a genome-based survey of
the secretome. Microbiol Mol Biol Rev 64, 515-47. (2000).
56.
Tweeddale, H., Notley-McRobb, L. & Ferenci, T. Effect of slow growth on
metabolism of Escherichia coli, as revealed by global metabolite pool
("metabolome") analysis. J Bacteriol 180, 5109-16. (1998).
Chapter 1: Integrating Genomic Data Sets
41
57.
Uetz, P. et al. A comprehensive analysis of protein-protein interactions in
Saccharomyces cerevisiae. Nature 403, 623-7. (2000).
58.
Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997).
59.
Vukmirovic, O. G. & Tilghman, S. M. Exploring genome space. Nature 405, 8202. (2000).
60.
Walhout, A. J. & Vidal, M. High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297-306. (2001).
61.
Wilson, C. A., Kreychman, J. & Gerstein, M. Assessing annotation transfer for
genomics: quantifying the relations between protein sequence, structure and
function through traditional and probabilistic scores. J Mol Biol 297, 233-49
(2000).
62.
Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by
gene deletion and parallel analysis. Science 285, 901-6 (1999).
63.
Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous
gene structures in the human genome. Genome Res 11, 803-16. (2001).
64.
Zhu, H. et al. Global analysis of protein activities using proteome chips. Science
293, 2101-5. (2001).
65.
Zhu, H. et al. Analysis of yeast protein kinases using protein chips. Nat Genet 26,
283-9. (2000).
Chapter 1: Integrating Genomic Data Sets
42
Chapter 2: mRNA expression and protein abundance
2.1 Analysis of mRNA expression and protein abundance data:
An approach for the comparison of the enrichment of features in
the cellular population of proteins and transcripts
Abstract
Motivation
Protein abundance is related to mRNA expression through many different cellular processes. Up to now, there have been conflicting results on how correlated the levels of
these two quantities are. Given that expression and abundance data are significantly more
complex and noisy than the underlying genomic sequence information, it is reasonable to
simplify and average them in terms of broad proteomic categories and features (e.g. functions or secondary structures), for understanding their relationship. Furthermore, it will
be essential to integrate, within a common framework, the results of many varied experiments by different investigators. This will allow one to survey the characteristics of
highly expressed genes and proteins.
Results
To this end, I outline a formalism for merging and scaling many different gene expression and protein abundance data sets into a comprehensive reference set, and I develop an
approach for analyzing this in terms of broad categories, such as composition, function,
structure and localization. As the various experiments are not always done using the
same set of genes, sampling bias becomes a central issue, and my formalism is designed
Chapter 2: mRNA expression and protein abundance
43
to explicitly show this and correct for it. I apply my formalism to the currently available
gene expression and protein abundance data for yeast. Overall, I found substantial
agreement between gene expression and protein abundance, in terms of the enrichment of
structural and functional categories. This agreement, which was considerably greater
than the simple correlation between these quantities for individual genes, reflects the way
broad categories collect many individual measurements into simple, robust averages. In
particular, I found that in comparison to the population of genes in the yeast genome, the
cellular populations of transcripts and proteins (weighted by their respective abundances)
were both enriched in: (i) the small amino acids Val, Gly, and Ala; (ii) low molecular
weight proteins; (iii) helices and sheets relative to coils; (iv) cytoplasmic proteins relative
to nuclear ones; and (v) proteins involved in "protein synthesis," "cell structure," and
"energy production".
Supplementary information
http://genecensus.org/expression/translatome
Introduction
With the recent popularity of high-throughput experimentation, biologists have begun to
create a large inventory of scientific data (Claverie, 1999,Einarson and Golemis,
2000,Epstein and Butow, 2000,Shapiro and Harris, 2000). Much of this has come from
expression experiments, partially fueled by the advent and continuous evolution of the
microarray and Gene Chip systems.
These experiments allow for large scale,
comprehensive scans of gene expression within the cell (Eisen and Brown, 1999,Ferea
Chapter 2: mRNA expression and protein abundance
44
and Brown, 1999,Lipshutz et al., 1999,Schena et al., 1995). Expression data sets are
currently the single richest source of information in genomics, and for yeast, expression
information now dwarfs that in the sequence alone. However, "theory" has not kept up
with experimentation in this area, and how to best interpret the vast amount of data
generated by these experiments is still a very open question (Bassett et al., 1996,Gerstein
and Jansen, 2000,Searls, 2000,Sherlock, 2000,Wittes and Friedman, 1999,Zhang, 1999).
Genome-wide experimentation has also been used to directly measure the cellular
population of proteins (protein abundance). (Anderson and Seilhamer, 1997,Futcher et al.,
1999,Gygi et al., 1999,Ross-Macdonald et al., 1999). Understanding how protein abundance is related to mRNA transcript levels is essential for interpreting gene expression
and also, more generally, for understanding the interactions, structures and functions in a
cellular system (Hatzimanikatis et al., 1999). Moreover, as protein concentration, rather
than transcript population, is the more relevant variable with respect to enzyme activity,
it is this quantity that connects genomics to the physical chemistry and dynamics of the
cell (Kidd et al., 2001). Finally, protein abundance levels may become invaluable for
diagnostic methods as well as for determining new drug targets (Corthals, 2000). Highthroughput two-dimensional gel electrophoresis (2-DE), in conjunction with mass
spectrometry, has been used to identify proteins that can then be quantified to determine
protein abundance (Futcher et al, 1999,Gygi et al, 1999,Harry, 2000). Other technologies
include using random integration of reporter transposons in yeast (Ross-Macdonald et al,
1999), and modifying the microarray concept for use with proteins (Lopez,
2000,MacBeath and Schreiber, 2000,Nelson et al., 2000,Zhu et al., 2000).
Chapter 2: mRNA expression and protein abundance
45
Gene expression is indirectly related to cellular protein abundance through the process of
translation. The cell connects mRNA expression and protein abundance through translational control, which is primarily regulated at the initiation of translation (Day and Tuite,
1998,Jackson and Wickens, 1997,Lindahl and Hinnebusch, 1992,McCarthy, 1998). Much
of this control is the result of multiple cis-acting elements in the mRNA (Jacobs
Anderson and Parker, 2000). There are large non-coding regions in each mRNA species
devoted to regulation of that mRNA as well as its stability and degradation properties,
including 5` and 3` UTRs, uORFs and uAUGs (Morris and Geballe, 2000,Vilela et al.,
1998,Vilela et al., 1999).
Previously, we surveyed the population of protein features -- such as folds, amino acid
composition, and functions -- in yeast, and a number of the other recently sequenced genomes (Das and M., 2000,Gerstein, 1997,Gerstein, 1998,Gerstein, 1998,Gerstein,
1998,Hegyi and Gerstein, 1999,Lin and Gerstein, 2000). Others have also done related
work (Frishman and Mewes, 1997,Frishman and Mewes, 1999,Jones, 1998,Tatusov et al.,
1997,Wallin and von Heijne, 1998,Wolf et al., 1999).
Recently, we extended this
concept to compare the population of features in the yeast transcriptome to that in the genome (Drawid et al., 2000,Jansen and Gerstein, 2000). Here, I present a new methodology to compare the features of the mRNA expression population with the protein
abundance population.
Precise terminology is essential for this comparison to be readily understandable.
Unfortunately, one of the terms that immediately come to mind in relation to protein
populations, “proteome”, has in the past been used inconsistently. In particular, the term
Chapter 2: mRNA expression and protein abundance
46
proteome can logically be used to describe all the distinctly different proteins in the genome (Bairoch, 2000,Cambillau and Claverie, 2000,Cavalcoli et al., 1997,Doolittle,
2000,Fey et al., 1997,Gaasterland, 1999,Garrels et al., 1997,Jones, 1999,Pandey and
Mann, 2000,Qi et al., 1996,Rubin et al., 2000,Sali, 1999,Tekaia et al., 1999) and, in this
context, it is equivalent to what others may refer to as the coding part of the genome.
However, in papers on 2D electrophoresis, it is often used to describe the sum total of
proteins in a cell, taking into account the different levels of protein abundance for
different proteins (Gygi et al., 2000,Lopez, 2000,Shevchenko et al., 1996,Washburn and
Yates, 2000). In an effort to be clear, I propose the term “translatome” for this second
usage of proteome.
With this definition, I am able to refer compactly to three different cellular populations.
These are illustrated in figure 1.
(i.)
I use the term genome when I refer to the population of open reading frames,
where each ORF counts once.
(ii.)
I use the term transcriptome when I refer to the population of mRNA transcripts. This term was originally coined by Velculescu et al. (Velculescu et al.,
1997). Note that each ORF may give rise to different numbers of transcripts.
Consequently, the transcriptome is essentially the same as the genome but with
each ORF weighted by its expression level.
(iii.)
The next level is the cellular population of proteins. As each protein repre-
sents a translated transcript, I make an analogy with the term transcriptome and
Chapter 2: mRNA expression and protein abundance
47
use the term translatome as described above to describe this third population.
Thus, the translatome is a subset of the genome where each ORF is weighted by
its associated level of protein abundance.
Note that one could also less compactly call the translatome a "weighted proteome".
However, doing so assumes one of the two aforementioned definitions of proteome. To
avoid ambiguity, I studiously avoid the use the proteome altogether in the paper.
Differences between the translatome and the transcriptome exist given that transcripts
from different genes can give rise to different numbers of proteins, due to different rates
of translation and protein degradation. Post-transcriptional modifications further affect
the translatome.
Although there are gene expression and protein abundance data sets for multiple organisms, I have chosen to work specifically on yeast. Besides having its whole genome
sequenced (Goffeau, 1996), yeast is also a powerful tool in genetics (Carlson, 2000) due
to, among other things, the two hybrid system, a robust and versatile technique used in
discerning protein-protein interactions (Luban and Goff, 1995,Young, 1998).
In my analysis of the transcriptome and translatome, I focus on global protein features
rather than the comparison of individual genes. Previous analyses have shown that
differences between mRNA expression and protein abundance level can be quite dramatic for individual genes. This may either be due to the noise in the data or to
fundamental biological processes. However, my analyses shows that the variation be-
Chapter 2: mRNA expression and protein abundance
48
tween transcriptome and translatome is much smaller for global properties that are computed by averaging over the properties of many individual genes.
Methods
Data sources used
For my analysis I culled many divergent data sets, representing protein abundance and
mRNA expression experiments and also other sources of genome annotation. These are
all summarized in Table 1. Briefly, they included two protein abundance sets, measured
via 2-dimensional gel electrophoresis and mass spectrometry. I termed these 2-DE #1
(Gygi et al, 1999) and 2-DE #2 (Futcher et al, 1999). These sets, while admittedly small
in comparison to the size of expression data sets, represent the largest amount of information on protein abundance publicly available at the present. I also apply my methodology,
with limited success, to the semi-quantitative Transposon insertion data set that measures
the LacZ expression of fusion proteins (Ross-Macdonald et al, 1999). Although this set
contains many more genes than either of the gel electrophoresis sets, and thus is an
appealing source of protein abundance information, the more qualitative nature of the
data makes comparisons with other data sets difficult.
My mRNA expression data came from multiple laboratories that used either Gene Chip
or SAGE technology. The Gene Chip sets included the Young Expression Set (Holstege
et al., 1998), the Church Expression Set (Roth et al., 1998) and the Samson Expression
Set (Jelinsky and Samson, 1999). I used data representing the vegetative state of yeast
from all of the above experiments. I also compiled two reference sets to be used in my
Chapter 2: mRNA expression and protein abundance
49
comparisons, one for protein abundance and another for mRNA expression (summarized
below). Finally, I used many different types of genome annotation in my analysis, which
are summarized in Table1. In particular, the Munich Information Center for Protein
Sequences (MIPS), a site containing a large number of databases (Mewes et al., 2000),
proved to be an invaluable source of data specifically in regard to functional categories.
Biases in the data
There is a caveat to the usage of data from high-throughput experimentation (i.e. microarrays and two-dimensional gel electrophoresis). With all high throughput expression
studies there always exists the difficulty of maintaining consistent biological and processing conditions across the assay. Moreover, the databases that annotate the specific genes
may not always be accurate (Ishii et al., 2000). Gene chip experiments suffer with regard
to cross hybridization and the saturation of probes for the highly expressed genes. SAGE
data is not always reliable for assessing ORFs with low expression levels. With regard to
2D gels, although the technology has undergone many improvements since its introduction over a quarter century ago (Klose, 1975,O'Farrell, 1975), there remain many aspects
of the procedure that introduce biases into the data. These include the inability to resolve
membrane proteins (approximately 30% of the genome) and basic proteins (Gerstein,
1998,Krogh et al., 2001). Moreover, there exist some biases in the data that, as in any
compilation, reflect the tendencies of the investigator. These include the lack of low
abundance proteins(Fey and Larsen, 2001,Gygi et al., 2000,Harry, 2000) and the
differences between labs in sample preparation. In addition, the procedures for identification (i.e. MALDI-TOF) and quantification (i.e. ICAT) (Gygi et al, 1999)of the protein
Chapter 2: mRNA expression and protein abundance
50
spots are much more recent and themselves subject to problems and uncertainties
(Haynes and Yates, 2000).
I am trying to correct for these biases in my analysis in two ways. First, I create reference mRNA expression and protein abundance datasets as a starting point for my analysis.
I achieve this by scaling and averaging different mRNA and protein datasets into a
combined reference, in an attempt to obtain a better estimate of the normal expression
state of a yeast cell (I explain this procedure in more detail in the following section).
This results in a correction of the biases that might be found in individual datasets.
Second, in analyzing the reference datasets, I use a formalism and a graphical representation that shows the dependency of the results on the subset of genes for which
experimental data is available, thus making sampling or selection biases explicit.
Data set scaling
A reference set for mRNA expression
With many different mRNA expression data sets available, it is worthwhile to integrate
them into a single unified reference set, with the intention of reducing the noise and errors contained in the individual data sets and to obtain a unified estimate of the normal
expression state in a cell.
I adopt an iterative scaling and merging formalism, which I summarize below. I present
a
more
detailed
review
of
the
methods
at
the
following
web
site:
genecensus.org/expression/translatome.
Chapter 2: mRNA expression and protein abundance
51
I start with the values of one Gene Chip data set Ui where i is used throughout as a subscript to denote gene number. I then transform the values of the next Gene Chip data set
Xi to Yi with the following non-linear regression:
min
 Y  U 
2
i
i
with Yi  AX iB
i
where A and B are the parameters of the regression. Note that two Gene Chip sets may
not be defined for the same set of genes, so I have to perform the fit only over the genes
common to both sets. The motivation for scaling is that the dynamic range of observed
expression levels varies somewhat between different data sets, although cell types and
growth conditions are very similar. Reasons for disparity may include different calibration procedures for relating fluorescence intensity to a cellular concentration (measured
in copies of transcripts per cell) or different protocols for harvesting and reversetranscribing the cellular mRNA.
I then merge and average the data to create a new reference set V as follows:
If
Ui and Yi are both defined for gene i and
Then
1
Vi  Yi  U i 
2
Yi  U i

Yi  U i
Else if only Yi exists, Vi = Yi
Else Vi = Ui
Chapter 2: mRNA expression and protein abundance
52
As presented above, where only one data set has a value for the corresponding ORF, I
incorporated that value and did not exclude it. When both data sets have values for an
ORF, I averaged the values if they were within 15% of each other; otherwise, I just
stayed with the original chip data set Ui. I used α = 15% in order to prevent outliers from
skewing the result. This 15% value is a reasonable threshold for excluding outliers
though other values (e.g. 10% or 20%) would give similar results (data not shown).
Other data sets are subsequently included in the same procedure, continuing the iteration
from the new expression values Vi. The initial iteration starts with the Young Expression
Set as Ui since I have the highest confidence in its accuracy.
The SAGE data was not included in the above procedure since it is of a fundamentally
different nature. An advantage of the SAGE technology over Gene Chips is that there is
no possible signal saturation for high expression levels, as is possible for chips(Futcher et
al, 1999). Conversely, SAGE values are less reliable for lowly expressed genes since
there is a chance that one might not sequence a SAGE tag corresponding to such a gene
altogether. Therefore, if after the last iteration, the average Gene Chip expression level
Vi was both above a certain threshold  and below the SAGE expression level Si for the
same gene, it was replaced with the SAGE value; otherwise the average Gene Chip value
was kept. This gave us my final expression set wmRNA. My treatment of the SAGE data
is modeled after that in Futcher et al. (Futcher et al, 1999), and like them, I used  = 16.
This incorporation of the SAGE data into the reference data set ensures that the highly
expressed outliers are as accurate as possible.
Chapter 2: mRNA expression and protein abundance
53
Rather than plain arithmetic averaging, this overall scaling procedure with the  cutoff
avoids “artificial averages” that combine very different values for a particular gene.
Some expression values might be statistical outliers. In addition, it may be possible that
the expression levels of a variety of genes can only be within mutually exclusive ranges
or modes, such as when two alternative pathways are switched on or off. Simply averaging these would give values that are less representative of the particular mode values.
This situation is analogous to that in averaging together an ensemble of protein structures,
say from an NMR structure determination. Each structure in the ensemble could be
stereochemically correct, with all side-chain atoms in predefined rotamer configurations.
However, an average of all structures in the ensemble could yield one that is stereochemically incorrect if this involved averaging over particular side-chains in different rotameric
states.
With regard to my regression analysis, I have investigated both non-linear and linear fits
but found a non-linear procedure to be more advantageous. The non-linear relationship
between different expression datasets perhaps reflects saturation in one or more of the
gene chips -- not an uncommon phenomenon. This non-linearity is immediately evident
on scatter plots of two datasets against one another (see website). Accordingly, the nonlinear fit produces a smaller residual than the linear fit: 98297 (non-linear) versus 122182
(linear) for the scaling of the Church dataset and 59828 (non-linear) versus 67462 (linear)
for the Samson dataset.
Chapter 2: mRNA expression and protein abundance
54
A reference set for protein abundance
I followed a similar procedure to calculate a reference protein abundance set from the two
gel electrophoresis data sets. I first scaled the two data sets against the mRNA expression
reference data set, getting regression parameters Cj and Dj:
min
 P
i, j
j
 C j wmRNA
,i
D

2
i
where the subscript j indicates the data set 2-DE #1 or 2-DE #2 respectively; Pi,j is the
protein abundance value in data set j, and wmRNA,i the corresponding reference expression
value, and Cj and Dj are the parameters of the non-linear regression.
Using these parameters, I transformed the values of set 2-DE #2 onto 2-DE #1. Then I
combined both sets into the reference protein set wProt by averaging them, if both values
existed. Otherwise, by using the existing value, viz:
P 
Qi , 2  C1  i , 2 
 C2 
D1 D2
wProt,i = (Pi,1 + Qi,2 )/2 if both Pi,1 and Qi,2 exist.
Else if only Pi,1 exists, wProt,i = Pi,1
Else if Qi,2 exists, wProt,i = Qi,2.
Chapter 2: mRNA expression and protein abundance
55
Enrichment of features
Figure 2 focuses on individual proteins. In the next part of my analysis, I want to group a
number of proteins together into various categories based on common features and
characterize those features that are enriched in one population relative to another, i.e. the
translatome population of proteins as measured by 2D gels relative to the transcriptome
population of transcripts or the genome population of genes. To this end, I set up a
formalism that could be applied universally to all the attributes that I was interested in.
Due to the limitations of the experiments, the translatome, transcriptome, and genome
populations are defined on different sets of genes, and sometimes I want to remove this
“selection bias” by forcing them to be compared on exactly the same set of genes. This is
a key aspect of my formalism as presented in figure 1.
I call an entity like [w, G] a "population", where G is a set describing a particular selection of genes from the genome and w is vector of weights associated with each element of
this population. In particular, I focus on three main populations here:
(i.)
[1,GGen] is the population of genes in the genome, all 6280 genes weighted
once (w = 1).
(ii.)
[wmRNA, GmRNA] is the observed population of the transcripts in the transcriptome, i.e. the 6249 genes in the reference expression set weighted by their reference expression value.
Chapter 2: mRNA expression and protein abundance
56
(iii.)
[wProt, GProt] is the observed cellular population of the proteins in the transla-
tome, i.e. the 181 genes in the reference abundance set weighted by their reference abundance value.
(The set of genes in the genome GGen is approximately equal to the genes in set GmRNA,
such that I can use both symbols interchangeably.) I can also use this notation to describe
specific experiments -- e.g. [wlacZ, GlacZ] describes the gene set and weights relating to the
Transposon Abundance set.
Furthermore, I define Fj as the value of a feature F in ORF j. For example, F could be
the composition of leucine (a real number) or a binary value (0 or 1) indicating whether
an ORF contains a trans-membrane segment. Given these definitions, the weighted average of feature F in population [w, G] is:
w F
 ( F ,[w, G ]) 
w
jG
j
jG
j
j
The weighted averages of two populations [w, G] and [v, S] can be compared by simply
looking at their relative difference Δ:
( F ,[ v, S ], [w, G]) 
 ( F ,[ v, S ])   ( F ,[w, G])
 ( F ,[w, G])
where v and w are weights for the sets of ORFs S and G respectively. I call Δ the
"enrichment" of feature F because it indicates whether F is enriched (if Δ is positive) or
depleted (if Δ is negative) in population [v, S] relative to [w, G].
Chapter 2: mRNA expression and protein abundance
57
Usually, the gene set G is defined by the particular experiment, for which the weight w
was measured. However, it is also possible to combine the gene set associated with one
experiment with expression levels from another set. One may want to do this to compute
the enrichment only on the genes common to both populations, for which there are defined values for both w and v, viz: , (F, [v, S  G],[w, S  G]). In practice, this is most
relevant for comparing GProt and GmRNA. Since GProt is completely a subset of GmRNA, I
need not explicitly deal with intersections if I calculate all statistics directly over GProt.
One can adjust the weight vectors to take into account different types of averaging. For
instance, when computing the amino acid composition (F = aa) from the amino acid
compositions of individual ORFs Fj = aaj (j  G ) , I weight by ORF length. In the case
of expression weights, I have:
wj = Nj wmRNA,j
j  G
where Nj is a measure of the length of ORF j (such as the number of amino acids.)
On the other hand, when computing the average molecular weight per amino acid, I need
to normalize by the number of amino acids per ORF, which is equivalent to choosing the
following weights:
wj 
wmRNA, j
Nj
j  G
Chapter 2: mRNA expression and protein abundance
58
Results
Comparison of mRNA expression and protein abundance
Figure 2a shows a comparison of my two reference data sets for transcripts and proteins
on a log-log graph. The correlation coefficient is 0.67. A previous study(Futcher et al,
1999), in which the data set 2-DE #2 was investigated, reported a higher correlation
coefficient of 0.76. The disparity may be due to the fact that I are looking at a larger
number of points. Inspection of figure 2a also shows that the correlation for the data values, which were derived from averaging values from both 2-DE sets, is larger. It should
be emphasized that there are many limitations in this analysis as both 2-DE sets represent
relatively homogenous sets of proteins and there are only a small number of proteins in
each set.
Figure 2b shows the outliers from figure 2a from both above and below the dashed line.
These outliers are representative of those genes for which their mRNA expression differs
significantly from their protein abundance (i.e. either there is little mRNA expression yet
significant protein abundance or significant mRNA expression yet minimal protein abundance). For each, I present a description of its function. With one exception all outliers
are associated with the MIPS category: cellular organization (MIPS category 30).
Chapter 2: mRNA expression and protein abundance
59
Enrichment of protein features
Amino acid enrichment
As shown in Figure 3a, I used my methodology to measure the enrichment of individual
amino acids in both the translatome and the transcriptome relative to the genome. The
horizontal axis lists the amino acids while the vertical axis shows their percent enrichment. I list enrichments for both the reference protein abundance and mRNA expression
sets in relation to the genome population. I found that three amino acids -- Valine,
Glycine and Alanine -- were consistently enriched in both transcriptome and translatome
populations.
In Figure 3a I compare different gene sets. In Figure 3b I focus mainly on the variation
in enrichments when all the comparisons are restricted to the set of 181 genes (GProt 
GmRNA = GProt) common to all data sets. Thus, the differences between the populations
now only reflect the effects of differential transcription of certain genes and differential
translation of certain transcripts. I find here an enrichment specifically of cysteine in the
translatome in relation to the transcriptome. This enrichment may be the result of the
stability associated with sulfur bridges.
To measure the statistical significance of the results on amino acid enrichment, I have
performed a control analysis on a randomized dataset (Figure 3D). I randomly permutated the expression values of the ORFs 1000 times and then recomputed the enrichments.
This allowed us to compute distributions for the amino acid enrichments and, from
Chapter 2: mRNA expression and protein abundance
60
integrating these, one-sided p-values indicating the significance of the observed enrichments.
Biomass enrichment
A corollary to amino acid enrichments is the determination of the average biomass of the
transcriptome and translatome populations. I show this in Figure 3C. I found that the
average molecular weight of a protein in both populations was, on average, lower than in
the genome population. These preliminary observations suggest a cell preference to use
less energetically expensive proteins for those that are highly transcribed or translated.
However, I also found that the average molecular weight per amino acid differed much
less between the transcriptome and the translatome on the one hand, and the genome on
the other hand (though it was still slightly less). This finding indicates that lower
molecular weights in the translatome and transcriptome populations relative to the
genome population are predominantly due to greater expression of shorter proteins rather
than the incorporation of smaller amino acids.
Secondary structure composition
I also used my methodology to study the enrichment of secondary-structural features.
Secondary structural annotation was derived from structure prediction applied uniformly
to all the ORFs in the yeast genome as described in Table 1. As shown in Figure 4A, all
three populations – genome, transcriptome, and translatome – had a fairly similar
composition of secondary structures -- sheets, helices, and coils. The differences between populations were marginal and based only on the small subset of genes. They do,
Chapter 2: mRNA expression and protein abundance
61
though, point to a possible trend of depletion of random coils relative to alpha helices and
beta sheets in the transcriptome and translatome.
I also found that transmembrane proteins were significantly depleted in the transcriptome
(see website). To identify transmembrane (TM) proteins, I used the GES hydrophobicity
scale as described previously (see caption to Table 1 (Gerstein, 1998). These results are
consistent with a previous analyses (Jansen and Gerstein, 2000). This analysis could not
be extended to the translatome because the 181 genes in the protein abundance data set
(GProt) do not contain any membrane proteins, which are difficult to detect in gel
electrophoresis (Molloy, 2000).
Subcellular localization
A generalization of the transmembrane protein analysis is subcellular localization. I
looked into the enrichment of proteins associated with the various subcellular compartments. This is shown in Figure 4C. For clarity, I divided the cell into five distinct
subcellular compartments, as described in Table 1. I found that, in comparison to the
genome, both the transcriptome and translatome are enriched in cytoplasmic proteins.
This is true whether I make my comparisons in relation to the relatively large reference
mRNA expression set or the smaller reference protein abundance set. As figure 4C shows,
the 2D gel experiments are clearly biased towards proteins from the cytoplasm. However,
in the biased subset Gprot transcription and translation lead to an even higher fraction of
cytoplasmic proteins in the translatome.
Chapter 2: mRNA expression and protein abundance
62
Functional categories
Finally, I compared the enrichment of various functional categories in both the translatome and the transcriptome (see Figure 4B). This gives us a broad yet informative view
of the cell as a whole. As described in Table 1, I used the top-level of the MIPS scheme
for the functional category definitions (Mewes et al, 2000). I found broad differences
between the various populations, with some of the functional categories showing
strikingly high enrichments.
In particular, I found enrichments of the “cellular or-
ganization,” “protein synthesis,” and “energy production” categories.
Application to semi-quantitative protein abundance data sets
I also tried to extend my methodology to cope with the semi-quantitative transposon set.
The qualitative nature of the set makes it impossible to compute statistical relationships
between mRNA and protein populations as I did for both the 2D gel sets. I briefly
summarize my approach.
Many ORFs in the Transposon dataset had multiple, sometimes inconsistent, measurements ranging from one (background) to four (strong) for various different transposon
insertions. I took only those 450 ORFs that consistently yielded either background or
strong. I then used this set in a binary fashion, interpreting an ORF as either on or off. I
show the enrichments of amino acids computed from this filtered Transposon Abundance
Set in Figure 3A. Overall, the enrichments from this set seemed to be attenuated in
comparison to either the mRNA expression or protein abundance data.
Chapter 2: mRNA expression and protein abundance
63
Discussion and conclusion
I developed a methodology for integrating many different types of gene expression and
protein abundance into a common framework and applied this to a preliminary analysis
of yeast. In particular, I developed a procedure for scaling and merging different mRNA
and protein sets together and then computing the enrichment of various proteomic features in the population of transcripts and proteins implied by these scaled sets. I showed
that by analyzing broad categories instead of individual noisy data points, I could find
logical trends in the underlying data.
The comparison of the translatome with the transcriptome and the genome helps to better
understand cellular processes.
For this purpose, I compiled two reference sets, the
mRNA reference expression set integrated from various Gene Chip and SAGE experiments, and the protein reference abundance set, collected from published 2D gel
electrophoresis experiments. My reference sets proved useful for my analysis of the
composition and enrichments of protein features in the various stages of gene expression.
I found many similar trends for general protein categories between these two sets.
To compare the translatome and the transcriptome, I devised a formalism to measure
enrichments of data sets. With this formalism I measured the enrichments of amino acids,
protein function and secondary structures in the vegetative yeast cell. Other comparisons
included looking at average biomasses, looking into subcellular localizations and a direct
comparison of mRNA expression vs. protein abundance.
Chapter 2: mRNA expression and protein abundance
64
Overall transcriptome and translatome similarity: outliers against trend
The overall similarity I find between transcriptome and translatome contrasts somewhat
with the weak correlation between mRNA expression and gene abundance as shown in
figure 2 and reported previously (Futcher et al, 1999,Gygi et al, 1999). This reflects the
way my system of overall categories collects many proteins into robust averages. It
shows that variation between proteins is not systematic with respect to the categories.
For example, individual transcription factors might have higher or lower protein abundance than one expects from their mRNA expression, but the category “transcription factors” as a whole has a similar representation in the transcriptome and translatome.
I used the reference data sets to compare mRNA expression and protein abundance for
the 181 genes shared between the two sets -- the largest such comparison. While I found
an overall correlation between the two data sets, indicating that mRNA expression may
be closely related to protein abundance, I found some genes that bucked the trend. Possible explanations for the aberrant behavior of some of these outliers are presented.
Those outliers that have higher levels of protein abundance than expected from their
mRNA expression are dominated by alcohol dehydrogenases and Glyceraldehyde-3phosphate (G3P) dehydrogenases. It is known that G3P dehyderogenase forms a bienzyme complex with alcohol dehydrogenase, thus, the similar abundance pattern of these
two enzymes can be rationalized (Batke et al., 1992). Alcohol dehydrogenase is also a
stress induced protein in many organisms(An et al., 1991,Matton et al., 1990,Millar et al.,
1994), induced into action when the cell undergoes trauma, thus perhaps translated to a
higher degree prophylactically (although the expression pattern of another stress-induced
Chapter 2: mRNA expression and protein abundance
65
protein (HSP70) shows that this is not always the case). Translation-related proteins are
more prominent in the outliers, with lower protein abundance than expected from mRNA
expression.
While it is known that multiple features of an individual mRNA influence its expression
and regulation, it is presently not clearly understood how. There are many non-coding
regions in each mRNA species that are responsible for this regulation. These include upstream AUG codons (uAUGs), both 3’ and 5’ untranslated regions, upstream open reading frames (uORFs) and the overall secondary structure of mRNA. Presently it is unclear
how these act to exert their control (Morris and Geballe, 2000).
One might conceive of using "outliers" with significantly different transcriptional and
translational behavior to find consensus regulatory sequences. One possible method
would involve using predicted mRNA structures (Jaeger et al., 1990,Zuker, 2000) to find
consensus structural elements in these outliers. In particular, it might be worthwhile to
investigate the secondary mRNA structure, to which the yeast translational machinery is
known to be sensitive (McCarthy, 1998).
Overall transcriptome and translatome similarity: consistent enrichments
I found the enrichments relative to the genome to be consistent between the translatome
and the transcriptome. In particular I found that the amino acids Valine, Glycine and
Alanine -- all relatively small amino acids -- are significantly enriched in both populations in comparison to the genome population. These results coincide with the previous
conclusion that those amino acids are also the most highly abundant amino acids in
Chapter 2: mRNA expression and protein abundance
66
soluble proteins (Nauchitel and Somorjai, 1994). Conversely I found that Cysteine,
Serine, Asparagine and Arginine were markedly depleted. My transcriptome enrichments
using the reference set were similar to results attained previously using individual mRNA
expression data sets (Jansen and Gerstein, 2000). In addition, I found that the translatome and the transcriptome both have lower molecular weight proteins in relation to the
genome.
Furthermore, I found, in comparison to the genome population, that the translatome and
transcriptome had a depletion of random coils, a relatively less structurally complex and,
as such, less stable protein structure, to alpha helices and beta sheets. These results are
from a small and potentially biased subset of proteins and so, in of themselves, may not
be informative. Yet, it is possible that they point to a logical trend that may result from
the cellular preference for stability and structural rigidity through more regular secondary
structures (helices and sheets).
In relation to functional categories, I found three trends that were particularly notable: (i)
the “cellular organization,” “protein synthesis,” and “energy production” categories were
increasingly enriched as I moved from genome to transcriptome to translatome. This
finding was true for either of the gene sets and reflects the great abundance of structural
proteins, such as actin, and, in the case of the transcriptome, ribosomal proteins. (In the
protein abundance set GProt ribosomal proteins are rather underrepresented.) (ii) Proteins
with “unclassified function” are significantly depleted in the transcriptome and the
translatome in relation to the genome, perhaps reflecting a bias against studying them.
(iii) Proteins in the “transcription” and “cell growth, cell division, and DNA synthesis”
Chapter 2: mRNA expression and protein abundance
67
categories were consistently depleted in the transcriptome and translatome population
relative to the genome. This perhaps reflects the fact that many of these proteins, such as
transcription factors, act as "switches". While many copies are needed in the genome to
give different specificities, only small quantities of the protein are necessary to activate or
deactivate a process.
These results concur with previous calculations (Jansen and
Gerstein, 2000) wherein I found the transcriptome is enriched specifically with proteins
involved in protein synthesis and energy.
As opposed to the genome population, where there is a wide distribution of products in
all cellular compartments, mainly cytoplasmic proteins dominate the translatome and
transcriptome. For instance, while the genome data set has the largest allocation of genes
going to the nucleus, the bulk of the translatome and transcriptome populations are localized to the cytoplasm. Part of this effect may also be due to the gel-electrophoresis
experimental process that favors the higher expressing cytoplasmic proteins, although a
similar effect can clearly be observed in the transcriptome data set, which does not have
this experimental bias. This may be related to the enrichment of functional categories
that are connected to cytoplasmic proteins, such as "protein synthesis".
Limitations given the small size of the protein abundance data
Even with the extended coverage made possible by merging many datasets together into
my two reference sets, I still found that the largest complication in my analysis was the
limited amount of data. This was, obviously, most applicable to the protein abundance
Chapter 2: mRNA expression and protein abundance
68
measurements. In addition to giving us fewer data points for my statistics, the small
number of protein abundance measurements potentially biased my statistical results towards certain protein families. The 181 proteins in Gprot are certainly not a random selection from the possible 6280 in yeast. They are, rather, skewed towards well-studied proteins that are highly expressed. My methodology attempts to control for this gene-selection bias through my enrichment formalism, which allows one to rather precisely gauge
various aspects of the bias.
My results will certainly be more complete and definitive when larger proteomics datasets become available, which I anticipate to become available soon (Smith, 2000). However, I believe that the essential formalism and approach that I develop will remain quite
relevant for all future datasets.
Although the translatome data I used in my study is small in comparison to the information on the genome and transcriptome, many protein features in both the translatome and
the transcriptome are dominated by the very highly expressed proteins (to which the 2DE experiments are biased). Under this circumstance, it is often sufficient to look at this
smaller number of dominating proteins to approximately characterize the whole
population. This is similar in spirit to the development of the codon adaptation index for
yeast (Sharp and Li, 1987). While based on only 24 highly expressed proteins, it has
proven to be robust in predicting expression levels for the entire genome. In contrast, the
experimental bias in the selection of proteins with particular biophysical properties
should be of more concern.
Chapter 2: mRNA expression and protein abundance
69
Future directions
Besides the recapitulation of my computations with the release of new data, I also hope to
expand this analysis to other organisms. While presently I have limited my study to yeast
gene expression, there are other potential model organisms for which there are expression
experiments. Moreover, I have also limited ourselves to Gene Chip experiments, but it
may be worthwhile to analyze cDNA microarray data sets (Cho et al., 1998,DeRisi et al.,
1997,Winzeler et al., 1999). I can use these sizeable microarray data sets to study
changes in protein features over time.
Acknowledgments
MG thanks the Keck foundation for support.
Chapter 2: mRNA expression and protein abundance
70
Figures and Tables
Chapter 2: mRNA expression and protein abundance 2.1 Analysis of mRNA expression and protein abundance dataexpression and protein
abundance data:
Table 1 Data sets
Annotation
Protein abundance
mRNA expression
Data set
Description
Size
Reference
[ORFs]
Young
Gene chip profiles yeast cells with mutations that
affect transcription
5455 al. (1998)
Church
Gene chip profiles of yeast cells under four
different conditions
6263 (1998)
Samson
Comparing gene chip profiles for yeast cells
subjected to alkylating agent
6090 (1998)
SAGE
Yeast cells during vegetative growth
3778 al. (1997)
Reference
expression
Scaling and integrating the mRNA expression set
into one data source
6249 -
2-DE #1
Measurement of yeast protein abundance by twodimensional (2D) gel electrophoresis and mass
spectrometry
2-DE #2
Similar to 2-DE set #1
Large-scale fusions of yeast genes with lacZ by
Holstege et
Roth et al.
Jelinsky et al.
Velculescu et
156
71
Gygi et al.
(1999)
Futcher et al.
(1999)
Ross-
Transposon transposon insertion
1410 Macdonald et
Reference
abundance
181
Annotated
Localization
Transmembrane
segments
MIPS
functions
GOR
secondary
structure
Scaling and integrating the 2-DE data sets into
one data source
al. (1999)
-
Subcellular localizations of yeast proteins
2133 Drawid et al.
(6280) (2000)
Predicted transmembrane and soluble proteins in
yeast
2710 Gerstein
(6280) (1998)
Functional categories for yeast ORFs
3519 Mewes et al.
(6194) (2000)
Predicted secondary structure for yeast ORFs
Chapter 2: mRNA expression and protein abundance
Gerstein
6280 (1998)
71
Table 1, data sets: This table provides an overview of the data sets used in my analysis.
The table is divided into three sections. The first section at the top lists different mRNA
expression sets. The second section in the middle shows the protein abundance data sets
used. The third section at the bottom contains different annotations of protein features.
The column "Data set" lists a shorthand reference to each data set used throughout this
paper. The next columns contain a brief description of the data sets, the number of ORFs
contained in each of them, the literature reference and the URL. In contrast to the other
data we investigated, the reference expression and abundance data sets have been
calculated for the purpose of my analysis (see text).
Some further information on the genome annotations:
Localization: Protein localization information from YPD, MIPS and SwissProt were
merged, filtered and standardized (Bairoch and Apweiler, 2000,Costanzo et al.,
2000,Mewes et al, 2000) into five simplified compartments -- cytoplasm, nucleus,
membrane, extracellular (including proteins in ER and golgi), and mitochondrial -according to the protocol in Drawid et al. (Drawid et al, 2000).
This yielded a
standardized annotation of protein subcellular localization for 2133 out of 6280 ORFs.
Transmembrane segments: In 2710 out of 6280 yeast ORFs transmembrane segments are
predicted to occur, ranging from low to high confidence (732 ORFs). The transmembrane prediction was performed as follows: The values from the scale for amino acids in
a window of size 20 (the typical size of a transmembrane helix) were averaged and then
compared against a cutoff of –1 kcal/mole. A value under this cutoff was taken to indicate the existence of a transmembrane helix. Initial hydrophobic stretches corresponding
Chapter 2: mRNA expression and protein abundance
72
to signal sequences for membrane insertion were excluded. (These have the pattern of a
charged residue within the first seven, followed by a stretch of 14 with an average
hydrophobicity under the cutoff.) These parameters have been used, tested, and refined
on surveys of membrane protein in genomes. "Sure" membrane proteins had at least two
TM-segments with an average hydrophobicity less than –2 kcal/mole(Gerstein et al.,
2000,Rost et al., 1995,Santoni et al., 2000,Senes et al., 2000).
Functions: MIPS functional categories have been assigned to 3519 out of 6194 ORFs.
(The remainder are assigned to category '98' or '99', which corresponds to unclassified
function.)
Chapter 2: mRNA expression and protein abundance
73
Figure 1 Schematic overview of the analysis
Figure 1 Schematic overview of the analysis. On the left side I outline the terms I use to
describe the process of gene expression. The coding section of the genome is transcribed
into a population of mRNA transcripts called the "transcriptome". The transcripts in turn
are translated to a population of proteins; I use the term "translatome" for this protein
population rather than the alternative "proteome" because the latter term may be
confounded with the protein complement of the genome (which is not necessarily
associated with a quantitative abundance level).
The matrix in the middle schematically shows an analysis of the three stages of expression. In general, I define a protein "population" as a set of genes associated with a corre-
Chapter 2: mRNA expression and protein abundance
74
sponding number of expression or abundance levels ("weights"). In the matrix each row
represents a weight and each column a gene set. In particular, I differentiate between the
mRNA reference expression set (GmRNA = GGen), which essentially covers the complete
genome, and the reference protein abundance set (GProt) which contains the proteins in
data sets 2-DE #1 and 2-DE #2 (see table 1) because the protein abundance set is a
significantly smaller subset of the genome. By definition, this subset contains only proteins that can be identified by 2-D gel electrophoresis and is therefore biased in this
sense. The enrichment figures throughout this paper, through a comparison of the right
and left sides of this figure, show the results of the experimental biases of 2D gels on the
data set.
Each pie chart represents a composition of a particular protein feature F (for instance, an
amino acid composition) in a population (represented by the symbol . I can further
look at the "enrichment" of this feature in one population relative to another (represented
by the symbol , see section "Methods" for an explanation of the formalism).
For simplification, I neglect the effects of post-transcriptional and post-translational
modifications that might alter the features of proteins (they affect the expression levels
but this is largely accounted for by the measurements). In this study I analyze protein
features as they are represented in the genome.
Chapter 2: mRNA expression and protein abundance
75
Figure 2 mRNA expression levels vs. protein abundance levels
Chapter 2: mRNA expression and protein abundance
76
Figure 2 mRNA expression levels vs. protein abundance levels. Part a of this figure
shows the reference protein abundance levels plotted against the mRNA reference
expression levels on a log-log scale; this plot is similar to the one reported by Futcher et
al. (Futcher et al, 1999) earlier. The trend line is described by the equation y = 5.20x0.61
where y represents the protein abundance level (in units of 103 copies/cell) and x the
mRNA expression level (in units of copies/cell). The dashed lines indicate a distance of
1.85 standard deviations (in the log scale) from the trend line. The outliers beyond the
dashed lines are listed in Part b. For each of these outlier ORFs I show a description of
their function and their respective MIPS categories (the numbers are defined in Figure
4C). With one exception, all outliers are associated with cellular organization (MIPS
category 30). Those outliers that have a high level of protein abundance relative to the
expected amount of mRNA expression are dominated by the alcohol and G3P dehydrogenases. Translation-related proteins are prominent in the group of those proteins with
low protein abundance in relation to mRNA expression.
Chapter 2: mRNA expression and protein abundance
77
Figure 3a-c Amino acid and biomass enrichment
Figure 3a-c
Amino acid and biomass enrichment Part a shows the amino acid
enrichments between different populations as indicated by the legend to the right of the
Chapter 2: mRNA expression and protein abundance
78
plot (the legend is ordered in the same way as the schematic illustration in Figure 1). The
bars indicate the enrichment of the transcriptome relative to the genome, whereas the
circles indicate the enrichment of the translatome relative to the genome. In addition, I
also show the enrichment for protein abundance from the Transposon Abundance Set,
represented by the circles with the line through them. It can be seen that the enrichments
for the transcriptome and the translatome follow a similar trend despite their differences.
In general, the amino acid enrichments seem to be more strongly emphasized in the
translatome. In contrast, the enrichments for the Transposon Abundance Set seem to be
very small. This may be due to the fact that the ORFs fused with lacZ produce different
gene products than the original genes. In both the translatome and the transcriptome the
amino acids Valine, Glycine and Alanine are strongly enriched. On the other end, the
amino acids Asparagine, Cysteine and Serine are strongly depleted.
Part b shows a different view of amino acid enrichment from that contained in part A,
now focusing on changes, and thus restricting the comparison to the genes common to all
the datasets. The graph is ordered according to the enrichment from transcriptome to
translatome (black squares). I focus here only on the changes for the abundance gene set
(GProt) to exclude the effects that arise from looking at different subsets. In this view the
enrichments from genome to transcriptome (white squares) and from genome to translatome (white diamonds) look more similar than do the analogous sets in Part A. To make
comparison with Part A easier I again show the enrichment from genome to the
transcriptome for the complete gene set (GGen, shown in bars).
Chapter 2: mRNA expression and protein abundance
79
Part c shows biomass enrichment. The left panel depicts the average molecular weight
per ORF (in units of kDa) and the right panel, the average molecular weight per amino
acid (in units of Daltons) in each of the three stages of gene expression. The numbers
inside the circles indicate the average molecular weights. The values next to the arrows
indicate the enrichments in biomass between different populations.
Both the circle
diameters and the arrow widths are functions of the corresponding values (the hollow arrow indicates a positive value). It is very clear that the average molecular weight per
ORF is much lower in the translatome (by 20% or 15%) and transcriptome (by 29%) than
in the genome. This relative depletion of biomass mainly takes place as a result of
transcription; the effect of translation is less clear, depending on the populations compared. On the other hand, the depletion in the average molecular weight per amino acid
(-3.3 % from genome to translatome) is an order of magnitude smaller than in the average
weight per ORF. This shows that the yeast cell favors the expression of shorter ORFs
over longer ones, and agrees with earlier observationsthat there is a negative correlation
between maximum ORF length and mRNA expression (Jansen and Gerstein, 2000); it
seems that this effect mainly takes place during transcription rather than translation.
Chapter 2: mRNA expression and protein abundance
80
Figure 3d Statistical significance
Figure 3d
Statistical significance. Part d shows that the amino acid enrichments are
statistically significant.
I have assessed significance by randomly permuting the
expression levels among the genes and then recomputing the amino acid enrichments.
This procedure can be repeated and used to generate distributions of random enrichments
that can then be compared against the observed enrichments. In the plot the gray bars
represent the observed enrichments already shown in figure 3a. On top of the gray bars I
show standard boxplots of enrichment distributions based on 20000 random permutations.
Chapter 2: mRNA expression and protein abundance
81
(The middle line represents the distribution median. The upper and lower sides of the
box coincide with the upper and lower quartiles. Outliers are shown as dots and defined
as data points that are outside the range of the whiskers, the length of which is 1.5 the
interquartile distance.) Based on the random distributions, I can compute one-sided Pvalues for the observed enrichments. Amino acids that are significant beyond  = 10-3
are shown in bold font (the only exception is Glutamine (Q), which has a P-value of
1.25·10-3). Note that  = 10-3 corresponds to ' = 5·10-5 = 10-3/20 for each individual
amino acid (Bonferroni correction) since I independently perform the same statistical test
20 times.
Chapter 2: mRNA expression and protein abundance
82
Figure 4 Breakdown of the transcriptome and translatome in terms of broad
categories relating to structure, localization, and function
Chapter 2: mRNA expression and protein abundance
83
Figure 4 Breakdown of the transcriptome and translatome in terms of broad categories
relating to structure, localization, and function All of the subfigures are analogous to the
schematic illustration in figure 1.
Part a represents the composition of secondary structure in the different populations. In
general, the secondary structure compositions appear to be relatively stable across the
different populations. The most notable change from genome to translatome is perhaps
the depletion of coils -- that is, relatively unordered structures compared to the more
structured helices and sheets -- by about 4%.
Part b represents the distribution of subcellular localizations associated with proteins in
the various populations. I used standardized localizations developed earlier(Drawid and
Gerstein, 2000), which, in turn, were derived from the MIPS, YPD, and Swiss-Prot databases (Bairoch and Apweiler, 2000,Costanzo et al, 2000,Mewes et al, 2000).
The
subcellular localization has been experimentally determined for less than half of the yeast
proteins, so my analysis applies only to this subset. The most notable difference between
genome, transcriptome and translatome is the strong enrichment of cytoplasmic proteins.
This is in agreement with my previous observations (Drawid et al., 2000). This also explains to some degree the observations for the functional classes in part C. For example,
the functional group "energy" is mostly dominated by the highly expressed glycolytic
proteins found in the cytoplasm. The depletion of the functional group "transcription"
makes sense in the light of the strong depletion for nuclear proteins. We have argued before (Drawid et al, 2000) that the number of proteins in a particular subcellular compartment may be roughly related to the size of the compartment. For instance, membrane
Chapter 2: mRNA expression and protein abundance
84
proteins occupy the relatively small "two-dimensional" space in lipid bi-layers. I also
performed a separate, independent calculation for a more comprehensive list of
transmembrane segments, which were predicted computationally (see caption of Table 1).
This largely confirms the result. (Data not shown.)
Part c shows the division of ORFs into different functional categories (according to the
MIPS classification) in the various populations. Only the largest functional categories of
the top level of the MIPS classification are shown. The group "Other" contains the
smaller top-level categories lumped together. This “Other” group is different from the
group "Unclassified," which contains genes without any functional description. One
complication is that many genes have multiple functional classifications such that they
may be counted in more than one category (this explains why the group "Unclassified"
has only a size of 28% for the genome population although the number of unclassified
genes in the yeast genome is much larger).
Comparing the genome with the transcriptome and translatome compositions in general,
it can be observed that if a functional class is enriched in the transcriptome relative to the
genome, it is also enriched in the translatome. Specifically, the functional classes
"metabolism", "energy", "protein synthesis" and "cellular organization" are enriched in
transcriptome and translatome. On the other hand, the classes "cell growth, cell division
and DNA synthesis" and "transcription" are depleted; in particular, this is the case for the
"unclassified" group, indicating that a lot of the current biochemical knowledge is clearly
skewed towards more highly expressed genes. Some of the differences between the complete gene set (GGen) and the protein abundance set (GProt) are obviously a result of the
Chapter 2: mRNA expression and protein abundance
85
bias of electrophoresis experiments. In addition, the ribosomal proteins that make up an
important highly expressed part of the class “protein synthesis” are underrepresented in
the protein abundance set (GProt).
Chapter 2: mRNA expression and protein abundance
86
References
1.
An, H., Scopes, R. K., Rodriguez, M., Keshav, K. F. & Ingram, L. O. Gel
electrophoretic analysis of Zymomonas mobilis glycolytic and fermentative
enzymes: identification of alcohol dehydrogenase II as a stress protein. J
Bacteriol 173, 5975-82 (1991).
2.
Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein
abundances in human liver. Electrophoresis 18, 533-7 (1997).
3.
Bairoch, A. Serendipity in bioinformatics, the tribulations of a Swiss
bioinformatician through exciting times! Bioinformatics 16, 48-64 (2000).
4.
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res 28, 45-8 (2000).
5.
Bassett, D. E., Jr. et al. Exploiting the complete yeast genome sequence. Curr
Opin Genet Dev 6, 763-6. (1996).
6.
Batke, J., Benito, V. A. & Tompa, P. A possible in vivo mechanism of
intermediate transfer by glycolytic enzyme complexes: steady state fluorescence
anisotropy analysis of an enzyme complex formation. Arch Biochem Biophys 296,
654-9 (1992).
7.
Cambillau, C. & Claverie, J. M. Structural and Genomic Correlates of
Hyperthermostability. J Biol Chem 275, 32383-32386 (2000).
8.
Carlson, M. The awesome power of yeast biochemical genomics. Trends in
Genetics 16, 49-51 (2000).
Chapter 2: mRNA expression and protein abundance
87
9.
Cavalcoli, J. D., VanBogelen, R. A., Andrews, P. C. & Moldover, B. Unique
identification of proteins from small genome organisms: theoretical feasibility of
high throughput proteome analysis. Electrophoresis 18, 2703-8 (1997).
10.
Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell 2, 65-73. (1998).
11.
Claverie, J. M. Computational methods for the identification of differential and
coordinated gene expression [In Process Citation]. Hum Mol Genet 8, 1821-32
(1999).
12.
Corthals, G., Wasinger VC, Hochstrasser DF, Sanchez JC. The dynamic range of
protein expression: a challenge for proteomic research. Electrophoreisis 21, 11041115 (2000).
13.
Costanzo, M. C. et al. The yeast proteome database (YPD) and Caenorhabditis
elegans proteome database (WormPD): comprehensive resources for the
organization and comparison of model organism protein information. Nucleic
Acids Res 28, 73-6 (2000).
14.
Das, R. & M., G. The Stability of Thermophilic Proteins: A Study Based on
Comprehensive Genome Comparison. Functional & Integrative Genomics 1, 3345 (2000).
15.
Day, D. A. & Tuite, M. F. Post-transcriptional gene regulatory mechanisms in
eukaryotes: an overview. J Endocrinol 157, 361-71. (1998).
16.
DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic and genetic
control of gene expression on a genomic scale. Science 278, 680-6 (1997).
Chapter 2: mRNA expression and protein abundance
88
17.
Doolittle, W. F. The nature of the universal ancestor and the evolution of the
proteome. Curr Opin Struct Biol 10, 355-8 (2000).
18.
Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with
sequence patterns for localizing proteins: comprehensive application to the yeast
genome. J Mol Biol 301, 1059-75. (2000).
19.
Drawid, A., Jansen, R. & Gerstein, M. Gene Expression Levels are Correlated
with Protein Subcellular Localization (in Press). Trends in Genetics (2000).
20.
Drawid, A., Jansen, R. & Gerstein, M. Genome-wide analysis relating expression
level with protein subcellular localization. Trends Genet 16, 426-30 (2000).
21.
Einarson, M. & Golemis, E. Encroaching genomics: adapting large-scale science
to small academic laboratories. Physiological Genomics 2, 85-92 (2000).
22.
Eisen, M. B. & Brown, P. O. DNA arrays for analysis of gene expression.
Methods Enzymol 303, 179-205 (1999).
23.
Epstein, C. & Butow, R. Microarray technology - enhanced versatility, persistent
challenge. Current Opinions Biotechnology 11, 36-41 (2000).
24.
Ferea, T. & Brown, P. Observing the living genome. Current Opinions Genetic
and Development 9, 715-722 (1999).
25.
Fey, S. J. & Larsen, P. M. 2D or not 2D. Two-dimensional gel electrophoresis.
Curr Opin Chem Biol 5, 26-33. (2001).
26.
Fey, S. J. et al. Proteome analysis of Saccharomyces cerevisiae: a methodological
outline. Electrophoresis 18, 1361-72 (1997).
27.
Frishman, D. & Mewes, H. W. Protein structural classes in five complete
genomes [letter]. Nat Struct Biol 4, 626-8 (1997).
Chapter 2: mRNA expression and protein abundance
89
28.
Frishman, D. & Mewes, H. W. Genome-based structural biology. Prog Biophys
Mol Biol 72, 1-17 (1999).
29.
Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A
sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999).
30.
Gaasterland, T. Archaeal genomics. Curr Opin Microbiol 2, 542-7 (1999).
31.
Garrels, J. I. et al. Proteome studies of Saccharomyces cerevisiae: identification
and characterization of abundant proteins. Electrophoresis 18, 1347-60 (1997).
32.
Gerstein, M. A structural census of genomes: comparing bacterial, eukaryotic, and
archaeal genomes in terms of protein structure. J Mol Biol 274, 562-76 (1997).
33.
Gerstein, M. How representative are the known structures of the proteins in a
complete genome? A comprehensive structural census. Fold Des 3, 497-512
(1998).
34.
Gerstein, M. Measurement of the effectiveness of transitive sequence comparison,
through a third 'intermediate' sequence. Bioinformatics 14, 707-14 (1998).
35.
Gerstein, M. Patterns of Protein-Fold Usage in Eight Microbial Genomes: A
Comprehensive Structural Census. Proteins 33, 518-534 (1998).
36.
Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of
whole-genome expression data: How does it relate to protein structure and
function (In press). Current Opinions in Structural Biology (2000).
37.
Gerstein, M., Lin, J. & Hegyi, H. Protein folds in the worm genome. Pac Symp
Biocomput, 30-41 (2000).
Chapter 2: mRNA expression and protein abundance
90
38.
Goffeau, A., Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F,
Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen
P, Tettelin H, Oliver SG. Life with 6000 genes. Science 274 (1996).
39.
Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. & Aebersold, R. Evaluation of
two-dimensional gel electrophoresis-based proteome analysis technology. Proc
Natl Acad Sci U S A 97, 9390-5. (2000).
40.
Gygi, S. P., Rist, B. & Aebersold, R. Measuring gene expression by quantitative
proteome analysis [In Process Citation]. Curr Opin Biotechnol 11, 396-401
(2000).
41.
Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between
protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999).
42.
Harry, J. W., MR Herbert, BR Packer,NH AA, Gooley Williams, KL.
Ptoteomics: Capacity versus utility. Electrophoreisis 21, 1071-1081 (2000).
43.
Hatzimanikatis, V., Choe, L. H. & Lee, K. H. Proteomics: theoretical and
experimental considerations. Biotechnol Prog 15, 312-8 (1999).
44.
Haynes, P. A. & Yates, J. R., 3rd. Proteome profiling-pitfalls and progress. Yeast
17, 81-7 (2000).
45.
Hegyi, H. & Gerstein, M. The relationship between protein structure and function:
a comprehensive survey with application to the yeast genome. J Mol Biol 288,
147-64 (1999).
46.
Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome.
Cell 95, 717-728 (1998).
Chapter 2: mRNA expression and protein abundance
91
47.
Ishii, M. et al. Direct comparison of GeneChip and SAGE on the quantitative
accuracy in transcript profiling analysis. Genomics 68, 136-43 (2000).
48.
Jackson, R. J. & Wickens, M. Translational controls impinging on the 5'untranslated region and initiation factor proteins. Curr Opin Genet Dev 7, 233-41.
(1997).
49.
Jacobs Anderson, J. S. & Parker, R. Computational identification of cis-acting
elements affecting post- transcriptional control of gene expression in
Saccharomyces cerevisiae. Nucleic Acids Res 28, 1604-17. (2000).
50.
Jaeger, J. A., Turner, D. H. & Zuker, M. Predicting optimal and suboptimal
secondary structure for RNA. Methods Enzymol 183, 281-306 (1990).
51.
Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and
functional categories: characterizing highly expressed proteins. Nucleic Acids Res
28, 1481-8 (2000).
52.
Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to
an alkylating agent. Proc Natl Acad Sci U S A 96, 1486-91 (1999).
53.
Jones, D. T. Do transmembrane protein superfolds exist? FEBS Lett 423, 281-5
(1998).
54.
Jones, D. T. GenTHREADER: an efficient and reliable protein fold recognition
method for genomic sequences. J Mol Biol 287, 797-815 (1999).
55.
Kidd, D., Liu, Y. & Cravatt, B. F. Profiling serine hydrolase activities in complex
proteomes. Biochemistry 40, 4005-15 (2001).
Chapter 2: mRNA expression and protein abundance
92
56.
Klose, J. Protein mapping by combined isoelectric focusing and electrophoresis of
mouse tissues. A novel approach to testing for induced point mutations in
mammals. Humangenetik 26, 231-43 (1975).
57.
Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. Predicting
transmembrane protein topology with a hidden Markov model: application to
complete genomes. J Mol Biol 305, 567-80 (2001).
58.
Lin, J. & Gerstein, M. Whole-genome trees based on the occurrence of folds and
orthologs: implications for comparing genomes on different levels. Genome Res
10, 808-18 (2000).
59.
Lindahl, L. & Hinnebusch, A. Diversity of mechanisms in the regulation of
translation in prokaryotes and lower eukaryotes. Curr Opin Genet Dev 2, 720-6.
(1992).
60.
Lipshutz, R. J., Fodor, S. P., Gingeras, T. R. & Lockhart, D. J. High density
synthetic oligonucleotide arrays. Nat Genet 21, 20-4 (1999).
61.
Lopez, M. F. Better approaches to finding the needle in a haystack: Optimizing
proteome analysis through automation. Electrophoreisis 21, 1082-1093 (2000).
62.
Luban, J. & Goff, S. P. The yeast two-hybrid system for studying protein-protein
interactions. Curr Opin Biotechnol 6, 59-64 (1995).
63.
MacBeath, G. & Schreiber, S. L. Printing proteins as microarrays for highthroughput function determination. Science 289, 1760-3. (2000).
64.
Matton, D. P., Constabel, P. & Brisson, N. Alcohol dehydrogenase gene
expression in potato following elicitor and stress treatment. Plant Mol Biol 14,
775-83 (1990).
Chapter 2: mRNA expression and protein abundance
93
65.
McCarthy, J. E. Posttranscriptional control of gene expression in yeast. Microbiol
Mol Biol Rev 62, 1492-553. (1998).
66.
Mewes, H. W. et al. MIPS: a database for genomes and protein sequences.
Nucleic Acids Res 28, 37-40 (2000).
67.
Millar, A. A., Olive, M. R. & Dennis, E. S. The expression and anaerobic
induction of alcohol dehydrogenase in cotton. Biochem Genet 32, 279-300 (1994).
68.
Molloy, M. P. Two-dimensional electrophoresis of membrane proteins using
immobilized pH gradients. Anal Biochem 280, 1-10 (2000).
69.
Morris, D. R. & Geballe, A. P. Upstream open reading frames as regulators of
mRNA translation. Mol Cell Biol 20, 8635-42. (2000).
70.
Nauchitel, V. V. & Somorjai, R. L. Spatial and free energy distribution patterns of
amino acid residues in water soluble proteins. Biophysical Chemistry 51, 327-336
(1994).
71.
Nelson, P. S. et al. Comprehensive analyses of prostate gene expression:
convergence of expressed sequence tag databases, transcript profiling and
proteomics [In Process Citation]. Electrophoresis 21, 1823-31 (2000).
72.
O'Farrell, P. H. High resolution two-dimensional electrophoresis of proteins. J
Biol Chem 250, 4007-21 (1975).
73.
Pandey, A. & Mann, M. Proteomics to study genes and genomes. Nature 405,
837-46 (2000).
74.
Qi, S. Y., Moir, A. & O'Connor, C. D. Proteome of Salmonella typhimurium
SL1344: identification of novel abundant cell envelope proteins and assignment to
a two-dimensional reference map. J Bacteriol 178, 5032-8 (1996).
Chapter 2: mRNA expression and protein abundance
94
75.
Ross-Macdonald, P. et al. Large-scale analysis of the yeast genome by transposon
tagging and gene disruption. Nature 402, 413-8 (1999).
76.
Rost, B., Casadio, R., Fariselli, P. & Sander, C. Transmembrane helices predicted
at 95% accuracy. Protein Sci 4, 521-33 (1995).
77.
Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory
motifs within unaligned noncoding sequences clustered by whole-genome mRNA
quantitation. Nat BIOTECHNOL 16, 939-45 (1998).
78.
Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 220415 (2000).
79.
Sali, A. Functional Links between Proteins. Nature 402, 25-26 (1999).
80.
Santoni, V., Molloy, M. & Rabilloud, T. Membrane proteins and proteomics: un
amour impossible? Electrophoreisis 21, 1054-1070 (2000).
81.
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of
gene expression patterns with a complementary DNA microarray. Science 270,
467-70 (1995).
82.
Searls, D. B. Using bioinformatics in gene and drug discovery. Drug Discovery
Today 5, 135-143 (2000).
83.
Senes, A., Gerstein, M. & Engelman, D. M. Statistical analysis of amino acid
patterns in transmembrane helices: the GxxxG motif occurs frequently and in
association with beta-branched residues at neighboring positions. J Mol Biol 296,
921-36 (2000).
84.
Shapiro, L. & Harris, T. Finding function through structural genomics. Current
Opinions in Biotechnology 11, 31-35 (2000).
Chapter 2: mRNA expression and protein abundance
95
85.
Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional
synonymous codon usage bias, and its potential applications. Nucleic Acids Res
15, 1281-95 (1987).
86.
Sherlock, G. Analysis of large-scale gene expression data. Curr Opin Immunol 12,
201-5 (2000).
87.
Shevchenko, A. et al. Linking genome and proteome by mass spectrometry: largescale identification of yeast proteins from two dimensional gels. Proc Natl Acad
Sci U S A 93, 14440-5 (1996).
88.
Smith, R. D. Probing proteomes-seeing the whole picture? [In Process Citation].
Nat Biotechnol 18, 1041-2 (2000).
89.
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein
families. Science 278, 631-7 (1997).
90.
Tekaia, F., Lazcano, A. & Dujon, B. The genomic tree as revealed from whole
proteome comparisons. Genome Res 9, 550-7 (1999).
91.
Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997).
92.
Vilela, C., Linz, B., Rodrigues-Pousada, C. & McCarthy, J. E. The yeast
transcription factor genes YAP1 and YAP2 are subject to differential control at
the levels of both translation and mRNA stability. Nucleic Acids Res 26, 1150-9.
(1998).
93.
Vilela, C., Ramirez, C. V., Linz, B., Rodrigues-Pousada, C. & McCarthy, J. E.
Post-termination ribosome interactions with the 5'UTR modulate yeast mRNA
stability. Embo J 18, 3139-52. (1999).
Chapter 2: mRNA expression and protein abundance
96
94.
Wallin, E. & von Heijne, G. Genome-wide analysis of integral membrane proteins
from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7, 1029-38
(1998).
95.
Washburn, M. P. & Yates, J. R., 3rd. Analysis of the microbial proteome. Curr
Opin Microbiol 3, 292-7 (2000).
96.
Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by
gene deletion and parallel analysis. Science 285, 901-6 (1999).
97.
Wittes, J. & Friedman, H. P. Searching for evidence of altered gene expression: a
comment on statistical analysis of microarray data [editorial; comment]. J Natl
Cancer Inst 91, 400-1 (1999).
98.
Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V. Distribution of protein
folds in the three superkingdoms of life. Genome Res 9, 17-26 (1999).
99.
Young, K. Yeast two-hybrid: so many interactions, (in) so little time... Biol
Reprod 58, 302-311 (1998).
100.
Zhang, M. Q. Large-scale gene expression data analysis: a new challenge to
computational biologists [published erratum appears in Genome Res 1999
Nov;9(11):1156]. Genome Res 9, 681-8 (1999).
101.
Zhu, H. et al. Analysis of yeast protein kinases using protein chips. Nat Genet 26,
283-9. (2000).
102.
Zuker, M. Calculating nucleic acid secondary structure. Curr Opin Struct Biol 10,
303-10. (2000).
Chapter 2: mRNA expression and protein abundance
97
Chapter 2: mRNA expression and protein abundance
2.2 Comparing protein abundance and mRNA expression levels
on a genomic scale
Abstract
Attempts to correlate protein abundance with mRNA expression levels have had variable
success. I review the results of these comparisons, focusing on yeast. In the process, I
survey experimental techniques for determining protein abundance, principally twodimensional gel electrophoresis and mass-spectrometry. I also merge many of the
available yeast protein-abundance datasets, using the resulting larger 'meta-dataset' to
find correlations between protein and mRNA expression, both globally and within
smaller categories.
Introduction
Although some of the underlying technology for quantifying protein abundance was
introduced almost thirty years ago (Klose, 1975,O'Farrell, 1975),there has recently been a
significant increase in the development of new tools. Concurrently, tools for analyzing
mRNA expression are becoming more mainstream. The quantification of both of these
molecular populations is not an exercise in redundancy; measurements taken from
mRNA and protein levels are complementary and both are necessary for a complete
understanding of how the cell works (Hatzimanikatis et al., 1999). Additionally, as
mRNA is eventually translated into protein, one might assume that there should be some
Chapter 2: mRNA expression and protein abundance
98
sort of correlation between the level of mRNA and that of protein. Alternatively, there
may not be any significant correlation, which, in itself, is an informative conclusion.
The two commonly used high-throughput methods for measuring mRNA expression,
microarrays and Affymetrix chips, have both been extensively reviewed elsewhere
(Brown and Botstein, 1999,McGall and Christians, 2002,Schena et al., 1998).There are
also two basic methods for determining protein abundance; either based on twodimensional electrophoresis or on mass-spectrometric methods (Table 1). I provide a
brief review of these technologies and recent efforts to determine correlations between
quantified protein abundances and mRNA expression.
Two-dimensional electrophoresis
Determining relative protein expression levels by conventional two-dimensional
electrophoresis requires isoelectric focusing, SDS-polyacrylamide gel electrophoresis,
staining, fixing, densitometry, and careful matching of the same spots on two or more
gels. Differentially expressed spots are then excised and enzymatically digested, and the
resulting peptides are identified using mass spectrometry. An attractive aspect of this
approach is the low capital equipment cost, but a high level of expertise is needed to
obtain reproducible gels, and two-dimensional electrophoresis is generally limited to
proteins that are neither too acidic, too basic, nor too hydrophobic, and that are between
10 and 200 kDa in size, so that they are reliably separated on gels. Additionally, this
approach detects only those proteins that are expressed at relatively high levels and that
Chapter 2: mRNA expression and protein abundance
99
have long half-lives (Gygi et al., 2000,Gygi et al., 1999) In one study using 40 μg yeast
lysate, the average protein abundance detected was 51,200 copies per cell, with no
proteins detected with abundances less than 1,000 copies per cell (Gygi et al, 2000).
Given that 1,500 spots were resolved on a 1.0 pH unit gel (Gygi et al, 2000), several gels
covering different pH ranges would be needed to resolve a whole cell lysate. Given these
limitations, conventional two-dimensional electrophoresis technology has limited
potential for large-scale proteome analysis (Gygi et al, 2000).
Two-dimensional fluorescence-difference gel electrophoresis (DIGE) utilizes mass- and
charge-matched, spectrally resolvable fluorescent dyes (such as Cy3 and Cy5) to label
two different protein samples in vitro prior to two-dimensional electrophoresis. Its main
advantage over conventional two-dimensional electrophoresis is that both the control and
the experimental sample are run in a single polyacrylamide gel. The samples are then
imaged separately but can be perfectly overlaid without any 'warping' of the gels. This
substantially raises the confidence with which protein changes between samples can be
detected and quantified. Changes in the relative level of expression of a protein may be
detected that are as little as 1.2-fold for large-volume spots (Tonge et al., 2001). Because
detection is based on fluorescence, DIGE has a large dynamic range of about 10,000,
which permits differential expression analysis of proteins that are present at relatively
low copy number (Tonge et al, 2001). The limit of detection of DIGE for quantifying
protein expression ratios is between 0.25 and 0.95 ng protein, which is similar to that for
silver staining (Gharbi et al., 2002,Tonge et al, 2001). In a recent study (Zhou et al.,
2002), the relative levels of expression of approximately 1,050 protein spots were
Chapter 2: mRNA expression and protein abundance
100
compared in 250,000 laser-dissected normal versus esophageal carcinoma cells. This
analysis identified 58 spots that were up-regulated by more than three-fold and 107 that
were down-regulated by more than three-fold in cancer cells.
Mass spectrometric approaches
Disease biomarker discovery
Current approaches to discovering protein or peptide markers of disease involve batch
chromatography, matrix-assisted laser desorption ionization mass spectrometry (MALDIMS) and statistical analysis of large numbers of disease versus normal serum or other
biological samples. Most recent studies have relied on surface-enhanced laser desorption
ionization time-of-flight mass spectrometry (SELDI-TOF-MS) (Adam et al., 2001,Issaq
et al., 2002). The SELDI approach (Issaq et al, 2002) involves using a gold-coated chip
with eight or sixteen 2 mm spots that are modified with chromatographic surfaces (for
example anionic, cationic, hydrophobic, and so on). After spotting a few microliters of
serum, any contaminants and salt are removed by washing with water, and the target is
dried by adding a MALDI matrix solution, such as α-cyano-4-hydroxy-cinnamic acid. In
a study by Petricoin et al. (Petricoin et al., 2002) SELDI-MS analysis of serum from 50
control and 50 case samples from patients with ovarian cancer resulted in identifying five
peptide biomarkers that ranged in size from 534 to 2,465 Da. The pattern formed by these
markers was then used to correctly classify all 50 ovarian cancer samples in a masked set
of serum samples from 116 patients who included 50 patients with ovarian cancer and 66
unaffected women. Similar promising results have been reported in studies of serum
Chapter 2: mRNA expression and protein abundance
101
samples from breast and prostrate cancer patients (Adam et al, 2001,Li et al., 2002) In a
recent study (Wu et al., 2003), which compared the relative ability of several different
statistical approaches to classify samples based on MS data, the disease biomarker
approach was extended to a conventional MALDI-MS platform. Although powerful, the
disease biomarker approach does not provide accurate relative amounts of the control
versus experimental biomarker, only the relative intensity difference.
Isotope-coded affinity-tag-based protein profiling
While both MALDI-MS-based disease biomarker discovery and DIGE comparatively
profile the naturally occurring forms of peptides and proteins, isotope-coded affinity-tag
(ICAT) analysis profiles the relative amounts of cysteine-containing peptides derived
from tryptic digests of protein extracts. Because only a single tryptic peptide is needed to
quantify the expression of the corresponding parent protein, the ICAT reagent utilizes a
thiol protein-reactive group that attaches both a biotin tag and either nine 12C (light) or
nine 13C (heavy) atoms to each cysteine residue. Following derivatization of the control
protein extract with [12C]-ICAT reagent and the experimental extract with [13C]-ICAT
reagent, the pooled samples are subjected to trypsin digestion followed by both cation
and avidin chromatography. Liquid chromatography and tandem mass spectrometry
(LC/MS/MS) is then used to identify ICAT peptide pairs and to quantify the relative
12C/13C ratios. It is important to note that the ICAT approach provides the relative
expression ratios of individual proteins under two conditions; it does not provide absolute
protein concentrations, nor does it provide the ratio of the concentration of one protein
Chapter 2: mRNA expression and protein abundance
102
relative to another in a single condition. A nice feature of this approach is that the in vitro
incorporation of a stable isotope into one of the two samples being compared obviates the
need to separately analyze the control and experimental samples by MS. Although a
tryptic digest of a whole-cell human protein extract might produce more than 500,000
peptides, less than 100,000 of these might be expected to contain cysteine, but based on a
search of the SwissProt database less than 5% of human proteins lack cysteine and
would therefore be missed (that is, more than 95% of proteins would include at least one
cysteine-containing peptide).
ICAT results are analogous to those obtained by the use of two different fluorescent dyes
in DNA microarray analysis of mRNA levels or DIGE analysis of protein expression.
The largest number of proteins profiled so far using this approach with a single sample
are the 491 proteins contained in microsomal fractions of naive and in vitro differentiated
human myeloid leukemia cells (Han et al., 2001).
Multidimensional protein identification technology
Multidimensional protein identification technology (MudPit) is similar to ICAT in that it
utilizes cation-exchange prefractionation followed by reverse-phase (RP) highperformance liquid chromatography (HPLC) separation and MS/MS analysis (Wolters et
al., 2001). In contrast to the ICAT approach, however, MudPit analyzes the entire
mixture of tryptically digested proteins and utilizes tandemly coupled (cation-exchange
followed by reverse-phase) columns. A specific subset of peptides is eluted from the
Chapter 2: mRNA expression and protein abundance
103
cation-exchange column, using a step gradient of increasing salt concentration, onto the
front of the RP column. Peptides are then eluted from the RP column and enter the mass
spectrometer for analysis. After the RP gradient is complete, the next step of the salt
gradient releases another subset of peptides from the cation-exchange column onto the
RP column, and the process repeats itself. Using this approach on the yeast proteome,
Wolters et al. (Wolters et al, 2001) identified 5,540 unique peptides from 1,484 proteins
and demonstrated a dynamic range of detection of 10,000-fold. This method has been
extended to comparative protein profiling by using in vivo 14N/15N metabolic labeling
(Washburn et al., 2003,Washburn et al., 2002). Washburn et al. (Washburn et al,
2002)used Saccharomyces cerevisiae grown in both 14N- and 15N-containing minimal
media, and 2,264 peptides and 872 proteins were uniquely identified. Also, accurate
14N/15N quantitation was determined for each peptide with an average standard
deviation of 30%.
Comparison of mRNA and protein levels
Even with the significant developments in the technologies used to quantify protein
abundance over the past couple of years, protein identification and quantification still
lags behind the high-throughput experimental techniques used to determine mRNA
expression levels. Yet, while mRNA expression values have shown their usefulness in a
broad range of applications, including the diagnosis and classification of cancers (Golub
et al., 1999,Macgregor and Squire, 2002), these results are almost certainly only
correlative, rather than causative; in the end it is most probably the concentration of
Chapter 2: mRNA expression and protein abundance
104
proteins and their interactions that are the true causative forces in the cell, and it is the
corresponding protein quantities that I ought to be studying.
Primarily because of a limited ability to measure protein abundances, researchers have
tried to find correlations between mRNA and the limited protein expression data, in the
hope that they could determine protein abundance levels from the more copious and
technically easier mRNA experiments. Alternatively, if there is definitively no correlation
between mRNA and protein data, both quantities could be used as independent sources of
information for use in machine-learning algorithms, for example, to predict protein
interactions. To date, there have been only a handful of efforts to find correlations
between mRNA and protein expression levels, most notably in human cancers and yeast
cells; for the most part, they have reported only minimal and/or limited correlations.
One of the earliest analyses of correlation looked at 19 proteins in the human liver.
Anderson and Seilhamer (Anderson and Seilhamer, 1997)found a somewhat positive
correlation of 0.48. Another limited analysis, of the three genes MMP-2, MMP-9 and
TIMP-1 in human prostate cancers, showed no significant relationship (Lichtinghagen et
al., 2002). An additional cancer study(Chen et al., 2002)showed a significant correlation
in only a small subset of the proteins studied. Conversely, Orntoft et al. (Orntoft et al.,
2002) found highly significant correlations in human carcinomas when looking at
changes in mRNA and protein expression levels.
Chapter 2: mRNA expression and protein abundance
105
Protein and mRNA correlations in yeast
Many of the present efforts at correlating mRNA and protein expression have been
conducted in yeast using two-dimensional electrophoresis techniques. In particular, Gygi
et al. (Gygi et al, 1999) found that even similar mRNA expression levels could be
accompanied by a wide range (up to 20-fold difference) of protein abundance levels, and
vice versa. These results contrast with those of Futcher et al. (Futcher et al., 1999), who
found relatively high correlations (r = 0.76) after transforming the data to normal
distributions. In a previous analysis (Greenbaum et al., 2002), I merged the data from
both of these datasets (referred to as 2DE-1 (Gygi et al, 1999) and 2DE-2 (Futcher et al,
1999)), comparing the resulting new larger protein abundance set ('merged data-set 1')
with a comprehensive mRNA expression dataset. The mRNA expression reference set
was constructed through iteratively combining, in a non-trivial fashion, three sets that
used Affymetrix chips and a SAGE dataset (Greenbaum et al, 2002). Using these
reference datasets, I was able to do an all-against-all comparison of mRNA and protein
expression levels, in addition to a number of analyses comparing protein and mRNA
expression using smaller, but broad categories (Greenbaum et al, 2002,Luscombe et al.,
2001).
Given the difficult, laborious, and limiting nature of two-dimensional electrophoresis
analysis, many of the newer protein abundance determinations have been done using
MudPit and derivative technologies. Washburn et al. (Washburn et al., 2001)used MudPit
to analyze and detect 1,484 arbitrary proteins: they were able to detect a somewhat
Chapter 2: mRNA expression and protein abundance
106
random sampling of proteins independent of abundance, localization, size or
hydrophobicity (I refer to this dataset as MudPit-1). In a further experiment the authors,
comparing expression ratios for both proteins and mRNA levels, found that although they
could not find correlations for individual loci, they could find overall correlations when
looking at pathways and complexes of proteins that functioned together (Washburn et al,
2003). Peng et al. (Peng et al., 2003)analyzed 1,504 yeast proteins with a false-positive
rate - misidentification of a protein - of less than 1% (I refer to this dataset as MudPit-2).
In their analysis (Peng et al, 2003), they contrasted their methodology with that of
Washburn et al. (Washburn et al, 2001)with which there was significant overlap of
proteins.
A new merged dataset
Expanding upon my previous merged dataset, I constructed a new merged dataset
(merged data set-2) using the two two-dimensional electrophoresis and two MudPit
datasets described above. Succinctly (more information is available on my website at:
http://bioinfo.mbb.yale.edu/expression/prot-v-mrna/), I transformed each of the proteinabundance datasets into more quantitative data by fitting each protein dataset individually
onto the reference mRNA expression dataset. The MudPit-1 dataset was also fitted onto
the more finely grained MudPit-2 dataset. Each of the new, fitted datasets was then
inversely transformed back into protein space. These derived protein datasets were then
combined into a larger reference dataset; when I had more than one abundance value for
an open reading frame (ORF), I chose the value from the dataset according to a
Chapter 2: mRNA expression and protein abundance
107
prescribed quality ranking (see Figure 1). The resulting set contained protein abundance
information for approximately 2,000 ORFs. (One caveat with the MudPit data: while
quantitative analysis can be subsequently done on the results of MudPit experiments,
MudPit data alone are only semi-quantitative, in that the number of peptides determined
is relative to the actual protein abundance within the cell (Washburn et al, 2001). Some
may therefore argue that MudPit alone is not optimal for a comparison with mRNA data.
Nevertheless, I feel that my methodical merging process creates a quantitative and
representative dataset that can be compared with the mRNA expression data.) Using the
resulting data I could compare mRNA expression and protein abundance globally (Figure
1a) as well as looking at smaller, broad categories, such as function or localization (see
Figure 1b,1c). In particular, I show that some localization categories - for example, the
nucleolus - have significantly higher correlations than the global correlation. Other
localizations may present less of a correlation between mRNA and protein data - for
example, the mitochondria - possibly reflecting the heterogeneous nature and function of
the latter organelle. In terms of MIPS functional categories (Mewes et al., 2002) I show
that although some categories, such as cell rescue, show a lower correlation than the
whole merged set, other functional categories, such as cell cycle, show a significant
increase in correlation. Logically, this increased correlation reflects the co-regulated
nature of the proteins in this functional category.
Reasons for the absence of correlation
Chapter 2: mRNA expression and protein abundance
108
There are presumably at least three reasons for the poor correlations generally reported in
the literature between the level of mRNA and the level of protein, and these may not be
mutually exclusive. First, there are many complicated and varied post-transcriptional
mechanisms involved in turning mRNA into protein that are not yet sufficiently well
defined to be able to compute protein concentrations from mRNA; second, proteins may
differ substantially in their in vivo half lives; and/or third, there is a significant amount of
error and noise in both protein and mRNA experiments that limit my ability to get a clear
picture (Baldi and Long, 2001,Szallasi, 1999).
Examining the first option - that there are a number of complex steps between
transcription and translation - I looked at correlations between mRNA and protein
abundance for those ORFs that had varied or steady levels of mRNA expression over the
course of the cell cycle (Cho et al., 1998). To normalize for the varied degrees of
expression for different ORFs, I took the standard deviation divided by the average
expression level as representative of the variation of each ORF over the course of the
yeast cell cycle (Figure 2). Broadly speaking, the cell can control the levels of protein
atthe transcriptional level and/or at the translational level. Logically, I would assume that
those ORFs that show a large degree of variation in their expression are controlled at the
transcriptional level - the variability of the mRNA expression is indicative of the cell
controlling mRNA expression at different points of the cell cycle to achieve the resulting
and desired protein levels. Thus I would expect, and I found, a high degree of correlation
(r = 0.89) between the reference mRNA and protein levels for these particular ORFs; the
cell has already put significant energy into dictating the final level of protein through
Chapter 2: mRNA expression and protein abundance
109
tightly controlling the mRNA expression, and I assume that there would then be minimal
control at the protein level. In contrast, those genes that show minimal variation in their
mRNA expression throughout the cell cycle are more likely to have little or no
correlation with the final protein level; the cell would be controlling these ORFs at the
translational and/or post-translational level, with the mRNA levels being somewhat
independent of the final protein concentration. And indeed, I found only minimal
correlation between protein and mRNA expression for these ORFs (r = 0.2).
Furthermore, I found that those ORFs that have higher than average levels of ribosomal
occupancy - that is that a large percentage of their cellular mRNA concentration is
associated with ribosomes (being translated) - have well correlated mRNA and protein
expression levels (Figure 2). These cases probably represent a situation wherein the cell,
having significantly controlled the mRNA expression to produce a specific level of
protein, will probably not also employ mechanisms to control the translation.
Alternatively, those proteins that have very low occupancy rates have uncorrelated
mRNA and protein expression; thus, given that the cell has not tightly controlled the
mRNA expression for this ORF, it will dictate the resulting protein levels through
rigorous controls of its translation (that is, through tight limits on occupancy) (Arava et
al., 2003)
A second option for a general lack of correlation between mRNA and protein abundance
may be that proteins have very different half-lives as the result of varied protein synthesis
and degradation. Protein turnover can vary significantly depending on a number of
Chapter 2: mRNA expression and protein abundance
110
different conditions (Glickman and Ciechanover, 2002); the cell can control the rates of
degradation or synthesis for a given protein, and there is significant heterogeneity even
within proteins that have similar functions (Pratt et al., 2002). Recent efforts have been
made to computationally measure these rates (Lian et al., 2002).
Simplistically, it can be presumed that the change in a protein's concentration over time
will be equal to the rate of translation minus the rate of degradation. By analogy to
concepts in chemical kinetics, I can approximate this equation: dP(i,t)/dt = SE(i,t) DP(i,t), where P is protein abundance i at time t, E is the mRNA expression level of
protein P, S is a general rate of protein synthesis per mRNA, and D is a general rate of
protein degradation per protein (Gerner et al., 2002). Additionally there are some
experimental methods that can also be used to measure turnover and the translational
control of protein levels (Gerner et al, 2002,Lian et al, 2002,Pratt et al, 2002,Serikawa et
al., 2003).
Given the degenerate nature of the genetic code, there are many synonymous codons
(codons that translate into the same amino acid). As the cell is biased in its usage of
synonymous codons - that is, the usage of a subset of codons results in a higher level of
mRNA expression, possibly as a result of differing cellular tRNA levels (Bennetzen and
Hall, 1982)- the codon adaptation index (CAI), a measurement of codon usage, can be
used to predict the expression of a gene (Sharp and Li, 1987) (we recently calculated new
parameters for this model, with some improvement in predictive strength (Jansen et al.,
2003)). It is thought that the CAI will correlate differently with mRNA levels than with
Chapter 2: mRNA expression and protein abundance
111
protein abundance levels due, in part, to protein turnover rates (Coghlan and Wolfe,
2000).
Ranking the ORFs in terms of their CAI value, I found that although those ORFs that
ranked the highest in terms of CAI did not show a very strong correlation between
mRNA and protein levels, they nevertheless showed a significantly higher correlation
than ORFs that were ranked as having the lower CAI values (r = 0.48 versus 0.02). The
low correlations reflect the fact that the CAI will correlate differently for protein and
mRNA values because of the additional cellular controls on protein translation, namely
the effect of protein turnover rates. Nevertheless, the sizable difference in correlations
between the two groups of ORFs with high- and low-ranking CAI values (Figure 2)
shows that there is some relationship between mRNA and protein values, possibly
indicating that highly expressed genes tend to result in a more correlated level of protein
abundance than lower expressed ones.
Correlations have been found between the mRNA expression levels of different protein
subunits within protein complexes (Jansen et al., 2002). This implies that there should be,
in general, a correlation between mRNA and protein abundance, as these subunits
provide a special case as they have to be available in stoichiometric amounts of proteins
for the complexes to function. Thus, I believe that a major limitation to finding
correlations is the degree of natural and manufactured systematic noise in mRNA and
protein expression experiments. There is a continued effort to both describe and reduce
this noise (Qian et al., 2003). Meanwhile, in an attempt to get around the noise one could
Chapter 2: mRNA expression and protein abundance
112
look at broad categories of proteins - for example, groups defined by function, structure,
or localization - such that the background noise is cancelled out to some degree
(Greenbaum et al, 2002).
Although proteomics is still in its infancy, given the pace of technological advancement
in protein quantification, mRNA expression analysis and noise reduction, more
comprehensive correlation studies will soon be feasible. This will allow for more robust
analyses of the relationship between mRNA expression and protein abundance values.
Finally, to be fully able to understand the relationship between mRNA and protein
abundances, the dynamic processes involved in protein synthesis and degradation have to
be better understood; is the protein level changing because of a change in the rate of
protein synthesis, or mRNA, or protein turnover? These questions need to be looked into
further before I can appreciate in full the relationship between mRNA and protein
abundance levels.
Acknowledgements
This project was funded in part with Federal funds from the National Heart, Lung, and
Blood Institute, National Institutes of Health, under contract No. N01-HV-28186.
Chapter 2: mRNA expression and protein abundance
113
Figures and Tables
2.2 Comparing protein abundance and
mRNA expression levels on a genomic scale
Table 1 Proteomic Technologies
Chapter 2: mRNA expression and protein abundance
114
Figure 1 Comparison of mRNA expression and protein abundance.
Figure 1 Comparison of mRNA expression and protein abundance. (a) A plot Figure 1
Comparison of mRNA expression and protein abundance. comparing my mRNA
Chapter 2: mRNA expression and protein abundance
115
reference expression set (Greenbaum et al, 2002)with my newly compiled protein
abundance dataset. The mRNA axis is in copies per cell; the protein axis is in thousand
copies per cell. The protein dataset is the result of iteratively fitting two MudPit datasets
(MudPit-1 (Washburn et al, 2001) and MudPit-2 (Peng et al, 2003)) and two twodimensional electrophoresis datasets (2DE-1 (Gygi et al, 1999)and 2DE-2 (Futcher et al,
1999)). Given the semi-quantitative nature of the MudPit data (Washburn et al, 2001), I
transformed the data into a more quantitative set by fitting each set individually onto my
reference mRNA expression dataset. In addition, I fit the MudPit-1 dataset onto the more
finely-grained MudPit-2 dataset. Each of the datasets was then moved back into 'protein
space' using an inverse transformation derived from the 2DE-1 set, as this set has the
most precise values. These datasets were then combined into the new reference
abundance dataset. In cases in which there were overlapping values for a given ORF I
used the dataset in accord with the following ordering: 2DE-1, 2DE-2, MudPit-2,
MudPit-1. The resulting reference protein abundance dataset (N = 2044) had a correlation
of 0.66 with the mRNA reference dataset. (b,c) Additionally, I show that when looking at
specific subsets (subcellular localization (Kumar et al., 2002) or functional groups
(Mewes et al, 2002)) I can find both higher and lower correlations amongst these groups.
The lower correlations are generally reflective of a more heterogeneous category. This
analysis indicates that while correlations may be weak when looking at the global data, I
tend to find higher correlations when looking at smaller well-defined subsets of ORFs.
Further analysis is available at http://bioinfo.mbb.yale.edu/expression/prot-v-mrna/.
Chapter 2: mRNA expression and protein abundance
116
Figure 2 The differences in correlation between mRNA and protein expression
values using novel categories.
Figure 2 The differences in correlation between mRNA and protein expression values
using novel categories. I see significant differences when looking at the highest and
lowest ranking of groups of ORFs in the following categories: occupancy, CAI (codon
adaptation index) value (Bennetzen and Hall, 1982,Jansen et al, 2003,Sharp and Li,
1987)and variability. Occupancy refers to the percentage of transcripts associated with
ribosomes; I compared the correlation between the top 100 ORFs and the bottom 100 in
terms of occupancy (r = 0.78 versus 0.30). For the CAI, I compared the correlation
between mRNA and protein for those ORFs with the highest CAI and those with the
lowest (r = 0.48 versus 0.02). Variability refers to the normalized standard deviation (that
is, the standard deviation divided by the average expression level) for all ORFs in the
Chapter 2: mRNA expression and protein abundance
117
cell-cycle expression dataset of Cho et al. (Cho et al, 1998). Here, I compared the
correlations between protein abundance and mRNA expression for the most variable
compared with the least variable proteins (r = 0.89 versus 0.20). I found significant
differences between the correlations of mRNA and protein levels for the top and bottom
ranking populations for each of the comparisons..
Chapter 2: mRNA expression and protein abundance
118
References
1.
Swissprot (http://us.expasy.org/sprot/)
2.
Adam, B. L., Vlahou, A., Semmes, O. J. & Wright, G. L., Jr. Proteomic
approaches to biomarker discovery in prostate and bladder cancers. Proteomics 1,
1264-70. (2001).
3.
Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein
abundances in human liver. Electrophoresis 18, 533-7 (1997).
4.
Arava, Y. et al. Genome-wide analysis of mRNA translation profiles in
Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 100, 3889-94 (2003).
5.
Baldi, P. & Long, A. D. A Bayesian framework for the analysis of microarray
expression data: regularized t -test and statistical inferences of gene changes.
Bioinformatics 17, 509-19. (2001).
6.
Bennetzen, J. L. & Hall, B. D. Codon selection in yeast. J Biol Chem 257, 302631 (1982).
7.
Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA
microarrays. Nat Genet 21, 33-7. (1999).
8.
Chen, G. et al. Discordant protein and mRNA expression in lung
adenocarcinomas. Mol Cell Proteomics 1, 304-13 (2002).
9.
Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell 2, 65-73. (1998).
10.
Coghlan, A. & Wolfe, K. H. Relationship of codon bias to mRNA concentration
and protein length in Saccharomyces cerevisiae. Yeast 16, 1131-45 (2000).
Chapter 2: mRNA expression and protein abundance
119
11.
Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A
sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999).
12.
Gerner, C. et al. Concomitant determination of absolute values of cellular protein
amounts, synthesis rates, and turnover rates by quantitative proteome profiling.
Mol Cell Proteomics 1, 528-37 (2002).
13.
Gharbi, S. et al. Evaluation of two-dimensional differential gel electrophoresis for
proteomic expression analysis of a model breast cancer cell system. Mol Cell
Proteomics 1, 91-8. (2002).
14.
Glickman, M. H. & Ciechanover, A. The ubiquitin-proteasome proteolytic
pathway: destruction for the sake of construction. Physiol Rev 82, 373-428 (2002).
15.
Golub, T. R. et al. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286, 531-7 (1999).
16.
Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and
protein abundance data: an approach for the comparison of the enrichment of
features in the cellular population of proteins and transcripts. Bioinformatics 18,
585-96 (2002).
17.
Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. & Aebersold, R. Evaluation of
two-dimensional gel electrophoresis-based proteome analysis technology. Proc
Natl Acad Sci U S A 97, 9390-5. (2000).
18.
Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between
protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999).
Chapter 2: mRNA expression and protein abundance
120
19.
Han, D. K., Eng, J., Zhou, H. & Aebersold, R. Quantitative profiling of
differentiation-induced microsomal proteins using isotope-coded affinity tags and
mass spectrometry. Nat Biotechnol 19, 946-51. (2001).
20.
Hatzimanikatis, V., Choe, L. H. & Lee, K. H. Proteomics: theoretical and
experimental considerations. Biotechnol Prog 15, 312-8 (1999).
21.
Issaq, H. J., Veenstra, T. D., Conrads, T. P. & Felschow, D. The SELDI-TOF MS
approach to proteomics: protein profiling and biomarker identification. Biochem
Biophys Res Commun 292, 587-92. (2002).
22.
Jansen, R., Bussemaker, H. J. & Gerstein, M. Revisiting the codon adaptation
index from a whole-genome perspective: analyzing the relationship between gene
expression and codon occurrence in yeast using a variety of models. Nucleic
Acids Res 31, 2242-51 (2003).
23.
Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression
data with protein-protein interactions. Genome Res 12, 37-46. (2002).
24.
Klose, J. Protein mapping by combined isoelectric focusing and electrophoresis of
mouse tissues. A novel approach to testing for induced point mutations in
mammals. Humangenetik 26, 231-43 (1975).
25.
Kumar, A. et al. Subcellular localization of the yeast proteome. Genes Dev 16,
707-19. (2002).
26.
Li, J., Zhang, Z., Rosenzweig, J., Wang, Y. Y. & Chan, D. W. Proteomics and
bioinformatics approaches for identification of serum biomarkers to detect breast
cancer. Clin Chem 48, 1296-304. (2002).
Chapter 2: mRNA expression and protein abundance
121
27.
Lian, Z. et al. Genomic and proteomic analysis of the myeloid differentiation
program: global analysis of gene expression during induced differentiation in the
MPRO cell line. Blood 100, 3209-20 (2002).
28.
Lichtinghagen, R. et al. Different mRNA and protein expression of matrix
metalloproteinases 2 and 9 and tissue inhibitor of metalloproteinases 1 in benign
and malignant prostate tissue. Eur Urol 42, 398-406 (2002).
29.
Luscombe, N. M., Greenbaum, D. & Gerstein, M. What is bioinformatics? A
proposed definition and overview of the field. Methods Inf Med 40, 346-58 (2001).
30.
Macgregor, P. F. & Squire, J. A. Application of microarrays to the analysis of
gene expression in cancer. Clin Chem 48, 1170-7. (2002).
31.
McGall, G. H. & Christians, F. C. High-density genechip oligonucleotide probe
arrays. Adv Biochem Eng Biotechnol 77, 21-42 (2002).
32.
Mewes, H. W. et al. MIPS: a database for genomes and protein sequences.
Nucleic Acids Res 30, 31-4. (2002).
33.
O'Farrell, P. H. High resolution two-dimensional electrophoresis of proteins. J
Biol Chem 250, 4007-21 (1975).
34.
Orntoft, T. F., Thykjaer, T., Waldman, F. M., Wolf, H. & Celis, J. E. Genomewide study of gene copy numbers, transcripts, and protein levels in pairs of noninvasive and invasive human transitional cell carcinomas. Mol Cell Proteomics 1,
37-45 (2002).
35.
Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J. & Gygi, S. P. Evaluation of
multidimensional chromatography coupled with tandem mass spectrometry
Chapter 2: mRNA expression and protein abundance
122
(LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome
Res 2, 43-50. (2003).
36.
Petricoin, E. F. et al. Use of proteomic patterns in serum to identify ovarian
cancer. Lancet 359, 572-7. (2002).
37.
Pratt, J. M. et al. Dynamics of protein turnover, a missing dimension in
proteomics. Mol Cell Proteomics 1, 579-91 (2002).
38.
Qian, J., Kluger, Y., Yu, H. & Gerstein, M. Identification and correction of
spurious spatial correlations in microarray data. Biotechniques 35, 42-4, 46, 48
(2003).
39.
Schena, M. et al. Microarrays: biotechnology's discovery platform for functional
genomics. Trends Biotechnol 16, 301-6. (1998).
40.
Serikawa, K. A. et al. The Transcriptome and Its Translation during Recovery
from Cell Cycle Arrest in Saccharomyces cerevisiae. Mol Cell Proteomics 2, 191204 (2003).
41.
Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional
synonymous codon usage bias, and its potential applications. Nucleic Acids Res
15, 1281-95 (1987).
42.
Szallasi, Z. Genetic network analysis in light of massively parallel biological data
acquisition. Pac Symp Biocomput, 5-16. (1999).
43.
Tonge, R. et al. Validation and development of fluorescence two-dimensional
differential gel electrophoresis proteomics technology. Proteomics 1, 377-96.
(2001).
Chapter 2: mRNA expression and protein abundance
123
44.
Washburn, M. P. et al. Protein pathway and complex clustering of correlated
mRNA and protein expression analyses in Saccharomyces cerevisiae. Proc Natl
Acad Sci U S A 100, 3107-12. (2003).
45.
Washburn, M. P., Ulaszek, R., Deciu, C., Schieltz, D. M. & Yates, J. R., 3rd.
Analysis of quantitative proteomic data generated via multidimensional protein
identification technology. Anal Chem 74, 1650-7. (2002).
46.
Washburn, M. P., Wolters, D. & Yates, J. R., 3rd. Large-scale analysis of the
yeast proteome by multidimensional protein identification technology. Nat
Biotechnol 19, 242-7. (2001).
47.
Wolters, D. A., Washburn, M. P. & Yates, J. R., 3rd. An automated
multidimensional protein identification technology for shotgun proteomics. Anal
Chem 73, 5683-90. (2001).
48.
Wu, B. et al. Comparison of statistical methods for classification of ovarian
cancer using mass spectrometry data. Bioinformatics 19, 1636-43 (2003).
49.
Zhou, G. et al. 2D differential in-gel electrophoresis for the identification of
esophageal scans cell cancer-specific protein markers. Mol Cell Proteomics 1,
117-24. (2002).
Chapter 2: mRNA expression and protein abundance
124
Chapter 3: mRNA Expression and protein-protein
interactions
3.1 Relating whole-genome expression data with protein-protein
interactions
Abstract
I investigate the relationship of protein-protein interactions with mRNA expression levels,
by integrating a variety of data sources for yeast. I focus on known protein complexes
(from the MIPS catalog) that have clearly defined interactions between their subunits. I
find that subunits of the same protein complex show significant co-expression, both in
terms of similarities of absolute mRNA levels and expression profiles -- e.g. I can often
see subunits of a complex having correlated patterns of expression over a time-course. I
classify the yeast protein complexes as either permanent or transient, with permanent
ones being maintained through most cellular conditions. I find that, generally, permanent
complexes, such as the ribosome and proteasome, have a particularly strong relationship
with expression, while transient ones do not. However, I note that several transient
complexes, such as the RNA polymerase II holoenzyme and the replication complex, can
be subdivided into smaller permanent ones, which do have a strong relationship to gene
expression. I also investigated the interactions in aggregated, genome-wide datasets, such
as the comprehensive yeast two-hybrid experiments, and found them to have an only
weak relationship with gene expression, similar to that of transient complexes. (Further
details
on
genecensus.org/expression/interactions
and
bioinfo.mbb.yale.edu/expression/interactions.)
Chapter 3: mRNA Expression and protein-protein interactions
125
Introduction
Analysis of gene expression data is currently one of the most exciting areas in genomics.
Computationally, it involves clustering and grouping individual expression measurements
and interrelating them to other sources of information, such as phenotypes, functional
classifications, or cellular responses(Brown and Botstein, 1999,Califano et al.,
2000,Gaasterland and Bekiranov, 2000,Golub et al., 1999,Raychaudhuri et al.,
2001,Subrahmanyam et al., 2001).
In particular, functional assignment of
uncharacterized genes can take place through transferring the annotation from a
characterized gene (gathered from databases such as MIPS or GO (Ashburner et al.,
2000,Mewes et al., 2000)) to an uncharacterized gene when their expression profiles are
strongly related by a similarity criterion (such as the correlation coefficient). While this
procedure is usually not sufficient to unambiguously determine the function of an
uncharacterized gene, it can be the starting point (e.g. in target selection) for further
genetic experiments, functional characterization, or high-throughput proteomic analysis
(Christendat et al., 2000,Christendat et al., 2000,Eisenberg et al., 2000,Emili and Cagney,
2000,Gerstein and Jansen, 2000,Luscombe et al., 1998,Westhead et al., 1999).
An important component of functional annotation is characterizing protein interactions as
these often circumscribe (or effectively define) protein function. Moreover, protein
interactions can often be described more precisely than protein functions. Thus, rather
than directly dealing with the general relationship between protein function and expression, I look here at a sub-problem: the relationship between mRNA expression and protein-protein interactions, especially those in protein complexes. A priori it seems reasonable that there should be a well-defined relationship between the expression levels of the
Chapter 3: mRNA Expression and protein-protein interactions
126
subunits in a complex: since the functionality of many complexes hinges on the presence
of all the subunits, a haphazard and independent expression of any one subunit would be
energetically costly. For instance, the components of the ribosome are regulated in a
complex way but there is usually agreement that they should be present in equimolar
amounts, although this has not yet been measured directly (Li et al., 1999,Nomura,
1999,Planta, 1997,Woolford and Warner., 1991).
I investigate this relationship for many of the known protein complexes in a comprehensive, global fashion by interrelating many of the yeast datasets for protein interactions
and expression. The diversity and number of yeast experiments provide high-quality data
under varied conditions. Additionally, I investigate the relationship between other types
of protein-protein interactions (e.g. aggregated physical and genetic interactions) and
mRNA expression. My work follows up on many recent analyses of protein-protein
interactions (Fellenberg et al., 2000,Hishigaki et al., 2001,Teichmann et al.,
2001,Walhout and Vidal, 2001).
In general, my goal was to integrate and cross-correlate already existing data from different sources and find general trends in it. This is an exploratory study prior to any type of
prediction. In a sense, this study can be understood as an exploration of the knowledge
already implicit in the current data but not yet obvious because, previously, it has not yet
been integrated and put together in this way.
Results
In my survey of existing data, I have used two different approaches to analyze the two
different types of expression data available: the computation of normalized differences
Chapter 3: mRNA Expression and protein-protein interactions
127
for absolute expression levels and a more standard analysis of the correlation of profiles
of relative expression levels (expression ratios). I explain these two approaches in more
detail in the following two sections.
Calculation of normalized differences between absolute expression
levels
In order to compare absolute mRNA expression levels between subunits of a protein
complex, I define the normalized difference Dij as follows:
Dij 
Ei  E j
Ei  E j
[1]
where Ei and Ej are the mRNA expression levels of subunits i and j. This quantity defines the difference as a fraction of the sum of the expression levels, thus allowing for a
comparison of gene pairs of both high and low expression. Values for the normalized
difference range from 0 to 1.
For a group of N proteins in a complex I generally compute the normalized difference not
only for the pairs that are in direct physical contact, but for all (N2 - N)/2 theoretically
possible pairs, thus arriving at a distribution of normalized differences of these pairs for
each complex. I can then investigate this distribution of normalized differences and
compare it with those among randomly chosen proteins. In the following discussion I
often refer to the median of the (N2 - N)/2 protein pairs as a key summarizing statistic.
In general, I assume stoichiometric ratios of 1:1 between subunits, although equation [1]
could be adjusted to account for other ratios. But even then, as shown in the Methods
Chapter 3: mRNA Expression and protein-protein interactions
128
section below, I would not expect this quantity to always be close to zero due to the
relationship between mRNA and protein and also the noise in the expression data.
It should also be noted that there are obviously many limitations in treating GeneChip
and SAGE data as absolute measurements of mRNA expression (Schadt et al., 2000).
In order to judge the statistical significance of normalized differences for particular
groups of proteins I compare them to the control distribution of randomly chosen protein
pairs (see figure 1). An interesting theoretical aspect in this context is that if Ei and Ej are
random variables with an exponential distribution (which is a close approximation to the
actual distribution of expression of levels in the reference expression set), then Dij is
distributed uniformly between 0 and 1 (Pitman, 1993). This explains why I can observe a
nearly uniform distribution of normalized differences for randomly selected pairs of
proteins (see figure 1).
Correlation of expression profiles for relative expression levels
Analysis of expression profiles may be more useful than that of absolute levels for
characterizing interacting proteins that exist in unequal but stoichiometrically related
amounts (e.g., 3:1) as it refers to the relative shape of expression profiles. It can be carried out on data from cDNA microarrays (such as the Rosetta data) because only relative
rather than absolute expression levels are necessary. Specifically, I look at the distribution of Pearson correlation coefficients for pairs of genes as the measure of similarity.
(Other measures of similarity are possible as well (D'haeseleer, 1997,Heyer et al.,
1999,Qian et al., 2001,Weaver et al., 1997).)
Chapter 3: mRNA Expression and protein-protein interactions
129
As the input for my procedure I use the expression vectors or profiles of all the subunits
of a complex and then compute their pair-wise correlations. Like for the normalized
difference, I compute the correlation coefficients for all protein pairs in a complex, thus
gaining a distribution of correlation coefficients. If the complex consists of N subunits,
this yields (N2 - N)/2 different combinations of protein pairs and thus correlation
coefficients. To summarize these distributions, I calculate the “average correlation” (by
which I mean the average of all pair-wise correlations within a complex). As a suitable
control to assess statistical significance, I use the distributions of correlation coefficients
for random groups of proteins and their averages (see methods).
I would expect
correlations of close to 1 for subunits in a tight complex. However, as I show in the
Methods section this will not be exactly the case due to the relationship between mRNA
and protein abundances.
Specific complexes
I first outline some results obtained for specific protein complexes, then I proceed to a
more general overview of complexes.
Ribosome
It has long been known that the mRNA expression levels of the ribosomal proteins are
strongly correlated with one another (Johannes et al., 1999).
Figure 1 shows the
observed distribution of normalized differences for protein pairs in the large subunit of
the cytoplasmic ribosome. The median of this distribution is 0.23, much lower than the
median of 0.5 for randomly selected protein pairs. While there is a wide range of normalized differences (which may partially result from the fact that many proteins in the ribo-
Chapter 3: mRNA Expression and protein-protein interactions
130
some are known not to be expressed in a 1:1 ratio (Kruiswijk et al., 1978)), the ribosomal
distribution is clearly skewed towards zero. Distributions of the correlation coefficients
for protein pairs within the large ribosomal subunit are shown in figure 2. For both the
cell cycle and the Rosetta data the correlations tend to be much higher than the random
control.
Similar observations can be made for the proteins in the small cytoplasmic ribosome.
Key statistics are summarized in figure 3 in comparison to those for other protein complexes. Furthermore, the two separate ribosome particles are strongly co-regulated. In
fact, the large and the small ribosomal particles cannot be differentiated by my measures
of expression similarity.
Proteasome
A second example of a complex whose individual subunits are strongly co-regulated is
the proteasome, which is involved in protein degradation and responsible for the rapid
breakdown of ubiquitinated proteins. Like the ribosome, the 26S proteasome can be divided into two sub-particles: the 20S and the 19S (or 19S/22S regulatory particle). The
20S particle is present as a dimer in the center of the complex structure and contains the
catalytic core, whereas two 19S particles are attached to both ends of the 20S particle
dimer (Coux et al., 1996,Wilkinson et al., 1999).
The distribution of the normalized differences for all possible protein pairs in the 20S
proteasome is shown in figure 1. Like the ribosome, it is clearly skewed towards zero,
compared to the control, with a median of 0.29. Figure 2 shows the distribution of
correlation coefficients, which is strongly shifted to the right of the control, though to a
Chapter 3: mRNA Expression and protein-protein interactions
131
lesser extent than that for the ribosome. An investigation of the crystal structure of 20S
particle (Whitby et al., 2000) did not reveal any relationship with the gene expression
differences (e.g. proteins with slightly more random correlations tending to be more on
the surface of the particle).
Similar results can be observed for the 19S particle of the proteasome (figure 3A). Also,
in terms of both measures of co-expression (normalized differences and correlation of
expression profiles) the 19S and the 20S particles of the proteasome form a single unit
that is difficult to separate by gene expression analysis. Part of the reason for this may be
that the common classification into 19S and 20S particles is based on the purification
procedure for the proteasome (Hochstrasser, 2001) and thus does not necessarily reflect
functional or biochemical properties in a direct way.
One subunit, Doa4p, exhibits a very low average correlation (-0.02). Biochemical studies
have previously shown that not all proteasomes have Doa4p bound and that the Doa4pproteasome interaction is more likely to be transitory (Papa et al., 1999,Papa and
Hochstrasser, 1993).
RNA polymerase II holoenzyme
I have shown above that the ribosome and proteasome can be regarded as strongly associated and co-regulated multi-particle complexes. However, in some cases a complex
contains more loosely associated components. An example is the RNA polymerase II
holoenzyme, which contains the core RNA polymerase II together with the more loosely
associated SRB complex (Kornberg's mediator) and other smaller components (such as
the SWIF/SNF complex and the TAFIIs).
Chapter 3: mRNA Expression and protein-protein interactions
132
It is known that, unlike the RNA polymerase II core enzyme, the SRB complex and the
other holoenzyme components are only needed for the transcription of a fraction of genes
(Holstege et al., 1998). In other words, the holoenzyme is an example of a complex of
transitory nature with a permanent core.
This permanent-and-transitory structure is
clearly evident in the gene expression analysis. For the core enzyme, the average correlation in both the cell cycle and Rosetta data sets are significantly higher than for the random control (Figure 3). However, for the SRB complex and a variety of other, smaller
components (e.g. the TAFIIs) the average correlations are virtually indistinguishable
from the random control.
Replication complex
Another example of a transient complex is the replication complex, which binds to DNA
and is needed for the initiation of replication. The replication complex can be subdivided
into a number of sub-components: the MCM proteins, the origin recognition complex and
the DNA polymerases  and (Aparicio et al., 1997).
As a whole, the replication complex exhibits a low average correlation not significantly
different from that of the random control (figures 3 and 4). However, figure 4 shows how
the entire complex breaks into subcomponents in terms of correlations in the cell-cycle
experiment. The individual correlations for each of the subcomponents are much higher
than that of the complex as a whole. This indicates that the replication complex is composed of independent units in terms of expression regulation. Using the permanent-transient terminology, each subcomponent behaves similarly to an independent permanent
complex, whereas the replication complex as a whole can be characterized as transient.
Chapter 3: mRNA Expression and protein-protein interactions
133
The permanent sub-components can be seen to come together to form a transient
functional entity. (Note, this effect is more evident in the cell cycle experiment than the
Rosetta data, as it should only be observable in a synchronized population of cells, not
those averaged across the cell cycle.)
Complexes in general: permanent vs. transient
In discussing the specific examples above, I have found the permanent or transient nature
of the association to be an important feature. This distinction is, in fact, valuable in a
more general context. As shown in figure 3, I have a priori formalized a division between "permanent" complexes, which are maintained throughout the cell cycle and most
cellular conditions, and "transient" ones, which I define here as a group of proteins that
do not consistently maintain their interactions. That is, the existence of a transient complex is temporal and specific to a part of the cell cycle or a subset of cellular states. I are
aware that the division into the two absolute categories "permanent" and "transient" is
perhaps somewhat oversimplifying as there can be varying degrees and combinations of
these attributes (see Discussion).
In figure 3, I show a general classification of the large MIPS complexes into permanent
and transient classes, together with key statistics (details of the classification method are
given in the caption). I list all complexes with more than 10 subunits (which together
account for ~80% of all the protein-protein interactions in the MIPS complexes), with
smaller complexes listed on my website. Figure 3B shows a graphical representation of
the complex list, synthesizing the correlations for both the Rosetta and cell-cycle experi-
Chapter 3: mRNA Expression and protein-protein interactions
134
ments with the normalized differences. It clearly shows that there is a greater tendency
for permanent complexes to have higher average correlations than for transient ones.
Comparing the average correlations in Figure 3A against random controls allows us to
derive P-values for the statistical significance of the correlation. As shown in the figure,
these are less then 10-4 for most of the permanent complexes. On the other hand, they are
considerably higher, and thus less significant, for transient complexes. The separation
between permanent and transient complexes is also evident in terms of the normalized
difference statistics, although not as strongly.
Aggregated protein-protein interaction sets
From my analysis above it seems reasonable to conclude that there is indeed a strong
relationship between mRNA expression and the protein-protein interactions in “permanent” complexes. This raises the question whether similar observations can be made for
other types of protein-protein interactions. I briefly summarize here the degree to which
the interactions in the aggregated interaction datasets, such as the yeast two-hybrid, are
related to expression.
Figure 1 shows the distribution of normalized differences and figure 2 the distributions of
correlation coefficients between interacting proteins in the aggregated data sets. The
distributions of normalized differences are relatively similar to those of the transient protein complexes. The physical interactions show the smallest median normalized difference while the yeast two-hybrid interactions have a median normalized difference closest
to the random control (~0.5). Figure 2 shows that the correlation distributions for the
aggregated data sets are fairly similar among themselves and only slightly shifted towards
Chapter 3: mRNA Expression and protein-protein interactions
135
the right of the distribution curve for random protein pairs. This, again, is very similar to
the behavior of transient protein complexes.
Thus, overall, it seems fair to conclude that the aggregated protein-protein interactions
are related to mRNA expression in a similar fashion as the transient protein complexes.
Discussion and conclusion
I have investigated the relationship of protein-protein interactions and mRNA expression
levels, integrating and surveying a variety of data sources for yeast. I have focused my
investigation on the protein interactions within specific complexes.
While I have
demonstrated a strong relationship between expression data and most permanent protein
complexes, this relationship is much weaker for transient protein complexes as well as for
the aggregated sets of protein-protein interactions (i.e. physical, genetic and yeast-two
hybrid interactions).
Issues with permanent-transient classification
My complex classification scheme -- separating most complexes into either permanent or
transient -- while useful cannot account for all complexes in the MIPS database. Some
complexes may not clearly fit into the permanent-transient classification. I list a few of
these as "other" in figure 3. Moreover, the complexes list is a compilation of current
biochemical knowledge and therefore reflects its inherent limitations (sometimes not all
subunits are known or some proteins are mistakenly assigned to a complex).
Of course, even for the complexes that I do classify, the terms "transient" and "permanent" are somewhat of an over-simplification. In particular, my detailed discussions of
Chapter 3: mRNA Expression and protein-protein interactions
136
the RNA polymerase II holoenzyme and the replication complex above are precisely two
examples where my simplified terminology fails to completely explain the situation since
these complexes are somewhere between fully "transient" and "permanent".
One can think about the distinction between permanent and transient in terms of the
mathematical model introduced in the Methods section. Whenever a complex is formed,
its subunits tend to be expressed at equimolar protein concentrations: Pi  Pj
and dPi dt  dPj dt (where Pi and Pj are the protein concentrations of two subunits i and
j). If the complex is "permanent", then these conditions should be approximately or
vaguely met. If the complex is "transient", then these conditions can be relaxed in those
situations where the complex is not formed. There are some complexes, that are always
formed ("permanent") whereas the "transient" complexes are only formed under particular conditions. There can be different degrees of being transient: for instance, complexes
that are formed under 80% of conditions or those that are formed under 20% of conditions. The transient complex formed under 80% of conditions behaves almost like
"permanent" (i.e., 100% of conditions), whereas the transient complex formed only 20%
of the time would be expected to show less significant normalized differences and
correlations.
If one goes as far as to accept the premise that the subunits in a complex should be present at equimolar amounts, then it is perhaps circular reasoning to say that they should
also be co-expressed.
Chapter 3: mRNA Expression and protein-protein interactions
137
Complexes versus the aggregated interactions: the need for structures
I found it difficult to discern expression-based relationships in the aggregated data sets.
This may be due to the generalized and heterogeneous nature of the aggregated data sets,
(e.g. inconsistent physiological conditions, false positives and false negatives). Moreover,
both the aggregated sets and the transient complexes suffer partially from the limited
amount of mRNA expression data as their interactions may occur under particular
physiological conditions that may not be sampled by mRNA expression data. My results,
thus, illustrate the difficulty in drawing general conclusions for the pair-wise interaction
sets and highlight the important role clearly resolved crystal structures of complexes,
detailing protein interactions between subunits, have in studying protein-protein interactions.
Noise in the expression and interaction data
In general, the interactions in the aggregated datasets exhibited surprisingly little deviation from randomness in terms of the co-expression of interaction pairs. This was most
strongly observed for the yeast two-hybrid data. It is true that, overall, this deviation
from randomness is statistically significant. All the same, the gene expression data and
the aggregated protein interaction data do not reinforce each other strongly and it seems
that the prediction of these type of interactions from expression data would be of little
benefit.
Perhaps the most optimistic view of this situation is that the strong degree of independence of the two types of data makes both of them suitable for use in machine-learning ap-
Chapter 3: mRNA Expression and protein-protein interactions
138
proaches to characterize genes of unknown function: if they were strongly correlated,
then one type of data could perhaps well replace the other since it represents very similar
information. A negative view would be that the reason for the surprisingly weak relationship between the aggregated interactions and mRNA expression are to be found in the
problems with the either the expression or the interaction data.
I feel confident that my results are robust to the noise in the expression data for the
following reasons.
With respect to the correlation analysis of expression profiles
roughly the same results (in terms of statistical significances) can be obtained for two
independent data sets (the cell-cycle timecourse and the Rosetta knockout series). The
normalized difference analysis is perhaps more sensitive to problems with the data, in
particular, considering that the measurement of absolute expression levels with gene
chips is problematic to start with. However, I have looked at an integrated dataset from
various chip experiments and the SAGE data, thus averaging out errors to some degree
(see Methods). In addition, for both the correlation and the normalized difference analysis, I have concentrated on the statistical significance of distributions rather than relying
on the error-prone data for individual protein pairs, thus observing more robust, aggregate
trends for whole complexes and groups of proteins.
Part of the aggregated data, in particular the yeast two-hybrid data, represent a relatively
new approach to studying protein-protein interactions and it is interesting to note that it,
obviously, includes some interactions implied by the complexes. However, the degree of
intersection with possible complexes interactions ranges from 35% for the physical
interactions to only approximately 6% for the yeast two-hybrid data (as a fraction of the
number of interactions in the aggregated datasets). This is surprisingly low, given that
Chapter 3: mRNA Expression and protein-protein interactions
139
the yeast two-hybrid data is from experiments that covered the complete genome (Ito et
al., 2001,Uetz and Hughes, 2000). Independently, Ito et al. (2001) have reported that
only a small fraction of the previous yeast two-hybrid data (Uetz and Hughes, 2000)
overlapped with their own yeast two-hybrid results.
(Although Ito and colleagues
assumed that their core data was similar in quality as the Uetz data, the fraction of
interactions present in both datasets was only 16.8% for the Ito core and 20.4% for the
Uetz data).
mRNA vs. protein expression
The co-regulation of subunits in a protein complex should be primarily observable in
terms of protein abundance and only indirectly in terms of mRNA expression. Several
recent studies have attempted to investigate the relationship between mRNA and protein
expression levels in yeast cells and found them to be correlated to various
degrees(Anderson and Seilhamer, 1997,Futcher et al., 1999,Greenbaum et al., 2002,Gygi
et al., 1999,Lian et al., 2001). Generally, post-transcriptional regulation is more difficult
to investigate given the sparse data resources currently available for protein abundance
levels. It is possible that in some situations co-regulation occurs mostly on the protein
level, almost independent of cellular mRNA levels.
Particularly, those permanent
complexes that do not have high levels of correlation in my analysis may be indicative of
translational or post-translational control and could be a starting point for further
experimental investigation. See also the Methods section for further discussion.
(Additional information can be found at genecensus.org/expression/interactions and
bioinfo.mbb.yale.edu/expression/interactions.)
Chapter 3: mRNA Expression and protein-protein interactions
140
Methods
Interactions data sources
The primary focus of this paper are the interactions occurring within specific complexes.
These were obtained from the MIPS complexes catalog (Fellenberg et al, 2000), which
represents a carefully annotated, comprehensive dataset of protein complexes culled from
the scientific literature. In addition, I looked at other types of protein-protein interactions
from large "aggregated" datasets collecting many heterogeneous pair-wise interactions. I
collected these from the MIPS catalogs of physical and genetic interactions(Fellenberg et
al, 2000), databases of interacting proteins (DIP and BIND)(Bader and Hogue,
2000,Xenarios et al., 2000), and a comprehensive collection of yeast 2-hybrid experiments (Y2H) (Cagney et al., 2000,Ito et al, 2001,Ito et al., 2000,Schwikowski et al.,
2000,Uetz et al., 2000,Uetz and Hughes, 2000). These interactions are subdivided into
groups based on their method of discovery. They include physical interactions (e.g., collected through co-immunoprecipitation and co-purification), genetic interactions (e.g.,
determined through genetic means such as synthetic lethality or suppression experiments),
and yeast two-hybrid pairs.
Expression data sources
I included two different types of expression measurements in my analysis: absolute
expression levels in vegetative yeast cells as determined by SAGE or gene chip experiments, and profiles of ratio-type expression data from microarray experiments. For the
first type, I use a comprehensive reference set, which I merged and scaled together from a
Chapter 3: mRNA Expression and protein-protein interactions
141
variety of Affymetrix GeneChip and SAGE datasets (Holstege et al, 1998,Jelinsky and
Samson, 1999,Roth et al., 1998,Velculescu et al., 1997) into a single representative data
source (scaling details on my website; (Greenbaum et al, 2002)). For the expression
profiles, I focused on two different datasets: a cell cycle experiment (Cho et al., 1998)
and the Rosetta yeast compendium (Hughes et al., 2000). The two datasets provide a
fairly good sampling of the possible cellular states of yeast and represent different experimental methodologies.
The cell-cycle data contains expression profiles obtained from
synchronized cells over the course of two cell cycles, whereas the Rosetta data contains
genome-wide expression ratios for 300 stationary cell states, which are derived from 280
gene deletions and the 20 drug interaction experiments.
Efficient calculation of the average correlations
For two expression ratio profiles Xi and Xj (transformed to average 0 and standard deviation 1), the Pearson correlation coefficient ij is given by the dot product:
ij 
1
Xi  X j ,
M 1
where M is the number of elements in the profiles Xi and Xj. The profile X can be computed as a ‘Z-score’ from the measured expression ratio profile x, through the relation
Xk 
xk  x
x
, where x denotes the average and σx the standard deviation of values in x,
and Xk and xk are the kth components of their respective profiles.
Given a group of N genes I can compute the correlation coefficient matrix R, where each
element ij of the matrix denotes the Pearson correlation coefficient between genes i and j.
Chapter 3: mRNA Expression and protein-protein interactions
142
I can then compute the average correlation coefficient  by averaging the matrix
elements (excluding the main diagonal).
This statistic gives an idea of the overall
similarity of the expression profiles in a group of genes. Although there are O(N2) elements in R, the computation time for  can be kept proportional to O(N) by using the
linearity of the correlation to calculate  as follows:

 N

1
1
 1

  Rij  N   2
XT  XT  N  ,

2


N  N  i, j

 N  N  M 1
N
where X T   X n is the sum of all expression profiles in the group of N genes.
n 1
Kinetic model of the relationship between protein and mRNA
concentration
For a protein complex that is perfectly co-regulated I can assume that its components are
present at equimolar amounts and change similarly over time. So for the protein concentrations Pi and Pj of two different subunits i and j I would get: Pi  Pj and
dPi dt  dPj dt . Using a simple model for the relationship between mRNA and protein
concentrations, I can see how even under these ideal conditions similarity measures based
on the mRNA concentrations would deviate from perfect results. For instance, a linear
kinetic model for the protein concentration Pi and the mRNA concentration Ri of a
subunit i in a complex is given by:
dPi
 k Ri Ri  k Pi Pi
dt
Chapter 3: mRNA Expression and protein-protein interactions
143
where kRi is an mRNA translation rate constant and kPi is a protein degradation constant.
Why expression profile correlations have to be less than one
For two subunits in a complex with Pi  Pj  P and dPi dt  dPj dt , I can deduce:
k Ri Ri (t )  k Rj R j (t )  k Pi  k Pj P(t )
It is clear that only under the strong assumption that the two protein degradation constants are equal (kPi = kPj)
Ri (t ) k Rj

 const
R j (t ) k Ri
from which would follow corr(Ri, Rj) = 1. Otherwise, corr(Ri,Rj) < 1.
Why normalized differences are greater than zero
Furthermore, assuming steady-state (that is, dPi dt  dPj dt  0 ), I can deduce the
following relationship for the relationship between the mRNA levels of two complex
subunits:
Ri 
k Rj k Pi
Rj
k Pj k Ri
Thus, the two mRNA expression levels are only expected to be equal if the ratios of the
rate constants for translation and degradation are the same for both proteins. This is not
necessarily the case for the subunits of a complex and therefore normalized differences
should not be expected to be zero.
Chapter 3: mRNA Expression and protein-protein interactions
144
It is clear that the arguments above are based on a variety of simplifying assumptions. In
reality, there are additional factors (such as the noise in the expression data, the stochastic
nature of gene expression) that add even more difficulty to the analysis of mRNA levels.
Acknowledgments
MG acknowledges support by the Keck Foundation. RJ is supported by an IBM PhD
Fellowship. The authors wish to thank Mark Hochstrasser and Jiang Qian for stimulating
discussions.
Chapter 3: mRNA Expression and protein-protein interactions
145
Figures
Chapter 3: mRNA Expression and protein-protein interactions
3.1 Relating whole-genome expression data with protein-protein interactions
Figure 1 Distributions of normalized differences for various groups of proteins
in boxplot representation.
Figure 1
Distributions of normalized differences for various groups of proteins in
boxplot representation. Distributions of normalized differences for various groups of
proteins in boxplot representation. The normalized difference Dij is a measure of the
relative similarity of two absolute gene expression levels Ei and Ej. The middle panel
shows the distribution for two protein complexes (the large ribosomal subunit and the
Chapter 3: mRNA Expression and protein-protein interactions
146
20S proteasome). Note that I considered all theoretically possible protein pairs within the
protein complex (as indicated in the schematic drawing above the panel). The right panel
shows the distribution for the aggregated datasets of protein-protein interactions (Y2H is
yeast two-hybrid) (Bader and Hogue, 2000,Cagney et al, 2000,Fellenberg et al, 2000,Ito
et al, 2001,Ito et al, 2000,Schwikowski et al, 2000,Uetz et al, 2000,Uetz and Hughes,
2000,Xenarios et al, 2000). Unlike in the complexes, where I consider interactions
among a whole group of proteins, the interactions in the aggregated datasets are specific
to individual protein pairs (see schematic drawing). The left panel shows two control
distributions of the normalized difference, on the left for pairs of nuclear and cytoplasmic
proteins -- which presumably, because of spatial separation, do not interact -- and on the
right for any random protein pair ("all transcripts") in yeast. The distribution of nuclear
versus cytoplasmic proteins is strongly skewed towards one (the maximum value of the
normalized difference), which is partially explained by the fact that cytoplasmic proteins
tend to have higher expression levels than cytoplasmic ones (Drawid and Gerstein, 2000).
The distribution of all transcripts is nearly uniform (with a median of 0.5) -- see Methods.
The complexes distributions are clearly skewed towards zero with medians between 0.2
and 0.3. The medians of the distributions of the aggregated datasets are still somewhat
smaller than the control median, most notably for the physical interactions dataset; on the
other hand, there is virtually no difference between the control and the distribution of the
yeast two-hybrid dataset.
The aggregated data, obviously, includes some interactions implied by the complexes,
with the degree of intersection ranging from 35% for the physical interactions to approximately 6% for Y2H.
Chapter 3: mRNA Expression and protein-protein interactions
147
Figure 2
Distributions of correlation coefficients between expression profiles
Chapter 3: mRNA Expression and protein-protein interactions
148
Figure 2 Distributions of correlation coefficients between expression profiles. In part a I
show distributions of the average correlation  N of N genes for the cell cycle
experiments. The gray curve in the background represents the case N = 2 (i.e., simply the
distribution of pair-wise correlations). In the case of N > 2,  N is defined as the average
of all possible (N2-N)/2 pairwise correlations among the N genes.
I show here, as
examples, the distributions for N = 3 and N = 5. The distributions obviously become
narrower, reflecting the fact that it becomes more unlikely to find large groups of
strongly correlated genes at random as N increases.
These distributions provide a suitable control for the observed correlations between pairs
of genes (N = 2) or for the average correlations among the subunits of a complex (N > 2).
Roanld Jansen has developed a method to efficiently sample the distribution curves f(  N )
(see Methods). Based on the distribution function of f(  N ) we can calculate a one-sided
P-value:
P(  N ) 
1
 f (
N
)d N
N
This P-value then represents the chance that a group of N randomly selected genes could
exhibit an average correlation greater than or equal to that of a complex with N proteins
(see figure 3).
Part b and c show the distribution of pair-wise correlations for both the cell cycle and
the Rosetta experiments in two protein complexes (the ribosome and the proteasome) as
well as for the aggregated datasets (genetic, physical and Y2H). The gray curves in the
Chapter 3: mRNA Expression and protein-protein interactions
149
background are the control distributions for N = 2 as explained above. The distributions
for the ribosome and the proteasome are strongly shifted to the right of the control; this
effect is much weaker for the datasets of aggregated interactions.
Chapter 3: mRNA Expression and protein-protein interactions
150
Figure 3a Various key statistics
Chapter 3: mRNA Expression and protein-protein interactions
151
Figure 3
Various key statistics shown in figures 1 and 2 for the ribosome and pro-
teasome as well as for a large number of protein complexes. I list all protein complexes
from the MIPS catalog having at least 10 ORFs. The complexes are divided into three
classes: permanent, transient or "other" (see below). Some complexes can be divided into
smaller sub-complexes (e.g., the ribosomes) as indicated. The table lists (from left to
right) the average expression level of the complex, the median normalized difference (see
figure 1A), the average correlation for the cell cycle and Rosetta experiments (see figure
2), the negative logarithm of the P-value of the average correlations in both experiments
(see figure 2), and the size of the complex in terms of the number of ORFs.
In general, the P-values for the average correlations are very low for most of the permanent protein complexes (accordingly, -log10(P) is very high), indicating that these averages are significantly greater than for random groups of proteins of the same size. The
same cannot be observed for the transient protein complexes, for which the correlation
averages are usually much smaller.
The section "other" at the bottom of part A contains complexes that are either difficult to
classify as permanent/transient or for which, due to very small turnover rates, downregulations of mRNA levels take a very long time to affect protein abundance. The H+transporting ATPase can be thought of as containing a mixture of permanent and transient components at the same time(Kane, 2001). The nuclear pore complex (NPC) and
the TRAPP complex are known to have low turnover rates (Barrowman et al.,
2000,Bucci and Wente, 1997,Sacher, 2001,Winey et al., 1997). The NPC has relatively
small average correlations, but this still yields P-values of 10-3 (cell cycle) and <10-4
Chapter 3: mRNA Expression and protein-protein interactions
152
(Rosetta) because the nuclear pore complex is a relatively large aggregation of proteins,
and even these weak average correlations are very unlikely to occur for random groups of
proteins of this size. The TRAPP protein complex, while existing throughout the cell
cycle, has a low turnover rate and as such its mRNA expression data would not be sufficient for my analysis.
The RNA polymerase holoenzyme is composed of both permanent and transient components. Note that the MIPS complexes catalog does not include the SWI/SNF chromatinremodeling complex and a subset of basal transcription factors (Wilson et al., 1996) as
part of the holoenzyme, thus I list them separately here.
The list does not include those categories from the MIPS complexes catalog that do not
really represent protein complexes per se but rather aggregations of disparate proteins
that are involved in similar types of complex interactions, such as the "actin-associated"
and "tubulin-associated" protein groups.
Chapter 3: mRNA Expression and protein-protein interactions
153
Figure 3b Graphical representation of part of the protein complex statistics
Figure 3b Graphical representation of part of the protein complex statistics from part a.
The abscissa and ordinate represent the average correlations in the cell cycle and the
Rosetta data, while the bubble sizes are a function of the normalized differences (larger
bubbles represent larger normalized differences). In general, the permanent complexes
tend to be located in the upper right region of the plot, whereas transient complexes are
closer to the random control in the lower left.
Chapter 3: mRNA Expression and protein-protein interactions
154
Figure 4 Representation of the replication complex and its components
Figure 4
Representation of the replication complex and its components Part a of the
figure shows a representation of the replication complex and its components on the same
Chapter 3: mRNA Expression and protein-protein interactions
155
coordinates as the protein complexes in figure 3B. The transient replication complex can
be decomposed into smaller complexes: the origin recognition complex, the MCM
proteins, and the DNA polymerases  and . Whereas the whole replication complex
exhibits an average correlation close to zero (in both the cell cycle and the Rosetta data),
the four smaller complexes show greater correlations in the cell cycle experiment. The
four sub-complexes behave more like permanent complexes than the replication complex
as a whole.
Part b shows the correlation coefficient matrix for the subunits of the replication complex derived from the cell cycle data. The upper triangle of the correlation matrix shows
the individual correlation coefficients for particular gene pairs (with darker colors
indicating higher correlations). The lower triangle shows the average correlations for
subgroups of proteins (representing the MCM proteins, the two DNA polymerases, and
the origin of the replication complex) within the complex as a whole. The table on the
right side shows which genes belong to which subgroups in different colors. The genes
were ordered with unsupervised clustering (average linkage) without regard to their
classification according to the three subgroups. It can be seen that this order reflects the
separation according to the subgroups very well (only the proteins in the two DNA polymerase cannot be separated into two groups). An exception is the CDC45 protein that
belongs to the MCM proteins but tends to cluster with the DNA polymerases.
Chapter 3: mRNA Expression and protein-protein interactions
156
References
1.
Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein
abundances in human liver. Electrophoresis 18, 533-7 (1997).
2.
Aparicio, O. M., Weinstein, D. M. & Bell, S. P. Components and dynamics of
DNA replication complexes in S. cerevisiae: redistribution of MCM proteins and
Cdc45p during S phase. Cell 91, 59-69. (1997).
3.
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet 25, 25-9. (2000).
4.
Bader, G. D. & Hogue, C. W. BIND--a data specification for storing and
describing biomolecular interactions, molecular complexes and pathways.
Bioinformatics 16, 465-77. (2000).
5.
Barrowman, J., Sacher, M. & Ferro-Novick, S. TRAPP stably associates with the
Golgi and is required for vesicle docking. Embo J 19, 862-9 (2000).
6.
Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA
microarrays. Nat Genet 21, 33-7. (1999).
7.
Bucci, M. & Wente, S. R. In vivo dynamics of nuclear pore complexes in yeast. J
Cell Biol 136, 1185-99. (1997).
8.
Cagney, G., Uetz, P. & Fields, S. High-throughput screening for protein-protein
interactions using two- hybrid assay. Methods Enzymol 328, 3-14 (2000).
9.
Califano, A., Stolovitzky, G. & Tu, Y. Analysis of gene expression microarrays
for phenotype classification. Proc Int Conf Intell Syst Mol Biol 8, 75-85 (2000).
10.
Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell 2, 65-73. (1998).
Chapter 3: mRNA Expression and protein-protein interactions
157
11.
Christendat, D. et al. Structural proteomics: prospects for high throughput sample
preparation. Prog Biophys Mol Biol 73, 339-45 (2000).
12.
Christendat, D. et al. Structural proteomics of an archaeon. Nat Struct Biol 7, 9039. (2000).
13.
Coux, O., Tanaka, K. & Goldberg, A. L. Structure and functions of the 20S and
26S proteasomes. Annu Rev Biochem 65, 801-47 (1996).
14.
D'haeseleer, P., Wen,X.,Fuhrman,S.,Somogyi,R. in Plenum (ed. M. Holcombe, P.,
R) 203-212 (1997).
15.
Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with
sequence patterns for localizing proteins: comprehensive application to the yeast
genome. J Mol Biol 301, 1059-75. (2000).
16.
Eisenberg, D., Marcotte, E. M., Xenarios, I. & Yeates, T. O. Protein function in
the post-genomic era. Nature 405, 823-6. (2000).
17.
Emili, A. Q. & Cagney, G. Large-scale functional analysis using peptide or
protein arrays. Nat Biotechnol 18, 393-7. (2000).
18.
Fellenberg, M., Albermann, K., Zollner, A., Mewes, H. W. & Hani, J. Integrative
analysis of protein interaction data. Proc Int Conf Intell Syst Mol Biol 8, 152-61
(2000).
19.
Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A
sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999).
20.
Gaasterland, T. & Bekiranov, S. Making the most of microarray data. Nat Genet
24, 204-6. (2000).
Chapter 3: mRNA Expression and protein-protein interactions
158
21.
Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of
whole-genome expression data: How does it relate to protein structure and
function (In press). Current Opinions in Structural Biology (2000).
22.
Golub, T. R. et al. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286, 531-7 (1999).
23.
Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and
protein abundance data: an approach for the comparison of the enrichment of
features in the cellular population of proteins and transcripts. Bioinformatics 18,
585-96 (2002).
24.
Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between
protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999).
25.
Heyer, L. J., Kruglyak, S. & Yooseph, S. Exploring expression data: identification
and analysis of coexpressed genes. Genome Res 9, 1106-15. (1999).
26.
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. & Takagi, T. Assessment of
prediction accuracy of protein function from protein- protein interaction data.
Yeast 18, 523-31. (2001).
27.
Hochstrasser, M. (2001).
28.
Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome.
Cell 95, 717-728 (1998).
29.
Hughes, T. R. et al. Functional discovery via a compendium of expression profiles.
Cell 102, 109-26. (2000).
30.
Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein
interactome. Proc Natl Acad Sci U S A 98, 4569-74. (2001).
Chapter 3: mRNA Expression and protein-protein interactions
159
31.
Ito, T. et al. Toward a protein-protein interaction map of the budding yeast: A
comprehensive system to examine two-hybrid interactions in all possible
combinations between the yeast proteins. Proc Natl Acad Sci 97, 1143-1147
(2000).
32.
Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to
an alkylating agent. Proc Natl Acad Sci U S A 96, 1486-91 (1999).
33.
Johannes, G., Carter, M. S., Eisen, M. B., Brown, P. O. & Sarnow, P.
Identification of eukaryotic mRNAs that are translated at reduced cap binding
complex eIF4F concentrations using a cDNA microarray. Proc Natl Acad Sci U S
A 96, 13118-23. (1999).
34.
Kane, P. (ed. Communication, P.) (2001).
35.
Kruiswijk, T., Planta, R. J. & Mager, W. H. Quantitative analysis of the protein
composition of yeast ribosomes. Eur J Biochem 83, 245-52. (1978).
36.
Li, B., Nierras, C. R. & Warner, J. R. Transcriptional elements involved in the
repression of ribosomal protein synthesis. Mol Cell Biol 19, 5393-404 (1999).
37.
Lian, Z. et al. Genomic and proteomic analysis of the myeloid differentiation
program. Blood 98, 513-24 (2001).
38.
Luscombe, N. M. et al. New tools and resources for analysing protein structures
and their interactions. Acta Crystallogr D Biol Crystallogr 54, 1132-8. (1998).
39.
Mewes, H. W. et al. MIPS: a database for genomes and protein sequences.
Nucleic Acids Res 28, 37-40 (2000).
Chapter 3: mRNA Expression and protein-protein interactions
160
40.
Nomura, M. Regulation of ribosome biosynthesis in Escherichia coli and
Saccharomyces cerevisiae: diversity and common principles. J Bacteriol 181,
6857-64 (1999).
41.
Papa, F. R., Amerik, A. Y. & Hochstrasser, M. Interaction of the Doa4
deubiquitinating enzyme with the yeast 26S proteasome. Mol Biol Cell 10, 741-56.
(1999).
42.
Papa, F. R. & Hochstrasser, M. The yeast DOA4 gene encodes a deubiquitinating
enzyme related to a product of the human tre-2 oncogene. Nature 366, 313-9.
(1993).
43.
Pitman, J. Probability (Springer-Verlag, New York, 1993).
44.
Planta, R. J. Regulation of ribosome synthesis in yeast. Yeast 13, 1505-18 (1997).
45.
Qian, J., Dolled-Filhart, M., Lin, J., Yu, H. & Gerstein, M. Beyond synexpression
relationships: local clustering of time-shifted and inverted gene expression
profiles identifies new, biologically relevant interactions. J Mol Biol 314, 1053-66
(2001).
46.
Raychaudhuri, S., Sutphin, P. D., Chang, J. T. & Altman, R. B. Basic microarray
analysis: grouping and feature reduction. Trends Biotechnol 19, 189-93. (2001).
47.
Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory
motifs within unaligned noncoding sequences clustered by whole-genome mRNA
quantitation. Nat BIOTECHNOL 16, 939-45 (1998).
48.
Sacher, M. (ed. Communication, P.) (2001).
49.
Schadt, E. E., Li, C., Su, C. & Wong, W. H. Analyzing high-density
oligonucleotide gene expression array data. J Cell Biochem 80, 192-202. (2000).
Chapter 3: mRNA Expression and protein-protein interactions
161
50.
Schwikowski, B., Uetz, P. & Fields, S. A network of protein-protein interactions
in yeast. Nat Biotechnol 18, 1257-61. (2000).
51.
Subrahmanyam, Y. V. et al. RNA expression patterns change dramatically in
human neutrophils exposed to bacteria. Blood 97, 2457-68. (2001).
52.
Teichmann, S. A., Murzin, A. G. & Chothia, C. Determination of protein function,
evolution and interactions by structural genomics. Curr Opin Struct Biol 11, 35463. (2001).
53.
Uetz, P. et al. A comprehensive analysis of protein-protein interactions in
Saccharomyces cerevisiae. Nature 403, 623-7. (2000).
54.
Uetz, P. & Hughes, R. E. Systematic and large-scale two-hybrid screens. Curr
Opin Microbiol 3, 303-8. (2000).
55.
Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997).
56.
Walhout, A. J. & Vidal, M. High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297-306. (2001).
57.
Weaver, P. L., Sun, C. & Chang, T. H. Dbp3p, a putative RNA helicase in
Saccharomyces cerevisiae, is required for efficient pre-rRNA processing
predominantly at site A3. Mol Cell Biol 17, 1354-65. (1997).
58.
Westhead, D. R., Slidel, T. W., Flores, T. P. & Thornton, J. M. Protein structural
topology: Automated analysis and diagrammatic representation. Protein Sci 8,
897-904. (1999).
59.
Whitby, F. G. et al. Structural basis for the activation of 20S proteasomes by 11S
regulators. Nature 408, 115-20. (2000).
Chapter 3: mRNA Expression and protein-protein interactions
162
60.
Wilkinson, C. R., Penney, M., McGurk, G., Wallace, M. & Gordon, C. The 26S
proteasome of the fission yeast Schizosaccharomyces pombe. Philos Trans R Soc
Lond B Biol Sci 354, 1523-32. (1999).
61.
Wilson, C. J. et al. RNA polymerase II holoenzyme contains SWI/SNF regulators
involved in chromatin remodeling. Cell 84, 235-44. (1996).
62.
Winey, M., Yarar, D., Giddings, T. H., Jr. & Mastronarde, D. N. Nuclear pore
complex number and distribution throughout the Saccharomyces cerevisiae cell
cycle by three-dimensional reconstruction from electron micrographs of nuclear
envelopes. Mol Biol Cell 8, 2119-32 (1997).
63.
Woolford, J. L. & Warner., J. R. in The Molecular and Cellular Biology of the
Yeast Saccharomyces: Genome Dynamics, Protein Synthesis, and Energetics (eds.
Broach, J. R., Pringle, J. R. & Jones, E. W.) 587-626 (Cold Spring Harbor
Laboratory Press., 1991).
64.
Xenarios, I. et al. DIP: the Database of Interacting Proteins. Nucleic Acids
Research 28, 289-291 (2000).
Chapter 3: mRNA Expression and protein-protein interactions
163
Appendix: Change in mRNA expression vs. change in
protein abundance levels
Genomic and proteomic analysis of the myeloid differentiation
program: global analysis of gene expression during induced
differentiation in the MPRO cell line
Abstract
I have used an approach using 2-dimensional gel electrophoresis with mass spectrometry
analysis combined with oligonucleotide chip hybridization for a comprehensive and
quantitative study of the temporal patterns of protein and mRNA expression during
myeloid development in the MPRO murine cell line. This global analysis detected 123
known proteins and 29 "new" proteins out of 220 protein spots identified by tandem mass
spectroscopy, including proteins in 12 functional categories such as transcription factors
and cytokines. Bioinformatic analysis of these proteins revealed clusters with functional
importance to myeloid differentiation. Previous analyses have found that for a substantial
number of genes the absolute amount of protein in the cell is not strongly correlated to
the amount of mRNA. These conclusions were based on simultaneous measurement of
mRNA and protein at just a single time point. Here, however, I am able to investigate the
relationship between mRNA and protein in terms of simultaneous changes in their levels
over multiple time points. This is the first time such a relationship has been studied, and I
find that it gives a much stronger correlation, consistent with the hypothesis that a
substantial proportion of protein change is a consequence of changed mRNA levels,
rather than posttranscriptional effects. Cycloheximide inhibition also showed that most of
the proteins detected by gel electrophoresis were relatively stable. Specific investigation
Appendix
164
of transcription factor mRNA representation showed considerable similarity to those of
mature human neutrophils and highlighted several transcription factors and other
functional nuclear proteins whose mRNA levels change prominently during MPRO
differentiation but which have not been investigated previously in the context of myeloid
development. Data are available online at
http://bioinfo.mbb.yale.edu/expression/myelopoiesis. (Blood. 2002;100:3209-3220)
Introduction
The study of myeloid differentiation provides important insights both into normal
developmental processes that generate peripheral blood leukocytes, as well as into
abnormalities that lead to myeloid aplasia, dysplasia, and leukemia.1-9 Access to normal
myeloid precursors at homogenous stages of development and in quantities sufficient for
biochemical analysis is not generally practicable so information about myeloid
differentiation has generally been obtained by studies of leukemic cells arrested at
various developmental stages.10 Informative results have also come from studies of
humans with genetic abnormalities affecting neutrophil accumulation11-13 and gene
targeting experiments, particularly of transcription factors.14 Overall, cell lines that can be
induced to undergo myeloid differentiation in vitro continue to provide many of the most
useful models for understanding of this process.15
Human and murine hematopoietic precursor lines have been developed that can be
induced to mature to various degrees toward adult neutrophils.8,16 Several of these lines
Appendix
165
fail to form a full complement of proteins or to fully undergo morphologic changes
characteristic of mature neutrophils, but the murine MPRO cell line provides a relatively
favorable system for studying myeloid differentiation.8 The cells are arrested at the
promyelocytic stage because of the presence of a dominant-negative retinoic acid
receptor. Differentiation can be induced by adding appropriate concentrations of all-trans
retinoic acid (ATRA). On differentiation, most cells mature to the level of band forms
and mature polymorphonuclear neutrophils and express secondary granule mRNAs and
proteins.8
Current methods that provide broad surveys of the patterns of mRNA expression include
oligonucleotide chip hybridization17 and 3' end restriction fragment gel display analysis18;
both have been used to study MPRO cell development. Although the chemical
heterogeneity of proteins prevents similar global methods of protein abundance analysis,
recent improvements in 2-dimensional gel electrophoresis, especially the development of
immobilized pH gradient isoelectric focusing gels, have made it possible to
semiquantitatively examine the levels of a substantial fraction of the proteins of a cell. 19
This approach, termed proteome analysis, has provided important contributions to
disease-related gene discovery, developmental program analysis, and drug discovery.
Interest in this area has been spurred by recent studies indicating a modest to poor
correlation between transcriptional profiles and actual protein levels in cells. These
studies make it clear that cellular protein analysis is complementary to genomic analysis
and that no biologic program can be successfully analyzed without the incorporation of a
proteomics platform.
Appendix
166
Previously, I used oligonucleotide chips and gel displays to study the patterns of mRNA
expression during MPRO cell differentiation and compared these with a very limited set
of protein analyses from wide pH range 2-dimensional gel electrophoresis.20 I have
expanded these studies to a more global analysis of a much wider array of mRNA and
protein species. The current studies use higher resolution narrow-range 2-dimensional gel
systems and tandem mass spectrometry to identify a substantial portion of the more
abundant proteins whose levels change during MPRO development. Bioinformatic and
functional tools were then used to analyze the role of these proteins in myeloid
differentiation. I have also used a new generation of oligonucleotide chips to compare
mRNA levels in MPRO cells 0 hours and 72 hours after induction of differentiation. In
particular, I have further examined the expression of transcription factor mRNAs in
MPRO cells and compared this pattern with transcription factor expression in mature
human neutrophils.
Materials and methods
Cell line growth and induction
The MPRO cells15 were obtained and incubated as described previously.20 MPRO cells
induced with retinoic acid for 0, 24, 48, and 72 hours were collected and analyzed by
procedures described below.
Appendix
167
Two-dimensional immobilized pH gradient gel electrophoresis
MPRO cells were disrupted in lysis buffer.20 I applied 50 to 100 µL of each MPRO cell
lysate (1.25 × 106cells/100-2.5 × 106cells/100 µL, about 100-200 µg protein) at the
cathodic end of the immobilized pH gradient gel (IPG) strips (pH 3-10 L, pH 4-7 and pH
6-11, Pharmacia Biotech, Uppsala, Sweden), and 2-dimensional IPG electrophoresis (2DIPG) was conducted for 10 to16 hours (13 000 to 20 100 V-h) using Electrophoresis
Power Supply ESP 3500 XL and Immobiline DryStrip Kit (Pharmacia Biotech). The
electrophoresis in the second dimension was carried out in a 12% sodium dodecyl
sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) gel with the Laemmli-SDScontinuous system in a PROTEAN (II xi 2-D cell, Bio-Rad, Hercules, CA), run at 40 mA
constant current for 5 hours.21,22
The 2-dimensional gels were stained with Coomassie brilliant blue G-colloidal following
the vendor's recommendations.23 Destaining was performed by soaking the gels in 10%
acetic acid and 25% methanol solution for 60 seconds, then in 25% methanol solution for
24 hours at room temperature. Silver staining was performed according to the protocol of
the manufacturer.24,25
The 2-dimensional maps of MPRO cells were compared by using the Adobe Photoshop
4.0 program Melanie III 2-D PAGE software (Genebio, Geneva, Switzerland) and
checked manually. Proteins were recovered by punching out spots with a MultiFit
Appendix
168
Research Pipet Tips (Volume: 100-1000 µL; Dot Scientific, Burton, MI). More than 200
visible protein spots were punched for later mass spectrometry analysis.
Mass spectrometry analysis
The punched samples were washed at room temperature in the following solutions: in
50% acetonitrile for 5 minutes; in 50% CH3CN/50 Mm NH4HCO3 for 30 minutes; then
in 50% CH3CN/10 Mm NH4HCO3 for 30 minutes. After drying the sample gels in a
SpeedVac Concentrator (Eppendorf, Hamburg, Germany), trypsin solution (0.05 µg
trypsin/7 µL 10 Mm NH4HCO3) was added to the samples and they were incubated at
37°C for 24 hours. The supernatants of the trypsin digestion products were collected, 1
µL sample digest was mixed with 1.0 µL -cyano-4-hydroxy cinnamic acid (CHCA; 4.5
mg/mL in 50% CH3CN, 0.05% trifluoroacetic acid [TFA]) matrix solution, and 1 µL
calibrants (100 fmol each). The mixture was loaded on a target of the sample plate, then
injected to the Perseptive Biosystem Voyager-DE STR instrument (Perseptive Biosystem,
Boston, MA). The spectra of the peptides were acquired in reflector/delayed extraction
mode. The standards used for calibration of peptide masses are bradykinin (average
(M+H) is 1061.23) and ACTH Clip (average (M+H) is 2466.70). The criteria we are
using currently to identify proteins are: (1) "Coverage," ratio of the portion of protein
sequence covered by matched peptides to the whole length of protein sequence, is 25% or
more; (2) "Z score" is more than 1.5; (3) "Probability" is "1.0e + 000"; (4) "Coverage
graphical" of the matched peptides from the protein candidate crosses the all length of the
protein.
Appendix
169
Peptide identification and database establishment
Peptides were identified using the ProFound-Peptide Mapping search engine
(http://www.proteometrics.com/profound_bin/WebProFound.exe),
and
subsequently
searched against the SWISS-PROT (http://www.expasy.ch/) or PIR (http://wwwnbrf.georgetown.edu/) sites. The differential patterns of protein expression were analyzed
with
Melanie
II
2-D
Page
Software
(Bio-Rad)
(http://www.expasy.ch/melanie/MelanieII/description.html).
The 2-dimensional reference maps and the identified protein information were collected
in a database (dbMCp) that contained information for each protein including: GenBank
matches, Locus Link or UniGene clusters, expression patterns, tissue distribution,
synonym(s) protein name, gene name(s), notations of possible functions in myeloid cell
biology and differentiation, and hyperlinks to the database searches, 2-dimensional
images, and related references. These data were gathered as separate entries in a file.
Supplementary information is available on the website
(http://bioinfo.mbb.yale.edu/expression/blood).
The proteins identified from different sets of 2-dimensional gels were grouped into 12
categories according to their functions as documented in SWISS-PROT and National
Center for Biotechnology Information (NCBI) databases. Furthermore, these proteins
were classified into 5 expression patterns by their similarity to the ideal expression
Appendix
170
patterns.20 The correlations at various levels of proteins or RNA were compared using
both visual estimates and Melanie software estimates of protein spot intensities and the
average difference between match and mismatch probe sets for each gene on the
oligonucleotide chips.
Protein synthesis inhibition by cycloheximide treatment
A pilot dose-response experiment determined the dose that produced 95% inhibition of
MPRO cell protein synthesis, assayed by incorporation of radiolabeled L-[35S]methionine. Based on dose-response experiment, MPRO cells (2 × 105 cells/mL) were
treated with or without cycloheximide (final concentration 10 µL/mL) for 2 hours, then
collected and sampled for proteomic analysis as described above.
mRNA isolation and analysis
The mRNA was isolated from MPRO cells at indicated time points during differentiation
as previously described.20 Oligonucleotide chip analysis was also performed as
previously described,20 except for the use of the more advanced Affymetrix chip probes
(Murine Genome U74Av2 array), interrogating approximately 36 000 full-length mouse
genes and expressed sequence tag (EST) clusters from the UniGene database. The
resulting data were compared with human neutrophil gene expression analysis using the
Affymetrix U60 set of oligonucleotide chips. Human neutrophils were prepared
according to the method described previously.18 Criteria for considering cDNAs "present"
Appendix
171
and for selecting those with significant average differences, as well as rescaling,
threshold, and normalization methods were applied as previously described.20
To study mRNA expression we first tested the incorporation of results from previous
work20 using Affymetrix 11K chips along with the present work set of measurements
using the newer generation Affymetrix murine genome U74Av2 array. Comparison of the
differences in expression at times 0 and 72 hours between the 2 different chips requires
preprocessing of the data, because the probe sets corresponding to any given gene in the
old and new chips are different. Genes were identified by their Locus Link ID, by
extracting the ID for each accession number in both the 11K and 36K chips using the
Stanford
Source
database
(http://genome-www5.stanford.edu/cgi-
bin/SMD/source/sourceSearch). We filtered out probe sets that had missing values of
expression levels at either 0 or 72 hours. The remaining probe sets of the 11K chip were
linked with the remaining probe sets of the 36K chip through a common Locus Link ID.
Most of the remaining distinct 1906 Locus Link IDs had a single probe set per Locus
Link ID, both in the old and new chips. However, 63 probe sets from the 11K chip were
linked with more than one probe set on the 36K chip and 400 probe sets from the 36K
chip were linked with more than one probe set of the 11K chip. We chose not to average
the RNA levels of probe sets that belong to the same gene, because it would not be
appropriate when the expression levels of one probe set dominate the others. Therefore,
we evaluated the correlation between mRNA from the 11K chip and the 36K chip using
only the subset of genes that had single probe sets on both chips. Using this subset we
found that the correlation between mRNA levels of the 11K chip and the 36K chip is 0.75
Appendix
172
at 0 hours and 0.7 at 72 hours. These correlations were lower than the correlations
between the mRNA levels at 0 and 72 hours using only the 11K chip (r = 0.89) or only
the 36K chip (r = 0.84). Therefore, changes in RNA levels were not entirely reproducible
using these 2 completely different chips. We compared the time course trends of 10 genes
previously studied using Northern blots with the corresponding trends of the 11K and
36K chips. The trends of the 11K chip agreed with the Northern blots only in 6 of 10
instances, whereas the new 36K chip success rate was 9 of 10. The mRNA for several of
the proteins we detected from 2-dimensional gels was reported as present on the new chip
set but absent from the old chip set. We therefore used only data form the new chip set
for comparisons with proteins and for further examination of changes in transcription
factors. The use of only one replica of the new 36K chip, although not ideal, should be
sufficient for exploring global relations between protein and mRNA.
Northern blot analysis was performed as described previously.20
Results
Proteomic analysis of MPRO differentiation
The MPRO cell model is particularly useful for studying aspects of myeloid
differentiation because large numbers of cells can be obtained, arrested at the
promyelocyte stage of development, and, most importantly, synchronous differentiation
can be induced by adding ATRA. The fully differentiated cells resemble mature
Appendix
173
neutrophils both morphologically and in the expression of secondary granule proteins.
For the purpose of initially scanning changes in protein levels during myeloid
differentiation, we used 2D-IPG with wide-range, linear IPGs (pH 3-10) in the first
dimension. Figure 1 shows analytical colloidal blue-stained 2D-IPG standard maps of
differentiated MPRO cells at 0, 24, 48, and 72 hours after the cells were induced with
ATRA. The expression patterns of more than 300 protein spots were followed through
the entire series of gels. The protein spots in different gels could easily be cross-matched
to each other, using Melanie III software, indicating the reproducibility of the method. A
large portion of these products changed their relative intensities among the 4 maps,
suggesting extensive protein expression changes during the course of MRPO
differentiation.
Protein identification
The protein spots in the different sets of the gels were identified by MALDI-MS on the
basis of peptide mass matching with the theoretical peptide masses in tryptic digests of all
known proteins from mouse and human species.26 Of 220 protein spots analyzed, 193
yielded high-quality spectral data. The experimental peptide masses were matched to a
total of 143 spots corresponding to 123 different known proteins, as presented in Table 1.
The accession numbers, protein names, and theoretical pI and Mr values, as well as the
number of peptide matches and probability of wrong assignment, are presented in the
database dbMCp (http://bioinfo.mbb.yale.edu/expression/myelopoiesis). There were 29
spots with high-quality spectra but poor matches in public databases; another 21 spots
Appendix
174
with good mass spectra matched many different proteins in the mouse database. The
latter finding was probably attributable to high sequence homology, but can also be the
result of a mixture of proteins in a single spot.
On the pH 3 to 10 maps, 14 protein species were represented by multiple spots (Table 2)
that differed due to the pI or Mr. These differences might be the result of alternative
splicing or posttranslational modifications, or of chemical modification by protease
inhibitors during sample preparation. Interestingly, some of these proteins showed the
same phenomena in Jurkat T-cell 2-dimensional protein maps.27 Some spots with highquality spectra, but shifted from their expected position in the gel, might also represent
posttranslational modifications. Proteins with lower than expected molecular weights
may be digestion fragments of larger proteins. Most proteins with low molecular weight
(< 14 kDa) usually presenting multiply matches, could not be identified.
Protein expression patterns during MPRO development
The 123 "known" proteins identified here were classified into 12 categories on the basis
of their function, including 18% categorized as cytoskeletal proteins, 15% metabolismrelated molecules, and 10% signaling pathway-related proteins (Table 3). These proteins
were abundant in the cell and easily detected by 2-dimensional electrophoresis. Smaller
sets of proteins included 7 possible transcription factors and 5 cytokines; other categories,
such as kinases and chromatin remodeling factors, contain even fewer members.
Appendix
175
We also classified all known proteins according to their expression patterns during
myeloid differentiation. We clustered the standardize protein expression level profiles (at
0, 24, 48 and 72 hours) using the GENECLUSTER version of the self-organizing maps
(SOMs) clustering algorithm,28 with a rectangular 3 × 2 grid as the input node geometry.
The final position of the nodes in the 4-dimensional (time course) space, represents the
centers of 6 clusters generated by the SOM algorithm. One of these clusters was empty.
Figure 4 shows the normalized expression profiles divided into the remaining 5 clusters
representing trends such as down-regulation (Figure 4A) and up-regulation (Figure 4D,E)
that occur during the cell maturation process. For example, the universal transcription
factor Eef2 is down-regulated. This finding is consistent with the concurrent reduction of
total RNA levels and cell size. Conversely, protein Es10 shows a pattern of up-regulation,
as expected for a granule component. Thus, these profiles offer information about the
roles of proteins in the different stages of the MPRO development.
Correlation of gene expression at the RNA and protein levels
One of the goals of this work is to search for global relationships between mRNA and
protein levels during MRPO cell maturation. Previous studies in yeast showed weak
correlations between average mRNA levels and average protein levels.29-32 These studies
focused on the relationship, at one instant, between absolute amounts of mRNA
(measured from Affymetrix GeneChip experiments) and protein. Here we investigate
another quantity: the correlation between changes over many time points, in mRNA
levels and in protein levels. This is only possible because we have available experiments
Appendix
176
simultaneously done on protein and mRNA levels over an entire time course. In
particular, we analyze the relationships between time course expression profiles of
mRNA and proteins during a process of mammalian cellular development. This is the
first time that the relationship between protein abundance and mRNA expression has
been studied in terms of changes over time. To study mRNA expression, we used
measurements taken at times 0 and 72 hours of the maturation process, using the
Affymetrix 36K murine chip. To compare the mRNA changes with protein changes we
first summed levels of proteins that are represented in more than one spot on the 2dimensional gels. We retained only mRNAs with an Affymetrix oligonucleotide probe set
with the suffix "_at" (representing a probe set corresponding to a single gene). This
procedure removes the ambiguity of multiple probe sets per Locus Link. We then
screened mRNA that had a "present" Affymetrix indicator and an amplitude more than 20
at 0 and 72 hours and found 51 different proteins that satisfied these conditions. The
correlation between the mRNA difference at 0 and 72 hours with the corresponding
protein difference is r = 0.58, as presented in Figure 5 (the exact formula for the Pearson
correlation coefficient r is given in the legend to Figure 5). Most proteins with increasing
levels of mRNA also have increasing protein levels, with the exception of 2 outliers
(enolase 1 and coronin). Overall, 11 of 51 proteins with upward/downward trends had an
opposite mRNA trend.
The reproducibility of the protein results was studied by repeating the induction
experiments of MPRO cells and also by repeated analyses of the same cell samples. The
induction experiments were repeated 3 times. In each experiment, the MPRO cells from
Appendix
177
different time courses were analyzed by 2D-IPG 2 to 6 times. We found that the protein
spot images were well reproduced, with only slight differences occurring at the far edges
of gels. Quantitative analysis of 4 dilutions of the same samples showed that the intensity
change of each protein was proportional. In comparisons of 2D-IPG of 0-hour and 72hour cells between 2 different induction experiments, we found that among the 220
analyzed proteins, 199 (90%) were reproducibly observed, and 21 were not observed in
all gel sets. The direction of expression changes of proteins in 72 hours against 0 hours
was similar in both experiments, with a correlation coefficient of 0.88.
We measured protein abundance using both software and manual estimations of spot
intensity. Using the Melanie III program from Genebio, we were able to compute the
protein abundance of thousands of proteins across the gels and found a general
consistency between measurements by eye and by software analysis (data not shown).
We did not expect to find a general correlation for the changes in levels of these proteins
and their mRNAs; rather, as previously hypothesized,29,32 we sought correlations between
smaller, better defined groups of proteins. Although the correlation over all proteins and
mRNA hovered around 0.3 for each of the time points, we found that the median
correlation for cytoskeletal proteins alone rose to approximately 0.65, highlighting the
importance of analyzing mRNA expression and protein abundance using well-defined
features and functions.
Appendix
178
Protein stability
The level of any protein is theoretically determined by its cumulative rate of synthesis
and by the rates of degradation or alteration (and an initial condition of protein level). For
protein stability studies, MPRO cells (1.5 × 105 cells/mL) were treated for 2 hours with
cycloheximide (final concentration 10 µL/mL, based on an initial dose-response
experiment). The cycloheximide-treated and control cells were analyzed on 2 sets of
IPGs (pH 4-7 and pH 6-11). As shown in Figures 6 and 7, the relative expression of most
proteins remained the same after 2 hours of treatment. Quantitative measurements
showed that 27.5% proteins dropped off significantly (fold change > 2), whereas 63.7%
of proteins were stable over this time period (Figure 8). Nine proteins showed a relatively
higher level of expression after cycloheximide treatment, indicating that posttranslation
modifications of these proteins occurred less than 2 hours after their synthesis, or that
their translation was relatively resistant to cycloheximide.
Comparison of differentiated MPRO cells with normal neutrophils
After 72 hours of ATRA treatment, the MPRO cells resembled mature neutrophils
morphologically, including the presence of secondary granule proteins. To obtain a more
complete picture of the differentiation state of the MPRO cells, I compared their RNA
profiles with those of mature neutrophils. Human neutrophils were used rather than
murine peripheral blood cells because the human cells are a more practical source of
sufficient RNA for replicate analyses. In particular, we chose to focus on the levels of
Appendix
179
mRNA encoding transcription factors, because they control the differentiation process
and determine the expression of the other genes. A total of 219 known or probable
transcription factors were represented in mRNA isolated at some stage of MPRO cell
development. Comparison of oligonucleotide chip analyses showed that there were 49
transcription factors whose mRNA was reported as present in resting human neutrophils
but whose homologues were reported as absent in 72-hour MPRO cells. To obtain more
precise data, we performed Northern blot analysis of 20 mRNAs encoding transcription
modulators (Table 4). Of these, the oligonucleotide chips reported 12 as present in human
neutrophils but absent in 72-hour MPRO cells. Eleven of these 12 were detected as
present in 72-hour MPRO cells by Northern blot analysis (Figure 9). These included
Bach1, not previously studied in myeloid cell differentiation, but markedly elevated in
the mature cells. Conversely Rybp was markedly reduced as the cells matured. This
finding is surprising because the protein is a presumptive transcriptional repressor and
part of the mammalian homologue of the Drosophila polycomb complex.
Discussion
We have used a 2-dimensional gel electrophoresis approach to explore the temporal
patterns of protein expression during ATRA-induced myeloid development in the MPRO
murine myeloid cell line.8 This global analysis has detected 123 known proteins and 29
"new" proteins out of 220 protein spots identified by tandem mass spectroscopy,
including proteins in 12 functional categories such as transcription factors, cytokines, and
others. Bioinformatic analysis of these proteins has revealed clusters with functional
Appendix
180
importance to myeloid differentiation. Comparison of gene expression at the genomic and
proteomic levels revealed some discrepancies between RNA and protein levels that
indicate the importance of posttranscriptional and posttranslation processes during cell
differentiation, although some differences undoubtedly arise at least in part from
technical limitations of the current methods of measurement. These discrepancies may
also be the result of varying translation and degradation efficiencies or might reflect
posttranslation modifications. Nonetheless, overall there was a significant correlation
between changes in mRNA and protein levels, consistent with the expectation that a
substantial proportion of protein change is a consequence of changed mRNA levels,
rather than posttranscriptional effects. Cycloheximide inhibition also showed that most of
the proteins detected by gel electrophoresis were relatively stable, so that increased
stability of proteins with maturation was not a likely explanation for the observed
changes. We further examined the expression of transcription factor mRNA in MPRO
cells and compared this with the expression pattern in mature human neutrophils. By
combining oligonucleotide chip and Northern blot analysis, we observed that most of the
transcription factor mRNAs detected in human neutrophils have homologues present in
mature MPRO cells, although estimated relative RNA abundances could be quite
different between species.
The first comparison of mRNA levels to the protein abundances of their gene products 33
found a correlation coefficient of 0.48. These observations highlighted the limitations of
functional studies performed only at mRNA level. Later, Anderson's group found a
correlation coefficient of only 0.43 in a comparison of protein and mRNA abundances for
Appendix
181
a single gene product across 60 human cell lines by an immunoaffinity high-performance
liquid chromatography method and quantitative Northern analysis.19 In 1999, Gygi et al30
quantitatively compared mRNA and protein expression levels for 128 different genes
expressed in yeast, using serial analysis of gene expression (SAGE) and capillary liquid
chromatography-tandem mass spectrometry methods. Their results showed a correlation
coefficient of 0.935 for the most abundant proteins; but the coefficient was only 0.356 for
the 69% of 106 genes34 for which the transcript levels were less than 10 copies/cell.
These prior studies examined static expression levels without correlation of changes in
protein and mRNA levels during cell development, as performed in the present study. In
general, we found a moderately high correlation (coefficient 0.58) between estimated
protein and RNA levels. There are multiple technical considerations, both in measuring
RNA and protein levels that might affect the results, but the general conclusion supports
previous contentions35 that interpretations of changes in cell behavior based on changing
mRNA levels is incomplete. Nevertheless, the correlation is sufficiently strong to indicate
that the regulation of transcript levels is probably a major determinant of changes in
protein levels during differentiation in this system. Because uninduced MPRO cells were
in a steady state, one might expect to see better correlation at later time points, when
changes in mRNA levels over time have been translated into protein levels.
Some loss of correlation could derive from unstable proteins that are differentially
regulated during cellular maturation. Using cycloheximide to inhibit protein synthesis, we
Appendix
182
found that the large majority of the proteins in this system are relatively stable. However,
protein stability is an important factor in posttranslational proteomic studies.
Much progress has been made in understanding transcriptional regulation of the myeloid
differentiation program. Transcription factors such as PU.1 and members of the C/EBP
family have been found to play important roles in the expression of a variety of myeloid
genes, both by examination of individual gene regulatory regions and by gene knock-out
studies in mice.36-39 Our previous work20 initiated and the present study has established a
database of transcription factors and target genes differentially regulated during myeloid
differentiation.
The
results
are
limited
by
the
sensitivity,
accuracy,
and
comprehensiveness of the available oligonucleotide chips for mouse mRNAs.
Detection of transcription factor proteins is difficult because they are often present at low
abundance, may have basic pIs, and may be present in various modified forms that alter
their mobility on 2-dimensional gels. Encouragingly, the present study identified 7
proteins potentially important to transcriptional regulation, including RNA polymerase II,
Stat5a, Aiolos, Hmg1 and 2, Kruppel-related zinc finger protein F80-m, and Zfp101.
Previous studies have shown that all 7 members of the signal transducers and activator of
transcription (STAT) family are involved in regulating expression of cytokine-induced
and growth factor-induced genes.40 Among them, Stat5 appears to have an important role
in myeloid cell development, primarily by mediating granulocyte-macrophage colonystimulating factor (GM-CSF) signaling. At the mRNA level, several STAT proteins,
including Stat1, 3, 5b, and 6, were moderately up-regulated in MPRO cells. Our data
Appendix
183
showed decreased expression of Stat5a protein at the late stage of MPRO differentiation,
as reported in other systems.40 Kruppel-related zinc finger protein F80-m and Aiolos are
2 newly identified transcription factors, with still unknown functions in myeloid cells,
although Aiolos is known to interact with Ras to control cell death in T cells.41 In MPRO
cells, we found that Aiolos is expressed at a fairly constant level throughout
differentiation. In contrast, Kruppel-related zinc finger protein F80-m was strongly downregulated and Zfp101 slightly up-regulated. The high mobility group (HMG) box domain
defines a family of proteins, mostly transcription factors, that specifically interact with
DNA on the minor groove.42,43 Surprisingly, recent studies suggest a second quite
different function for Hmg1 and 2 as cytokinelike factors.44,45 In this study, both Hmg1
and Hmg2 were detected by 2DE analysis. Hmg2 was significantly up-regulated
indicating its possible important function in biologic processes in MPRO differentiation.
Oligonucleotide chip analyses showed the presence of mRNAs for about 123
transcription- or chromatin-modifying factors in differentiated MPRO cells and 147
factors in mature human neutrophils. Overall, 49 of these factors represented in
neutrophil mRNA were not detected by chip analysis of MPRO cells, but 11 of 12 were
detectable by Northern blot analysis. In some cases the failure to find an mRNA by chip
analysis was probably because the amount of transcript was below the threshold for
oligonucleotide chip detection,46,47 but in other cases relatively strong Northern signals
were obtained.
Appendix
184
Several subsets of transcription factor mRNAs had patterns of expression that could be
interpreted in terms of known function of the products. Myc is a well-known transcription
factor that promotes growth rather than differentiation,48 and in turn is regulated by
interactions with a family of proteins including Max, Mad, and Sin3B.49 In developing
MPRO cells Myc is down-regulated and Mad is up-regulated. The related protein Mad4
is slightly down-regulated and Mad5 is markedly down-regulated and apparently absent
from the mature cells. Mad5 differs from other proteins of this group in that it may act to
stimulate as well as repress transcription. In addition, Sin3b is one of the more markedly
up-regulated transcription factor mRNAs. The combined changes in Mad, Myc, and
Sin3b would be expected to synergistically prevent activation of Myc target genes.
PU.1 is a transcription factor implicated in the transcriptional control of neutrophilspecific genes and in neutrophil production, which is defective in PU.1 knockout mice. 50
Sp1, Purb, Klf9/Bteb1, and Maz are broadly expressed transcription factors that bind to
purine-rich sites, including potentially some PU.1 sites. PU.1 is up-regulated almost 3fold at the RNA level, whereas all 4 of the latter factors are down-regulated during
MPRO development, as is the SP1-like factor Klfl3.
We have previously observed20 by Northern blot analysis that there is a shift in the
balance of members of the C/EBP family of transcription factors at the mRNA level
during MPRO differentiation, with some progressive down-regulation of C/EBP and upregulation first of C/EBP then C/EBP and . These results are consistent with the role of
these factors in neutrophil development, deduced from both transcriptional analysis of
Appendix
185
individual promoters and gene knockout effects on myelopoiesis. The present set of RNA
analyses by oligonucleotide chip hybridization is more consistent with the Northern blot
analyses than were the preliminary results,20 although C/EBP is still not represented on
the chip.
Overall, these coordinated changes in the expression of multiple transcription factors
would serve to amplify differences in transcription and permit fine control of the timing
and amplitude of regulation for multiple gene targets. As previously postulated,51 such
reciprocal regulation of competing factors may be a common mechanism in
differentiation. The changes in mRNA levels during maturation of myeloid cells include
both the silencing of a number of genes and up-regulation of a number of other genes.
The substantial changes in the level of some putative transcriptional repressors, both up
(eg, Sin3b, Atf7ip) and down (eg, Rybp) during differentiation, suggest that specific
repression of transcription provides an important and under-investigated means of
regulating myeloid differentiation, in addition to more conventional mechanisms such as
competition for binding sites and changes in activating factor levels.
The striking morphologic changes in the maturing nuclei of "polymorphonuclear
leukocytes" remain mysterious both in terms of mechanism and teleology. Some possible
clues may be observed in the current RNA expression data. For example, Ran is a small
guanosine triphosphatase (GTPase) required for nuclear import and export, and mRNA
levels for Ran and Ran binding proteins 1 and 2 decline as the cells mature. This change
could either be a cause or consequence of decreased nuclear import of macromolecules
Appendix
186
coincident with nuclear condensation. Another protein, acinus, is implicated in causing
chromatin condensation without DNA breakage during apoptosis52; its mRNA increases
about 3-fold as MPRO cells mature and form highly condensed, multilobed nuclei.
In summary, we have comprehensively and quantitatively analyzed both RNA and
protein expression patterns during myeloid differentiation. Changes in protein levels
correlated moderately well with changes in mRNA expression. Investigation of
transcription factor mRNA representation showed considerable similarity to those of
mature human neutrophils and highlight several transcription factors and other functional
nuclear proteins whose mRNA levels change prominently during MPRO differentiation
but which have not been investigated previously in the context of myeloid development.
The number of transcription factors expressed in these cells greatly exceeds those
previously identified as important for the regulation of specific myeloid genes. Currently
emerging techniques53-55 for genomic analysis of factor binding sites in mammalian DNA
may help to elucidate their gene targets and potential roles in myeloid differentiation.
Appendix
187
Acknowledgments
We express our gratitude to Dr S. Tsai (Program in Molecular Medicine, Fred
Hutchinson Cancer Research Center, Seattle, WA) for his kind gift of MPRO cell line,
and Mr Jeffrey J. Meyer (University of Chicago School of Medicine) for helpful advice.
Supported by National Institutes of Health (NIH) grants CA42556, AI43558, DK54369,
and HL63357, and by Gene Logic (S.M.W.); NIH grant HL63357 (Z.L.); NIH grant DK
54369, grants from the Arthritis Foundation and Charles H. Hood Foundation, and the
John H. Pierce Pediatric Oncology Research Fund (P.E.N.); and NIH grant P50
HG02357-01 (M.G.).
S.M.W. owns stock in and consults for Gene Logic Inc.
Appendix
188
Figures and Tables
Appendix: Change in mRNA expression vs. change in protein abundance levels
Appendix: Change in mRNA expression vs. change in protein abundance levels
Figure 1 Two-dimensional electrophoretograms of wide pH range of MPRO
cells.
Figure 1. Two-dimensional electrophoretograms of wide pH range of MPRO cells.
MPRO cells differentiate to mature neutrophils in the presence of ATRA. Following
exposure to 10 µM ATRA for 0, 24, 48, or 72 hours, MPRO cell lysate (2.5 × 106
cells/sample) was loaded for 2-dimensional electrophoretic (2DE) analysis. The gels were
stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell (0 hour); (B)
MPRO cells induced with ATRA for 24 hours; (C) MPRO cells induced with ATRA for
Appendix
189
48 hours; (D) matured MPRO cells induced with ATRA for 72 hours. The most visible
protein spots in the maps were subjected to MS analysis. The marked 2 DE maps could
be found in our website (http://bioinfo.mbb.yale.edu/expression/myelopoiesis). *2 DE
maps of panels A and D were published in our previous paper.20
Appendix
190
Figure 2 Two-dimensional electrophoretograms of MPRO cells in pH range 4
to 7.
Figure 2. Two-dimensional electrophoretograms of MPRO cells in pH range 4 to 7.
MPRO cell lysate (1.5 × 106 cells/sample) was loaded for 2DE analysis (pH 4-7). The
gels were stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell (0 hour);
(B) matured MPRO cells induced with ATRA for 72 hours. The other information is
presented as in the legend to Figure 1. In these wide-range 2-dimensional maps, there is
a loss of resolution in the region pH 4 to 7, most probably due to the fact that the pI
values of many proteins occur in this range. Therefore, we also performed electrophoresis
on pH 4 to 7 and pH 6 to 11 narrow-range IPGs to get better protein separation (Figures 2
and 3). These narrower pH gels allowed a higher resolution and more protein spots in the
relative pH zones. The abundant protein spots could also be cross-correlated between the
wide and narrow gels.
Appendix
191
Figure 3 Two-dimensional electrophoretograms of MPRO cells in pH range 6
to 11.
Figure 3. Two-dimensional electrophoretograms of MPRO cells in pH range 6 to 11.
MPRO cell lysate (1.5 × 106 cells/sample) was loaded for basic pH 2DE analysis (pH 611). The gels were stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell
(0 hour); (B) matured MPRO cells induced with ATRA for 72 hours. The other
information is presented as in the legend to Figure 1.
Appendix
192
Table 1 Distribution of protein spots identified during myeloid differentiation
Table 2. Protein species represented by multiple spots
Theoretical
value
Symbol
Aldh2
Atp5a1
Ddx5
Gapd
Accession
NP_033786
NP_031531
NP_031866
NP_032110
Hnrpa2b1 NP_058086
Hnrph1
Appendix
NP_067485
*
Gi#
Protein
ID
MPRO6753036 004
MPRO006
MPRO6680748 087
MPRO088
MPRO6681157 206
MPRO207
MPRO6679937 035
MPRO085
MPRO7949053 223
MPRO227
MPRO229
MPRO10946928 155
MPRO154
kDa
pl
56.52
Practical value
%
kDa
pl
7.7 23
31~50
6.4~6.6
56.52
7.7 20
6~14
7.3~7.6
59.73
9.3 24
45~55
7.6~7.8
59.73
9.3 24
45~55
7.8~8.0
69.3
9.3 22
45~66
9.1~9.3
69.3
9.3 26
45~55
9.1~9.4
35.79
8.7 39
25~35
8.0~8.2
35.79
8.7 34
28~38
7.7~7.9
35.98
8.7 55
25~31
9.2~9.3
35.98
8.7 55
21~31
9.2~9.3
35.98
8.7 55
21~33
9.1~9.2
49.18
5.9 26
45~66
5.9~6.0
49.18
5.9 40
45~66
5.8~5.9
193
Hmg2
NP_032278
11527222
6680229
Pk3
NP_035229
6755074
2506796
Rbm3
STEFIN 3
Tpi
Tpm5
NP_058089
P35175
NP_033441
P21107
Vim
Vdac1
7949121
461911
6678413
136097
2078001
Q60932
10720404
MPRO076
MPRO104
MPRO023
MPRO008
MPRO014
MPRO015
MPRO005
MPRO033
MPRO012
MPRO073
MPRO083
MPRO112
MPRO093
MPRO110
MPRO228
MPRO235
24.16
6.9 26
18~28
7.2~7.4
14.16
6.9 26
14~21
7.6~7.8
57.9
7.2 48
45~66
7.2~7.4
57.87
7.2 42 150~200 7.0~7.5
16.59
6.8 25
7~14
6.6~6.8
16.59
6.8 25
12~16
6.2~6.4
10.99
5.9 48
1~6.5
6.2~6.4
10.99
5.9 53
1~6.5
5.8~6.0
26.69
6.9 26
15~25
6.9~7.1
26.69
6.9 40
18~28
6.7~6.9
29
4.7 46
6.5~14
7.5~7.7
29
4.7 27
21~31
4.6~4.8
51.55
4.9 25
31~45
4.7~4.9
53.67
5.1 28
40~50
4.9~5.0
32.33
8.7 49
21~33
8.8~9.0
32.33
8.7 35
21~31
8.7~8.9
Protein symbol, accession, and Gi# refer to NCBI UniGene database (if represented).
Theoretical value refers from ProFound website (http://prowl.rockefeller.edu/cgibin/ProFound). Practical value is the observed value in 2 DE gels (see "Appendix").
Table 2 Protein species represented by multiple spots
Appendix
194
Table 2. Protein species represented by multiple spots
Theoretical
value
Symbol
Aldh2
Atp5a1
Ddx5
Gapd
Accession
NP_033786
NP_031531
NP_031866
NP_032110
Hnrpa2b1 NP_058086
Hnrph1
Hmg2
Pk3
Appendix
NP_067485
NP_032278
NP_035229
Gi#*
Protein
ID
MPRO6753036 004
MPRO006
MPRO6680748 087
MPRO088
MPRO6681157 206
MPRO207
MPRO6679937 035
MPRO085
MPRO7949053 223
MPRO227
MPRO229
MPRO10946928 155
MPRO154
MPRO11527222 076
MPRO6680229 104
MPRO6755074 023
kDa
pl
56.52
Practical value
%
kDa
pl
7.7 23
31~50
6.4~6.6
56.52
7.7 20
6~14
7.3~7.6
59.73
9.3 24
45~55
7.6~7.8
59.73
9.3 24
45~55
7.8~8.0
69.3
9.3 22
45~66
9.1~9.3
69.3
9.3 26
45~55
9.1~9.4
35.79
8.7 39
25~35
8.0~8.2
35.79
8.7 34
28~38
7.7~7.9
35.98
8.7 55
25~31
9.2~9.3
35.98
8.7 55
21~31
9.2~9.3
35.98
8.7 55
21~33
9.1~9.2
49.18
5.9 26
45~66
5.9~6.0
49.18
5.9 40
45~66
5.8~5.9
24.16
6.9 26
18~28
7.2~7.4
14.16
6.9 26
14~21
7.6~7.8
57.9
7.2 48
45~66
7.2~7.4
195
2506796
Rbm3
STEFIN 3
Tpi
Tpm5
NP_058089
P35175
NP_033441
P21107
Vim
Vdac1
7949121
461911
6678413
136097
2078001
Q60932
10720404
MPRO008
MPRO014
MPRO015
MPRO005
MPRO033
MPRO012
MPRO073
MPRO083
MPRO112
MPRO093
MPRO110
MPRO228
MPRO235
57.87
7.2 42 150~200 7.0~7.5
16.59
6.8 25
7~14
6.6~6.8
16.59
6.8 25
12~16
6.2~6.4
10.99
5.9 48
1~6.5
6.2~6.4
10.99
5.9 53
1~6.5
5.8~6.0
26.69
6.9 26
15~25
6.9~7.1
26.69
6.9 40
18~28
6.7~6.9
29
4.7 46
6.5~14
7.5~7.7
29
4.7 27
21~31
4.6~4.8
51.55
4.9 25
31~45
4.7~4.9
53.67
5.1 28
40~50
4.9~5.0
32.33
8.7 49
21~33
8.8~9.0
32.33
8.7 35
21~31
8.7~8.9
Protein symbol, accession, and Gi# refer to NCBI UniGene database (if represented).
Theoretical value refers from ProFound website (http://prowl.rockefeller.edu/cgibin/ProFound). Practical value is the observed value in 2 DE gels (see "Appendix").
Table 3 Classification of known proteins
Appendix
196
Table 3. Classification of known proteins
Category
Cytoskeleton
Energy
metabolism
Signaling
pathway
Cytokine
Transcription
modulators
Chaperone
Granule-related
protein
Mitochondrial
RNA
metabolism
Transporter
Chromatin
Other categories
Protein (gene) symbol
Actb, Actg, Anxa1, Anxa11, Anxa2, Anxa3, Arpc3, Cappa1, Coro1a,
ECP, KER1, KER8, KER10, KER47, KER59, Krt2-6g, KT14, SAC,
Tpm5, Tuba6, Tubb5, vim
Eno1, Gapd, Idh1, Idh2, Impdh2, Ldh1, Papss2/Atpsk2, Pygm,
Taldo1, Tpi
Arhgdib, Arhn, Ephb2, G4-1-pending, Gnb2-rs1, Hcph, Nme1, Pgk1,
Pk3, Ptpn1, Rac2, Ran, Rin, Vav2
Hgf, IFI-205, IIIf5, Pbp, Spry1
Hmgb1, Hmg2, KRZF80M, Rnf17, Stat5a, Taf2e, Zfp101, ZFP1A3,
Zfp354a
Cab140, Cct2, Cct5, Cct6a, GROEL, Grp58, Hsc70, Hsp110,
Hspa5/Grp78, Hspa8, P4hb, Ppia, Stip1
Cas1, Es10, Psmc1, Psma7, Psmc2, Sod1, STEFIN3
Got2, Aldh2, Atp5a1, Atp5b, Mor1
Hnrpa1, Hnrpa2b1, Hnrph1, Nsap1-pending, Rbm3, RNPC
Slc23a2, Vdac1
Lmnb1, Pcna
Abpa, Cftr, Crmp1, C4, Ddx5, Eef2, Eef1a1, Ehd1, Fut4, Gc, Gstm1,
HPD76, IGVAP, Ltf, Tinag, Ube1x, H2-Ab1, Phb, Prdx1, Prdx2, Pdi4,
Rag1, LOC56463, PRO2675, Tagln2, AA589396, Lgals3, Sfmbt
Protein symbols refer to NCBI databases (see "Appendix").
Figure 4 Protein clusters according to their expression patterns.
Appendix
197
Figure 4
Protein clusters according to their expression patterns. The 72 protein spots
were grouped into 6 clusters (1 empty cluster is not shown). Each cluster is represented
by the centroid (average pattern represented by a thick red line) for genes in the cluster.
Expression level of each gene was standardized to have zero mean and unit SD across the
4 time points. Standardized expression levels are shown on y-axis and time points on xaxis.
Appendix
198
Figure 5 The correlation between the mRNA difference at 0 and 72 hours and
the corresponding protein difference.
Figure 5. The correlation between the mRNA difference at 0 and 72 hours and the
corresponding protein difference. Correlation between RNA expression level differences,
R RNA(t = 72) RNA(t = 0), and protein level differences P P(t = 72) P(t = 0).
Expression levels of proteins that have more than one conformation were summed. In this
regression analysis we retained only RNA probe sets that correspond to single genes (the
remaining probe sets lacked the ambiguity of multiple probe sets per Locus Link) and
that had a "present" Affymetrix indicator and an amplitude more than 20 both at t = 0 and
t = 72 hours. There were 51 different proteins that satisfy these conditions. The linear
association r between changes in RNA levels (R) and changes in protein levels (P) of the
remaining 51 genes is only 0.58, where r is the Pearson correlation coefficient defined as
Appendix
199
r(P,R) = i (Pi (Ri / However, about 80% of the genes are located in the first and third
quadrants, indicating a general trend that genes with increasing/decreasing levels of RNA
also have increasing/decreasing protein levels.
Appendix
200
Figure 6 Two-dimensional electrophoretograms of cycloheximide inhibition of
MPRO cells.
Figure 6. Two-dimensional electrophoretograms of cycloheximide inhibition of
MPRO cells. MPRO cells were treated with cycloheximide for 2 hours. MPRO cell
lysate (1.5 × 106 cells/sample) was loaded for 2DE analysis (pH 4-7). (A) Control MPRO
cells. (B) Cycloheximide-treated MPRO cells. The gels were stained with brilliant blue
G-colloidal dye. (C,D) The magnified regions of 2 DE gels shown as inset in panels A
Appendix
201
and B. The arrowheads point to protein spots that decrease in intensity after
cycloheximide treatment; the arrows point to spots whose intensity increases after
cycloheximide treatment. The other information is presented as in the legend to Figure 1.
Appendix
202
Figure 7 Two-dimensional electrophoretograms of cycloheximide inhibition of
MPRO cells.
Figure 7. Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO
cells. MPRO cells from cycloheximide inhibition experiment were also analyzed by basic
pH range 2 DE. MPRO cell lysate (1.5 × 106 cells/sample) was loaded for IPGs-PAGE
pH 6 to 11 and stained with brilliant blue G-colloidal dye. (A) Control MPRO cells. (B)
Cycloheximide-treated MPRO cells. (C,D) The magnified regions of 2 DE gels shown as
Appendix
203
inset in panels A and B. The arrowheads point to protein spots that decrease in intensity
after cycloheximide treatment; the arrows point to spots whose intensity increases after
cycloheximide treatment. The other information is presented as in the legend to Figure 1.
Appendix
204
Figure 8 Distribution of protein spots from cycloheximide experiment.
Figure 8. Distribution of protein spots from cycloheximide experiment. In the
cycloheximide experiment, MPRO cells were treated with cycloheximide for 2 hours; the
untreated MPRO cells were used as a control. The protein inhibition patterns were
compared with those of the control cells by Melanie-II software. For each protein, the xaxis value represents OD value of untreated with cycloheximide. The y-axis represents
OD value after cycloheximide treatment. The information of proteins was collected in
database dbMCp.
Appendix
205
Table 4 Transcription factors analyzed by Northern blot assay
Table 4. Transcription factors analyzed by Northern blot assay
Symbol
AA407540
Bach1
Baz1b
Crem
Creb1
Cutl1
Hipk3
Maz
Mycbp
Nmi
pou2f1
Pou5f1
Pura
Rybp
Elf4
Sp1
Ncoa1
Fos
p202
p204
0 h 24 h
2
1
4
2
2
2
5
4
1
1
3
1
3
5
0
0
0
0
0
0
3
3
7
1
3
1
5
4
2
2
2
2
3
4
0
0
0
0
0
0
48 h
72 h
MPRO 72 h*
Human 60K*
1
4
3
0
2
3
3
2
2
2
1
2
5
1
0
0
0
0
0
0
1
4
3
0
2
3
3
2
2
2
1
3
8
1
0
0
0
0
0
0
169.81/P
20/A
20/A
20/A
20/A
20/A
26.97/A
20/A
20/A
20/A
20/A
20/A
81.62/P
20/A
20/A
20/A
27.24/P
20/A
N/A
N/A
N/A
679.32/P
679.32/P
821.08/P
233.22/P
125.96/P
41.14/P
1674.86/P
128.25/P
407.37/P
101.02/P
65.02/A
49.42/P
592.25/P
729.51/P
20/A
392.04/P
N/A
N/A
N/A
Band intensities at the different time courses from Northern blot assay were
semiquantified on a scale from 1 (+) to 8 (++++++++).
*
The numbers in these columns are average differences in the value of hybridization
intensity between the set of perfectly matched oligonucleotides and the set of
mismatched oligonucleotides in the oligonucleotide array. "A" represents the genes that
are absent, and "P" represents present in Affymetrix chip assay. The other information is
presented as in the footnote to Table 2.
N/A indicates the gene is not presented in Affymetrix chips.
Appendix
206
Figure 9 Northern blot analysis of selected mRNAs.
Figure 9. Northern blot analysis of selected mRNAs. Equivalent amounts of RNA
from MPRO cells induced by ATRA at different time points (0 hour, 24 hours, 48 hours,
and 72 hours) were resolved by formaldehyde-agarose gel electrophoresis, stained to
verify the amount of loading. Twenty transcription factor genes were separately probed
on the RNA filters. The gene symbol of each probe was listed at the left of related
Appendix
207
Northern blot result. One of the RNA-blotted membrane photographs is shown with
methylene blue-stained 28S and 18S RNA subunits demonstrating the quality and
quantity of RNA loaded in individual lanes.
Appendix
208
References
1. Phillips RL, Ernst RE, Brunk B, et al. The genetic program of hematopoietic stem cells.
Science. 2000;288:1635-1640[Abstract/Free Full Text].
2. Theilgaard-Monch K, Cowland J, Borregaard N. Profiling of gene expression in
individual hematopoietic cells by global mRNA amplification and slot blot analysis. J
Immunol Methods. 2001;252:175-189[CrossRef][Medline] [Order article via Infotrieve].
3. Skalnik DG. Transcriptional mechanisms regulating myeloid-specific genes. Gene.
2002;284:1-21[CrossRef][Medline] [Order article via Infotrieve].
4. Jacobsen FW, Rusten LS, Jacobsen SE. Direct synergistic effects of interleukin-7 on in
vitro myelopoiesis of human CD34+ bone marrow progenitors. Blood. 1994;84:775779[Abstract/Free Full Text].
5. Bennett CM, Kanki JP, Rhodes J, et al. Myelopoiesis in the zebrafish, Danio rerio.
Blood. 2001;98:643-651[Abstract/Free Full Text].
6. Reya T, Contractor NV, Couzens MS, Wasik MA, Emerson SG, Carding SR.
Abnormal myelocytic cell development in interleukin-2 (IL-2)-deficient mice: evidence
for the involvement of IL-2 in myelopoiesis. Blood. 1998;91:2935-2947[Abstract/Free
Full Text].
Appendix
209
7. Sterkers Y, Preudhomme C, Lai JL, et al. Acute myeloid leukemia and
myelodysplastic
syndromes
following
essential
thrombocythemia
treated
with
hydroxyurea: high proportion of cases with 17p deletion. Blood. 1998;91:616622[Abstract/Free Full Text].
8. Lawson ND, Krause DS, Berliner N. Normal neutrophil differentiation and secondary
granule gene expression in the EML and MPRO cell lines. Exp Hematol. 1998;26:11781185[Medline] [Order article via Infotrieve].
9. Berliner N. Molecular biology of neutrophil differentiation. Curr Opin Hematol.
1998;5:49-53[Medline] [Order article via Infotrieve].
10. Tenen DG, Hromas R, Licht JD, Zhang DE. Transcription factors, normal myeloid
development, and leukemia. Blood. 1997;90:489-519[Free Full Text].
11. Samarkos M, Aessopos A, Fragodimitri C, et al. Neutrophil elastase in patients with
homozygous beta-thalassemia and pseudoxanthoma elasticum-like syndrome. Am J
Hematol. 2000;63:63-67[CrossRef][Medline] [Order article via Infotrieve].
12. Kogan SC, Brown DE, Shultz DB, et al. BCL-2 cooperates with promyelocytic
leukemia retinoic acid receptor alpha chimeric protein (PMLRARalpha) to block
Appendix
210
neutrophil differentiation and initiate acute leukemia. J Exp Med. 2001;193:531543[Abstract/Free Full Text].
13. Calvo KR, Knoepfler PS, Sykes DB, Pasillas MP, Kamps MP. Meis1a suppresses
differentiation by G-CSF and promotes proliferation by SCF: potential mechanisms of
cooperativity with Hoxa9 in myeloid leukemia. Proc Natl Acad Sci U S A.
2001;98:13120-13125[Abstract/Free Full Text].
14. Orkin SH. Transcription factors and hematopoietic development. J Biol Chem.
1995;270:4955-4958[Free Full Text].
15. Tsai S, Collins SJ. A dominant negative retinoic acid receptor blocks neutrophil
differentiation at the promyelocyte stage. Proc Natl Acad Sci U S A. 1993;90:71537157[Abstract].
16. Drexler HG, Quentmeier H, MacLeod RA, Uphoff CC, Hu ZB. Leukemia cell lines:
in vitro models for the study of acute promyelocytic leukemia. Leuk Res. 1995;19:681691[CrossRef][Medline] [Order article via Infotrieve].
17. Hacia JG, Makalowski W, Edgemon K, et al. Evolutionary sequence comparisons
using
high-density
oligonucleotide
arrays.
Nat
Genet.
1998;18:155-
158[CrossRef][Medline] [Order article via Infotrieve].
Appendix
211
18. Subrahmanyam YV, Baskaran N, Newburger PE, Weissman SM. A modified method
for the display of 3'-end restriction fragments of cDNAs: molecular profiling of gene
expression in neutrophils. Methods Enzymol. 1999;303:272-297[Medline] [Order article
via Infotrieve].
19. Anderson NL, Anderson NG. Proteome and proteomics: new technologies, new
concepts, and new words. Electrophoresis. 1998;19:1853-1861[Medline] [Order article
via Infotrieve].
20. Lian Z, Wang L, Yamaga S, et al. Genomic and proteomic analysis of the myeloid
differentiation program. Blood. 2001;98:513-524[Abstract/Free Full Text].
21. Laemmli UK. Cleavage of structural proteins during the assembly of the head of
bacteriophage T4. Nature. 1970;227:680-685[Medline] [Order article via Infotrieve].
22. Studier FW. Analysis of bacteriophage T7 early RNAs and proteins on slab gels. J
Mol Biol. 1973;79:237-248[Medline] [Order article via Infotrieve].
23. Neuhoff V, Arold N, Taube D, Ehrhardt W. Improved staining of proteins in
polyacrylamide gels including isoelectric focusing gels with clear background at
nanogram sensitivity using Coomassie brilliant blue G-250 and R-250. Electrophoresis.
1988;9:255-262[Medline] [Order article via Infotrieve].
Appendix
212
24. Switzer RC 3rd, Merril CR, Shifrin S. A highly sensitive silver stain for detecting
proteins and peptides in polyacrylamide gels. Anal Biochem. 1979;98:231-237[Medline]
[Order article via Infotrieve].
25. Gorg A, Obermaier C, Boguth G, et al. The current state of two-dimensional
electrophoresis with immobilized pH gradients. Electrophoresis. 2000;21:10371053[CrossRef][Medline] [Order article via Infotrieve].
26. Henzel WJ, Billeci TM, Stults JT, Wong SC, Grimley C, Watanabe C. Identifying
proteins from two-dimensional gels by molecular mass searching of peptide fragments in
protein sequence databases. Proc Natl Acad Sci U S A. 1993;90:5011-5015[Abstract].
27. Thiede B, Siejak F, Dimmler C, Jungblut PR, Rudel T. A two dimensional
electrophoresis database of a human Jurkat T-cell line. Electrophoresis. 2000;21:27132720[CrossRef][Medline] [Order article via Infotrieve].
28. Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with
self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl
Acad Sci U S A. 1999;96:2907-2912[Abstract/Free Full Text].
29. Greenbaum D, Jansen R, Gerstein M. Analysis of mRNA expression and protein
abundance data: an approach for the comparison of the enrichment of features in the
Appendix
213
cellular
population
of
proteins
and
transcripts.
Bioinformatics.
2002;18:585-
596[Abstract/Free Full Text].
30. Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and
mRNA abundance in yeast. Mol Cell Biol. 1999;19:1720-1730[Abstract/Free Full Text].
31. Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI. A sampling of the yeast
proteome. Mol Cell Biol. 1999;19:7357-7368[Abstract/Free Full Text].
32. Greenbaum D, Luscombe NM, Jansen R, Qian J, Gerstein M. Interrelating different
types of genomic data, from proteome to secretome: 'oming in on function. Genome Res.
2001;11:1463-1468[Abstract/Free Full Text].
33. Anderson L, Seilhamer J. A comparison of selected mRNA and protein abundances in
human liver. Electrophoresis. 1997;18:533-537[Medline] [Order article via Infotrieve].
34. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. Quantitative
analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol.
1999;17:994-999[CrossRef][Medline] [Order article via Infotrieve].
35. Van Belle D, Andre B. A genomic view of yeast membrane transporters. Curr Opin
Cell Biol. 2001;13:389-398[CrossRef][Medline] [Order article via Infotrieve].
Appendix
214
36. Yuo A. Differentiation, apoptosis, and function of human immature and mature
myeloid cells: intracellular signaling mechanism. Int J Hematol. 2001;73:438452[Medline] [Order article via Infotrieve].
37. Nagamura-Inoue T, Tamura T, Ozato K. Transcription factors that regulate growth
and differentiation of myeloid cells. Int Rev Immunol. 2001;20:83-105[Medline] [Order
article via Infotrieve].
38. Kubota T, Kawano S, Chih DY, et al. Representational difference analysis using
myeloid
cells
from
C/EBP
epsilon
deletional
mice.
Blood.
2000;96:3953-
3957[Abstract/Free Full Text].
39. Yamanaka R, Barlow C, Lekstrom-Himes J, et al. Impaired granulopoiesis,
myelodysplasia, and early lethality in CCAAT/enhancer binding protein epsilon-deficient
mice. Proc Natl Acad Sci U S A. 1997;94:13187-13192[Abstract/Free Full Text].
40. Ward AC, van Aesch YM, Schelen AM, Touw IP. Defective internalization and
sustained activation of truncated granulocyte colony-stimulating factor receptor found in
severe
congenital
neutropenia/acute
myeloid
leukemia.
Blood.
1999;93:447-
458[Abstract/Free Full Text].
Appendix
215
41. Romero F, Martinez AC, Camonis J, Rebollo A. Aiolos transcription factor controls
cell death in T cells by regulating Bcl-2 expression and its cellular localization. EMBO J.
1999;18:3419-3430[Abstract/Free Full Text].
42. Massaad-Massade L, Navarro S, Krummrei U, Reeves R, Beaune P, Barouki R.
HMGA1 enhances the transcriptional activity and binding of the estrogen receptor to its
responsive element. Biochemistry. 2002;41:2760-2768[CrossRef][Medline] [Order
article via Infotrieve].
43. Webb M, Payet D, Lee KB, Travers AA, Thomas JO. Structural requirements for
cooperative binding of HMG1 to DNA minicircles. J Mol Biol. 2001;309:7988[CrossRef][Medline] [Order article via Infotrieve].
44. Czura CJ, Wang H, Tracey KJ. Dual roles for HMGB1: DNA binding and cytokine. J
Endotoxin Res. 2001;7:315-321[Medline] [Order article via Infotrieve].
45. Yang H, Wang H, Tracey KJ. HMG-1 rediscovered as a cytokine. Shock.
2001;15:247-253[Medline] [Order article via Infotrieve].
46. Dong G, Loukinova E, Chen Z, et al. Molecular profiling of transformed and
metastatic murine squamous carcinoma cells by differential display and cDNA
microarray reveals altered expression of multiple genes related to growth, apoptosis,
Appendix
216
angiogenesis, and the NF-kappaB signal pathway. Cancer Res. 2001;61:47974808[Abstract/Free Full Text].
47. Taniguchi M, Miura K, Iwao H, Yamanaka S. Quantitative assessment of DNA
microarrayscomparison
with
Northern
blot
analyses.
Genomics.
2001;71:34-
39[CrossRef][Medline] [Order article via Infotrieve].
48. Nikiforov MA, Kotenko I, Petrenko O, et al. Complementation of Myc-dependent
cell proliferation by cDNA expression library screening. Oncogene. 2000;19:48284831[CrossRef][Medline] [Order article via Infotrieve].
49. Sommer A, Hilfenhaus S, Menkel A, et al. Cell growth inhibition by the Mad/Max
complex through recruitment of histone deacetylase activity. Curr Biol. 1997;7:357365[Medline] [Order article via Infotrieve].
50. Anderson KL, Smith KA, Perkin H, et al. PU.1 and the granulocyte and macrophage
colony-stimulating factor receptors play distinct roles in late-stage myeloid cell
differentiation. Blood. 1999;94:2310-2318[Abstract/Free Full Text].
51. Orkin SH. Diversification of haematopoietic stem cells to specific lineages. Nat Rev
Genet. 2000;1:57-64[CrossRef][Medline] [Order article via Infotrieve].
Appendix
217
52. Sahara S, Aoto M, Eguchi Y, Imamoto N, Yoneda Y, Tsujimoto Y. Acinus is a
caspase-3 activated protein required for apoptotic chromatin condensation. Nature.
1999;401:168-173[CrossRef][Medline] [Order article via Infotrieve].
53. Weinmann AS, Yan PS, Oberley MJ, Huang TH, Farnham PJ. Isolating human
transcription factor targets by coupling chromatin immunoprecipitation and CpG island
microarray analysis. Genes Dev. 2002;16:235-244[Abstract/Free Full Text].
54. Nau GJ, Richmond JF, Schlesinger A, Jennings EG, Lander ES, Young RA. Human
macrophage activation programs induced by bacterial pathogens. Proc Natl Acad Sci U S
A. 2002;99:1503-1508[Abstract/Free Full Text].
55. Horak CE, Mahajan MC, Luscombe NM, Gerstein M, Weissman SM, Snyder M.
GATA-1 binding sites mapped in the beta-globin locus by using mammalian chIp-chip
analysis. Proc Natl Acad Sci U S A. 2002;99:2924-2929[Abstract/Free Full Text].
Appendix
218
Appendix
This section contains the genes described in this paper, including figures, tables, and text.
AA589396: dendritic cell protein; Abpa: androgen-binding protein: subunit alpha; Actb:
put. Beta-actin; Actg: actin, gamma, cytoplasmic; Aldh2: aldehyde dehydrogenase 2,
mitochondrial; Anxa1: lipocortin I protein annexin 1; Anxa11: annexin A11; Anxa2:
annexin II calpactin I heavy chain; Anxa3: annexin A3; Arhgdib: RHO GDP-dissociation
inhibitor 2(RHO GDI2); Arhn: rho7; Arpc3: actin-related protein 2/3 complex, subunit 3
(21 kDa); Arp2/3 complex subunit p21-Arc, Atp5a1: ATP synthase, H+ transporting,
mitochondrial F1 complex, alpha subunit, isoform 1; Atp5b: ATP synthase, H+
transporting mitochondrial F1 complex, alpha subunit; C4: MHC complement component
C4; Cab140: 170 kDa glucose regulated protein GRP170 precursor; Cappa1: F-actin
capping protein alpha-1 subunit; Cas1: catalase 1; Cct2: chaperonin containing TCP-1
beta subunit ; Cct5: chaperonin subunit 5 (epsilon); Cct6a: Chaperonin subunit 6a (zeta);
Cftr: cystic fibrosis transmembrane conductance regulator homolog; Coro1a: coronin,
actin-binding protein 1A; Crmp1: collapsin response mediator; Ddx5: DEAD (aspartateglutamate-alanine-aspartate) box polypeptide 5; ECP: EndoA' cytokeratin 5' end put.);
putative; Eef1a1: eukaryotic translation elongation factor 1 alpha 1; Eef2: elongation
factor 2; Ehd1: "EH-domain containing 1, PAST, HPAST, H-PAST"; Eno1: alpha
enolase; Ephb2: protein-tyrosine kinase (EC 2.7.1.112) sek-3, Eph receptor A4; Es10:
sid478p/Esterase 10; Fut4: fucosyltransferase 4; G4-1-pending: phosphatase subunit gene
g4-1; Gapd: glyceraldehyde-3-phosphate dehydrogenase; Gc: vitamin D-binding protein
precursor; Gnb2-rs1: guanine nucleotide binding protein, beta-2, related sequence1, p205,
Appendix
219
Rack1, Gnb2l1, GB-like; Got2: glutamate oxaloacetate transaminase 2, mitochondrial;
mitochondrial aspartate aminotransferase; GROEL: chaperonin groEL precursor; Grp58:
glucose regulated protein, 58 kDa; endoplasmic reticulum protein; phospholipase C,
alpha; Gstm1: glutathione-S-transferase, mu1; H2-Ab1: histocompatibility 2, class II
antigen A, beta 1; Hcph: PTPN6 tyrosine phosphatase, me, hcp, PTPN6, Ptp1C, SHP-1;
Hgf: hepatocyte growth factor precursor; Hmg2: high mobility group protein 2; Hmgb1:
high mobility group protein 1; Hnrpa1: heterogeneous nuclear ribonucleoprotein A1;
Hnrpa2b1:
heterogeneous
nuclear
ribonucleoprotein
A2;
heterogenous
nuclear
ribonucleoprotein A2/B1; Hnrph1: heterogeneous nuclear ribonucleoprotein H1; HPD76:
hypothetical protein DKFZp761C10121.1; Hsc70: dnaK-type molecular chaperone
hsc73/Heat shock protein cognate 70; Hsp110: heat shock protein, 110 kDa; Hspa5:
glucose-regulated protein, 78 kDa; Hspa5/Grp78: 78 kDa glucose-regulated protein
precursor (GRP 78); Hspa8: dnaK-type molecular chaperone hsc70; Idh1: isocitrate
dehydrogenase 1(NADP+), soluble; Idh2: NADP+-specific isocitrate dehydrogenase; IFI205: interferon-activatable protein 205; IGVAP: Ig Vkappa, antiphenyloxazolone; Il1f5:
interleukin 1 receptor antagonist homolog 1; Impdh2: inosine-5'-monophosphate
dehydrogenase; KER1: keratin 1; KER8: keratin 8, type II cytoskeletal, embryonic;
KER10: keratin 10, type I, cytoskeletal; KER14: keratin 8, type I cytoskeletal 14; KER47:
47 kDa keratin; KER59: keratin, 59K type I cytoskeletal; Krt2-6g: keratin, type II
cytoskeletal 6; KRZF80M: Kruppel-related zinc finger protein F80-M; KT14: keratin,
type I, cytoskeletal; Ldh1: lactate dehydrogenase 1, A chain; Lgals3: galectin-3; Lmnb1:
lamin B1; LOC56463: p100coactivator; Ltf: lactotransferrin precursor; Mor1: malate
dehydrogenase; Nme1: nucleoside diphosphate kinase A; Nsap1-pending: syncrip; P4hb:
Appendix
220
protein disulfide-isomerase, PDI; Papss2/Atpsk2: ATP sulfurylase/APS kinase, 2: PAPS
synthetase; Pbp: hippocampal cholinergic neurostimulating peptide precursor protein,
phosphatidylethanolamine-binding protein; Pcna: proliferating cell nuclear antigen; Pdi4:
peptidyl arginine deiminase, type IV; PAD type IV; Pgk1: phosphoglycerate kinase 1;
Phb: prohibitin; Pk3: pyruvate kinase 3; Ppia: peptidylprolyl isomerase A; cyclophilin A ;
Prdx1: proliferation-associated gene A, osteoblast specific factor 3; Prdx2: Antioxidant
protein 2; PRO2675: PRO2675; Psma7: proteasome (prosome, macropain) subunit, alpha
type 7, Proteasome subunit RC6-1; Psmc1: protease (prosome, macropain) 26S subunit,
ATPase 1; Psmc2: 26S protease regulatory subunit 7, MSS1 protein; Ptpn1: protein
tyrosine phosphatase; Pygm: muscle glycogen phosphorylase; Rac2: RAS-related C3
botulinum substrate 2, p21-Rac2, EN-7 protein; Rag1: recombination activating gene 1;
Ran: GTP-binding nuclear protein Ran (TC4); Rbm3: RNA binding motif protein 3; Rin:
RAS-like protein expressed in neuro; Rnf17: RING finger protein Mmip-2; RNPC: RNP
particle component; SAC: spectrin alpha chain; Sfmbt: Scm-related gene containing 4
mbt domains; Slc23a2: solute carrier family 23, (nucleobase transporters) member 1;
Sod1: putative peroxisomal antioxidant enzyme, superoxide dismutase 1; Spry1: sprouty
homolog 1 (Drosophila); Stat5a: signal transducer and activator of transcription 5A;
STEFIN3: stefin 3; Stip1: extendin/Stress-induced phosphoprotein1; Taf2e: TATA box
binding protein (Tbp)-associated factor, RNA polymerase II, E; Tagln2: transgelin 2;
Taldo1: transaldolase; Tinag: tubulointerstitial nephritis antigen; Tpi: triosephosphate
isomerase, TIM; Tpm5: tropomyosin 5, cytoskeletal type; Tuba6: tubulin alpha 6; Tubb5:
tubulin, beta 5; Ube1x: ubiquitin-activating enzyme E1 X; Vav2: Vav2 oncogene; Vdac1:
voltage-dependent anion-selective channel protein 1; vim: vimentin; Zfp101: zinc finger
Appendix
221
protein 101; ZFP1A3: Aiolos/zinc finger protein, subfamily 1A, 3; Zfp354a: transcription
factor 17.
Appendix
222