Abstract Comparing mRNA Expression and Protein Abundance via Genomic and Proteomic Characteristics Dov Greenbaum 2004 With the advent of high throughput proteomic and genomic technologies we now have appreciable quantitative mRNA expression and protein abundance levels for much of the yeast genome. While the cellular mRNA and protein concentrations are clearly mechanistically related, a quantitative relationship, however, may not be clear-cut. This thesis is an attempt to quantify the relationship via interrelating diverse expression data sets and additional external information. There are three important aspects to this analysis: (i) the data, particularly the protein data, is the result of various experimental methodologies, as such, the information was integrated judiciously through iteratively fitting the available datasets into reference sets; (ii) the data is inherently noisy. To minimize noise broad categories (e.g. functional, structural, and interaction categories) were used to average the data-points into more robust numbers; (iii) protein complexes, where the subunits occur in stoichiometrically equal amounts, can be used in simple but valuable illustrations of the relationship between protein products and their mRNA precursors. Overall, considerable agreement between mRNA expression and protein abundance, in terms of the enrichment of structural and functional categories was found. This agreement, which was considerably greater than the simple correlation between these quantities for individual genes, reflects the way broad categories collect many individual measurements into simple, robust averages. In particular it is shown that in respect to the genome, the proteome is enriched in (i) small amino acids Val, Gly, and Ala (high levels of these amino acids in proteins lead to more compact and more stable proteins); (ii)low molecular weight (i.e. more cost efficient) proteins; (iii) proteins involved in protein cell structure and energy production; and is depleted in proteins that act as molecular switches (i.e., transcription and cell growth). mRNA expression levels are also shown to correlate well among the members of permanent protein complexes. Generally, permanent complexes, such as the ribosome and proteasome, are shown to have a particularly strong relationship with mRNA expression, while transient ones do not. However, several transient complexes, such as the RNA polymerase II holoenzyme and the replication complex, can be subdivided into smaller permanent ones, which do have a strong relationship to gene expression. Comparing mRNA Expression and Protein Abundance via Genomic and Proteomic Characteristics A Dissertation Presented to the Faculty of the Graduate School of Yale University in Candidacy for the Degree of Doctor of Philosophy by Dov Greenbaum Dissertation Director: Mark Gerstein May 2004 © 2004 by Dov Greenbaum All rights reserved. Table of Contents TABLE OF CONTENTS ................................................................................................. 3 LIST OF FIGURES AND TABLES................................................................................ 5 ACKNOWLEDGMENTS ................................................................................................ 7 INTRODUCTION........................................................................................................... 10 CHAPTER 1: INTEGRATING GENOMIC DATA SETS......................................... 15 1.1 INTERRELATING DIFFERENT TYPES OF GENOMIC DATA, FROM PROTEOME TO SECRETOME: 'OMING IN ON FUNCTION........................................................................... 15 Abstract ..................................................................................................................... 15 Introduction............................................................................................................... 16 The Path to Function is Filled with 'omes ................................................................ 18 Computational Methods for Defining 'omes ............................................................. 20 Experimental Methods for Defining 'omes ............................................................... 21 Interrelating Different 'omes..................................................................................... 22 The Use of Broad Categories to Interpret Noisy Data ............................................. 23 A Case Study: Interrelating the Transcriptome and the Translatome ...................... 26 Conclusion ................................................................................................................ 28 Figures and Tables ................................................................................................... 29 References ................................................................................................................. 36 CHAPTER 2: MRNA EXPRESSION AND PROTEIN ABUNDANCE ................... 43 2.1 ANALYSIS OF MRNA EXPRESSION AND PROTEIN ABUNDANCE DATA: AN APPROACH FOR THE COMPARISON OF THE ENRICHMENT OF FEATURES IN THE CELLULAR POPULATION OF PROTEINS AND TRANSCRIPTS ..................................................................................... 43 Abstract ..................................................................................................................... 43 Introduction............................................................................................................... 44 Methods ..................................................................................................................... 49 Data set scaling......................................................................................................... 51 Enrichment of features .............................................................................................. 56 Results ....................................................................................................................... 59 Application to semi-quantitative protein abundance data sets ................................. 63 Discussion and conclusion ........................................................................................ 64 Acknowledgments ...................................................................................................... 70 Figures and Tables ................................................................................................... 71 References ................................................................................................................. 87 2.2 COMPARING PROTEIN ABUNDANCE AND MRNA EXPRESSION LEVELS ON A GENOMIC SCALE ............................................................................................................................ 98 Abstract ..................................................................................................................... 98 Introduction............................................................................................................... 98 Two-dimensional electrophoresis ............................................................................. 99 Mass spectrometric approaches ............................................................................. 101 Comparison of mRNA and protein levels................................................................ 104 Introduction 3 Acknowledgements .................................................................................................. 113 Figures and Tables ................................................................................................. 114 References ............................................................................................................... 119 CHAPTER 3: MRNA EXPRESSION AND PROTEIN-PROTEIN INTERACTIONS.......................................................................................................... 125 3.1 RELATING WHOLE-GENOME EXPRESSION DATA WITH PROTEIN-PROTEIN INTERACTIONS ............................................................................................................. 125 Abstract ................................................................................................................... 125 Introduction............................................................................................................. 126 Results ..................................................................................................................... 127 Discussion and conclusion ...................................................................................... 136 Methods ................................................................................................................... 141 Efficient calculation of the average correlations.................................................... 142 Kinetic model of the relationship between protein and mRNA concentration........ 143 Acknowledgments .................................................................................................... 145 Figures .................................................................................................................... 146 References ............................................................................................................... 157 APPENDIX: CHANGE IN MRNA EXPRESSION VS. CHANGE IN PROTEIN ABUNDANCE LEVELS .............................................................................................. 164 GENOMIC AND PROTEOMIC ANALYSIS OF THE MYELOID DIFFERENTIATION PROGRAM: GLOBAL ANALYSIS OF GENE EXPRESSION DURING INDUCED DIFFERENTIATION IN THE MPRO CELL LINE ........................................................................................................ 164 Abstract ................................................................................................................... 164 Introduction............................................................................................................. 165 Materials and methods ............................................................................................ 167 Results ..................................................................................................................... 173 Discussion ............................................................................................................... 180 Acknowledgments .................................................................................................... 188 Figures and Tables ................................................................................................. 189 References ............................................................................................................... 209 Appendix ................................................................................................................. 219 Introduction 4 List of figures and tables CHAPTER 1: INTEGRATING GENOMIC DATA SETS .......................................................... 29 1.1 Interrelating Different Types of Genomic Data, from Proteome to Secretome: 'Oming in on Function ............................................................................................ 29 Figure 1 An overview of the current `omic terminology ............................................... 29 Figure 2 Interrelating the transcriptome and the translatome ......................................... 31 Table 1 A Table of 'omes, ............................................................................................... 34 CHAPTER 2: MRNA EXPRESSION AND PROTEIN ABUNDANCE ......................................... 71 2.1 Analysis of mRNA expression and protein abundance data: ............................... 71 Figure 1 Schematic overview of the analysis ................................................................. 74 Figure 2 mRNA expression levels vs. protein abundance levels .................................... 76 Figure 3a-c Amino acid and biomass enrichment........................................................... 78 Figure 3d Statistical significance .................................................................................... 81 Figure 4 Breakdown of the transcriptome and translatome in terms of broad categories relating to structure, localization, and function ........................................................ 83 2.2 Comparing protein abundance and mRNA expression levels on a genomic scale ................................................................................................................................. 114 Table 1 Proteomic Technologies .................................................................................. 114 Figure 1 Comparison of mRNA expression and protein abundance. ........................... 115 Figure 2 The differences in correlation between mRNA and protein expression values using novel categories. ............................................................................................ 117 CHAPTER 3: MRNA EXPRESSION AND PROTEIN-PROTEIN INTERACTIONS .................. 146 3.1 Relating whole-genome expression data with protein-protein interactions ...... 146 Figure 1 Distributions of normalized differences for various groups of proteins in boxplot representation. ............................................................................................ 146 Figure 2 Distributions of correlation coefficients between expression profiles .......... 148 Figure 3a Various key statistics .................................................................................... 151 Figure 3b Graphical representation of part of the protein complex statistics ............... 154 Figure 4 Representation of the replication complex and its components ..................... 155 APPENDIX: CHANGE IN MRNA EXPRESSION VS. CHANGE IN PROTEIN ABUNDANCE LEVELS .................................................................................................................... 189 Figure 1 Two-dimensional electrophoretograms of wide pH range of MPRO cells. ... 189 Figure 2 Two-dimensional electrophoretograms of MPRO cells in pH range 4 to 7. .. 191 Table 1 Distribution of protein spots identified during myeloid differentiation .......... 193 Table 2 Protein species represented by multiple spots ................................................. 194 Table 3 Classification of known proteins ..................................................................... 196 Figure 4 Protein clusters according to their expression patterns. ................................. 197 Figure 5 The correlation between the mRNA difference at 0 and 72 hours and the corresponding protein difference. ........................................................................... 199 Introduction 5 Figure 6 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. ........................................................................................................................ 201 Figure 7 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. ........................................................................................................................ 203 Figure 8 Distribution of protein spots from cycloheximide experiment....................... 205 Table 4 Transcription factors analyzed by Northern blot assay ................................... 206 Introduction 6 Acknowledgments I have thoroughly enjoyed working in Mark Gerstein’s lab over the past number of years. Through interactions with Mark and other lab members I have grown in my understantding and appreciation of science in general and, in particular, gained a substantial understanding of bioinformatics and genetics. Many colleagues have contributed, either directly, or indirectly, to this dissertation. These include various coauthors, confidants and mentors. In particular, I would like to thank: Ronald Jansen, Yuval Kluger, Haiyuan Yu, Nick Luscombe, Hedi Hegyi, Jiang Qian, Jimmy Lin, Paul Bertone, Lian Zheng, David Tuck, Jochen Junker, Rajdeep Das, Sambath Chung, Mike Snyder, Nevan Krogan, Al Edwards, Andrew Emili, Bart Kus, Jack Greenblatt, Ken Williams, Christopher Colangelo, John Karro, Xiaowei Zhu, and the entire Gerstein lab. I would like to thank Drs. Sherman Weissman and Kevin White for serving on my research committee. Both Sherman and Kevin have been a stimulating force in my research; their comments and suggestions have proven invaluable to my research. I would also like to thank my department for all their help and support over the past six years, in particular Betsy Jasiorkowski and Michael Stern have shown excessive patience in helping me. Introduction 7 It is impossible to overstate my gratitude to my advisor Mark Gerstein. He has guided me through the process of exploring a new scientific field. He not only helped me to develop thorough scientific judgment, but also taught me about many of the practical aspects of doing science. I want to thank my parents, Drs. Cheryl and Joseph Greenbaum, my brothers: Eli, Yale, Moshe, Rafi, and Ari, and my in-laws: The Honarable and Mrs. Simon Gluck, for their help and continuing support through many years. My daughter, Liana Tova, eclipses all as the source of my greatest pride and joy. Her smile lights up the room and prevents me from doing my work. Finally, I want to express my deepest gratitude to my eishes chayil, Sabrina. She has always been there for me, and has graciously allowed, and continues to allow me to prolong my education and the pursuit of knowledge. She has been an awe-inspiring source of love, and intellectual and moral support. Introduction 8 Introduction 9 Introduction A central and integral biological process in every cell is the faithful transition from DNA, through an mRNA intermediary, to the final protein product. The cell exquisitely controls every step of the process, the result being the desired concentration of functional proteins. While we understand this process on a biological level, it is obvious that the population of mRNA leads to the total protein complement of the cell, it is now possible, with the advent of high throughput genomic methodologies for measuring mRNA expression and protein abundance, to analyze and accurately measure this relationship between mRNA and protein qualitatively. Simplistically, we can view this relationship as the consequence of translation of mRNA’s and degradation of protein; i.e. Dp i /Dt = ks;i * mRNAi - kd;i Pi where ks is the rate of translation and kd represents the rate of degradation. Thus, at steady state: P = ks;i * mRNAi/Kdi. When I first began my research ks, for the most part, was unknown. Presently, kd is still unknown on a genomic scale. Through investigating and analyzing this relationship we gain a broader understanding of the cellular mechanisms and controls used in synthesizing the protein population. Additionally, given the large discrepancy in data quality and availability between mRNA expression and protein abundance, it is helpful to understand the relationship between the two populations: difficult to measure protein abundance levels may possibly be predicted from mRNA expression data and other associated information sources. Introduction 10 One of the goals of bioinformatics is to provide robust methodologies for analyzing the data derived from high throughput experimentation, and to then extract biological insights from the data. Data from high throughput experimentations is often noisy. One can minimize the effect of the noise on an analysis in a number of ways: First, by integrating multiple data sources and observations; and, secondly, by integrating additional tangential resources. This dissertation encompasses previously published research focusing on the correlation between mRNA and protein levels in Saccharomyces cerevisiae. Each chapter represents an important part of the analysis of the relationship. Chapter 1 introduces the concept of cellular populations as defined both by their physical constitution, but also, in a more novel sense, by their function. This differentiation of the cellular protein complement into distinct categories or ‘omes’ is instrumental in my analysis of correlations between mRNA and protein abundance. I also present an initial analysis of the relationship between mRNA and protein levels. Chapter 2 represents a formalization of the problem presented in the first chapter. Given some of the limitations inherent in the date (e.g. size and quality of the datasets), I have devised a methodology for merging of the current mRNA and protein data sets in larger and more reliable reference data sets. I then analyze the relationship between mRNA and protein population levels in the cell, specifically as it related to a number of broad Introduction 11 categories including secondary structure, function, and subcellular localization, and particularly with regard to well defined gene populations. I show that biologically relevant insights can be discerned through my methods. Chapter 2.2 presents a second look at the relationship between mRNA and protein levels using a newer, larger and more reliable data set. I also looked at additional novel categories with which to compare protein and mRNA. These included ribosomal occupancy levels for each mRNA species, the Codon Adaptation Index and the variability of mRNA expression as measured by the coefficient of variation. Chapter 3 looks to expand the original analysis of mRNA and protein levels by investigating correlations among the proteins of binary and complex interactions. Assuming that there is a relationship between protein and mRNA levels in the cell, one would hope to find that pairs and groups of proteins which are thought to exist in the cell in similar protein concentrations also have similar concentrations of the mRNA. My analysis has shown that while proteins in binary interactions do not have, on average, similar levels of mRNA as their interaction partners (initial protein abundance data shows similar results), proteins that interact together in complexes do tend to have overall similar levels of mRNA concentration. These results provide further evidence of the possibility of quantifying a relationship between mRNA and protein expression levels in yeast cells. Introduction 12 In addition to setting up a preliminary cDNA microarray facility with Professor Arch Perkins, I further attempted to enhance my understanding of the experimental techniques behind mRNA expression and protein abundance determination through extensive hands– on work in deciphering two-dimensional gels. This work also provided me with an appreciation of the efforts necessary to consistently and accurately determine protein abundance levels. This analysis, as described in the appendix, involves a proteomic analysis of myeloid differentiation in a murine promyelocytic (MPRO) cell line. In particular, I investigated the relationship between mRNA and protein in terms of simultaneous changes in their levels over multiple time points. This is the first time such a relationship has been studied. These datasets gave a much stronger correlation, than previous analyses involving only a solitary time point. This result is consistent with the hypothesis that a substantial proportion of protein change is a consequence of changed mRNA levels, rather than posttranscriptional effects. Introduction 13 References Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585-96 (2002). Greenbaum, D., Colangelo, C., Williams, K. & Gerstein, M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol 4, 117 (2003). Greenbaum, D., Luscombe, N. M., Jansen, R., Qian, J. & Gerstein, M. Interrelating different types of genomic data, from proteome to secretome: 'oming in on function. Genome Res 11, 1463-8 (2001). Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression data with protein-protein interactions. Genome Res 12, 37-46. (2002). Lian Z, Kluger Y, Greenbaum DS, Tuck D, Gerstein M, Berliner N, Weissman SM, Newburger PE. Genomic and proteomic analysis of the myeloid differentiation program: global analysis of gene expression during induced differentiation in the MPRO cell line. Blood. 100(9):3209-20. (2002). Introduction 14 Chapter 1: Integrating Genomic Data Sets 1.1 Interrelating Different Types of Genomic Data, from Proteome to Secretome: 'Oming in on Function Abstract With the completion of genome sequences, the current challenge for biology is to determine the functions of all gene products and to understand how they contribute in making an organism viable. For the first time, biological systems can be viewed as being finite, with a limited set of molecular parts. However, the full range of biological processes controlled by these parts is extremely complex. Thus, a key approach in genomic research is to divide the cellular contents into distinct sub-populations, which are often given an "-omic" term. For example, the proteome is the full complement of proteins encoded by the genome, and the secretome is the part of it secreted from the cell. Carrying this further, I suggest the term "translatome" to describe the members of the proteome weighted by their abundance, and the "functome" to describe all the functions carried out by these. Once the individual sub-populations are defined and analyzed, I can then try to reconstruct the full organism by interrelating them, eventually allowing for a full and dynamic view of the cell. All this is, of course, made possible because of the increasing amount of large-scale data resulting from functional genomics experiments. However, there are still many difficulties resulting from the noisiness and complexity of the information. To some degree, these can be overcome through averaging with broad proteomic categories such as those implicit in functional and structural classifications. For illustration, I discuss one example in detail, interrelating transcript and cellular Chapter 1: Integrating Genomic Data Sets 15 protein populations (transcriptome and translatome). Further information is available at http://bioinfo.mbb.yale.edu/what-is-it. Introduction "[It] does not consist of individuals, but expresses the sum of interrelations, the relations within which these individuals stand." adapted from Karl Marx, Grundrisse (1857) (Marx, 1857). The raw data produced by genome sequencing projects currently provides little insight into the precise workings of an organism at the molecular level (Luscombe et al., 2001). Therefore, the goal of functional genomics is to complement the genomic sequence by assigning useful biological information to every gene. Through this, I aim to improve my understanding of how the different biological molecules contained within the cell (i.e., DNA, RNA, proteins, and metabolites) combine to make the organism viable. Clearly, the main challenge is the elucidation of all molecular, cellular, and physiological functions of each gene product. However, there are many subsidiary goals as part of this challenge, such as defining the three-dimensional structures of these macromolecules, their subcellular localizations, intermolecular interactions, and expression levels. Although gathering and classifying the necessary information is central to this process, it is impractical to rely on individual experiments for the potentially thousands of genes in each organism. Furthermore, with large-scale proteomic experiments still yet to be used widely, computational techniques while sometimes based on less than ideal information provide a crucial resource for assigning biological data. Chapter 1: Integrating Genomic Data Sets 16 The paper by Antelmann et al. in this issue of Genome Research (Antelmann et al., 2001) evaluates their earlier attempts to assign protein functions through computational means. Previously, the group used computational methods to predict all exported proteins(or members of the secretome) in Bacillus subtilis by searching for signal peptides and cell retention signals in the protein sequences. A better understanding of how and why a protein is secreted is valuable as the bacterium's ability to export numerous enzymes enables it to degrade extracellular substrates and survive in a continuously changing environment. Moreover, it will eventually allow these bacteria to be employed as "cellular factories" for secreting commercially valuable proteins in large quantities (Tjalsma et al., 2000). Antelmann et al.'s present paper aims to verify their previous predictions by experimentally characterising the entire population of secreted proteins using 2D gel electrophoresis and mass spectrometry. They showed that the original predictions correctly identified about 50% of all secreted proteins. Most of the disagreements were due to the inability to predict the secretion of proteins lacking the appropriate signal, or those containing seemingly inappropriate signals (cell retention signals). In summary, Antelmann et al.'s work highlights both the encouraging aspects of computational assignments of biological data, and reveals some of the shortcomings in the current methods. Chapter 1: Integrating Genomic Data Sets 17 The Path to Function is Filled with 'omes To describe their studies, Antelmann et al. coined the term "secretome". This 'omic term is an example of the new lexicon that has appeared recently to define the varied populations and sub-populations in the cell (Fig. 1). These terms are generally suffixed with "-ome", with an associated research topic of "-omics". Broadly, the existing 'omes can be divided into those that represent a population of molecules, and those that define their actions (Fig. 1). For the first category, populations provide an inventory or "parts list" of molecules contained within an organism (Gerstein and Hegyi, 1998,Qian et al., 2001,Skolnick and Fetrow, 2000,Vukmirovic and Tilghman, 2000). The genome, the entire DNA sequence of an organism, presents a basis for defining the proteome, a list of coding DNA regions that result in protein products. Transcription of these coding sequences produces the transcriptome (Velculescu et al., 1997), which is the cellular complement of all mRNA under a variety of cellular conditions. Note, this population is weighted by the expression level of each molecule and, ideally, should incorporate the results of alternative splicing. Following translation of the transcriptome, I suggest the term "translatome" to describe the cellular population of proteins expressed in the organism at a given time, explicitly weighted by their abundance. It is important to note that, whereas the membership of the genome and proteome are virtually static, the transcriptome and translatome are dynamic and continually change in response to internal and external events. Additional 'omes describe the presence of molecules that are not encoded by the genome, but are nonetheless essential, for instance, the metabolome (Tweeddale et al., 1998). Because of the newness Chapter 1: Integrating Genomic Data Sets 18 of most 'omic terms, a few still have competing definitions. This is most evident for the proteome (see Table 1). The second group of 'omes are fewer in number and describe the actions of the protein products. For example, the secretome is a subset of the proteome that is defined by its action, that is, it is actively exported from the cell. The interactome (Sanchez et al., 1999)lists all of the specific interactions that are made between macromolecules in the cell. More abstractly, the regulome (Web references only; see Table 1) defines the genome-wide regulatory network of the cell and most notably includes transcription regulation pathways. The elucidation of each of these 'omes contributes to the ultimate goal of functional genomics, defining the functome,which describes all of the functions that are assigned to each gene in the genome ( (Rison, 2000), http://www.biochem.ucl.ac.uk/~rison). The functions of a gene can be described at many levels, including their biochemical, cellular and physiological roles (Ashburner et al., 2000), and also depend on additional factors that are not immediately associated with their basic functions, such as subcellular localization and intermolecular interactions. Therefore, aspects of the functome may be expressed in terms of other 'omes, for example those that group similar biochemical functions, for example the immunome (Pederson, 1999); similar localizations, for example the secretome; and similar interactions, for example, the interactome. For the record, I coin my own term here; at present, a large proportion of genes can only be described as members of the "unknome": those with currently no functional information! Chapter 1: Integrating Genomic Data Sets 19 Computational Methods for Defining 'omes There are a variety of computational approaches for defining 'omes (Gerstein and Honig, 2001): (1) Algorithmic methods for predicting genes, protein structure, interactions, or localization based on patterns in individual sequences or structures; for example, defining the proteome or orfeome using a gene-finding algorithm on the genome (Claverie, 1997,Guigo et al., 2000,Harrison et al., 2001,Yeh et al., 2001)determining the foldome from structure prediction of the proteome (Simons et al., 2001), determining the interactome from the foldome, using known binding sites (Teichmann et al., 2001), and determining the secretome through identifying signal sequences in the proteome (Tjalsma et al, 2000). (2) Annotation transfer through homology, that is, inferring structure or function based on sequence and structural information of homologous proteins (Gerstein, 1997,Gerstein, 1998,Hegyi and Gerstein, 1999,Hegyi et al., 2002,Thornton, 2001,Wilson et al., 2000). (3) Using a "guilt-by-association" method based on clustering where functions or interactions are inferred from clusters of functional genomic data, such as expression information. For example, similar functions can sometimes be inferred through interactions with other proteins or similar expression profiles (Eisen et al., 1998,Gerstein and Jansen, 2000,Ito et al., 2001,Marcotte et al., 1999). Chapter 1: Integrating Genomic Data Sets 20 Experimental Methods for Defining 'omes Although still in their infancy, several large-scale experimental techniques are designed to assess the nature of different 'omes. Gene expression studies are now well established and microarray or GeneChip technologies can be used to measure mRNA abundance in the cell and hence define the transcriptome (Epstein and Butow, 2000). Detection of protein concentration and definition of the translatome is more difficult, however, as evidenced by the dearth of such data. At present, the most prominent method employs two-dimensional electrophoresis to isolate proteins followed by mass spectrometry for their identification (Futcher et al., 1999,Gygi et al., 1999,Naaby-Hansen et al., 2001)followed by quantification (Aebersold et al., 2000,Appel et al., 1997,Gygi et al., 2000). The two-hybrid system enables detection of specific protein-protein associations to build the interactome (Ito et al., 2001,Uetz et al., 2000,Walhout and Vidal, 2001). Antelmann et al. (Antelmann et al, 2001) used two-dimensional electrophoresis to determine the membership of the secretome. Given the goal of determining the functome, perhaps the most exciting technology is the protein chip system, which is capable of high-throughput screening of protein biochemical activity. (Zhu et al., 2001,Zhu et al., 2000). Other methods for obtaining large-scale protein functional characterization include a transposon insertion methodology (Ross-Macdonald et al., 1999,Zhu et al, 2001,Zhu et al, 2000). Although I discuss the computational and experimental methods separately, there is, in fact, an inseparable relationship between the two. On the one hand, data resulting from Chapter 1: Integrating Genomic Data Sets 21 high-throughput experimentation require intensive computational interpretation and evaluation (Carson et al., 2001). On the other hand, computational methods use empirical data to build a knowledge base for predictions. Furthermore, they sometimes produce questionable predictions that should be reviewed and confirmed through experiments, as Antelmann et al. point out. In addition to these high-throughput techniques, another interesting tactic is to aggregate the results of individual experiments through comprehensive literature searches. Although there clearly are difficulties with differing experimental conditions and varying interpretations, preliminary results have shown this to be an effective method (Jensen, 2001,Marcotte et al., 2001,Ono et al., 2001). Interrelating Different 'omes Having categorized the organism into different sub-populations, a fundamental approach in genomics is to establish relationships between the different 'omes. In other words, by piecing the individual 'omes together, I hope to build a full and dynamic view of the complex processes that support the organism. For example, how do the proteome and regulome combine to produce the translatome? As with defining the 'omes, these relationships can be explored in different ways: (1) Defining or assigning one 'ome based on another, as described above. (2) Comparing one 'ome with another to better understand the processes that shift one population into its successor. For instance, this could be done by correlating expression measurements for the transcriptome and translatome (see below). Chapter 1: Integrating Genomic Data Sets 22 (3) Calculating "missing" (experimentally unattainable) information in one 'ome based on information in another one - for example, using the known relationships between gene expression level and subcellular location to help predict the destination of proteins of unknown localization (Drawid and Gerstein, 2000,Drawid et al., 2000). (4) Describing the intersection between multiple populations. For example, combining data from the transcriptome and the functome could describe the array of biochemical, and potentially, physiological functions that are available to the cell at any given time (Hegyi and Gerstein, 1999). The Use of Broad Categories to Interpret Noisy Data Functional genomics experiments generally give rise to very complicated data that are inherently hard to interpret. Furthermore, these data are often plagued with noise (Kerr et al., 2000). Both factors can lead to inaccuracies and conflicting interpretations. A good example is gene expression measurements, which are known to fluctuate between experiments even if the conditions are apparently identical (Baldi and Long, 2001). These fluctuations are often due to measurement errors, but there are also inherent biological variations of expression levels, relating to the stochastic nature of gene expression (Szallasi, 1999). One cause is the very low cellular concentrations of many transcription factors, meaning, that they bind promoters very rarely. Such events approximate to a Poisson process, and in fact, macroscopic chemical kinetics would fail to describe the resulting expression level of the gene (McAdams and Arkin, 1999,Thattai and van Oudenaarden, 2001)In another example, the interactome, when determined using the Chapter 1: Integrating Genomic Data Sets 23 yeast two-hybrid technique, is notorious for false positives and negatives (Ito et al, 2001,Ito et al., 2000,Legrain et al., 2001,Serebriiskii et al., 2000). A useful way to tackle noise and complexity of functional genomics information is to average the data from many different genes into broad 'omic categories (Jansen and Gerstein, 2000)For instance, instead of looking at how the level of expression of an individual gene changes over a timecourse, I can average all the genes in a functional category (e.g., glycolysis) together. This gives a more robust answer about the degree to which a functional system changes over the timecourse. Likewise, if one wants to investigate the relationship between a gene's essentiality whether or not it is essential (Winzeler et al., 1999) and its subcellular localization, it might be useful to combine the results for all proteins in the same compartment. This would give the average degree of essentiality of all nuclear proteins, cytoplasmic proteins, and so forth. In an actual study for predicting protein subcellular localization, I obtained more accurate predictions for the overall populations (96% accuracy) of a given subcellular compartment than for individual genes (75% accuracy) (Drawid et al, 2000). Thus, the strength of genomic studies lies in the global comparisons between biological systems rather than detailed examination of single genes or proteins. Genomic information is often misused when applied exclusively to individual genes. If one is interested only in one particular gene, there are many more conclusive experiments that should be consulted before using the results from genomics datasets. Therefore, genomic data should not be used in lieu of traditional biochemistry, but as an initial guideline to Chapter 1: Integrating Genomic Data Sets 24 identify areas for deeper investigation and to see how those results fit in with the rest of the genome. Moreover, most genomics datasets give relative rather than absolute information, which means that information about a single gene has little meaning in isolation. For example, they are best used to identify "outlier" genes that are particularly highly-expressed, or have especially many interactions, rather than to focus on the individual measurements for a particular gene. A gene that makes a particularly large number of interactions may indicate that it is a key component of the cell. One numerical technique that is particularly useful with regard to dealing with this information is expressing results through ranks (i.e., not giving the number of interactions of a particular gene product, but how it ranks when compared with others). Furthermore, it provides a powerful way to combine many different heterogeneous sources of information into a common and statistically robust numerical framework (Gerstein and Hegyi, 1998,Gerstein and Levitt, 1997,Qian et al, 2001). These observations should be kept in mind when interacting with genomics tools and databases. Many websites focus on providing a lot of information for a single gene sequence or protein, in a "non-genomic" fashion. Rather, such sites should be designed to simultaneously display and manipulate large populations of genes. In the absence of such an 'omic interface, it is important that information resources at least accommodate bulk downloading of standardized data. Chapter 1: Integrating Genomic Data Sets 25 A Case Study: Interrelating the Transcriptome and the Translatome A specific example of comparing the transcriptome and translatome will illustrate the points I made about interrelating 'omes and using categories to interpret noisy data. Here the question is to what degree do highly expressed genes (transcriptome) correspond to highly expressed proteins (translatome)? I can get very different answers depending on the perspective I take: Theoretical View Turning to the entire mRNA and protein populations, the change in protein concentration over time is equal to the rate of translation minus the rate of degradation. Borrowing from chemical kinetics, this is approximately expressed by the equation dP(i,t)/dt = SE(i,t) DP(i,t), where P is the abundance of protein i at time t, E is the corresponding expression level of this protein, S is a general rate of protein synthesis per mRNA, and D is a general rate of protein degradation per protein. Obviously, this is highly simplified and in a more general context one would expect that the rates of synthesis and degradation to be different for each gene and dependent on the regulatory effects of other genes over time. In addition, the equation does not take into account the stochastic nature of gene expression (see above) (Chen et al., 1999). Direct Comparison of Individual mRNA and Protein Data Chapter 1: Integrating Genomic Data Sets 26 At the moment, I do not have good enough data to apply models such as the equation above. However, there is an intuitive sense that highly expressed genes correspond to highly abundant proteins. (One can see this by imagining the situation at steady-state, when the lefthand side of the equation is zero and a positive correlation between E and P results.) Figure 2A shows the direct comparison between raw measurements of mRNA expression and protein abundance data for 181 genes in yeast drawn from two recent studies (Futcher et al, 1999,Gygi et al, 1999). The two variables show a high degree of variation for individual data pairs and investigators have come to different conclusions about the general correlation between them. This is, to some degree, dependent on the subjective way of analyzing the data. Analysis of the Data in Terms of Categories Although the relationship between mRNA and protein levels is vague for individual genes, some of the statistics for broad categories of protein properties are much more robust. Figure 2B shows the protein secondary structure and functional composition in the genome, the transcriptome (i.e., weighted by mRNA abundance), and in the translatome (i.e., weighted by protein abundance). In contrast to the differences between mRNA and protein data for individual genes, the broad categories show that the transcriptome and translatome populations are remarkably similar; both contain roughly the same proportions of secondary structure and functional categories. Moreover, this contrasts with the genome, which appears to have a distinctly different composition of functional categories. This illustrates that I get a more consistent picture when I average Chapter 1: Integrating Genomic Data Sets 27 across the population; that is, there is broad similarity between the characteristics of highly expressed mRNA and highly abundant proteins. Conclusion The ultimate goal of genomics is the elucidation of the functome, but there are many intermediate steps. By viewing the cell in terms of a list of distinct parts, I can define, part by part, each 'ome in an effort to determine and categorize functional information for each gene. High-throughput experimentation and computational techniques are valuable and complementary; that is, conclusive results often cannot be made based on a single methodology. It must be noted that this data is only valuable with regard to large populations, and as such, should only be used as a secondary source for single gene queries. Moreover, genomic approaches result in inaccurate and noisy data. This noise, while deafening on the single gene level, can be tolerated through the use of broad categories to analyze the data. ACKNOWLEDGMENTS R.J. acknowledges IBM Graduate Research Fellowship. Chapter 1: Integrating Genomic Data Sets 28 Chapter 1: Integrating Genomic Data Sets Figures and Tables Chapter 1: Integrating Genomic Data Sets 1.1 Interrelating Different Types of Genomic Data, from Proteome to Secretome: 'Oming in on Function Figure 1a An overview of the current `omic terminology Chapter 1: Integrating Genomic Data Sets 29 Figure 1b Figure 1 An overview of the current `omic terminology. (A) A schematic of the main 'omes in the process of gene expression. (B) The literature citations of four of the most widely used 'omes over time. Chapter 1: Integrating Genomic Data Sets 30 Figure 2 Interrelating the transcriptome and the translatome Figure 2 Interrelating the transcriptome and the translatome.(A) A direct comparison of protein abundance and mRNA expression. The abundance data is from two recent studies (datasets 1 and 2) of a global comparison of protein and mRNA expression levels in yeast (Futcher et al, 1999,Gygi et al, 1999). The combined protein abundance dataset Chapter 1: Integrating Genomic Data Sets 31 is an average of the data points from the two studies if the given gene product appears in both studies. The mRNA expression data is mainly derived from Holstege (Holstege et al., 1998). Although there is a general trend for protein concentration to rise with mRNA levels, the actual correlation is weak and protein concentrations can sometimes vary by more than two orders of magnitude for a given mRNA level. Similar observations were reported by a study in human liver cells (Anderson and Seilhamer, 1997). The mRNA expression data was scaled and the process is described on this paper’s eb site (http://bioinfo.mbb.yale.edu/expression). (B) The composition of the genome (proteome), transcriptome and translatome in terms of broad categories: protein secondary structures and functions. This is based on the analysis in Jansen and Gerstein (Jansen and Gerstein, 2000) with updates to include protein abundance data. The bottom piecharts give the composition in the genome, the middle charts in the transcriptome and the top charts in the translatome. The compositions for the transcriptome and the translatome are calculated by weighting each mRNA/protein with its respective expression level. The secondary structure composition does not vary significantly between the different 'omes, mainly because transcription and translation are independent of secondary structure. The right five piecharts analyse the functional composition. I highlight the Energy and Cellular Organization categories determined from MIPS (Mewes et al., 2000). A problem in comparing the different 'omes is that each represents a different set of genes. For instance, protein levels have been measured only for a fraction of genes whereas mRNA levels are known for almost all genes. The piecharts show the compositions for the whole genome in the right column and a representative subset of genes with known protein levels in the left column. Comparing the left to the right immediately shows the Chapter 1: Integrating Genomic Data Sets 32 experimental bias of two-dimensional electrophoresis (the method for measuring protein abundance) with respect to certain functional categories. There is good agreement between the composition in the translatome and the transcriptome, despite the low correlation of protein and mRNA levels for individual genes. In comparison, the compositions in the genome are much lower. Chapter 1: Integrating Genomic Data Sets 33 Table 1 A Table of 'omes, Together with their Occurrence in the Literature and on the World Wide Web Term Description Genome Google Year of first PubMed PubMed citation The full complement of genetic ~1880000 66171 information both coding and non coding in the organism Proteome The protein-coding regions of the ~63,000 703 genome Transcriptome The population of mRNA transcripts in 3520 72 the cell, weighted by their expression levels Physiome Quantitative description of the 2980 15 physiological dynamics or functions of the whole organism Metabolome The quantitative complement of all the 349 12 small molecules present in a cell in a specific physiological state Phenome Qualitative identification of the form 4980 6 and function derived from genes, but lacking a quantitative, integrative definition Morphome The quantitative description of 238 2 anatomical structure, biochemical and chemical composition of an intact organism, including its genome, proteome, cell, tissue and organ structures Interactome List of interactions between all 56 2 macromolecules in a cell Glycome The population of carbohydrate 46 1 molecules in the cell Secretome The population of gene products that 21 1 are secreted from the cell Ribonome The population of RNA-coding regions 1 1 of the genome Orfeome The sum total of open reading frames 42 in the genome, without regard to Chapter 1: Integrating Genomic Data Sets 1932** 1995 1997 1997 1998 1995 1996 1999 2000 2000 2000 - 34 whether or not they code; a subset of this is the proteome Regulome Genome-wide regulatory network of the cell Cellome The entire complement of molecules and their interactions within a cell Operome The characterization of proteins with unknown biological function Transportome The population of the gene products that are transported; this includes the secretome Pseudome The complement of pseudogenes in the proteome Functome The population of gene products classified by their functions Translatome The population of proteins in the cell, weighted by their expression levels Foldome The population of gene products classified through their tertiary structure * Unknome Genes of unknown function 18 - - 17 - - 8 - - 1 - - - - - 1 - - - - - - - - - - - Updated versions of this table will be available through my Web site at http://bioinfo.mbb.yale.edu/what-is-it. Note that I define five new 'omes: the translatome, the foldome, the pseudome, the functome, and the unknome. My definition of the translatome is motivated partially by the ambiguities in term proteome, which has two competing definitions. First, broadly favored by computational biologists, it is a list of all the proteins encoded in the genome (Gaasterland 1999; Doolittle 2000). In this context, it is equivalent to what some refer to as the orfeome, (i.e., the set of genes excluding noncoding regions). Experimentalists, especially those involved in large-scale experiments such as expression analysis and 2D electrophoresis, favor a second definition. Here, it is used to describe the actual cellular contents of proteins, taking into account the different levels of protein concentrations (Yates 2000). I prefer the former definition for proteome, and use the term translatome for the latter. See http://www.genomic_glossaries.com/content/omes.asp for a listing of other 'omes and their definitions. * This term is also used in other fields with different meanings. **First citation according to the Oxford English Dictionary. Chapter 1: Integrating Genomic Data Sets 35 References 1. Aebersold, R., Rist, B. & Gygi, S. P. Quantitative proteome analysis: methods and applications. Ann N Y Acad Sci 919, 33-47 (2000). 2. Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 18, 533-7 (1997). 3. Antelmann, H. et al. A proteomic view on genome-based signal peptide predictions. Genome Res 11, 1484-502 (2001). 4. Appel, R. D., Vargas, J. R., Palagi, P. M., Walther, D. & Hochstrasser, D. F. Melanie II--a third-generation software package for analysis of two- dimensional electrophoresis images: II. Algorithms. Electrophoresis 18, 2735-48. (1997). 5. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9. (2000). 6. Baldi, P. & Long, A. D. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 17, 509-19. (2001). 7. Carson, J. H., Cowan, A. & Loew, L. M. Computational cell biologists snowed in at Cranwell. Trends Cell Biol 11, 236-8. (2001). 8. Chen, T., He, H. L. & Church, G. M. Modeling gene expression with differential equations. Pac Symp Biocomput, 29-40. (1999). 9. Claverie, J. M. Computational methods for the identification of genes in vertebrate genomic sequences. Hum Mol Genet 6, 1735-44 (1997). Chapter 1: Integrating Genomic Data Sets 36 10. Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J Mol Biol 301, 1059-75. (2000). 11. Drawid, A., Jansen, R. & Gerstein, M. Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet 16, 426-30 (2000). 12. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 148638 (1998). 13. Epstein, C. & Butow, R. Microarray technology - enhanced versatility, persistent challenge. Current Opinions Biotechnology 11, 36-41 (2000). 14. Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999). 15. Gerstein, M. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J Mol Biol 274, 562-76 (1997). 16. Gerstein, M. Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census. Proteins 33, 518-534 (1998). 17. Gerstein, M. & Hegyi, H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 22, 277-304 (1998). 18. Gerstein, M. & Honig, B. Sequences and topology. Curr Opin Struct Biol 11, 327-9. (2001). 19. Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function (In press). Current Opinions in Structural Biology (2000). Chapter 1: Integrating Genomic Data Sets 37 20. Gerstein, M. & Levitt, M. A structural census of the current population of protein sequences. Proc Natl Acad Sci U S A 94, 11911-6. (1997). 21. Guigo, R., Agarwal, P., Abril, J. F., Burset, M. & Fickett, J. W. An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10, 1631-42. (2000). 22. Gygi, S. P., Rist, B. & Aebersold, R. Measuring gene expression by quantitative proteome analysis [In Process Citation]. Curr Opin Biotechnol 11, 396-401 (2000). 23. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999). 24. Harrison, P. M., Echols, N. & Gerstein, M. B. Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 29, 818-30. (2001). 25. Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 288, 147-64 (1999). 26. Hegyi, H., Lin, J., Greenbaum, D. & Gerstein, M. Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins 47, 126-41 (2002). 27. Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717-728 (1998). 28. Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 98, 4569-74. (2001). Chapter 1: Integrating Genomic Data Sets 38 29. Ito, T., Chiba, T. & Yoshida, M. Exploring the protein interactome using comprehensive two-hybrid projects. Trends Biotechnol 19, S23-7. (2001). 30. Ito, T. et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci 97, 1143-1147 (2000). 31. Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res 28, 1481-8 (2000). 32. Jensen, F. V. Bayesian Networks and Decision Graphs (Springer, New York, 2001). 33. Kerr, M. K., Martin, M. & Churchill, G. A. Analysis of variance for gene expression microarray data. J Comput Biol 7, 819-37 (2000). 34. Legrain, P., Wojcik, J. & Gauthier, J. M. Protein--protein interaction maps: a lead towards cellular functions. Trends Genet 17, 346-52. (2001). 35. Luscombe, N. M., Greenbaum, D. & Gerstein, M. What is bioinformatics? A proposed definition and overview of the field. Methods Inf Med 40, 346-58 (2001). 36. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. & Eisenberg, D. A combined algorithm for genome-wide prediction of protein function. Nature 402, 83-6. (1999). 37. Marcotte, E. M., Xenarios, I. & Eisenberg, D. Mining literature for proteinprotein interactions. Bioinformatics 17, 359-63. (2001). 38. Marx, K. Grundrisse (1857). Chapter 1: Integrating Genomic Data Sets 39 39. McAdams, H. H. & Arkin, A. It's a noisy business! Genetic regulation at the nanomolar scale. Trends Genet 15, 65-9. (1999). 40. Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 28, 37-40 (2000). 41. Naaby-Hansen, S., Waterfield, M. D. & Cramer, R. Proteomics - post-genomic cartography to understand gene function. Trends Pharmacol Sci 22, 376-84. (2001). 42. Ono, T., Hishigaki, H., Tanigami, A. & Takagi, T. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17, 155-61. (2001). 43. Pederson, T. The immunome. Mol Immunol 36, 1127-8. (1999). 44. Qian, J. et al. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res 29, 1750-64 (2001). 45. Rison, S. C. G. H., T. C. Thornton, J.M. Comparison of Functional Annotation Schemes for Genomes. Funct Integr Genomics 1, 56-59 (2000). 46. Ross-Macdonald, P. et al. Large-scale analysis of the yeast genome by transposon tagging and gene disruption. Nature 402, 413-8 (1999). 47. Sanchez, C. et al. Grasping at molecular interactions and genetic networks in Drosophila melanogaster using FlyNets, an Internet database. Nucleic Acids Res 27, 89-94. (1999). Chapter 1: Integrating Genomic Data Sets 40 48. Serebriiskii, I., Estojak, J., Berman, M. & Golemis, E. A. Approaches to detecting false positives in yeast two-hybrid systems. Biotechniques 28, 328-30, 332-6. (2000). 49. Simons, K. T., Strauss, C. & Baker, D. Prospects for ab initio protein structural genomics. J Mol Biol 306, 1191-9. (2001). 50. Skolnick, J. & Fetrow, J. S. From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotechnol 18, 34-9. (2000). 51. Szallasi, Z. Genetic network analysis in light of massively parallel biological data acquisition. Pac Symp Biocomput, 5-16. (1999). 52. Teichmann, S. A., Murzin, A. G. & Chothia, C. Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol 11, 35463. (2001). 53. Thattai, M. & van Oudenaarden, A. Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci U S A 98, 8614-9. (2001). 54. Thornton, J. M. From genome to function. Science 292, 2095-7. (2001). 55. Tjalsma, H., Bolhuis, A., Jongbloed, J. D., Bron, S. & van Dijl, J. M. Signal peptide-dependent protein transport in Bacillus subtilis: a genome-based survey of the secretome. Microbiol Mol Biol Rev 64, 515-47. (2000). 56. Tweeddale, H., Notley-McRobb, L. & Ferenci, T. Effect of slow growth on metabolism of Escherichia coli, as revealed by global metabolite pool ("metabolome") analysis. J Bacteriol 180, 5109-16. (1998). Chapter 1: Integrating Genomic Data Sets 41 57. Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-7. (2000). 58. Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997). 59. Vukmirovic, O. G. & Tilghman, S. M. Exploring genome space. Nature 405, 8202. (2000). 60. Walhout, A. J. & Vidal, M. High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297-306. (2001). 61. Wilson, C. A., Kreychman, J. & Gerstein, M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 297, 233-49 (2000). 62. Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 (1999). 63. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res 11, 803-16. (2001). 64. Zhu, H. et al. Global analysis of protein activities using proteome chips. Science 293, 2101-5. (2001). 65. Zhu, H. et al. Analysis of yeast protein kinases using protein chips. Nat Genet 26, 283-9. (2000). Chapter 1: Integrating Genomic Data Sets 42 Chapter 2: mRNA expression and protein abundance 2.1 Analysis of mRNA expression and protein abundance data: An approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts Abstract Motivation Protein abundance is related to mRNA expression through many different cellular processes. Up to now, there have been conflicting results on how correlated the levels of these two quantities are. Given that expression and abundance data are significantly more complex and noisy than the underlying genomic sequence information, it is reasonable to simplify and average them in terms of broad proteomic categories and features (e.g. functions or secondary structures), for understanding their relationship. Furthermore, it will be essential to integrate, within a common framework, the results of many varied experiments by different investigators. This will allow one to survey the characteristics of highly expressed genes and proteins. Results To this end, I outline a formalism for merging and scaling many different gene expression and protein abundance data sets into a comprehensive reference set, and I develop an approach for analyzing this in terms of broad categories, such as composition, function, structure and localization. As the various experiments are not always done using the same set of genes, sampling bias becomes a central issue, and my formalism is designed Chapter 2: mRNA expression and protein abundance 43 to explicitly show this and correct for it. I apply my formalism to the currently available gene expression and protein abundance data for yeast. Overall, I found substantial agreement between gene expression and protein abundance, in terms of the enrichment of structural and functional categories. This agreement, which was considerably greater than the simple correlation between these quantities for individual genes, reflects the way broad categories collect many individual measurements into simple, robust averages. In particular, I found that in comparison to the population of genes in the yeast genome, the cellular populations of transcripts and proteins (weighted by their respective abundances) were both enriched in: (i) the small amino acids Val, Gly, and Ala; (ii) low molecular weight proteins; (iii) helices and sheets relative to coils; (iv) cytoplasmic proteins relative to nuclear ones; and (v) proteins involved in "protein synthesis," "cell structure," and "energy production". Supplementary information http://genecensus.org/expression/translatome Introduction With the recent popularity of high-throughput experimentation, biologists have begun to create a large inventory of scientific data (Claverie, 1999,Einarson and Golemis, 2000,Epstein and Butow, 2000,Shapiro and Harris, 2000). Much of this has come from expression experiments, partially fueled by the advent and continuous evolution of the microarray and Gene Chip systems. These experiments allow for large scale, comprehensive scans of gene expression within the cell (Eisen and Brown, 1999,Ferea Chapter 2: mRNA expression and protein abundance 44 and Brown, 1999,Lipshutz et al., 1999,Schena et al., 1995). Expression data sets are currently the single richest source of information in genomics, and for yeast, expression information now dwarfs that in the sequence alone. However, "theory" has not kept up with experimentation in this area, and how to best interpret the vast amount of data generated by these experiments is still a very open question (Bassett et al., 1996,Gerstein and Jansen, 2000,Searls, 2000,Sherlock, 2000,Wittes and Friedman, 1999,Zhang, 1999). Genome-wide experimentation has also been used to directly measure the cellular population of proteins (protein abundance). (Anderson and Seilhamer, 1997,Futcher et al., 1999,Gygi et al., 1999,Ross-Macdonald et al., 1999). Understanding how protein abundance is related to mRNA transcript levels is essential for interpreting gene expression and also, more generally, for understanding the interactions, structures and functions in a cellular system (Hatzimanikatis et al., 1999). Moreover, as protein concentration, rather than transcript population, is the more relevant variable with respect to enzyme activity, it is this quantity that connects genomics to the physical chemistry and dynamics of the cell (Kidd et al., 2001). Finally, protein abundance levels may become invaluable for diagnostic methods as well as for determining new drug targets (Corthals, 2000). Highthroughput two-dimensional gel electrophoresis (2-DE), in conjunction with mass spectrometry, has been used to identify proteins that can then be quantified to determine protein abundance (Futcher et al, 1999,Gygi et al, 1999,Harry, 2000). Other technologies include using random integration of reporter transposons in yeast (Ross-Macdonald et al, 1999), and modifying the microarray concept for use with proteins (Lopez, 2000,MacBeath and Schreiber, 2000,Nelson et al., 2000,Zhu et al., 2000). Chapter 2: mRNA expression and protein abundance 45 Gene expression is indirectly related to cellular protein abundance through the process of translation. The cell connects mRNA expression and protein abundance through translational control, which is primarily regulated at the initiation of translation (Day and Tuite, 1998,Jackson and Wickens, 1997,Lindahl and Hinnebusch, 1992,McCarthy, 1998). Much of this control is the result of multiple cis-acting elements in the mRNA (Jacobs Anderson and Parker, 2000). There are large non-coding regions in each mRNA species devoted to regulation of that mRNA as well as its stability and degradation properties, including 5` and 3` UTRs, uORFs and uAUGs (Morris and Geballe, 2000,Vilela et al., 1998,Vilela et al., 1999). Previously, we surveyed the population of protein features -- such as folds, amino acid composition, and functions -- in yeast, and a number of the other recently sequenced genomes (Das and M., 2000,Gerstein, 1997,Gerstein, 1998,Gerstein, 1998,Gerstein, 1998,Hegyi and Gerstein, 1999,Lin and Gerstein, 2000). Others have also done related work (Frishman and Mewes, 1997,Frishman and Mewes, 1999,Jones, 1998,Tatusov et al., 1997,Wallin and von Heijne, 1998,Wolf et al., 1999). Recently, we extended this concept to compare the population of features in the yeast transcriptome to that in the genome (Drawid et al., 2000,Jansen and Gerstein, 2000). Here, I present a new methodology to compare the features of the mRNA expression population with the protein abundance population. Precise terminology is essential for this comparison to be readily understandable. Unfortunately, one of the terms that immediately come to mind in relation to protein populations, “proteome”, has in the past been used inconsistently. In particular, the term Chapter 2: mRNA expression and protein abundance 46 proteome can logically be used to describe all the distinctly different proteins in the genome (Bairoch, 2000,Cambillau and Claverie, 2000,Cavalcoli et al., 1997,Doolittle, 2000,Fey et al., 1997,Gaasterland, 1999,Garrels et al., 1997,Jones, 1999,Pandey and Mann, 2000,Qi et al., 1996,Rubin et al., 2000,Sali, 1999,Tekaia et al., 1999) and, in this context, it is equivalent to what others may refer to as the coding part of the genome. However, in papers on 2D electrophoresis, it is often used to describe the sum total of proteins in a cell, taking into account the different levels of protein abundance for different proteins (Gygi et al., 2000,Lopez, 2000,Shevchenko et al., 1996,Washburn and Yates, 2000). In an effort to be clear, I propose the term “translatome” for this second usage of proteome. With this definition, I am able to refer compactly to three different cellular populations. These are illustrated in figure 1. (i.) I use the term genome when I refer to the population of open reading frames, where each ORF counts once. (ii.) I use the term transcriptome when I refer to the population of mRNA transcripts. This term was originally coined by Velculescu et al. (Velculescu et al., 1997). Note that each ORF may give rise to different numbers of transcripts. Consequently, the transcriptome is essentially the same as the genome but with each ORF weighted by its expression level. (iii.) The next level is the cellular population of proteins. As each protein repre- sents a translated transcript, I make an analogy with the term transcriptome and Chapter 2: mRNA expression and protein abundance 47 use the term translatome as described above to describe this third population. Thus, the translatome is a subset of the genome where each ORF is weighted by its associated level of protein abundance. Note that one could also less compactly call the translatome a "weighted proteome". However, doing so assumes one of the two aforementioned definitions of proteome. To avoid ambiguity, I studiously avoid the use the proteome altogether in the paper. Differences between the translatome and the transcriptome exist given that transcripts from different genes can give rise to different numbers of proteins, due to different rates of translation and protein degradation. Post-transcriptional modifications further affect the translatome. Although there are gene expression and protein abundance data sets for multiple organisms, I have chosen to work specifically on yeast. Besides having its whole genome sequenced (Goffeau, 1996), yeast is also a powerful tool in genetics (Carlson, 2000) due to, among other things, the two hybrid system, a robust and versatile technique used in discerning protein-protein interactions (Luban and Goff, 1995,Young, 1998). In my analysis of the transcriptome and translatome, I focus on global protein features rather than the comparison of individual genes. Previous analyses have shown that differences between mRNA expression and protein abundance level can be quite dramatic for individual genes. This may either be due to the noise in the data or to fundamental biological processes. However, my analyses shows that the variation be- Chapter 2: mRNA expression and protein abundance 48 tween transcriptome and translatome is much smaller for global properties that are computed by averaging over the properties of many individual genes. Methods Data sources used For my analysis I culled many divergent data sets, representing protein abundance and mRNA expression experiments and also other sources of genome annotation. These are all summarized in Table 1. Briefly, they included two protein abundance sets, measured via 2-dimensional gel electrophoresis and mass spectrometry. I termed these 2-DE #1 (Gygi et al, 1999) and 2-DE #2 (Futcher et al, 1999). These sets, while admittedly small in comparison to the size of expression data sets, represent the largest amount of information on protein abundance publicly available at the present. I also apply my methodology, with limited success, to the semi-quantitative Transposon insertion data set that measures the LacZ expression of fusion proteins (Ross-Macdonald et al, 1999). Although this set contains many more genes than either of the gel electrophoresis sets, and thus is an appealing source of protein abundance information, the more qualitative nature of the data makes comparisons with other data sets difficult. My mRNA expression data came from multiple laboratories that used either Gene Chip or SAGE technology. The Gene Chip sets included the Young Expression Set (Holstege et al., 1998), the Church Expression Set (Roth et al., 1998) and the Samson Expression Set (Jelinsky and Samson, 1999). I used data representing the vegetative state of yeast from all of the above experiments. I also compiled two reference sets to be used in my Chapter 2: mRNA expression and protein abundance 49 comparisons, one for protein abundance and another for mRNA expression (summarized below). Finally, I used many different types of genome annotation in my analysis, which are summarized in Table1. In particular, the Munich Information Center for Protein Sequences (MIPS), a site containing a large number of databases (Mewes et al., 2000), proved to be an invaluable source of data specifically in regard to functional categories. Biases in the data There is a caveat to the usage of data from high-throughput experimentation (i.e. microarrays and two-dimensional gel electrophoresis). With all high throughput expression studies there always exists the difficulty of maintaining consistent biological and processing conditions across the assay. Moreover, the databases that annotate the specific genes may not always be accurate (Ishii et al., 2000). Gene chip experiments suffer with regard to cross hybridization and the saturation of probes for the highly expressed genes. SAGE data is not always reliable for assessing ORFs with low expression levels. With regard to 2D gels, although the technology has undergone many improvements since its introduction over a quarter century ago (Klose, 1975,O'Farrell, 1975), there remain many aspects of the procedure that introduce biases into the data. These include the inability to resolve membrane proteins (approximately 30% of the genome) and basic proteins (Gerstein, 1998,Krogh et al., 2001). Moreover, there exist some biases in the data that, as in any compilation, reflect the tendencies of the investigator. These include the lack of low abundance proteins(Fey and Larsen, 2001,Gygi et al., 2000,Harry, 2000) and the differences between labs in sample preparation. In addition, the procedures for identification (i.e. MALDI-TOF) and quantification (i.e. ICAT) (Gygi et al, 1999)of the protein Chapter 2: mRNA expression and protein abundance 50 spots are much more recent and themselves subject to problems and uncertainties (Haynes and Yates, 2000). I am trying to correct for these biases in my analysis in two ways. First, I create reference mRNA expression and protein abundance datasets as a starting point for my analysis. I achieve this by scaling and averaging different mRNA and protein datasets into a combined reference, in an attempt to obtain a better estimate of the normal expression state of a yeast cell (I explain this procedure in more detail in the following section). This results in a correction of the biases that might be found in individual datasets. Second, in analyzing the reference datasets, I use a formalism and a graphical representation that shows the dependency of the results on the subset of genes for which experimental data is available, thus making sampling or selection biases explicit. Data set scaling A reference set for mRNA expression With many different mRNA expression data sets available, it is worthwhile to integrate them into a single unified reference set, with the intention of reducing the noise and errors contained in the individual data sets and to obtain a unified estimate of the normal expression state in a cell. I adopt an iterative scaling and merging formalism, which I summarize below. I present a more detailed review of the methods at the following web site: genecensus.org/expression/translatome. Chapter 2: mRNA expression and protein abundance 51 I start with the values of one Gene Chip data set Ui where i is used throughout as a subscript to denote gene number. I then transform the values of the next Gene Chip data set Xi to Yi with the following non-linear regression: min Y U 2 i i with Yi AX iB i where A and B are the parameters of the regression. Note that two Gene Chip sets may not be defined for the same set of genes, so I have to perform the fit only over the genes common to both sets. The motivation for scaling is that the dynamic range of observed expression levels varies somewhat between different data sets, although cell types and growth conditions are very similar. Reasons for disparity may include different calibration procedures for relating fluorescence intensity to a cellular concentration (measured in copies of transcripts per cell) or different protocols for harvesting and reversetranscribing the cellular mRNA. I then merge and average the data to create a new reference set V as follows: If Ui and Yi are both defined for gene i and Then 1 Vi Yi U i 2 Yi U i Yi U i Else if only Yi exists, Vi = Yi Else Vi = Ui Chapter 2: mRNA expression and protein abundance 52 As presented above, where only one data set has a value for the corresponding ORF, I incorporated that value and did not exclude it. When both data sets have values for an ORF, I averaged the values if they were within 15% of each other; otherwise, I just stayed with the original chip data set Ui. I used α = 15% in order to prevent outliers from skewing the result. This 15% value is a reasonable threshold for excluding outliers though other values (e.g. 10% or 20%) would give similar results (data not shown). Other data sets are subsequently included in the same procedure, continuing the iteration from the new expression values Vi. The initial iteration starts with the Young Expression Set as Ui since I have the highest confidence in its accuracy. The SAGE data was not included in the above procedure since it is of a fundamentally different nature. An advantage of the SAGE technology over Gene Chips is that there is no possible signal saturation for high expression levels, as is possible for chips(Futcher et al, 1999). Conversely, SAGE values are less reliable for lowly expressed genes since there is a chance that one might not sequence a SAGE tag corresponding to such a gene altogether. Therefore, if after the last iteration, the average Gene Chip expression level Vi was both above a certain threshold and below the SAGE expression level Si for the same gene, it was replaced with the SAGE value; otherwise the average Gene Chip value was kept. This gave us my final expression set wmRNA. My treatment of the SAGE data is modeled after that in Futcher et al. (Futcher et al, 1999), and like them, I used = 16. This incorporation of the SAGE data into the reference data set ensures that the highly expressed outliers are as accurate as possible. Chapter 2: mRNA expression and protein abundance 53 Rather than plain arithmetic averaging, this overall scaling procedure with the cutoff avoids “artificial averages” that combine very different values for a particular gene. Some expression values might be statistical outliers. In addition, it may be possible that the expression levels of a variety of genes can only be within mutually exclusive ranges or modes, such as when two alternative pathways are switched on or off. Simply averaging these would give values that are less representative of the particular mode values. This situation is analogous to that in averaging together an ensemble of protein structures, say from an NMR structure determination. Each structure in the ensemble could be stereochemically correct, with all side-chain atoms in predefined rotamer configurations. However, an average of all structures in the ensemble could yield one that is stereochemically incorrect if this involved averaging over particular side-chains in different rotameric states. With regard to my regression analysis, I have investigated both non-linear and linear fits but found a non-linear procedure to be more advantageous. The non-linear relationship between different expression datasets perhaps reflects saturation in one or more of the gene chips -- not an uncommon phenomenon. This non-linearity is immediately evident on scatter plots of two datasets against one another (see website). Accordingly, the nonlinear fit produces a smaller residual than the linear fit: 98297 (non-linear) versus 122182 (linear) for the scaling of the Church dataset and 59828 (non-linear) versus 67462 (linear) for the Samson dataset. Chapter 2: mRNA expression and protein abundance 54 A reference set for protein abundance I followed a similar procedure to calculate a reference protein abundance set from the two gel electrophoresis data sets. I first scaled the two data sets against the mRNA expression reference data set, getting regression parameters Cj and Dj: min P i, j j C j wmRNA ,i D 2 i where the subscript j indicates the data set 2-DE #1 or 2-DE #2 respectively; Pi,j is the protein abundance value in data set j, and wmRNA,i the corresponding reference expression value, and Cj and Dj are the parameters of the non-linear regression. Using these parameters, I transformed the values of set 2-DE #2 onto 2-DE #1. Then I combined both sets into the reference protein set wProt by averaging them, if both values existed. Otherwise, by using the existing value, viz: P Qi , 2 C1 i , 2 C2 D1 D2 wProt,i = (Pi,1 + Qi,2 )/2 if both Pi,1 and Qi,2 exist. Else if only Pi,1 exists, wProt,i = Pi,1 Else if Qi,2 exists, wProt,i = Qi,2. Chapter 2: mRNA expression and protein abundance 55 Enrichment of features Figure 2 focuses on individual proteins. In the next part of my analysis, I want to group a number of proteins together into various categories based on common features and characterize those features that are enriched in one population relative to another, i.e. the translatome population of proteins as measured by 2D gels relative to the transcriptome population of transcripts or the genome population of genes. To this end, I set up a formalism that could be applied universally to all the attributes that I was interested in. Due to the limitations of the experiments, the translatome, transcriptome, and genome populations are defined on different sets of genes, and sometimes I want to remove this “selection bias” by forcing them to be compared on exactly the same set of genes. This is a key aspect of my formalism as presented in figure 1. I call an entity like [w, G] a "population", where G is a set describing a particular selection of genes from the genome and w is vector of weights associated with each element of this population. In particular, I focus on three main populations here: (i.) [1,GGen] is the population of genes in the genome, all 6280 genes weighted once (w = 1). (ii.) [wmRNA, GmRNA] is the observed population of the transcripts in the transcriptome, i.e. the 6249 genes in the reference expression set weighted by their reference expression value. Chapter 2: mRNA expression and protein abundance 56 (iii.) [wProt, GProt] is the observed cellular population of the proteins in the transla- tome, i.e. the 181 genes in the reference abundance set weighted by their reference abundance value. (The set of genes in the genome GGen is approximately equal to the genes in set GmRNA, such that I can use both symbols interchangeably.) I can also use this notation to describe specific experiments -- e.g. [wlacZ, GlacZ] describes the gene set and weights relating to the Transposon Abundance set. Furthermore, I define Fj as the value of a feature F in ORF j. For example, F could be the composition of leucine (a real number) or a binary value (0 or 1) indicating whether an ORF contains a trans-membrane segment. Given these definitions, the weighted average of feature F in population [w, G] is: w F ( F ,[w, G ]) w jG j jG j j The weighted averages of two populations [w, G] and [v, S] can be compared by simply looking at their relative difference Δ: ( F ,[ v, S ], [w, G]) ( F ,[ v, S ]) ( F ,[w, G]) ( F ,[w, G]) where v and w are weights for the sets of ORFs S and G respectively. I call Δ the "enrichment" of feature F because it indicates whether F is enriched (if Δ is positive) or depleted (if Δ is negative) in population [v, S] relative to [w, G]. Chapter 2: mRNA expression and protein abundance 57 Usually, the gene set G is defined by the particular experiment, for which the weight w was measured. However, it is also possible to combine the gene set associated with one experiment with expression levels from another set. One may want to do this to compute the enrichment only on the genes common to both populations, for which there are defined values for both w and v, viz: , (F, [v, S G],[w, S G]). In practice, this is most relevant for comparing GProt and GmRNA. Since GProt is completely a subset of GmRNA, I need not explicitly deal with intersections if I calculate all statistics directly over GProt. One can adjust the weight vectors to take into account different types of averaging. For instance, when computing the amino acid composition (F = aa) from the amino acid compositions of individual ORFs Fj = aaj (j G ) , I weight by ORF length. In the case of expression weights, I have: wj = Nj wmRNA,j j G where Nj is a measure of the length of ORF j (such as the number of amino acids.) On the other hand, when computing the average molecular weight per amino acid, I need to normalize by the number of amino acids per ORF, which is equivalent to choosing the following weights: wj wmRNA, j Nj j G Chapter 2: mRNA expression and protein abundance 58 Results Comparison of mRNA expression and protein abundance Figure 2a shows a comparison of my two reference data sets for transcripts and proteins on a log-log graph. The correlation coefficient is 0.67. A previous study(Futcher et al, 1999), in which the data set 2-DE #2 was investigated, reported a higher correlation coefficient of 0.76. The disparity may be due to the fact that I are looking at a larger number of points. Inspection of figure 2a also shows that the correlation for the data values, which were derived from averaging values from both 2-DE sets, is larger. It should be emphasized that there are many limitations in this analysis as both 2-DE sets represent relatively homogenous sets of proteins and there are only a small number of proteins in each set. Figure 2b shows the outliers from figure 2a from both above and below the dashed line. These outliers are representative of those genes for which their mRNA expression differs significantly from their protein abundance (i.e. either there is little mRNA expression yet significant protein abundance or significant mRNA expression yet minimal protein abundance). For each, I present a description of its function. With one exception all outliers are associated with the MIPS category: cellular organization (MIPS category 30). Chapter 2: mRNA expression and protein abundance 59 Enrichment of protein features Amino acid enrichment As shown in Figure 3a, I used my methodology to measure the enrichment of individual amino acids in both the translatome and the transcriptome relative to the genome. The horizontal axis lists the amino acids while the vertical axis shows their percent enrichment. I list enrichments for both the reference protein abundance and mRNA expression sets in relation to the genome population. I found that three amino acids -- Valine, Glycine and Alanine -- were consistently enriched in both transcriptome and translatome populations. In Figure 3a I compare different gene sets. In Figure 3b I focus mainly on the variation in enrichments when all the comparisons are restricted to the set of 181 genes (GProt GmRNA = GProt) common to all data sets. Thus, the differences between the populations now only reflect the effects of differential transcription of certain genes and differential translation of certain transcripts. I find here an enrichment specifically of cysteine in the translatome in relation to the transcriptome. This enrichment may be the result of the stability associated with sulfur bridges. To measure the statistical significance of the results on amino acid enrichment, I have performed a control analysis on a randomized dataset (Figure 3D). I randomly permutated the expression values of the ORFs 1000 times and then recomputed the enrichments. This allowed us to compute distributions for the amino acid enrichments and, from Chapter 2: mRNA expression and protein abundance 60 integrating these, one-sided p-values indicating the significance of the observed enrichments. Biomass enrichment A corollary to amino acid enrichments is the determination of the average biomass of the transcriptome and translatome populations. I show this in Figure 3C. I found that the average molecular weight of a protein in both populations was, on average, lower than in the genome population. These preliminary observations suggest a cell preference to use less energetically expensive proteins for those that are highly transcribed or translated. However, I also found that the average molecular weight per amino acid differed much less between the transcriptome and the translatome on the one hand, and the genome on the other hand (though it was still slightly less). This finding indicates that lower molecular weights in the translatome and transcriptome populations relative to the genome population are predominantly due to greater expression of shorter proteins rather than the incorporation of smaller amino acids. Secondary structure composition I also used my methodology to study the enrichment of secondary-structural features. Secondary structural annotation was derived from structure prediction applied uniformly to all the ORFs in the yeast genome as described in Table 1. As shown in Figure 4A, all three populations – genome, transcriptome, and translatome – had a fairly similar composition of secondary structures -- sheets, helices, and coils. The differences between populations were marginal and based only on the small subset of genes. They do, Chapter 2: mRNA expression and protein abundance 61 though, point to a possible trend of depletion of random coils relative to alpha helices and beta sheets in the transcriptome and translatome. I also found that transmembrane proteins were significantly depleted in the transcriptome (see website). To identify transmembrane (TM) proteins, I used the GES hydrophobicity scale as described previously (see caption to Table 1 (Gerstein, 1998). These results are consistent with a previous analyses (Jansen and Gerstein, 2000). This analysis could not be extended to the translatome because the 181 genes in the protein abundance data set (GProt) do not contain any membrane proteins, which are difficult to detect in gel electrophoresis (Molloy, 2000). Subcellular localization A generalization of the transmembrane protein analysis is subcellular localization. I looked into the enrichment of proteins associated with the various subcellular compartments. This is shown in Figure 4C. For clarity, I divided the cell into five distinct subcellular compartments, as described in Table 1. I found that, in comparison to the genome, both the transcriptome and translatome are enriched in cytoplasmic proteins. This is true whether I make my comparisons in relation to the relatively large reference mRNA expression set or the smaller reference protein abundance set. As figure 4C shows, the 2D gel experiments are clearly biased towards proteins from the cytoplasm. However, in the biased subset Gprot transcription and translation lead to an even higher fraction of cytoplasmic proteins in the translatome. Chapter 2: mRNA expression and protein abundance 62 Functional categories Finally, I compared the enrichment of various functional categories in both the translatome and the transcriptome (see Figure 4B). This gives us a broad yet informative view of the cell as a whole. As described in Table 1, I used the top-level of the MIPS scheme for the functional category definitions (Mewes et al, 2000). I found broad differences between the various populations, with some of the functional categories showing strikingly high enrichments. In particular, I found enrichments of the “cellular or- ganization,” “protein synthesis,” and “energy production” categories. Application to semi-quantitative protein abundance data sets I also tried to extend my methodology to cope with the semi-quantitative transposon set. The qualitative nature of the set makes it impossible to compute statistical relationships between mRNA and protein populations as I did for both the 2D gel sets. I briefly summarize my approach. Many ORFs in the Transposon dataset had multiple, sometimes inconsistent, measurements ranging from one (background) to four (strong) for various different transposon insertions. I took only those 450 ORFs that consistently yielded either background or strong. I then used this set in a binary fashion, interpreting an ORF as either on or off. I show the enrichments of amino acids computed from this filtered Transposon Abundance Set in Figure 3A. Overall, the enrichments from this set seemed to be attenuated in comparison to either the mRNA expression or protein abundance data. Chapter 2: mRNA expression and protein abundance 63 Discussion and conclusion I developed a methodology for integrating many different types of gene expression and protein abundance into a common framework and applied this to a preliminary analysis of yeast. In particular, I developed a procedure for scaling and merging different mRNA and protein sets together and then computing the enrichment of various proteomic features in the population of transcripts and proteins implied by these scaled sets. I showed that by analyzing broad categories instead of individual noisy data points, I could find logical trends in the underlying data. The comparison of the translatome with the transcriptome and the genome helps to better understand cellular processes. For this purpose, I compiled two reference sets, the mRNA reference expression set integrated from various Gene Chip and SAGE experiments, and the protein reference abundance set, collected from published 2D gel electrophoresis experiments. My reference sets proved useful for my analysis of the composition and enrichments of protein features in the various stages of gene expression. I found many similar trends for general protein categories between these two sets. To compare the translatome and the transcriptome, I devised a formalism to measure enrichments of data sets. With this formalism I measured the enrichments of amino acids, protein function and secondary structures in the vegetative yeast cell. Other comparisons included looking at average biomasses, looking into subcellular localizations and a direct comparison of mRNA expression vs. protein abundance. Chapter 2: mRNA expression and protein abundance 64 Overall transcriptome and translatome similarity: outliers against trend The overall similarity I find between transcriptome and translatome contrasts somewhat with the weak correlation between mRNA expression and gene abundance as shown in figure 2 and reported previously (Futcher et al, 1999,Gygi et al, 1999). This reflects the way my system of overall categories collects many proteins into robust averages. It shows that variation between proteins is not systematic with respect to the categories. For example, individual transcription factors might have higher or lower protein abundance than one expects from their mRNA expression, but the category “transcription factors” as a whole has a similar representation in the transcriptome and translatome. I used the reference data sets to compare mRNA expression and protein abundance for the 181 genes shared between the two sets -- the largest such comparison. While I found an overall correlation between the two data sets, indicating that mRNA expression may be closely related to protein abundance, I found some genes that bucked the trend. Possible explanations for the aberrant behavior of some of these outliers are presented. Those outliers that have higher levels of protein abundance than expected from their mRNA expression are dominated by alcohol dehydrogenases and Glyceraldehyde-3phosphate (G3P) dehydrogenases. It is known that G3P dehyderogenase forms a bienzyme complex with alcohol dehydrogenase, thus, the similar abundance pattern of these two enzymes can be rationalized (Batke et al., 1992). Alcohol dehydrogenase is also a stress induced protein in many organisms(An et al., 1991,Matton et al., 1990,Millar et al., 1994), induced into action when the cell undergoes trauma, thus perhaps translated to a higher degree prophylactically (although the expression pattern of another stress-induced Chapter 2: mRNA expression and protein abundance 65 protein (HSP70) shows that this is not always the case). Translation-related proteins are more prominent in the outliers, with lower protein abundance than expected from mRNA expression. While it is known that multiple features of an individual mRNA influence its expression and regulation, it is presently not clearly understood how. There are many non-coding regions in each mRNA species that are responsible for this regulation. These include upstream AUG codons (uAUGs), both 3’ and 5’ untranslated regions, upstream open reading frames (uORFs) and the overall secondary structure of mRNA. Presently it is unclear how these act to exert their control (Morris and Geballe, 2000). One might conceive of using "outliers" with significantly different transcriptional and translational behavior to find consensus regulatory sequences. One possible method would involve using predicted mRNA structures (Jaeger et al., 1990,Zuker, 2000) to find consensus structural elements in these outliers. In particular, it might be worthwhile to investigate the secondary mRNA structure, to which the yeast translational machinery is known to be sensitive (McCarthy, 1998). Overall transcriptome and translatome similarity: consistent enrichments I found the enrichments relative to the genome to be consistent between the translatome and the transcriptome. In particular I found that the amino acids Valine, Glycine and Alanine -- all relatively small amino acids -- are significantly enriched in both populations in comparison to the genome population. These results coincide with the previous conclusion that those amino acids are also the most highly abundant amino acids in Chapter 2: mRNA expression and protein abundance 66 soluble proteins (Nauchitel and Somorjai, 1994). Conversely I found that Cysteine, Serine, Asparagine and Arginine were markedly depleted. My transcriptome enrichments using the reference set were similar to results attained previously using individual mRNA expression data sets (Jansen and Gerstein, 2000). In addition, I found that the translatome and the transcriptome both have lower molecular weight proteins in relation to the genome. Furthermore, I found, in comparison to the genome population, that the translatome and transcriptome had a depletion of random coils, a relatively less structurally complex and, as such, less stable protein structure, to alpha helices and beta sheets. These results are from a small and potentially biased subset of proteins and so, in of themselves, may not be informative. Yet, it is possible that they point to a logical trend that may result from the cellular preference for stability and structural rigidity through more regular secondary structures (helices and sheets). In relation to functional categories, I found three trends that were particularly notable: (i) the “cellular organization,” “protein synthesis,” and “energy production” categories were increasingly enriched as I moved from genome to transcriptome to translatome. This finding was true for either of the gene sets and reflects the great abundance of structural proteins, such as actin, and, in the case of the transcriptome, ribosomal proteins. (In the protein abundance set GProt ribosomal proteins are rather underrepresented.) (ii) Proteins with “unclassified function” are significantly depleted in the transcriptome and the translatome in relation to the genome, perhaps reflecting a bias against studying them. (iii) Proteins in the “transcription” and “cell growth, cell division, and DNA synthesis” Chapter 2: mRNA expression and protein abundance 67 categories were consistently depleted in the transcriptome and translatome population relative to the genome. This perhaps reflects the fact that many of these proteins, such as transcription factors, act as "switches". While many copies are needed in the genome to give different specificities, only small quantities of the protein are necessary to activate or deactivate a process. These results concur with previous calculations (Jansen and Gerstein, 2000) wherein I found the transcriptome is enriched specifically with proteins involved in protein synthesis and energy. As opposed to the genome population, where there is a wide distribution of products in all cellular compartments, mainly cytoplasmic proteins dominate the translatome and transcriptome. For instance, while the genome data set has the largest allocation of genes going to the nucleus, the bulk of the translatome and transcriptome populations are localized to the cytoplasm. Part of this effect may also be due to the gel-electrophoresis experimental process that favors the higher expressing cytoplasmic proteins, although a similar effect can clearly be observed in the transcriptome data set, which does not have this experimental bias. This may be related to the enrichment of functional categories that are connected to cytoplasmic proteins, such as "protein synthesis". Limitations given the small size of the protein abundance data Even with the extended coverage made possible by merging many datasets together into my two reference sets, I still found that the largest complication in my analysis was the limited amount of data. This was, obviously, most applicable to the protein abundance Chapter 2: mRNA expression and protein abundance 68 measurements. In addition to giving us fewer data points for my statistics, the small number of protein abundance measurements potentially biased my statistical results towards certain protein families. The 181 proteins in Gprot are certainly not a random selection from the possible 6280 in yeast. They are, rather, skewed towards well-studied proteins that are highly expressed. My methodology attempts to control for this gene-selection bias through my enrichment formalism, which allows one to rather precisely gauge various aspects of the bias. My results will certainly be more complete and definitive when larger proteomics datasets become available, which I anticipate to become available soon (Smith, 2000). However, I believe that the essential formalism and approach that I develop will remain quite relevant for all future datasets. Although the translatome data I used in my study is small in comparison to the information on the genome and transcriptome, many protein features in both the translatome and the transcriptome are dominated by the very highly expressed proteins (to which the 2DE experiments are biased). Under this circumstance, it is often sufficient to look at this smaller number of dominating proteins to approximately characterize the whole population. This is similar in spirit to the development of the codon adaptation index for yeast (Sharp and Li, 1987). While based on only 24 highly expressed proteins, it has proven to be robust in predicting expression levels for the entire genome. In contrast, the experimental bias in the selection of proteins with particular biophysical properties should be of more concern. Chapter 2: mRNA expression and protein abundance 69 Future directions Besides the recapitulation of my computations with the release of new data, I also hope to expand this analysis to other organisms. While presently I have limited my study to yeast gene expression, there are other potential model organisms for which there are expression experiments. Moreover, I have also limited ourselves to Gene Chip experiments, but it may be worthwhile to analyze cDNA microarray data sets (Cho et al., 1998,DeRisi et al., 1997,Winzeler et al., 1999). I can use these sizeable microarray data sets to study changes in protein features over time. Acknowledgments MG thanks the Keck foundation for support. Chapter 2: mRNA expression and protein abundance 70 Figures and Tables Chapter 2: mRNA expression and protein abundance 2.1 Analysis of mRNA expression and protein abundance dataexpression and protein abundance data: Table 1 Data sets Annotation Protein abundance mRNA expression Data set Description Size Reference [ORFs] Young Gene chip profiles yeast cells with mutations that affect transcription 5455 al. (1998) Church Gene chip profiles of yeast cells under four different conditions 6263 (1998) Samson Comparing gene chip profiles for yeast cells subjected to alkylating agent 6090 (1998) SAGE Yeast cells during vegetative growth 3778 al. (1997) Reference expression Scaling and integrating the mRNA expression set into one data source 6249 - 2-DE #1 Measurement of yeast protein abundance by twodimensional (2D) gel electrophoresis and mass spectrometry 2-DE #2 Similar to 2-DE set #1 Large-scale fusions of yeast genes with lacZ by Holstege et Roth et al. Jelinsky et al. Velculescu et 156 71 Gygi et al. (1999) Futcher et al. (1999) Ross- Transposon transposon insertion 1410 Macdonald et Reference abundance 181 Annotated Localization Transmembrane segments MIPS functions GOR secondary structure Scaling and integrating the 2-DE data sets into one data source al. (1999) - Subcellular localizations of yeast proteins 2133 Drawid et al. (6280) (2000) Predicted transmembrane and soluble proteins in yeast 2710 Gerstein (6280) (1998) Functional categories for yeast ORFs 3519 Mewes et al. (6194) (2000) Predicted secondary structure for yeast ORFs Chapter 2: mRNA expression and protein abundance Gerstein 6280 (1998) 71 Table 1, data sets: This table provides an overview of the data sets used in my analysis. The table is divided into three sections. The first section at the top lists different mRNA expression sets. The second section in the middle shows the protein abundance data sets used. The third section at the bottom contains different annotations of protein features. The column "Data set" lists a shorthand reference to each data set used throughout this paper. The next columns contain a brief description of the data sets, the number of ORFs contained in each of them, the literature reference and the URL. In contrast to the other data we investigated, the reference expression and abundance data sets have been calculated for the purpose of my analysis (see text). Some further information on the genome annotations: Localization: Protein localization information from YPD, MIPS and SwissProt were merged, filtered and standardized (Bairoch and Apweiler, 2000,Costanzo et al., 2000,Mewes et al, 2000) into five simplified compartments -- cytoplasm, nucleus, membrane, extracellular (including proteins in ER and golgi), and mitochondrial -according to the protocol in Drawid et al. (Drawid et al, 2000). This yielded a standardized annotation of protein subcellular localization for 2133 out of 6280 ORFs. Transmembrane segments: In 2710 out of 6280 yeast ORFs transmembrane segments are predicted to occur, ranging from low to high confidence (732 ORFs). The transmembrane prediction was performed as follows: The values from the scale for amino acids in a window of size 20 (the typical size of a transmembrane helix) were averaged and then compared against a cutoff of –1 kcal/mole. A value under this cutoff was taken to indicate the existence of a transmembrane helix. Initial hydrophobic stretches corresponding Chapter 2: mRNA expression and protein abundance 72 to signal sequences for membrane insertion were excluded. (These have the pattern of a charged residue within the first seven, followed by a stretch of 14 with an average hydrophobicity under the cutoff.) These parameters have been used, tested, and refined on surveys of membrane protein in genomes. "Sure" membrane proteins had at least two TM-segments with an average hydrophobicity less than –2 kcal/mole(Gerstein et al., 2000,Rost et al., 1995,Santoni et al., 2000,Senes et al., 2000). Functions: MIPS functional categories have been assigned to 3519 out of 6194 ORFs. (The remainder are assigned to category '98' or '99', which corresponds to unclassified function.) Chapter 2: mRNA expression and protein abundance 73 Figure 1 Schematic overview of the analysis Figure 1 Schematic overview of the analysis. On the left side I outline the terms I use to describe the process of gene expression. The coding section of the genome is transcribed into a population of mRNA transcripts called the "transcriptome". The transcripts in turn are translated to a population of proteins; I use the term "translatome" for this protein population rather than the alternative "proteome" because the latter term may be confounded with the protein complement of the genome (which is not necessarily associated with a quantitative abundance level). The matrix in the middle schematically shows an analysis of the three stages of expression. In general, I define a protein "population" as a set of genes associated with a corre- Chapter 2: mRNA expression and protein abundance 74 sponding number of expression or abundance levels ("weights"). In the matrix each row represents a weight and each column a gene set. In particular, I differentiate between the mRNA reference expression set (GmRNA = GGen), which essentially covers the complete genome, and the reference protein abundance set (GProt) which contains the proteins in data sets 2-DE #1 and 2-DE #2 (see table 1) because the protein abundance set is a significantly smaller subset of the genome. By definition, this subset contains only proteins that can be identified by 2-D gel electrophoresis and is therefore biased in this sense. The enrichment figures throughout this paper, through a comparison of the right and left sides of this figure, show the results of the experimental biases of 2D gels on the data set. Each pie chart represents a composition of a particular protein feature F (for instance, an amino acid composition) in a population (represented by the symbol . I can further look at the "enrichment" of this feature in one population relative to another (represented by the symbol , see section "Methods" for an explanation of the formalism). For simplification, I neglect the effects of post-transcriptional and post-translational modifications that might alter the features of proteins (they affect the expression levels but this is largely accounted for by the measurements). In this study I analyze protein features as they are represented in the genome. Chapter 2: mRNA expression and protein abundance 75 Figure 2 mRNA expression levels vs. protein abundance levels Chapter 2: mRNA expression and protein abundance 76 Figure 2 mRNA expression levels vs. protein abundance levels. Part a of this figure shows the reference protein abundance levels plotted against the mRNA reference expression levels on a log-log scale; this plot is similar to the one reported by Futcher et al. (Futcher et al, 1999) earlier. The trend line is described by the equation y = 5.20x0.61 where y represents the protein abundance level (in units of 103 copies/cell) and x the mRNA expression level (in units of copies/cell). The dashed lines indicate a distance of 1.85 standard deviations (in the log scale) from the trend line. The outliers beyond the dashed lines are listed in Part b. For each of these outlier ORFs I show a description of their function and their respective MIPS categories (the numbers are defined in Figure 4C). With one exception, all outliers are associated with cellular organization (MIPS category 30). Those outliers that have a high level of protein abundance relative to the expected amount of mRNA expression are dominated by the alcohol and G3P dehydrogenases. Translation-related proteins are prominent in the group of those proteins with low protein abundance in relation to mRNA expression. Chapter 2: mRNA expression and protein abundance 77 Figure 3a-c Amino acid and biomass enrichment Figure 3a-c Amino acid and biomass enrichment Part a shows the amino acid enrichments between different populations as indicated by the legend to the right of the Chapter 2: mRNA expression and protein abundance 78 plot (the legend is ordered in the same way as the schematic illustration in Figure 1). The bars indicate the enrichment of the transcriptome relative to the genome, whereas the circles indicate the enrichment of the translatome relative to the genome. In addition, I also show the enrichment for protein abundance from the Transposon Abundance Set, represented by the circles with the line through them. It can be seen that the enrichments for the transcriptome and the translatome follow a similar trend despite their differences. In general, the amino acid enrichments seem to be more strongly emphasized in the translatome. In contrast, the enrichments for the Transposon Abundance Set seem to be very small. This may be due to the fact that the ORFs fused with lacZ produce different gene products than the original genes. In both the translatome and the transcriptome the amino acids Valine, Glycine and Alanine are strongly enriched. On the other end, the amino acids Asparagine, Cysteine and Serine are strongly depleted. Part b shows a different view of amino acid enrichment from that contained in part A, now focusing on changes, and thus restricting the comparison to the genes common to all the datasets. The graph is ordered according to the enrichment from transcriptome to translatome (black squares). I focus here only on the changes for the abundance gene set (GProt) to exclude the effects that arise from looking at different subsets. In this view the enrichments from genome to transcriptome (white squares) and from genome to translatome (white diamonds) look more similar than do the analogous sets in Part A. To make comparison with Part A easier I again show the enrichment from genome to the transcriptome for the complete gene set (GGen, shown in bars). Chapter 2: mRNA expression and protein abundance 79 Part c shows biomass enrichment. The left panel depicts the average molecular weight per ORF (in units of kDa) and the right panel, the average molecular weight per amino acid (in units of Daltons) in each of the three stages of gene expression. The numbers inside the circles indicate the average molecular weights. The values next to the arrows indicate the enrichments in biomass between different populations. Both the circle diameters and the arrow widths are functions of the corresponding values (the hollow arrow indicates a positive value). It is very clear that the average molecular weight per ORF is much lower in the translatome (by 20% or 15%) and transcriptome (by 29%) than in the genome. This relative depletion of biomass mainly takes place as a result of transcription; the effect of translation is less clear, depending on the populations compared. On the other hand, the depletion in the average molecular weight per amino acid (-3.3 % from genome to translatome) is an order of magnitude smaller than in the average weight per ORF. This shows that the yeast cell favors the expression of shorter ORFs over longer ones, and agrees with earlier observationsthat there is a negative correlation between maximum ORF length and mRNA expression (Jansen and Gerstein, 2000); it seems that this effect mainly takes place during transcription rather than translation. Chapter 2: mRNA expression and protein abundance 80 Figure 3d Statistical significance Figure 3d Statistical significance. Part d shows that the amino acid enrichments are statistically significant. I have assessed significance by randomly permuting the expression levels among the genes and then recomputing the amino acid enrichments. This procedure can be repeated and used to generate distributions of random enrichments that can then be compared against the observed enrichments. In the plot the gray bars represent the observed enrichments already shown in figure 3a. On top of the gray bars I show standard boxplots of enrichment distributions based on 20000 random permutations. Chapter 2: mRNA expression and protein abundance 81 (The middle line represents the distribution median. The upper and lower sides of the box coincide with the upper and lower quartiles. Outliers are shown as dots and defined as data points that are outside the range of the whiskers, the length of which is 1.5 the interquartile distance.) Based on the random distributions, I can compute one-sided Pvalues for the observed enrichments. Amino acids that are significant beyond = 10-3 are shown in bold font (the only exception is Glutamine (Q), which has a P-value of 1.25·10-3). Note that = 10-3 corresponds to ' = 5·10-5 = 10-3/20 for each individual amino acid (Bonferroni correction) since I independently perform the same statistical test 20 times. Chapter 2: mRNA expression and protein abundance 82 Figure 4 Breakdown of the transcriptome and translatome in terms of broad categories relating to structure, localization, and function Chapter 2: mRNA expression and protein abundance 83 Figure 4 Breakdown of the transcriptome and translatome in terms of broad categories relating to structure, localization, and function All of the subfigures are analogous to the schematic illustration in figure 1. Part a represents the composition of secondary structure in the different populations. In general, the secondary structure compositions appear to be relatively stable across the different populations. The most notable change from genome to translatome is perhaps the depletion of coils -- that is, relatively unordered structures compared to the more structured helices and sheets -- by about 4%. Part b represents the distribution of subcellular localizations associated with proteins in the various populations. I used standardized localizations developed earlier(Drawid and Gerstein, 2000), which, in turn, were derived from the MIPS, YPD, and Swiss-Prot databases (Bairoch and Apweiler, 2000,Costanzo et al, 2000,Mewes et al, 2000). The subcellular localization has been experimentally determined for less than half of the yeast proteins, so my analysis applies only to this subset. The most notable difference between genome, transcriptome and translatome is the strong enrichment of cytoplasmic proteins. This is in agreement with my previous observations (Drawid et al., 2000). This also explains to some degree the observations for the functional classes in part C. For example, the functional group "energy" is mostly dominated by the highly expressed glycolytic proteins found in the cytoplasm. The depletion of the functional group "transcription" makes sense in the light of the strong depletion for nuclear proteins. We have argued before (Drawid et al, 2000) that the number of proteins in a particular subcellular compartment may be roughly related to the size of the compartment. For instance, membrane Chapter 2: mRNA expression and protein abundance 84 proteins occupy the relatively small "two-dimensional" space in lipid bi-layers. I also performed a separate, independent calculation for a more comprehensive list of transmembrane segments, which were predicted computationally (see caption of Table 1). This largely confirms the result. (Data not shown.) Part c shows the division of ORFs into different functional categories (according to the MIPS classification) in the various populations. Only the largest functional categories of the top level of the MIPS classification are shown. The group "Other" contains the smaller top-level categories lumped together. This “Other” group is different from the group "Unclassified," which contains genes without any functional description. One complication is that many genes have multiple functional classifications such that they may be counted in more than one category (this explains why the group "Unclassified" has only a size of 28% for the genome population although the number of unclassified genes in the yeast genome is much larger). Comparing the genome with the transcriptome and translatome compositions in general, it can be observed that if a functional class is enriched in the transcriptome relative to the genome, it is also enriched in the translatome. Specifically, the functional classes "metabolism", "energy", "protein synthesis" and "cellular organization" are enriched in transcriptome and translatome. On the other hand, the classes "cell growth, cell division and DNA synthesis" and "transcription" are depleted; in particular, this is the case for the "unclassified" group, indicating that a lot of the current biochemical knowledge is clearly skewed towards more highly expressed genes. Some of the differences between the complete gene set (GGen) and the protein abundance set (GProt) are obviously a result of the Chapter 2: mRNA expression and protein abundance 85 bias of electrophoresis experiments. In addition, the ribosomal proteins that make up an important highly expressed part of the class “protein synthesis” are underrepresented in the protein abundance set (GProt). Chapter 2: mRNA expression and protein abundance 86 References 1. An, H., Scopes, R. K., Rodriguez, M., Keshav, K. F. & Ingram, L. O. Gel electrophoretic analysis of Zymomonas mobilis glycolytic and fermentative enzymes: identification of alcohol dehydrogenase II as a stress protein. J Bacteriol 173, 5975-82 (1991). 2. Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 18, 533-7 (1997). 3. Bairoch, A. Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times! Bioinformatics 16, 48-64 (2000). 4. Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28, 45-8 (2000). 5. Bassett, D. E., Jr. et al. Exploiting the complete yeast genome sequence. Curr Opin Genet Dev 6, 763-6. (1996). 6. Batke, J., Benito, V. A. & Tompa, P. A possible in vivo mechanism of intermediate transfer by glycolytic enzyme complexes: steady state fluorescence anisotropy analysis of an enzyme complex formation. Arch Biochem Biophys 296, 654-9 (1992). 7. Cambillau, C. & Claverie, J. M. Structural and Genomic Correlates of Hyperthermostability. J Biol Chem 275, 32383-32386 (2000). 8. Carlson, M. The awesome power of yeast biochemical genomics. Trends in Genetics 16, 49-51 (2000). Chapter 2: mRNA expression and protein abundance 87 9. Cavalcoli, J. D., VanBogelen, R. A., Andrews, P. C. & Moldover, B. Unique identification of proteins from small genome organisms: theoretical feasibility of high throughput proteome analysis. Electrophoresis 18, 2703-8 (1997). 10. Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2, 65-73. (1998). 11. Claverie, J. M. Computational methods for the identification of differential and coordinated gene expression [In Process Citation]. Hum Mol Genet 8, 1821-32 (1999). 12. Corthals, G., Wasinger VC, Hochstrasser DF, Sanchez JC. The dynamic range of protein expression: a challenge for proteomic research. Electrophoreisis 21, 11041115 (2000). 13. Costanzo, M. C. et al. The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res 28, 73-6 (2000). 14. Das, R. & M., G. The Stability of Thermophilic Proteins: A Study Based on Comprehensive Genome Comparison. Functional & Integrative Genomics 1, 3345 (2000). 15. Day, D. A. & Tuite, M. F. Post-transcriptional gene regulatory mechanisms in eukaryotes: an overview. J Endocrinol 157, 361-71. (1998). 16. DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680-6 (1997). Chapter 2: mRNA expression and protein abundance 88 17. Doolittle, W. F. The nature of the universal ancestor and the evolution of the proteome. Curr Opin Struct Biol 10, 355-8 (2000). 18. Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J Mol Biol 301, 1059-75. (2000). 19. Drawid, A., Jansen, R. & Gerstein, M. Gene Expression Levels are Correlated with Protein Subcellular Localization (in Press). Trends in Genetics (2000). 20. Drawid, A., Jansen, R. & Gerstein, M. Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet 16, 426-30 (2000). 21. Einarson, M. & Golemis, E. Encroaching genomics: adapting large-scale science to small academic laboratories. Physiological Genomics 2, 85-92 (2000). 22. Eisen, M. B. & Brown, P. O. DNA arrays for analysis of gene expression. Methods Enzymol 303, 179-205 (1999). 23. Epstein, C. & Butow, R. Microarray technology - enhanced versatility, persistent challenge. Current Opinions Biotechnology 11, 36-41 (2000). 24. Ferea, T. & Brown, P. Observing the living genome. Current Opinions Genetic and Development 9, 715-722 (1999). 25. Fey, S. J. & Larsen, P. M. 2D or not 2D. Two-dimensional gel electrophoresis. Curr Opin Chem Biol 5, 26-33. (2001). 26. Fey, S. J. et al. Proteome analysis of Saccharomyces cerevisiae: a methodological outline. Electrophoresis 18, 1361-72 (1997). 27. Frishman, D. & Mewes, H. W. Protein structural classes in five complete genomes [letter]. Nat Struct Biol 4, 626-8 (1997). Chapter 2: mRNA expression and protein abundance 89 28. Frishman, D. & Mewes, H. W. Genome-based structural biology. Prog Biophys Mol Biol 72, 1-17 (1999). 29. Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999). 30. Gaasterland, T. Archaeal genomics. Curr Opin Microbiol 2, 542-7 (1999). 31. Garrels, J. I. et al. Proteome studies of Saccharomyces cerevisiae: identification and characterization of abundant proteins. Electrophoresis 18, 1347-60 (1997). 32. Gerstein, M. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J Mol Biol 274, 562-76 (1997). 33. Gerstein, M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold Des 3, 497-512 (1998). 34. Gerstein, M. Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence. Bioinformatics 14, 707-14 (1998). 35. Gerstein, M. Patterns of Protein-Fold Usage in Eight Microbial Genomes: A Comprehensive Structural Census. Proteins 33, 518-534 (1998). 36. Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function (In press). Current Opinions in Structural Biology (2000). 37. Gerstein, M., Lin, J. & Hegyi, H. Protein folds in the worm genome. Pac Symp Biocomput, 30-41 (2000). Chapter 2: mRNA expression and protein abundance 90 38. Goffeau, A., Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG. Life with 6000 genes. Science 274 (1996). 39. Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. & Aebersold, R. Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc Natl Acad Sci U S A 97, 9390-5. (2000). 40. Gygi, S. P., Rist, B. & Aebersold, R. Measuring gene expression by quantitative proteome analysis [In Process Citation]. Curr Opin Biotechnol 11, 396-401 (2000). 41. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999). 42. Harry, J. W., MR Herbert, BR Packer,NH AA, Gooley Williams, KL. Ptoteomics: Capacity versus utility. Electrophoreisis 21, 1071-1081 (2000). 43. Hatzimanikatis, V., Choe, L. H. & Lee, K. H. Proteomics: theoretical and experimental considerations. Biotechnol Prog 15, 312-8 (1999). 44. Haynes, P. A. & Yates, J. R., 3rd. Proteome profiling-pitfalls and progress. Yeast 17, 81-7 (2000). 45. Hegyi, H. & Gerstein, M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 288, 147-64 (1999). 46. Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717-728 (1998). Chapter 2: mRNA expression and protein abundance 91 47. Ishii, M. et al. Direct comparison of GeneChip and SAGE on the quantitative accuracy in transcript profiling analysis. Genomics 68, 136-43 (2000). 48. Jackson, R. J. & Wickens, M. Translational controls impinging on the 5'untranslated region and initiation factor proteins. Curr Opin Genet Dev 7, 233-41. (1997). 49. Jacobs Anderson, J. S. & Parker, R. Computational identification of cis-acting elements affecting post- transcriptional control of gene expression in Saccharomyces cerevisiae. Nucleic Acids Res 28, 1604-17. (2000). 50. Jaeger, J. A., Turner, D. H. & Zuker, M. Predicting optimal and suboptimal secondary structure for RNA. Methods Enzymol 183, 281-306 (1990). 51. Jansen, R. & Gerstein, M. Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res 28, 1481-8 (2000). 52. Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc Natl Acad Sci U S A 96, 1486-91 (1999). 53. Jones, D. T. Do transmembrane protein superfolds exist? FEBS Lett 423, 281-5 (1998). 54. Jones, D. T. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287, 797-815 (1999). 55. Kidd, D., Liu, Y. & Cravatt, B. F. Profiling serine hydrolase activities in complex proteomes. Biochemistry 40, 4005-15 (2001). Chapter 2: mRNA expression and protein abundance 92 56. Klose, J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik 26, 231-43 (1975). 57. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567-80 (2001). 58. Lin, J. & Gerstein, M. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res 10, 808-18 (2000). 59. Lindahl, L. & Hinnebusch, A. Diversity of mechanisms in the regulation of translation in prokaryotes and lower eukaryotes. Curr Opin Genet Dev 2, 720-6. (1992). 60. Lipshutz, R. J., Fodor, S. P., Gingeras, T. R. & Lockhart, D. J. High density synthetic oligonucleotide arrays. Nat Genet 21, 20-4 (1999). 61. Lopez, M. F. Better approaches to finding the needle in a haystack: Optimizing proteome analysis through automation. Electrophoreisis 21, 1082-1093 (2000). 62. Luban, J. & Goff, S. P. The yeast two-hybrid system for studying protein-protein interactions. Curr Opin Biotechnol 6, 59-64 (1995). 63. MacBeath, G. & Schreiber, S. L. Printing proteins as microarrays for highthroughput function determination. Science 289, 1760-3. (2000). 64. Matton, D. P., Constabel, P. & Brisson, N. Alcohol dehydrogenase gene expression in potato following elicitor and stress treatment. Plant Mol Biol 14, 775-83 (1990). Chapter 2: mRNA expression and protein abundance 93 65. McCarthy, J. E. Posttranscriptional control of gene expression in yeast. Microbiol Mol Biol Rev 62, 1492-553. (1998). 66. Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 28, 37-40 (2000). 67. Millar, A. A., Olive, M. R. & Dennis, E. S. The expression and anaerobic induction of alcohol dehydrogenase in cotton. Biochem Genet 32, 279-300 (1994). 68. Molloy, M. P. Two-dimensional electrophoresis of membrane proteins using immobilized pH gradients. Anal Biochem 280, 1-10 (2000). 69. Morris, D. R. & Geballe, A. P. Upstream open reading frames as regulators of mRNA translation. Mol Cell Biol 20, 8635-42. (2000). 70. Nauchitel, V. V. & Somorjai, R. L. Spatial and free energy distribution patterns of amino acid residues in water soluble proteins. Biophysical Chemistry 51, 327-336 (1994). 71. Nelson, P. S. et al. Comprehensive analyses of prostate gene expression: convergence of expressed sequence tag databases, transcript profiling and proteomics [In Process Citation]. Electrophoresis 21, 1823-31 (2000). 72. O'Farrell, P. H. High resolution two-dimensional electrophoresis of proteins. J Biol Chem 250, 4007-21 (1975). 73. Pandey, A. & Mann, M. Proteomics to study genes and genomes. Nature 405, 837-46 (2000). 74. Qi, S. Y., Moir, A. & O'Connor, C. D. Proteome of Salmonella typhimurium SL1344: identification of novel abundant cell envelope proteins and assignment to a two-dimensional reference map. J Bacteriol 178, 5032-8 (1996). Chapter 2: mRNA expression and protein abundance 94 75. Ross-Macdonald, P. et al. Large-scale analysis of the yeast genome by transposon tagging and gene disruption. Nature 402, 413-8 (1999). 76. Rost, B., Casadio, R., Fariselli, P. & Sander, C. Transmembrane helices predicted at 95% accuracy. Protein Sci 4, 521-33 (1995). 77. Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat BIOTECHNOL 16, 939-45 (1998). 78. Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 220415 (2000). 79. Sali, A. Functional Links between Proteins. Nature 402, 25-26 (1999). 80. Santoni, V., Molloy, M. & Rabilloud, T. Membrane proteins and proteomics: un amour impossible? Electrophoreisis 21, 1054-1070 (2000). 81. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-70 (1995). 82. Searls, D. B. Using bioinformatics in gene and drug discovery. Drug Discovery Today 5, 135-143 (2000). 83. Senes, A., Gerstein, M. & Engelman, D. M. Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at neighboring positions. J Mol Biol 296, 921-36 (2000). 84. Shapiro, L. & Harris, T. Finding function through structural genomics. Current Opinions in Biotechnology 11, 31-35 (2000). Chapter 2: mRNA expression and protein abundance 95 85. Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15, 1281-95 (1987). 86. Sherlock, G. Analysis of large-scale gene expression data. Curr Opin Immunol 12, 201-5 (2000). 87. Shevchenko, A. et al. Linking genome and proteome by mass spectrometry: largescale identification of yeast proteins from two dimensional gels. Proc Natl Acad Sci U S A 93, 14440-5 (1996). 88. Smith, R. D. Probing proteomes-seeing the whole picture? [In Process Citation]. Nat Biotechnol 18, 1041-2 (2000). 89. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631-7 (1997). 90. Tekaia, F., Lazcano, A. & Dujon, B. The genomic tree as revealed from whole proteome comparisons. Genome Res 9, 550-7 (1999). 91. Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997). 92. Vilela, C., Linz, B., Rodrigues-Pousada, C. & McCarthy, J. E. The yeast transcription factor genes YAP1 and YAP2 are subject to differential control at the levels of both translation and mRNA stability. Nucleic Acids Res 26, 1150-9. (1998). 93. Vilela, C., Ramirez, C. V., Linz, B., Rodrigues-Pousada, C. & McCarthy, J. E. Post-termination ribosome interactions with the 5'UTR modulate yeast mRNA stability. Embo J 18, 3139-52. (1999). Chapter 2: mRNA expression and protein abundance 96 94. Wallin, E. & von Heijne, G. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7, 1029-38 (1998). 95. Washburn, M. P. & Yates, J. R., 3rd. Analysis of the microbial proteome. Curr Opin Microbiol 3, 292-7 (2000). 96. Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 (1999). 97. Wittes, J. & Friedman, H. P. Searching for evidence of altered gene expression: a comment on statistical analysis of microarray data [editorial; comment]. J Natl Cancer Inst 91, 400-1 (1999). 98. Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V. Distribution of protein folds in the three superkingdoms of life. Genome Res 9, 17-26 (1999). 99. Young, K. Yeast two-hybrid: so many interactions, (in) so little time... Biol Reprod 58, 302-311 (1998). 100. Zhang, M. Q. Large-scale gene expression data analysis: a new challenge to computational biologists [published erratum appears in Genome Res 1999 Nov;9(11):1156]. Genome Res 9, 681-8 (1999). 101. Zhu, H. et al. Analysis of yeast protein kinases using protein chips. Nat Genet 26, 283-9. (2000). 102. Zuker, M. Calculating nucleic acid secondary structure. Curr Opin Struct Biol 10, 303-10. (2000). Chapter 2: mRNA expression and protein abundance 97 Chapter 2: mRNA expression and protein abundance 2.2 Comparing protein abundance and mRNA expression levels on a genomic scale Abstract Attempts to correlate protein abundance with mRNA expression levels have had variable success. I review the results of these comparisons, focusing on yeast. In the process, I survey experimental techniques for determining protein abundance, principally twodimensional gel electrophoresis and mass-spectrometry. I also merge many of the available yeast protein-abundance datasets, using the resulting larger 'meta-dataset' to find correlations between protein and mRNA expression, both globally and within smaller categories. Introduction Although some of the underlying technology for quantifying protein abundance was introduced almost thirty years ago (Klose, 1975,O'Farrell, 1975),there has recently been a significant increase in the development of new tools. Concurrently, tools for analyzing mRNA expression are becoming more mainstream. The quantification of both of these molecular populations is not an exercise in redundancy; measurements taken from mRNA and protein levels are complementary and both are necessary for a complete understanding of how the cell works (Hatzimanikatis et al., 1999). Additionally, as mRNA is eventually translated into protein, one might assume that there should be some Chapter 2: mRNA expression and protein abundance 98 sort of correlation between the level of mRNA and that of protein. Alternatively, there may not be any significant correlation, which, in itself, is an informative conclusion. The two commonly used high-throughput methods for measuring mRNA expression, microarrays and Affymetrix chips, have both been extensively reviewed elsewhere (Brown and Botstein, 1999,McGall and Christians, 2002,Schena et al., 1998).There are also two basic methods for determining protein abundance; either based on twodimensional electrophoresis or on mass-spectrometric methods (Table 1). I provide a brief review of these technologies and recent efforts to determine correlations between quantified protein abundances and mRNA expression. Two-dimensional electrophoresis Determining relative protein expression levels by conventional two-dimensional electrophoresis requires isoelectric focusing, SDS-polyacrylamide gel electrophoresis, staining, fixing, densitometry, and careful matching of the same spots on two or more gels. Differentially expressed spots are then excised and enzymatically digested, and the resulting peptides are identified using mass spectrometry. An attractive aspect of this approach is the low capital equipment cost, but a high level of expertise is needed to obtain reproducible gels, and two-dimensional electrophoresis is generally limited to proteins that are neither too acidic, too basic, nor too hydrophobic, and that are between 10 and 200 kDa in size, so that they are reliably separated on gels. Additionally, this approach detects only those proteins that are expressed at relatively high levels and that Chapter 2: mRNA expression and protein abundance 99 have long half-lives (Gygi et al., 2000,Gygi et al., 1999) In one study using 40 μg yeast lysate, the average protein abundance detected was 51,200 copies per cell, with no proteins detected with abundances less than 1,000 copies per cell (Gygi et al, 2000). Given that 1,500 spots were resolved on a 1.0 pH unit gel (Gygi et al, 2000), several gels covering different pH ranges would be needed to resolve a whole cell lysate. Given these limitations, conventional two-dimensional electrophoresis technology has limited potential for large-scale proteome analysis (Gygi et al, 2000). Two-dimensional fluorescence-difference gel electrophoresis (DIGE) utilizes mass- and charge-matched, spectrally resolvable fluorescent dyes (such as Cy3 and Cy5) to label two different protein samples in vitro prior to two-dimensional electrophoresis. Its main advantage over conventional two-dimensional electrophoresis is that both the control and the experimental sample are run in a single polyacrylamide gel. The samples are then imaged separately but can be perfectly overlaid without any 'warping' of the gels. This substantially raises the confidence with which protein changes between samples can be detected and quantified. Changes in the relative level of expression of a protein may be detected that are as little as 1.2-fold for large-volume spots (Tonge et al., 2001). Because detection is based on fluorescence, DIGE has a large dynamic range of about 10,000, which permits differential expression analysis of proteins that are present at relatively low copy number (Tonge et al, 2001). The limit of detection of DIGE for quantifying protein expression ratios is between 0.25 and 0.95 ng protein, which is similar to that for silver staining (Gharbi et al., 2002,Tonge et al, 2001). In a recent study (Zhou et al., 2002), the relative levels of expression of approximately 1,050 protein spots were Chapter 2: mRNA expression and protein abundance 100 compared in 250,000 laser-dissected normal versus esophageal carcinoma cells. This analysis identified 58 spots that were up-regulated by more than three-fold and 107 that were down-regulated by more than three-fold in cancer cells. Mass spectrometric approaches Disease biomarker discovery Current approaches to discovering protein or peptide markers of disease involve batch chromatography, matrix-assisted laser desorption ionization mass spectrometry (MALDIMS) and statistical analysis of large numbers of disease versus normal serum or other biological samples. Most recent studies have relied on surface-enhanced laser desorption ionization time-of-flight mass spectrometry (SELDI-TOF-MS) (Adam et al., 2001,Issaq et al., 2002). The SELDI approach (Issaq et al, 2002) involves using a gold-coated chip with eight or sixteen 2 mm spots that are modified with chromatographic surfaces (for example anionic, cationic, hydrophobic, and so on). After spotting a few microliters of serum, any contaminants and salt are removed by washing with water, and the target is dried by adding a MALDI matrix solution, such as α-cyano-4-hydroxy-cinnamic acid. In a study by Petricoin et al. (Petricoin et al., 2002) SELDI-MS analysis of serum from 50 control and 50 case samples from patients with ovarian cancer resulted in identifying five peptide biomarkers that ranged in size from 534 to 2,465 Da. The pattern formed by these markers was then used to correctly classify all 50 ovarian cancer samples in a masked set of serum samples from 116 patients who included 50 patients with ovarian cancer and 66 unaffected women. Similar promising results have been reported in studies of serum Chapter 2: mRNA expression and protein abundance 101 samples from breast and prostrate cancer patients (Adam et al, 2001,Li et al., 2002) In a recent study (Wu et al., 2003), which compared the relative ability of several different statistical approaches to classify samples based on MS data, the disease biomarker approach was extended to a conventional MALDI-MS platform. Although powerful, the disease biomarker approach does not provide accurate relative amounts of the control versus experimental biomarker, only the relative intensity difference. Isotope-coded affinity-tag-based protein profiling While both MALDI-MS-based disease biomarker discovery and DIGE comparatively profile the naturally occurring forms of peptides and proteins, isotope-coded affinity-tag (ICAT) analysis profiles the relative amounts of cysteine-containing peptides derived from tryptic digests of protein extracts. Because only a single tryptic peptide is needed to quantify the expression of the corresponding parent protein, the ICAT reagent utilizes a thiol protein-reactive group that attaches both a biotin tag and either nine 12C (light) or nine 13C (heavy) atoms to each cysteine residue. Following derivatization of the control protein extract with [12C]-ICAT reagent and the experimental extract with [13C]-ICAT reagent, the pooled samples are subjected to trypsin digestion followed by both cation and avidin chromatography. Liquid chromatography and tandem mass spectrometry (LC/MS/MS) is then used to identify ICAT peptide pairs and to quantify the relative 12C/13C ratios. It is important to note that the ICAT approach provides the relative expression ratios of individual proteins under two conditions; it does not provide absolute protein concentrations, nor does it provide the ratio of the concentration of one protein Chapter 2: mRNA expression and protein abundance 102 relative to another in a single condition. A nice feature of this approach is that the in vitro incorporation of a stable isotope into one of the two samples being compared obviates the need to separately analyze the control and experimental samples by MS. Although a tryptic digest of a whole-cell human protein extract might produce more than 500,000 peptides, less than 100,000 of these might be expected to contain cysteine, but based on a search of the SwissProt database less than 5% of human proteins lack cysteine and would therefore be missed (that is, more than 95% of proteins would include at least one cysteine-containing peptide). ICAT results are analogous to those obtained by the use of two different fluorescent dyes in DNA microarray analysis of mRNA levels or DIGE analysis of protein expression. The largest number of proteins profiled so far using this approach with a single sample are the 491 proteins contained in microsomal fractions of naive and in vitro differentiated human myeloid leukemia cells (Han et al., 2001). Multidimensional protein identification technology Multidimensional protein identification technology (MudPit) is similar to ICAT in that it utilizes cation-exchange prefractionation followed by reverse-phase (RP) highperformance liquid chromatography (HPLC) separation and MS/MS analysis (Wolters et al., 2001). In contrast to the ICAT approach, however, MudPit analyzes the entire mixture of tryptically digested proteins and utilizes tandemly coupled (cation-exchange followed by reverse-phase) columns. A specific subset of peptides is eluted from the Chapter 2: mRNA expression and protein abundance 103 cation-exchange column, using a step gradient of increasing salt concentration, onto the front of the RP column. Peptides are then eluted from the RP column and enter the mass spectrometer for analysis. After the RP gradient is complete, the next step of the salt gradient releases another subset of peptides from the cation-exchange column onto the RP column, and the process repeats itself. Using this approach on the yeast proteome, Wolters et al. (Wolters et al, 2001) identified 5,540 unique peptides from 1,484 proteins and demonstrated a dynamic range of detection of 10,000-fold. This method has been extended to comparative protein profiling by using in vivo 14N/15N metabolic labeling (Washburn et al., 2003,Washburn et al., 2002). Washburn et al. (Washburn et al, 2002)used Saccharomyces cerevisiae grown in both 14N- and 15N-containing minimal media, and 2,264 peptides and 872 proteins were uniquely identified. Also, accurate 14N/15N quantitation was determined for each peptide with an average standard deviation of 30%. Comparison of mRNA and protein levels Even with the significant developments in the technologies used to quantify protein abundance over the past couple of years, protein identification and quantification still lags behind the high-throughput experimental techniques used to determine mRNA expression levels. Yet, while mRNA expression values have shown their usefulness in a broad range of applications, including the diagnosis and classification of cancers (Golub et al., 1999,Macgregor and Squire, 2002), these results are almost certainly only correlative, rather than causative; in the end it is most probably the concentration of Chapter 2: mRNA expression and protein abundance 104 proteins and their interactions that are the true causative forces in the cell, and it is the corresponding protein quantities that I ought to be studying. Primarily because of a limited ability to measure protein abundances, researchers have tried to find correlations between mRNA and the limited protein expression data, in the hope that they could determine protein abundance levels from the more copious and technically easier mRNA experiments. Alternatively, if there is definitively no correlation between mRNA and protein data, both quantities could be used as independent sources of information for use in machine-learning algorithms, for example, to predict protein interactions. To date, there have been only a handful of efforts to find correlations between mRNA and protein expression levels, most notably in human cancers and yeast cells; for the most part, they have reported only minimal and/or limited correlations. One of the earliest analyses of correlation looked at 19 proteins in the human liver. Anderson and Seilhamer (Anderson and Seilhamer, 1997)found a somewhat positive correlation of 0.48. Another limited analysis, of the three genes MMP-2, MMP-9 and TIMP-1 in human prostate cancers, showed no significant relationship (Lichtinghagen et al., 2002). An additional cancer study(Chen et al., 2002)showed a significant correlation in only a small subset of the proteins studied. Conversely, Orntoft et al. (Orntoft et al., 2002) found highly significant correlations in human carcinomas when looking at changes in mRNA and protein expression levels. Chapter 2: mRNA expression and protein abundance 105 Protein and mRNA correlations in yeast Many of the present efforts at correlating mRNA and protein expression have been conducted in yeast using two-dimensional electrophoresis techniques. In particular, Gygi et al. (Gygi et al, 1999) found that even similar mRNA expression levels could be accompanied by a wide range (up to 20-fold difference) of protein abundance levels, and vice versa. These results contrast with those of Futcher et al. (Futcher et al., 1999), who found relatively high correlations (r = 0.76) after transforming the data to normal distributions. In a previous analysis (Greenbaum et al., 2002), I merged the data from both of these datasets (referred to as 2DE-1 (Gygi et al, 1999) and 2DE-2 (Futcher et al, 1999)), comparing the resulting new larger protein abundance set ('merged data-set 1') with a comprehensive mRNA expression dataset. The mRNA expression reference set was constructed through iteratively combining, in a non-trivial fashion, three sets that used Affymetrix chips and a SAGE dataset (Greenbaum et al, 2002). Using these reference datasets, I was able to do an all-against-all comparison of mRNA and protein expression levels, in addition to a number of analyses comparing protein and mRNA expression using smaller, but broad categories (Greenbaum et al, 2002,Luscombe et al., 2001). Given the difficult, laborious, and limiting nature of two-dimensional electrophoresis analysis, many of the newer protein abundance determinations have been done using MudPit and derivative technologies. Washburn et al. (Washburn et al., 2001)used MudPit to analyze and detect 1,484 arbitrary proteins: they were able to detect a somewhat Chapter 2: mRNA expression and protein abundance 106 random sampling of proteins independent of abundance, localization, size or hydrophobicity (I refer to this dataset as MudPit-1). In a further experiment the authors, comparing expression ratios for both proteins and mRNA levels, found that although they could not find correlations for individual loci, they could find overall correlations when looking at pathways and complexes of proteins that functioned together (Washburn et al, 2003). Peng et al. (Peng et al., 2003)analyzed 1,504 yeast proteins with a false-positive rate - misidentification of a protein - of less than 1% (I refer to this dataset as MudPit-2). In their analysis (Peng et al, 2003), they contrasted their methodology with that of Washburn et al. (Washburn et al, 2001)with which there was significant overlap of proteins. A new merged dataset Expanding upon my previous merged dataset, I constructed a new merged dataset (merged data set-2) using the two two-dimensional electrophoresis and two MudPit datasets described above. Succinctly (more information is available on my website at: http://bioinfo.mbb.yale.edu/expression/prot-v-mrna/), I transformed each of the proteinabundance datasets into more quantitative data by fitting each protein dataset individually onto the reference mRNA expression dataset. The MudPit-1 dataset was also fitted onto the more finely grained MudPit-2 dataset. Each of the new, fitted datasets was then inversely transformed back into protein space. These derived protein datasets were then combined into a larger reference dataset; when I had more than one abundance value for an open reading frame (ORF), I chose the value from the dataset according to a Chapter 2: mRNA expression and protein abundance 107 prescribed quality ranking (see Figure 1). The resulting set contained protein abundance information for approximately 2,000 ORFs. (One caveat with the MudPit data: while quantitative analysis can be subsequently done on the results of MudPit experiments, MudPit data alone are only semi-quantitative, in that the number of peptides determined is relative to the actual protein abundance within the cell (Washburn et al, 2001). Some may therefore argue that MudPit alone is not optimal for a comparison with mRNA data. Nevertheless, I feel that my methodical merging process creates a quantitative and representative dataset that can be compared with the mRNA expression data.) Using the resulting data I could compare mRNA expression and protein abundance globally (Figure 1a) as well as looking at smaller, broad categories, such as function or localization (see Figure 1b,1c). In particular, I show that some localization categories - for example, the nucleolus - have significantly higher correlations than the global correlation. Other localizations may present less of a correlation between mRNA and protein data - for example, the mitochondria - possibly reflecting the heterogeneous nature and function of the latter organelle. In terms of MIPS functional categories (Mewes et al., 2002) I show that although some categories, such as cell rescue, show a lower correlation than the whole merged set, other functional categories, such as cell cycle, show a significant increase in correlation. Logically, this increased correlation reflects the co-regulated nature of the proteins in this functional category. Reasons for the absence of correlation Chapter 2: mRNA expression and protein abundance 108 There are presumably at least three reasons for the poor correlations generally reported in the literature between the level of mRNA and the level of protein, and these may not be mutually exclusive. First, there are many complicated and varied post-transcriptional mechanisms involved in turning mRNA into protein that are not yet sufficiently well defined to be able to compute protein concentrations from mRNA; second, proteins may differ substantially in their in vivo half lives; and/or third, there is a significant amount of error and noise in both protein and mRNA experiments that limit my ability to get a clear picture (Baldi and Long, 2001,Szallasi, 1999). Examining the first option - that there are a number of complex steps between transcription and translation - I looked at correlations between mRNA and protein abundance for those ORFs that had varied or steady levels of mRNA expression over the course of the cell cycle (Cho et al., 1998). To normalize for the varied degrees of expression for different ORFs, I took the standard deviation divided by the average expression level as representative of the variation of each ORF over the course of the yeast cell cycle (Figure 2). Broadly speaking, the cell can control the levels of protein atthe transcriptional level and/or at the translational level. Logically, I would assume that those ORFs that show a large degree of variation in their expression are controlled at the transcriptional level - the variability of the mRNA expression is indicative of the cell controlling mRNA expression at different points of the cell cycle to achieve the resulting and desired protein levels. Thus I would expect, and I found, a high degree of correlation (r = 0.89) between the reference mRNA and protein levels for these particular ORFs; the cell has already put significant energy into dictating the final level of protein through Chapter 2: mRNA expression and protein abundance 109 tightly controlling the mRNA expression, and I assume that there would then be minimal control at the protein level. In contrast, those genes that show minimal variation in their mRNA expression throughout the cell cycle are more likely to have little or no correlation with the final protein level; the cell would be controlling these ORFs at the translational and/or post-translational level, with the mRNA levels being somewhat independent of the final protein concentration. And indeed, I found only minimal correlation between protein and mRNA expression for these ORFs (r = 0.2). Furthermore, I found that those ORFs that have higher than average levels of ribosomal occupancy - that is that a large percentage of their cellular mRNA concentration is associated with ribosomes (being translated) - have well correlated mRNA and protein expression levels (Figure 2). These cases probably represent a situation wherein the cell, having significantly controlled the mRNA expression to produce a specific level of protein, will probably not also employ mechanisms to control the translation. Alternatively, those proteins that have very low occupancy rates have uncorrelated mRNA and protein expression; thus, given that the cell has not tightly controlled the mRNA expression for this ORF, it will dictate the resulting protein levels through rigorous controls of its translation (that is, through tight limits on occupancy) (Arava et al., 2003) A second option for a general lack of correlation between mRNA and protein abundance may be that proteins have very different half-lives as the result of varied protein synthesis and degradation. Protein turnover can vary significantly depending on a number of Chapter 2: mRNA expression and protein abundance 110 different conditions (Glickman and Ciechanover, 2002); the cell can control the rates of degradation or synthesis for a given protein, and there is significant heterogeneity even within proteins that have similar functions (Pratt et al., 2002). Recent efforts have been made to computationally measure these rates (Lian et al., 2002). Simplistically, it can be presumed that the change in a protein's concentration over time will be equal to the rate of translation minus the rate of degradation. By analogy to concepts in chemical kinetics, I can approximate this equation: dP(i,t)/dt = SE(i,t) DP(i,t), where P is protein abundance i at time t, E is the mRNA expression level of protein P, S is a general rate of protein synthesis per mRNA, and D is a general rate of protein degradation per protein (Gerner et al., 2002). Additionally there are some experimental methods that can also be used to measure turnover and the translational control of protein levels (Gerner et al, 2002,Lian et al, 2002,Pratt et al, 2002,Serikawa et al., 2003). Given the degenerate nature of the genetic code, there are many synonymous codons (codons that translate into the same amino acid). As the cell is biased in its usage of synonymous codons - that is, the usage of a subset of codons results in a higher level of mRNA expression, possibly as a result of differing cellular tRNA levels (Bennetzen and Hall, 1982)- the codon adaptation index (CAI), a measurement of codon usage, can be used to predict the expression of a gene (Sharp and Li, 1987) (we recently calculated new parameters for this model, with some improvement in predictive strength (Jansen et al., 2003)). It is thought that the CAI will correlate differently with mRNA levels than with Chapter 2: mRNA expression and protein abundance 111 protein abundance levels due, in part, to protein turnover rates (Coghlan and Wolfe, 2000). Ranking the ORFs in terms of their CAI value, I found that although those ORFs that ranked the highest in terms of CAI did not show a very strong correlation between mRNA and protein levels, they nevertheless showed a significantly higher correlation than ORFs that were ranked as having the lower CAI values (r = 0.48 versus 0.02). The low correlations reflect the fact that the CAI will correlate differently for protein and mRNA values because of the additional cellular controls on protein translation, namely the effect of protein turnover rates. Nevertheless, the sizable difference in correlations between the two groups of ORFs with high- and low-ranking CAI values (Figure 2) shows that there is some relationship between mRNA and protein values, possibly indicating that highly expressed genes tend to result in a more correlated level of protein abundance than lower expressed ones. Correlations have been found between the mRNA expression levels of different protein subunits within protein complexes (Jansen et al., 2002). This implies that there should be, in general, a correlation between mRNA and protein abundance, as these subunits provide a special case as they have to be available in stoichiometric amounts of proteins for the complexes to function. Thus, I believe that a major limitation to finding correlations is the degree of natural and manufactured systematic noise in mRNA and protein expression experiments. There is a continued effort to both describe and reduce this noise (Qian et al., 2003). Meanwhile, in an attempt to get around the noise one could Chapter 2: mRNA expression and protein abundance 112 look at broad categories of proteins - for example, groups defined by function, structure, or localization - such that the background noise is cancelled out to some degree (Greenbaum et al, 2002). Although proteomics is still in its infancy, given the pace of technological advancement in protein quantification, mRNA expression analysis and noise reduction, more comprehensive correlation studies will soon be feasible. This will allow for more robust analyses of the relationship between mRNA expression and protein abundance values. Finally, to be fully able to understand the relationship between mRNA and protein abundances, the dynamic processes involved in protein synthesis and degradation have to be better understood; is the protein level changing because of a change in the rate of protein synthesis, or mRNA, or protein turnover? These questions need to be looked into further before I can appreciate in full the relationship between mRNA and protein abundance levels. Acknowledgements This project was funded in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, under contract No. N01-HV-28186. Chapter 2: mRNA expression and protein abundance 113 Figures and Tables 2.2 Comparing protein abundance and mRNA expression levels on a genomic scale Table 1 Proteomic Technologies Chapter 2: mRNA expression and protein abundance 114 Figure 1 Comparison of mRNA expression and protein abundance. Figure 1 Comparison of mRNA expression and protein abundance. (a) A plot Figure 1 Comparison of mRNA expression and protein abundance. comparing my mRNA Chapter 2: mRNA expression and protein abundance 115 reference expression set (Greenbaum et al, 2002)with my newly compiled protein abundance dataset. The mRNA axis is in copies per cell; the protein axis is in thousand copies per cell. The protein dataset is the result of iteratively fitting two MudPit datasets (MudPit-1 (Washburn et al, 2001) and MudPit-2 (Peng et al, 2003)) and two twodimensional electrophoresis datasets (2DE-1 (Gygi et al, 1999)and 2DE-2 (Futcher et al, 1999)). Given the semi-quantitative nature of the MudPit data (Washburn et al, 2001), I transformed the data into a more quantitative set by fitting each set individually onto my reference mRNA expression dataset. In addition, I fit the MudPit-1 dataset onto the more finely-grained MudPit-2 dataset. Each of the datasets was then moved back into 'protein space' using an inverse transformation derived from the 2DE-1 set, as this set has the most precise values. These datasets were then combined into the new reference abundance dataset. In cases in which there were overlapping values for a given ORF I used the dataset in accord with the following ordering: 2DE-1, 2DE-2, MudPit-2, MudPit-1. The resulting reference protein abundance dataset (N = 2044) had a correlation of 0.66 with the mRNA reference dataset. (b,c) Additionally, I show that when looking at specific subsets (subcellular localization (Kumar et al., 2002) or functional groups (Mewes et al, 2002)) I can find both higher and lower correlations amongst these groups. The lower correlations are generally reflective of a more heterogeneous category. This analysis indicates that while correlations may be weak when looking at the global data, I tend to find higher correlations when looking at smaller well-defined subsets of ORFs. Further analysis is available at http://bioinfo.mbb.yale.edu/expression/prot-v-mrna/. Chapter 2: mRNA expression and protein abundance 116 Figure 2 The differences in correlation between mRNA and protein expression values using novel categories. Figure 2 The differences in correlation between mRNA and protein expression values using novel categories. I see significant differences when looking at the highest and lowest ranking of groups of ORFs in the following categories: occupancy, CAI (codon adaptation index) value (Bennetzen and Hall, 1982,Jansen et al, 2003,Sharp and Li, 1987)and variability. Occupancy refers to the percentage of transcripts associated with ribosomes; I compared the correlation between the top 100 ORFs and the bottom 100 in terms of occupancy (r = 0.78 versus 0.30). For the CAI, I compared the correlation between mRNA and protein for those ORFs with the highest CAI and those with the lowest (r = 0.48 versus 0.02). Variability refers to the normalized standard deviation (that is, the standard deviation divided by the average expression level) for all ORFs in the Chapter 2: mRNA expression and protein abundance 117 cell-cycle expression dataset of Cho et al. (Cho et al, 1998). Here, I compared the correlations between protein abundance and mRNA expression for the most variable compared with the least variable proteins (r = 0.89 versus 0.20). I found significant differences between the correlations of mRNA and protein levels for the top and bottom ranking populations for each of the comparisons.. Chapter 2: mRNA expression and protein abundance 118 References 1. Swissprot (http://us.expasy.org/sprot/) 2. Adam, B. L., Vlahou, A., Semmes, O. J. & Wright, G. L., Jr. Proteomic approaches to biomarker discovery in prostate and bladder cancers. Proteomics 1, 1264-70. (2001). 3. Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 18, 533-7 (1997). 4. Arava, Y. et al. Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 100, 3889-94 (2003). 5. Baldi, P. & Long, A. D. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 17, 509-19. (2001). 6. Bennetzen, J. L. & Hall, B. D. Codon selection in yeast. J Biol Chem 257, 302631 (1982). 7. Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nat Genet 21, 33-7. (1999). 8. Chen, G. et al. Discordant protein and mRNA expression in lung adenocarcinomas. Mol Cell Proteomics 1, 304-13 (2002). 9. Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2, 65-73. (1998). 10. Coghlan, A. & Wolfe, K. H. Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast 16, 1131-45 (2000). Chapter 2: mRNA expression and protein abundance 119 11. Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999). 12. Gerner, C. et al. Concomitant determination of absolute values of cellular protein amounts, synthesis rates, and turnover rates by quantitative proteome profiling. Mol Cell Proteomics 1, 528-37 (2002). 13. Gharbi, S. et al. Evaluation of two-dimensional differential gel electrophoresis for proteomic expression analysis of a model breast cancer cell system. Mol Cell Proteomics 1, 91-8. (2002). 14. Glickman, M. H. & Ciechanover, A. The ubiquitin-proteasome proteolytic pathway: destruction for the sake of construction. Physiol Rev 82, 373-428 (2002). 15. Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7 (1999). 16. Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585-96 (2002). 17. Gygi, S. P., Corthals, G. L., Zhang, Y., Rochon, Y. & Aebersold, R. Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc Natl Acad Sci U S A 97, 9390-5. (2000). 18. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999). Chapter 2: mRNA expression and protein abundance 120 19. Han, D. K., Eng, J., Zhou, H. & Aebersold, R. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat Biotechnol 19, 946-51. (2001). 20. Hatzimanikatis, V., Choe, L. H. & Lee, K. H. Proteomics: theoretical and experimental considerations. Biotechnol Prog 15, 312-8 (1999). 21. Issaq, H. J., Veenstra, T. D., Conrads, T. P. & Felschow, D. The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification. Biochem Biophys Res Commun 292, 587-92. (2002). 22. Jansen, R., Bussemaker, H. J. & Gerstein, M. Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res 31, 2242-51 (2003). 23. Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression data with protein-protein interactions. Genome Res 12, 37-46. (2002). 24. Klose, J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik 26, 231-43 (1975). 25. Kumar, A. et al. Subcellular localization of the yeast proteome. Genes Dev 16, 707-19. (2002). 26. Li, J., Zhang, Z., Rosenzweig, J., Wang, Y. Y. & Chan, D. W. Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin Chem 48, 1296-304. (2002). Chapter 2: mRNA expression and protein abundance 121 27. Lian, Z. et al. Genomic and proteomic analysis of the myeloid differentiation program: global analysis of gene expression during induced differentiation in the MPRO cell line. Blood 100, 3209-20 (2002). 28. Lichtinghagen, R. et al. Different mRNA and protein expression of matrix metalloproteinases 2 and 9 and tissue inhibitor of metalloproteinases 1 in benign and malignant prostate tissue. Eur Urol 42, 398-406 (2002). 29. Luscombe, N. M., Greenbaum, D. & Gerstein, M. What is bioinformatics? A proposed definition and overview of the field. Methods Inf Med 40, 346-58 (2001). 30. Macgregor, P. F. & Squire, J. A. Application of microarrays to the analysis of gene expression in cancer. Clin Chem 48, 1170-7. (2002). 31. McGall, G. H. & Christians, F. C. High-density genechip oligonucleotide probe arrays. Adv Biochem Eng Biotechnol 77, 21-42 (2002). 32. Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 30, 31-4. (2002). 33. O'Farrell, P. H. High resolution two-dimensional electrophoresis of proteins. J Biol Chem 250, 4007-21 (1975). 34. Orntoft, T. F., Thykjaer, T., Waldman, F. M., Wolf, H. & Celis, J. E. Genomewide study of gene copy numbers, transcripts, and protein levels in pairs of noninvasive and invasive human transitional cell carcinomas. Mol Cell Proteomics 1, 37-45 (2002). 35. Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J. & Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry Chapter 2: mRNA expression and protein abundance 122 (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res 2, 43-50. (2003). 36. Petricoin, E. F. et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572-7. (2002). 37. Pratt, J. M. et al. Dynamics of protein turnover, a missing dimension in proteomics. Mol Cell Proteomics 1, 579-91 (2002). 38. Qian, J., Kluger, Y., Yu, H. & Gerstein, M. Identification and correction of spurious spatial correlations in microarray data. Biotechniques 35, 42-4, 46, 48 (2003). 39. Schena, M. et al. Microarrays: biotechnology's discovery platform for functional genomics. Trends Biotechnol 16, 301-6. (1998). 40. Serikawa, K. A. et al. The Transcriptome and Its Translation during Recovery from Cell Cycle Arrest in Saccharomyces cerevisiae. Mol Cell Proteomics 2, 191204 (2003). 41. Sharp, P. M. & Li, W. H. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15, 1281-95 (1987). 42. Szallasi, Z. Genetic network analysis in light of massively parallel biological data acquisition. Pac Symp Biocomput, 5-16. (1999). 43. Tonge, R. et al. Validation and development of fluorescence two-dimensional differential gel electrophoresis proteomics technology. Proteomics 1, 377-96. (2001). Chapter 2: mRNA expression and protein abundance 123 44. Washburn, M. P. et al. Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 100, 3107-12. (2003). 45. Washburn, M. P., Ulaszek, R., Deciu, C., Schieltz, D. M. & Yates, J. R., 3rd. Analysis of quantitative proteomic data generated via multidimensional protein identification technology. Anal Chem 74, 1650-7. (2002). 46. Washburn, M. P., Wolters, D. & Yates, J. R., 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19, 242-7. (2001). 47. Wolters, D. A., Washburn, M. P. & Yates, J. R., 3rd. An automated multidimensional protein identification technology for shotgun proteomics. Anal Chem 73, 5683-90. (2001). 48. Wu, B. et al. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, 1636-43 (2003). 49. Zhou, G. et al. 2D differential in-gel electrophoresis for the identification of esophageal scans cell cancer-specific protein markers. Mol Cell Proteomics 1, 117-24. (2002). Chapter 2: mRNA expression and protein abundance 124 Chapter 3: mRNA Expression and protein-protein interactions 3.1 Relating whole-genome expression data with protein-protein interactions Abstract I investigate the relationship of protein-protein interactions with mRNA expression levels, by integrating a variety of data sources for yeast. I focus on known protein complexes (from the MIPS catalog) that have clearly defined interactions between their subunits. I find that subunits of the same protein complex show significant co-expression, both in terms of similarities of absolute mRNA levels and expression profiles -- e.g. I can often see subunits of a complex having correlated patterns of expression over a time-course. I classify the yeast protein complexes as either permanent or transient, with permanent ones being maintained through most cellular conditions. I find that, generally, permanent complexes, such as the ribosome and proteasome, have a particularly strong relationship with expression, while transient ones do not. However, I note that several transient complexes, such as the RNA polymerase II holoenzyme and the replication complex, can be subdivided into smaller permanent ones, which do have a strong relationship to gene expression. I also investigated the interactions in aggregated, genome-wide datasets, such as the comprehensive yeast two-hybrid experiments, and found them to have an only weak relationship with gene expression, similar to that of transient complexes. (Further details on genecensus.org/expression/interactions and bioinfo.mbb.yale.edu/expression/interactions.) Chapter 3: mRNA Expression and protein-protein interactions 125 Introduction Analysis of gene expression data is currently one of the most exciting areas in genomics. Computationally, it involves clustering and grouping individual expression measurements and interrelating them to other sources of information, such as phenotypes, functional classifications, or cellular responses(Brown and Botstein, 1999,Califano et al., 2000,Gaasterland and Bekiranov, 2000,Golub et al., 1999,Raychaudhuri et al., 2001,Subrahmanyam et al., 2001). In particular, functional assignment of uncharacterized genes can take place through transferring the annotation from a characterized gene (gathered from databases such as MIPS or GO (Ashburner et al., 2000,Mewes et al., 2000)) to an uncharacterized gene when their expression profiles are strongly related by a similarity criterion (such as the correlation coefficient). While this procedure is usually not sufficient to unambiguously determine the function of an uncharacterized gene, it can be the starting point (e.g. in target selection) for further genetic experiments, functional characterization, or high-throughput proteomic analysis (Christendat et al., 2000,Christendat et al., 2000,Eisenberg et al., 2000,Emili and Cagney, 2000,Gerstein and Jansen, 2000,Luscombe et al., 1998,Westhead et al., 1999). An important component of functional annotation is characterizing protein interactions as these often circumscribe (or effectively define) protein function. Moreover, protein interactions can often be described more precisely than protein functions. Thus, rather than directly dealing with the general relationship between protein function and expression, I look here at a sub-problem: the relationship between mRNA expression and protein-protein interactions, especially those in protein complexes. A priori it seems reasonable that there should be a well-defined relationship between the expression levels of the Chapter 3: mRNA Expression and protein-protein interactions 126 subunits in a complex: since the functionality of many complexes hinges on the presence of all the subunits, a haphazard and independent expression of any one subunit would be energetically costly. For instance, the components of the ribosome are regulated in a complex way but there is usually agreement that they should be present in equimolar amounts, although this has not yet been measured directly (Li et al., 1999,Nomura, 1999,Planta, 1997,Woolford and Warner., 1991). I investigate this relationship for many of the known protein complexes in a comprehensive, global fashion by interrelating many of the yeast datasets for protein interactions and expression. The diversity and number of yeast experiments provide high-quality data under varied conditions. Additionally, I investigate the relationship between other types of protein-protein interactions (e.g. aggregated physical and genetic interactions) and mRNA expression. My work follows up on many recent analyses of protein-protein interactions (Fellenberg et al., 2000,Hishigaki et al., 2001,Teichmann et al., 2001,Walhout and Vidal, 2001). In general, my goal was to integrate and cross-correlate already existing data from different sources and find general trends in it. This is an exploratory study prior to any type of prediction. In a sense, this study can be understood as an exploration of the knowledge already implicit in the current data but not yet obvious because, previously, it has not yet been integrated and put together in this way. Results In my survey of existing data, I have used two different approaches to analyze the two different types of expression data available: the computation of normalized differences Chapter 3: mRNA Expression and protein-protein interactions 127 for absolute expression levels and a more standard analysis of the correlation of profiles of relative expression levels (expression ratios). I explain these two approaches in more detail in the following two sections. Calculation of normalized differences between absolute expression levels In order to compare absolute mRNA expression levels between subunits of a protein complex, I define the normalized difference Dij as follows: Dij Ei E j Ei E j [1] where Ei and Ej are the mRNA expression levels of subunits i and j. This quantity defines the difference as a fraction of the sum of the expression levels, thus allowing for a comparison of gene pairs of both high and low expression. Values for the normalized difference range from 0 to 1. For a group of N proteins in a complex I generally compute the normalized difference not only for the pairs that are in direct physical contact, but for all (N2 - N)/2 theoretically possible pairs, thus arriving at a distribution of normalized differences of these pairs for each complex. I can then investigate this distribution of normalized differences and compare it with those among randomly chosen proteins. In the following discussion I often refer to the median of the (N2 - N)/2 protein pairs as a key summarizing statistic. In general, I assume stoichiometric ratios of 1:1 between subunits, although equation [1] could be adjusted to account for other ratios. But even then, as shown in the Methods Chapter 3: mRNA Expression and protein-protein interactions 128 section below, I would not expect this quantity to always be close to zero due to the relationship between mRNA and protein and also the noise in the expression data. It should also be noted that there are obviously many limitations in treating GeneChip and SAGE data as absolute measurements of mRNA expression (Schadt et al., 2000). In order to judge the statistical significance of normalized differences for particular groups of proteins I compare them to the control distribution of randomly chosen protein pairs (see figure 1). An interesting theoretical aspect in this context is that if Ei and Ej are random variables with an exponential distribution (which is a close approximation to the actual distribution of expression of levels in the reference expression set), then Dij is distributed uniformly between 0 and 1 (Pitman, 1993). This explains why I can observe a nearly uniform distribution of normalized differences for randomly selected pairs of proteins (see figure 1). Correlation of expression profiles for relative expression levels Analysis of expression profiles may be more useful than that of absolute levels for characterizing interacting proteins that exist in unequal but stoichiometrically related amounts (e.g., 3:1) as it refers to the relative shape of expression profiles. It can be carried out on data from cDNA microarrays (such as the Rosetta data) because only relative rather than absolute expression levels are necessary. Specifically, I look at the distribution of Pearson correlation coefficients for pairs of genes as the measure of similarity. (Other measures of similarity are possible as well (D'haeseleer, 1997,Heyer et al., 1999,Qian et al., 2001,Weaver et al., 1997).) Chapter 3: mRNA Expression and protein-protein interactions 129 As the input for my procedure I use the expression vectors or profiles of all the subunits of a complex and then compute their pair-wise correlations. Like for the normalized difference, I compute the correlation coefficients for all protein pairs in a complex, thus gaining a distribution of correlation coefficients. If the complex consists of N subunits, this yields (N2 - N)/2 different combinations of protein pairs and thus correlation coefficients. To summarize these distributions, I calculate the “average correlation” (by which I mean the average of all pair-wise correlations within a complex). As a suitable control to assess statistical significance, I use the distributions of correlation coefficients for random groups of proteins and their averages (see methods). I would expect correlations of close to 1 for subunits in a tight complex. However, as I show in the Methods section this will not be exactly the case due to the relationship between mRNA and protein abundances. Specific complexes I first outline some results obtained for specific protein complexes, then I proceed to a more general overview of complexes. Ribosome It has long been known that the mRNA expression levels of the ribosomal proteins are strongly correlated with one another (Johannes et al., 1999). Figure 1 shows the observed distribution of normalized differences for protein pairs in the large subunit of the cytoplasmic ribosome. The median of this distribution is 0.23, much lower than the median of 0.5 for randomly selected protein pairs. While there is a wide range of normalized differences (which may partially result from the fact that many proteins in the ribo- Chapter 3: mRNA Expression and protein-protein interactions 130 some are known not to be expressed in a 1:1 ratio (Kruiswijk et al., 1978)), the ribosomal distribution is clearly skewed towards zero. Distributions of the correlation coefficients for protein pairs within the large ribosomal subunit are shown in figure 2. For both the cell cycle and the Rosetta data the correlations tend to be much higher than the random control. Similar observations can be made for the proteins in the small cytoplasmic ribosome. Key statistics are summarized in figure 3 in comparison to those for other protein complexes. Furthermore, the two separate ribosome particles are strongly co-regulated. In fact, the large and the small ribosomal particles cannot be differentiated by my measures of expression similarity. Proteasome A second example of a complex whose individual subunits are strongly co-regulated is the proteasome, which is involved in protein degradation and responsible for the rapid breakdown of ubiquitinated proteins. Like the ribosome, the 26S proteasome can be divided into two sub-particles: the 20S and the 19S (or 19S/22S regulatory particle). The 20S particle is present as a dimer in the center of the complex structure and contains the catalytic core, whereas two 19S particles are attached to both ends of the 20S particle dimer (Coux et al., 1996,Wilkinson et al., 1999). The distribution of the normalized differences for all possible protein pairs in the 20S proteasome is shown in figure 1. Like the ribosome, it is clearly skewed towards zero, compared to the control, with a median of 0.29. Figure 2 shows the distribution of correlation coefficients, which is strongly shifted to the right of the control, though to a Chapter 3: mRNA Expression and protein-protein interactions 131 lesser extent than that for the ribosome. An investigation of the crystal structure of 20S particle (Whitby et al., 2000) did not reveal any relationship with the gene expression differences (e.g. proteins with slightly more random correlations tending to be more on the surface of the particle). Similar results can be observed for the 19S particle of the proteasome (figure 3A). Also, in terms of both measures of co-expression (normalized differences and correlation of expression profiles) the 19S and the 20S particles of the proteasome form a single unit that is difficult to separate by gene expression analysis. Part of the reason for this may be that the common classification into 19S and 20S particles is based on the purification procedure for the proteasome (Hochstrasser, 2001) and thus does not necessarily reflect functional or biochemical properties in a direct way. One subunit, Doa4p, exhibits a very low average correlation (-0.02). Biochemical studies have previously shown that not all proteasomes have Doa4p bound and that the Doa4pproteasome interaction is more likely to be transitory (Papa et al., 1999,Papa and Hochstrasser, 1993). RNA polymerase II holoenzyme I have shown above that the ribosome and proteasome can be regarded as strongly associated and co-regulated multi-particle complexes. However, in some cases a complex contains more loosely associated components. An example is the RNA polymerase II holoenzyme, which contains the core RNA polymerase II together with the more loosely associated SRB complex (Kornberg's mediator) and other smaller components (such as the SWIF/SNF complex and the TAFIIs). Chapter 3: mRNA Expression and protein-protein interactions 132 It is known that, unlike the RNA polymerase II core enzyme, the SRB complex and the other holoenzyme components are only needed for the transcription of a fraction of genes (Holstege et al., 1998). In other words, the holoenzyme is an example of a complex of transitory nature with a permanent core. This permanent-and-transitory structure is clearly evident in the gene expression analysis. For the core enzyme, the average correlation in both the cell cycle and Rosetta data sets are significantly higher than for the random control (Figure 3). However, for the SRB complex and a variety of other, smaller components (e.g. the TAFIIs) the average correlations are virtually indistinguishable from the random control. Replication complex Another example of a transient complex is the replication complex, which binds to DNA and is needed for the initiation of replication. The replication complex can be subdivided into a number of sub-components: the MCM proteins, the origin recognition complex and the DNA polymerases and (Aparicio et al., 1997). As a whole, the replication complex exhibits a low average correlation not significantly different from that of the random control (figures 3 and 4). However, figure 4 shows how the entire complex breaks into subcomponents in terms of correlations in the cell-cycle experiment. The individual correlations for each of the subcomponents are much higher than that of the complex as a whole. This indicates that the replication complex is composed of independent units in terms of expression regulation. Using the permanent-transient terminology, each subcomponent behaves similarly to an independent permanent complex, whereas the replication complex as a whole can be characterized as transient. Chapter 3: mRNA Expression and protein-protein interactions 133 The permanent sub-components can be seen to come together to form a transient functional entity. (Note, this effect is more evident in the cell cycle experiment than the Rosetta data, as it should only be observable in a synchronized population of cells, not those averaged across the cell cycle.) Complexes in general: permanent vs. transient In discussing the specific examples above, I have found the permanent or transient nature of the association to be an important feature. This distinction is, in fact, valuable in a more general context. As shown in figure 3, I have a priori formalized a division between "permanent" complexes, which are maintained throughout the cell cycle and most cellular conditions, and "transient" ones, which I define here as a group of proteins that do not consistently maintain their interactions. That is, the existence of a transient complex is temporal and specific to a part of the cell cycle or a subset of cellular states. I are aware that the division into the two absolute categories "permanent" and "transient" is perhaps somewhat oversimplifying as there can be varying degrees and combinations of these attributes (see Discussion). In figure 3, I show a general classification of the large MIPS complexes into permanent and transient classes, together with key statistics (details of the classification method are given in the caption). I list all complexes with more than 10 subunits (which together account for ~80% of all the protein-protein interactions in the MIPS complexes), with smaller complexes listed on my website. Figure 3B shows a graphical representation of the complex list, synthesizing the correlations for both the Rosetta and cell-cycle experi- Chapter 3: mRNA Expression and protein-protein interactions 134 ments with the normalized differences. It clearly shows that there is a greater tendency for permanent complexes to have higher average correlations than for transient ones. Comparing the average correlations in Figure 3A against random controls allows us to derive P-values for the statistical significance of the correlation. As shown in the figure, these are less then 10-4 for most of the permanent complexes. On the other hand, they are considerably higher, and thus less significant, for transient complexes. The separation between permanent and transient complexes is also evident in terms of the normalized difference statistics, although not as strongly. Aggregated protein-protein interaction sets From my analysis above it seems reasonable to conclude that there is indeed a strong relationship between mRNA expression and the protein-protein interactions in “permanent” complexes. This raises the question whether similar observations can be made for other types of protein-protein interactions. I briefly summarize here the degree to which the interactions in the aggregated interaction datasets, such as the yeast two-hybrid, are related to expression. Figure 1 shows the distribution of normalized differences and figure 2 the distributions of correlation coefficients between interacting proteins in the aggregated data sets. The distributions of normalized differences are relatively similar to those of the transient protein complexes. The physical interactions show the smallest median normalized difference while the yeast two-hybrid interactions have a median normalized difference closest to the random control (~0.5). Figure 2 shows that the correlation distributions for the aggregated data sets are fairly similar among themselves and only slightly shifted towards Chapter 3: mRNA Expression and protein-protein interactions 135 the right of the distribution curve for random protein pairs. This, again, is very similar to the behavior of transient protein complexes. Thus, overall, it seems fair to conclude that the aggregated protein-protein interactions are related to mRNA expression in a similar fashion as the transient protein complexes. Discussion and conclusion I have investigated the relationship of protein-protein interactions and mRNA expression levels, integrating and surveying a variety of data sources for yeast. I have focused my investigation on the protein interactions within specific complexes. While I have demonstrated a strong relationship between expression data and most permanent protein complexes, this relationship is much weaker for transient protein complexes as well as for the aggregated sets of protein-protein interactions (i.e. physical, genetic and yeast-two hybrid interactions). Issues with permanent-transient classification My complex classification scheme -- separating most complexes into either permanent or transient -- while useful cannot account for all complexes in the MIPS database. Some complexes may not clearly fit into the permanent-transient classification. I list a few of these as "other" in figure 3. Moreover, the complexes list is a compilation of current biochemical knowledge and therefore reflects its inherent limitations (sometimes not all subunits are known or some proteins are mistakenly assigned to a complex). Of course, even for the complexes that I do classify, the terms "transient" and "permanent" are somewhat of an over-simplification. In particular, my detailed discussions of Chapter 3: mRNA Expression and protein-protein interactions 136 the RNA polymerase II holoenzyme and the replication complex above are precisely two examples where my simplified terminology fails to completely explain the situation since these complexes are somewhere between fully "transient" and "permanent". One can think about the distinction between permanent and transient in terms of the mathematical model introduced in the Methods section. Whenever a complex is formed, its subunits tend to be expressed at equimolar protein concentrations: Pi Pj and dPi dt dPj dt (where Pi and Pj are the protein concentrations of two subunits i and j). If the complex is "permanent", then these conditions should be approximately or vaguely met. If the complex is "transient", then these conditions can be relaxed in those situations where the complex is not formed. There are some complexes, that are always formed ("permanent") whereas the "transient" complexes are only formed under particular conditions. There can be different degrees of being transient: for instance, complexes that are formed under 80% of conditions or those that are formed under 20% of conditions. The transient complex formed under 80% of conditions behaves almost like "permanent" (i.e., 100% of conditions), whereas the transient complex formed only 20% of the time would be expected to show less significant normalized differences and correlations. If one goes as far as to accept the premise that the subunits in a complex should be present at equimolar amounts, then it is perhaps circular reasoning to say that they should also be co-expressed. Chapter 3: mRNA Expression and protein-protein interactions 137 Complexes versus the aggregated interactions: the need for structures I found it difficult to discern expression-based relationships in the aggregated data sets. This may be due to the generalized and heterogeneous nature of the aggregated data sets, (e.g. inconsistent physiological conditions, false positives and false negatives). Moreover, both the aggregated sets and the transient complexes suffer partially from the limited amount of mRNA expression data as their interactions may occur under particular physiological conditions that may not be sampled by mRNA expression data. My results, thus, illustrate the difficulty in drawing general conclusions for the pair-wise interaction sets and highlight the important role clearly resolved crystal structures of complexes, detailing protein interactions between subunits, have in studying protein-protein interactions. Noise in the expression and interaction data In general, the interactions in the aggregated datasets exhibited surprisingly little deviation from randomness in terms of the co-expression of interaction pairs. This was most strongly observed for the yeast two-hybrid data. It is true that, overall, this deviation from randomness is statistically significant. All the same, the gene expression data and the aggregated protein interaction data do not reinforce each other strongly and it seems that the prediction of these type of interactions from expression data would be of little benefit. Perhaps the most optimistic view of this situation is that the strong degree of independence of the two types of data makes both of them suitable for use in machine-learning ap- Chapter 3: mRNA Expression and protein-protein interactions 138 proaches to characterize genes of unknown function: if they were strongly correlated, then one type of data could perhaps well replace the other since it represents very similar information. A negative view would be that the reason for the surprisingly weak relationship between the aggregated interactions and mRNA expression are to be found in the problems with the either the expression or the interaction data. I feel confident that my results are robust to the noise in the expression data for the following reasons. With respect to the correlation analysis of expression profiles roughly the same results (in terms of statistical significances) can be obtained for two independent data sets (the cell-cycle timecourse and the Rosetta knockout series). The normalized difference analysis is perhaps more sensitive to problems with the data, in particular, considering that the measurement of absolute expression levels with gene chips is problematic to start with. However, I have looked at an integrated dataset from various chip experiments and the SAGE data, thus averaging out errors to some degree (see Methods). In addition, for both the correlation and the normalized difference analysis, I have concentrated on the statistical significance of distributions rather than relying on the error-prone data for individual protein pairs, thus observing more robust, aggregate trends for whole complexes and groups of proteins. Part of the aggregated data, in particular the yeast two-hybrid data, represent a relatively new approach to studying protein-protein interactions and it is interesting to note that it, obviously, includes some interactions implied by the complexes. However, the degree of intersection with possible complexes interactions ranges from 35% for the physical interactions to only approximately 6% for the yeast two-hybrid data (as a fraction of the number of interactions in the aggregated datasets). This is surprisingly low, given that Chapter 3: mRNA Expression and protein-protein interactions 139 the yeast two-hybrid data is from experiments that covered the complete genome (Ito et al., 2001,Uetz and Hughes, 2000). Independently, Ito et al. (2001) have reported that only a small fraction of the previous yeast two-hybrid data (Uetz and Hughes, 2000) overlapped with their own yeast two-hybrid results. (Although Ito and colleagues assumed that their core data was similar in quality as the Uetz data, the fraction of interactions present in both datasets was only 16.8% for the Ito core and 20.4% for the Uetz data). mRNA vs. protein expression The co-regulation of subunits in a protein complex should be primarily observable in terms of protein abundance and only indirectly in terms of mRNA expression. Several recent studies have attempted to investigate the relationship between mRNA and protein expression levels in yeast cells and found them to be correlated to various degrees(Anderson and Seilhamer, 1997,Futcher et al., 1999,Greenbaum et al., 2002,Gygi et al., 1999,Lian et al., 2001). Generally, post-transcriptional regulation is more difficult to investigate given the sparse data resources currently available for protein abundance levels. It is possible that in some situations co-regulation occurs mostly on the protein level, almost independent of cellular mRNA levels. Particularly, those permanent complexes that do not have high levels of correlation in my analysis may be indicative of translational or post-translational control and could be a starting point for further experimental investigation. See also the Methods section for further discussion. (Additional information can be found at genecensus.org/expression/interactions and bioinfo.mbb.yale.edu/expression/interactions.) Chapter 3: mRNA Expression and protein-protein interactions 140 Methods Interactions data sources The primary focus of this paper are the interactions occurring within specific complexes. These were obtained from the MIPS complexes catalog (Fellenberg et al, 2000), which represents a carefully annotated, comprehensive dataset of protein complexes culled from the scientific literature. In addition, I looked at other types of protein-protein interactions from large "aggregated" datasets collecting many heterogeneous pair-wise interactions. I collected these from the MIPS catalogs of physical and genetic interactions(Fellenberg et al, 2000), databases of interacting proteins (DIP and BIND)(Bader and Hogue, 2000,Xenarios et al., 2000), and a comprehensive collection of yeast 2-hybrid experiments (Y2H) (Cagney et al., 2000,Ito et al, 2001,Ito et al., 2000,Schwikowski et al., 2000,Uetz et al., 2000,Uetz and Hughes, 2000). These interactions are subdivided into groups based on their method of discovery. They include physical interactions (e.g., collected through co-immunoprecipitation and co-purification), genetic interactions (e.g., determined through genetic means such as synthetic lethality or suppression experiments), and yeast two-hybrid pairs. Expression data sources I included two different types of expression measurements in my analysis: absolute expression levels in vegetative yeast cells as determined by SAGE or gene chip experiments, and profiles of ratio-type expression data from microarray experiments. For the first type, I use a comprehensive reference set, which I merged and scaled together from a Chapter 3: mRNA Expression and protein-protein interactions 141 variety of Affymetrix GeneChip and SAGE datasets (Holstege et al, 1998,Jelinsky and Samson, 1999,Roth et al., 1998,Velculescu et al., 1997) into a single representative data source (scaling details on my website; (Greenbaum et al, 2002)). For the expression profiles, I focused on two different datasets: a cell cycle experiment (Cho et al., 1998) and the Rosetta yeast compendium (Hughes et al., 2000). The two datasets provide a fairly good sampling of the possible cellular states of yeast and represent different experimental methodologies. The cell-cycle data contains expression profiles obtained from synchronized cells over the course of two cell cycles, whereas the Rosetta data contains genome-wide expression ratios for 300 stationary cell states, which are derived from 280 gene deletions and the 20 drug interaction experiments. Efficient calculation of the average correlations For two expression ratio profiles Xi and Xj (transformed to average 0 and standard deviation 1), the Pearson correlation coefficient ij is given by the dot product: ij 1 Xi X j , M 1 where M is the number of elements in the profiles Xi and Xj. The profile X can be computed as a ‘Z-score’ from the measured expression ratio profile x, through the relation Xk xk x x , where x denotes the average and σx the standard deviation of values in x, and Xk and xk are the kth components of their respective profiles. Given a group of N genes I can compute the correlation coefficient matrix R, where each element ij of the matrix denotes the Pearson correlation coefficient between genes i and j. Chapter 3: mRNA Expression and protein-protein interactions 142 I can then compute the average correlation coefficient by averaging the matrix elements (excluding the main diagonal). This statistic gives an idea of the overall similarity of the expression profiles in a group of genes. Although there are O(N2) elements in R, the computation time for can be kept proportional to O(N) by using the linearity of the correlation to calculate as follows: N 1 1 1 Rij N 2 XT XT N , 2 N N i, j N N M 1 N where X T X n is the sum of all expression profiles in the group of N genes. n 1 Kinetic model of the relationship between protein and mRNA concentration For a protein complex that is perfectly co-regulated I can assume that its components are present at equimolar amounts and change similarly over time. So for the protein concentrations Pi and Pj of two different subunits i and j I would get: Pi Pj and dPi dt dPj dt . Using a simple model for the relationship between mRNA and protein concentrations, I can see how even under these ideal conditions similarity measures based on the mRNA concentrations would deviate from perfect results. For instance, a linear kinetic model for the protein concentration Pi and the mRNA concentration Ri of a subunit i in a complex is given by: dPi k Ri Ri k Pi Pi dt Chapter 3: mRNA Expression and protein-protein interactions 143 where kRi is an mRNA translation rate constant and kPi is a protein degradation constant. Why expression profile correlations have to be less than one For two subunits in a complex with Pi Pj P and dPi dt dPj dt , I can deduce: k Ri Ri (t ) k Rj R j (t ) k Pi k Pj P(t ) It is clear that only under the strong assumption that the two protein degradation constants are equal (kPi = kPj) Ri (t ) k Rj const R j (t ) k Ri from which would follow corr(Ri, Rj) = 1. Otherwise, corr(Ri,Rj) < 1. Why normalized differences are greater than zero Furthermore, assuming steady-state (that is, dPi dt dPj dt 0 ), I can deduce the following relationship for the relationship between the mRNA levels of two complex subunits: Ri k Rj k Pi Rj k Pj k Ri Thus, the two mRNA expression levels are only expected to be equal if the ratios of the rate constants for translation and degradation are the same for both proteins. This is not necessarily the case for the subunits of a complex and therefore normalized differences should not be expected to be zero. Chapter 3: mRNA Expression and protein-protein interactions 144 It is clear that the arguments above are based on a variety of simplifying assumptions. In reality, there are additional factors (such as the noise in the expression data, the stochastic nature of gene expression) that add even more difficulty to the analysis of mRNA levels. Acknowledgments MG acknowledges support by the Keck Foundation. RJ is supported by an IBM PhD Fellowship. The authors wish to thank Mark Hochstrasser and Jiang Qian for stimulating discussions. Chapter 3: mRNA Expression and protein-protein interactions 145 Figures Chapter 3: mRNA Expression and protein-protein interactions 3.1 Relating whole-genome expression data with protein-protein interactions Figure 1 Distributions of normalized differences for various groups of proteins in boxplot representation. Figure 1 Distributions of normalized differences for various groups of proteins in boxplot representation. Distributions of normalized differences for various groups of proteins in boxplot representation. The normalized difference Dij is a measure of the relative similarity of two absolute gene expression levels Ei and Ej. The middle panel shows the distribution for two protein complexes (the large ribosomal subunit and the Chapter 3: mRNA Expression and protein-protein interactions 146 20S proteasome). Note that I considered all theoretically possible protein pairs within the protein complex (as indicated in the schematic drawing above the panel). The right panel shows the distribution for the aggregated datasets of protein-protein interactions (Y2H is yeast two-hybrid) (Bader and Hogue, 2000,Cagney et al, 2000,Fellenberg et al, 2000,Ito et al, 2001,Ito et al, 2000,Schwikowski et al, 2000,Uetz et al, 2000,Uetz and Hughes, 2000,Xenarios et al, 2000). Unlike in the complexes, where I consider interactions among a whole group of proteins, the interactions in the aggregated datasets are specific to individual protein pairs (see schematic drawing). The left panel shows two control distributions of the normalized difference, on the left for pairs of nuclear and cytoplasmic proteins -- which presumably, because of spatial separation, do not interact -- and on the right for any random protein pair ("all transcripts") in yeast. The distribution of nuclear versus cytoplasmic proteins is strongly skewed towards one (the maximum value of the normalized difference), which is partially explained by the fact that cytoplasmic proteins tend to have higher expression levels than cytoplasmic ones (Drawid and Gerstein, 2000). The distribution of all transcripts is nearly uniform (with a median of 0.5) -- see Methods. The complexes distributions are clearly skewed towards zero with medians between 0.2 and 0.3. The medians of the distributions of the aggregated datasets are still somewhat smaller than the control median, most notably for the physical interactions dataset; on the other hand, there is virtually no difference between the control and the distribution of the yeast two-hybrid dataset. The aggregated data, obviously, includes some interactions implied by the complexes, with the degree of intersection ranging from 35% for the physical interactions to approximately 6% for Y2H. Chapter 3: mRNA Expression and protein-protein interactions 147 Figure 2 Distributions of correlation coefficients between expression profiles Chapter 3: mRNA Expression and protein-protein interactions 148 Figure 2 Distributions of correlation coefficients between expression profiles. In part a I show distributions of the average correlation N of N genes for the cell cycle experiments. The gray curve in the background represents the case N = 2 (i.e., simply the distribution of pair-wise correlations). In the case of N > 2, N is defined as the average of all possible (N2-N)/2 pairwise correlations among the N genes. I show here, as examples, the distributions for N = 3 and N = 5. The distributions obviously become narrower, reflecting the fact that it becomes more unlikely to find large groups of strongly correlated genes at random as N increases. These distributions provide a suitable control for the observed correlations between pairs of genes (N = 2) or for the average correlations among the subunits of a complex (N > 2). Roanld Jansen has developed a method to efficiently sample the distribution curves f( N ) (see Methods). Based on the distribution function of f( N ) we can calculate a one-sided P-value: P( N ) 1 f ( N )d N N This P-value then represents the chance that a group of N randomly selected genes could exhibit an average correlation greater than or equal to that of a complex with N proteins (see figure 3). Part b and c show the distribution of pair-wise correlations for both the cell cycle and the Rosetta experiments in two protein complexes (the ribosome and the proteasome) as well as for the aggregated datasets (genetic, physical and Y2H). The gray curves in the Chapter 3: mRNA Expression and protein-protein interactions 149 background are the control distributions for N = 2 as explained above. The distributions for the ribosome and the proteasome are strongly shifted to the right of the control; this effect is much weaker for the datasets of aggregated interactions. Chapter 3: mRNA Expression and protein-protein interactions 150 Figure 3a Various key statistics Chapter 3: mRNA Expression and protein-protein interactions 151 Figure 3 Various key statistics shown in figures 1 and 2 for the ribosome and pro- teasome as well as for a large number of protein complexes. I list all protein complexes from the MIPS catalog having at least 10 ORFs. The complexes are divided into three classes: permanent, transient or "other" (see below). Some complexes can be divided into smaller sub-complexes (e.g., the ribosomes) as indicated. The table lists (from left to right) the average expression level of the complex, the median normalized difference (see figure 1A), the average correlation for the cell cycle and Rosetta experiments (see figure 2), the negative logarithm of the P-value of the average correlations in both experiments (see figure 2), and the size of the complex in terms of the number of ORFs. In general, the P-values for the average correlations are very low for most of the permanent protein complexes (accordingly, -log10(P) is very high), indicating that these averages are significantly greater than for random groups of proteins of the same size. The same cannot be observed for the transient protein complexes, for which the correlation averages are usually much smaller. The section "other" at the bottom of part A contains complexes that are either difficult to classify as permanent/transient or for which, due to very small turnover rates, downregulations of mRNA levels take a very long time to affect protein abundance. The H+transporting ATPase can be thought of as containing a mixture of permanent and transient components at the same time(Kane, 2001). The nuclear pore complex (NPC) and the TRAPP complex are known to have low turnover rates (Barrowman et al., 2000,Bucci and Wente, 1997,Sacher, 2001,Winey et al., 1997). The NPC has relatively small average correlations, but this still yields P-values of 10-3 (cell cycle) and <10-4 Chapter 3: mRNA Expression and protein-protein interactions 152 (Rosetta) because the nuclear pore complex is a relatively large aggregation of proteins, and even these weak average correlations are very unlikely to occur for random groups of proteins of this size. The TRAPP protein complex, while existing throughout the cell cycle, has a low turnover rate and as such its mRNA expression data would not be sufficient for my analysis. The RNA polymerase holoenzyme is composed of both permanent and transient components. Note that the MIPS complexes catalog does not include the SWI/SNF chromatinremodeling complex and a subset of basal transcription factors (Wilson et al., 1996) as part of the holoenzyme, thus I list them separately here. The list does not include those categories from the MIPS complexes catalog that do not really represent protein complexes per se but rather aggregations of disparate proteins that are involved in similar types of complex interactions, such as the "actin-associated" and "tubulin-associated" protein groups. Chapter 3: mRNA Expression and protein-protein interactions 153 Figure 3b Graphical representation of part of the protein complex statistics Figure 3b Graphical representation of part of the protein complex statistics from part a. The abscissa and ordinate represent the average correlations in the cell cycle and the Rosetta data, while the bubble sizes are a function of the normalized differences (larger bubbles represent larger normalized differences). In general, the permanent complexes tend to be located in the upper right region of the plot, whereas transient complexes are closer to the random control in the lower left. Chapter 3: mRNA Expression and protein-protein interactions 154 Figure 4 Representation of the replication complex and its components Figure 4 Representation of the replication complex and its components Part a of the figure shows a representation of the replication complex and its components on the same Chapter 3: mRNA Expression and protein-protein interactions 155 coordinates as the protein complexes in figure 3B. The transient replication complex can be decomposed into smaller complexes: the origin recognition complex, the MCM proteins, and the DNA polymerases and . Whereas the whole replication complex exhibits an average correlation close to zero (in both the cell cycle and the Rosetta data), the four smaller complexes show greater correlations in the cell cycle experiment. The four sub-complexes behave more like permanent complexes than the replication complex as a whole. Part b shows the correlation coefficient matrix for the subunits of the replication complex derived from the cell cycle data. The upper triangle of the correlation matrix shows the individual correlation coefficients for particular gene pairs (with darker colors indicating higher correlations). The lower triangle shows the average correlations for subgroups of proteins (representing the MCM proteins, the two DNA polymerases, and the origin of the replication complex) within the complex as a whole. The table on the right side shows which genes belong to which subgroups in different colors. The genes were ordered with unsupervised clustering (average linkage) without regard to their classification according to the three subgroups. It can be seen that this order reflects the separation according to the subgroups very well (only the proteins in the two DNA polymerase cannot be separated into two groups). An exception is the CDC45 protein that belongs to the MCM proteins but tends to cluster with the DNA polymerases. Chapter 3: mRNA Expression and protein-protein interactions 156 References 1. Anderson, L. & Seilhamer, J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 18, 533-7 (1997). 2. Aparicio, O. M., Weinstein, D. M. & Bell, S. P. Components and dynamics of DNA replication complexes in S. cerevisiae: redistribution of MCM proteins and Cdc45p during S phase. Cell 91, 59-69. (1997). 3. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9. (2000). 4. Bader, G. D. & Hogue, C. W. BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16, 465-77. (2000). 5. Barrowman, J., Sacher, M. & Ferro-Novick, S. TRAPP stably associates with the Golgi and is required for vesicle docking. Embo J 19, 862-9 (2000). 6. Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nat Genet 21, 33-7. (1999). 7. Bucci, M. & Wente, S. R. In vivo dynamics of nuclear pore complexes in yeast. J Cell Biol 136, 1185-99. (1997). 8. Cagney, G., Uetz, P. & Fields, S. High-throughput screening for protein-protein interactions using two- hybrid assay. Methods Enzymol 328, 3-14 (2000). 9. Califano, A., Stolovitzky, G. & Tu, Y. Analysis of gene expression microarrays for phenotype classification. Proc Int Conf Intell Syst Mol Biol 8, 75-85 (2000). 10. Cho, R. J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 2, 65-73. (1998). Chapter 3: mRNA Expression and protein-protein interactions 157 11. Christendat, D. et al. Structural proteomics: prospects for high throughput sample preparation. Prog Biophys Mol Biol 73, 339-45 (2000). 12. Christendat, D. et al. Structural proteomics of an archaeon. Nat Struct Biol 7, 9039. (2000). 13. Coux, O., Tanaka, K. & Goldberg, A. L. Structure and functions of the 20S and 26S proteasomes. Annu Rev Biochem 65, 801-47 (1996). 14. D'haeseleer, P., Wen,X.,Fuhrman,S.,Somogyi,R. in Plenum (ed. M. Holcombe, P., R) 203-212 (1997). 15. Drawid, A. & Gerstein, M. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J Mol Biol 301, 1059-75. (2000). 16. Eisenberg, D., Marcotte, E. M., Xenarios, I. & Yeates, T. O. Protein function in the post-genomic era. Nature 405, 823-6. (2000). 17. Emili, A. Q. & Cagney, G. Large-scale functional analysis using peptide or protein arrays. Nat Biotechnol 18, 393-7. (2000). 18. Fellenberg, M., Albermann, K., Zollner, A., Mewes, H. W. & Hani, J. Integrative analysis of protein interaction data. Proc Int Conf Intell Syst Mol Biol 8, 152-61 (2000). 19. Futcher, B., Latter, G. I., Monardo, P., McLaughlin, C. S. & Garrels, J. I. A sampling of the yeast proteome. Mol Cell Biol 19, 7357-68 (1999). 20. Gaasterland, T. & Bekiranov, S. Making the most of microarray data. Nat Genet 24, 204-6. (2000). Chapter 3: mRNA Expression and protein-protein interactions 158 21. Gerstein, M. & Jansen, R. The current excitement in bioinformatics, analysis of whole-genome expression data: How does it relate to protein structure and function (In press). Current Opinions in Structural Biology (2000). 22. Golub, T. R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7 (1999). 23. Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585-96 (2002). 24. Gygi, S. P., Rochon, Y., Franza, B. R. & Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19, 1720-30. (1999). 25. Heyer, L. J., Kruglyak, S. & Yooseph, S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9, 1106-15. (1999). 26. Hishigaki, H., Nakai, K., Ono, T., Tanigami, A. & Takagi, T. Assessment of prediction accuracy of protein function from protein- protein interaction data. Yeast 18, 523-31. (2001). 27. Hochstrasser, M. (2001). 28. Holstege, F. C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717-728 (1998). 29. Hughes, T. R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109-26. (2000). 30. Ito, T. et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 98, 4569-74. (2001). Chapter 3: mRNA Expression and protein-protein interactions 159 31. Ito, T. et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci 97, 1143-1147 (2000). 32. Jelinsky, S. A. & Samson, L. D. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc Natl Acad Sci U S A 96, 1486-91 (1999). 33. Johannes, G., Carter, M. S., Eisen, M. B., Brown, P. O. & Sarnow, P. Identification of eukaryotic mRNAs that are translated at reduced cap binding complex eIF4F concentrations using a cDNA microarray. Proc Natl Acad Sci U S A 96, 13118-23. (1999). 34. Kane, P. (ed. Communication, P.) (2001). 35. Kruiswijk, T., Planta, R. J. & Mager, W. H. Quantitative analysis of the protein composition of yeast ribosomes. Eur J Biochem 83, 245-52. (1978). 36. Li, B., Nierras, C. R. & Warner, J. R. Transcriptional elements involved in the repression of ribosomal protein synthesis. Mol Cell Biol 19, 5393-404 (1999). 37. Lian, Z. et al. Genomic and proteomic analysis of the myeloid differentiation program. Blood 98, 513-24 (2001). 38. Luscombe, N. M. et al. New tools and resources for analysing protein structures and their interactions. Acta Crystallogr D Biol Crystallogr 54, 1132-8. (1998). 39. Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res 28, 37-40 (2000). Chapter 3: mRNA Expression and protein-protein interactions 160 40. Nomura, M. Regulation of ribosome biosynthesis in Escherichia coli and Saccharomyces cerevisiae: diversity and common principles. J Bacteriol 181, 6857-64 (1999). 41. Papa, F. R., Amerik, A. Y. & Hochstrasser, M. Interaction of the Doa4 deubiquitinating enzyme with the yeast 26S proteasome. Mol Biol Cell 10, 741-56. (1999). 42. Papa, F. R. & Hochstrasser, M. The yeast DOA4 gene encodes a deubiquitinating enzyme related to a product of the human tre-2 oncogene. Nature 366, 313-9. (1993). 43. Pitman, J. Probability (Springer-Verlag, New York, 1993). 44. Planta, R. J. Regulation of ribosome synthesis in yeast. Yeast 13, 1505-18 (1997). 45. Qian, J., Dolled-Filhart, M., Lin, J., Yu, H. & Gerstein, M. Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J Mol Biol 314, 1053-66 (2001). 46. Raychaudhuri, S., Sutphin, P. D., Chang, J. T. & Altman, R. B. Basic microarray analysis: grouping and feature reduction. Trends Biotechnol 19, 189-93. (2001). 47. Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat BIOTECHNOL 16, 939-45 (1998). 48. Sacher, M. (ed. Communication, P.) (2001). 49. Schadt, E. E., Li, C., Su, C. & Wong, W. H. Analyzing high-density oligonucleotide gene expression array data. J Cell Biochem 80, 192-202. (2000). Chapter 3: mRNA Expression and protein-protein interactions 161 50. Schwikowski, B., Uetz, P. & Fields, S. A network of protein-protein interactions in yeast. Nat Biotechnol 18, 1257-61. (2000). 51. Subrahmanyam, Y. V. et al. RNA expression patterns change dramatically in human neutrophils exposed to bacteria. Blood 97, 2457-68. (2001). 52. Teichmann, S. A., Murzin, A. G. & Chothia, C. Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol 11, 35463. (2001). 53. Uetz, P. et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623-7. (2000). 54. Uetz, P. & Hughes, R. E. Systematic and large-scale two-hybrid screens. Curr Opin Microbiol 3, 303-8. (2000). 55. Velculescu, V. E. et al. Characterization of the yeast transcriptome. Cell 88, 243251 (1997). 56. Walhout, A. J. & Vidal, M. High-throughput yeast two-hybrid assays for largescale protein interaction mapping. Methods 24, 297-306. (2001). 57. Weaver, P. L., Sun, C. & Chang, T. H. Dbp3p, a putative RNA helicase in Saccharomyces cerevisiae, is required for efficient pre-rRNA processing predominantly at site A3. Mol Cell Biol 17, 1354-65. (1997). 58. Westhead, D. R., Slidel, T. W., Flores, T. P. & Thornton, J. M. Protein structural topology: Automated analysis and diagrammatic representation. Protein Sci 8, 897-904. (1999). 59. Whitby, F. G. et al. Structural basis for the activation of 20S proteasomes by 11S regulators. Nature 408, 115-20. (2000). Chapter 3: mRNA Expression and protein-protein interactions 162 60. Wilkinson, C. R., Penney, M., McGurk, G., Wallace, M. & Gordon, C. The 26S proteasome of the fission yeast Schizosaccharomyces pombe. Philos Trans R Soc Lond B Biol Sci 354, 1523-32. (1999). 61. Wilson, C. J. et al. RNA polymerase II holoenzyme contains SWI/SNF regulators involved in chromatin remodeling. Cell 84, 235-44. (1996). 62. Winey, M., Yarar, D., Giddings, T. H., Jr. & Mastronarde, D. N. Nuclear pore complex number and distribution throughout the Saccharomyces cerevisiae cell cycle by three-dimensional reconstruction from electron micrographs of nuclear envelopes. Mol Biol Cell 8, 2119-32 (1997). 63. Woolford, J. L. & Warner., J. R. in The Molecular and Cellular Biology of the Yeast Saccharomyces: Genome Dynamics, Protein Synthesis, and Energetics (eds. Broach, J. R., Pringle, J. R. & Jones, E. W.) 587-626 (Cold Spring Harbor Laboratory Press., 1991). 64. Xenarios, I. et al. DIP: the Database of Interacting Proteins. Nucleic Acids Research 28, 289-291 (2000). Chapter 3: mRNA Expression and protein-protein interactions 163 Appendix: Change in mRNA expression vs. change in protein abundance levels Genomic and proteomic analysis of the myeloid differentiation program: global analysis of gene expression during induced differentiation in the MPRO cell line Abstract I have used an approach using 2-dimensional gel electrophoresis with mass spectrometry analysis combined with oligonucleotide chip hybridization for a comprehensive and quantitative study of the temporal patterns of protein and mRNA expression during myeloid development in the MPRO murine cell line. This global analysis detected 123 known proteins and 29 "new" proteins out of 220 protein spots identified by tandem mass spectroscopy, including proteins in 12 functional categories such as transcription factors and cytokines. Bioinformatic analysis of these proteins revealed clusters with functional importance to myeloid differentiation. Previous analyses have found that for a substantial number of genes the absolute amount of protein in the cell is not strongly correlated to the amount of mRNA. These conclusions were based on simultaneous measurement of mRNA and protein at just a single time point. Here, however, I am able to investigate the relationship between mRNA and protein in terms of simultaneous changes in their levels over multiple time points. This is the first time such a relationship has been studied, and I find that it gives a much stronger correlation, consistent with the hypothesis that a substantial proportion of protein change is a consequence of changed mRNA levels, rather than posttranscriptional effects. Cycloheximide inhibition also showed that most of the proteins detected by gel electrophoresis were relatively stable. Specific investigation Appendix 164 of transcription factor mRNA representation showed considerable similarity to those of mature human neutrophils and highlighted several transcription factors and other functional nuclear proteins whose mRNA levels change prominently during MPRO differentiation but which have not been investigated previously in the context of myeloid development. Data are available online at http://bioinfo.mbb.yale.edu/expression/myelopoiesis. (Blood. 2002;100:3209-3220) Introduction The study of myeloid differentiation provides important insights both into normal developmental processes that generate peripheral blood leukocytes, as well as into abnormalities that lead to myeloid aplasia, dysplasia, and leukemia.1-9 Access to normal myeloid precursors at homogenous stages of development and in quantities sufficient for biochemical analysis is not generally practicable so information about myeloid differentiation has generally been obtained by studies of leukemic cells arrested at various developmental stages.10 Informative results have also come from studies of humans with genetic abnormalities affecting neutrophil accumulation11-13 and gene targeting experiments, particularly of transcription factors.14 Overall, cell lines that can be induced to undergo myeloid differentiation in vitro continue to provide many of the most useful models for understanding of this process.15 Human and murine hematopoietic precursor lines have been developed that can be induced to mature to various degrees toward adult neutrophils.8,16 Several of these lines Appendix 165 fail to form a full complement of proteins or to fully undergo morphologic changes characteristic of mature neutrophils, but the murine MPRO cell line provides a relatively favorable system for studying myeloid differentiation.8 The cells are arrested at the promyelocytic stage because of the presence of a dominant-negative retinoic acid receptor. Differentiation can be induced by adding appropriate concentrations of all-trans retinoic acid (ATRA). On differentiation, most cells mature to the level of band forms and mature polymorphonuclear neutrophils and express secondary granule mRNAs and proteins.8 Current methods that provide broad surveys of the patterns of mRNA expression include oligonucleotide chip hybridization17 and 3' end restriction fragment gel display analysis18; both have been used to study MPRO cell development. Although the chemical heterogeneity of proteins prevents similar global methods of protein abundance analysis, recent improvements in 2-dimensional gel electrophoresis, especially the development of immobilized pH gradient isoelectric focusing gels, have made it possible to semiquantitatively examine the levels of a substantial fraction of the proteins of a cell. 19 This approach, termed proteome analysis, has provided important contributions to disease-related gene discovery, developmental program analysis, and drug discovery. Interest in this area has been spurred by recent studies indicating a modest to poor correlation between transcriptional profiles and actual protein levels in cells. These studies make it clear that cellular protein analysis is complementary to genomic analysis and that no biologic program can be successfully analyzed without the incorporation of a proteomics platform. Appendix 166 Previously, I used oligonucleotide chips and gel displays to study the patterns of mRNA expression during MPRO cell differentiation and compared these with a very limited set of protein analyses from wide pH range 2-dimensional gel electrophoresis.20 I have expanded these studies to a more global analysis of a much wider array of mRNA and protein species. The current studies use higher resolution narrow-range 2-dimensional gel systems and tandem mass spectrometry to identify a substantial portion of the more abundant proteins whose levels change during MPRO development. Bioinformatic and functional tools were then used to analyze the role of these proteins in myeloid differentiation. I have also used a new generation of oligonucleotide chips to compare mRNA levels in MPRO cells 0 hours and 72 hours after induction of differentiation. In particular, I have further examined the expression of transcription factor mRNAs in MPRO cells and compared this pattern with transcription factor expression in mature human neutrophils. Materials and methods Cell line growth and induction The MPRO cells15 were obtained and incubated as described previously.20 MPRO cells induced with retinoic acid for 0, 24, 48, and 72 hours were collected and analyzed by procedures described below. Appendix 167 Two-dimensional immobilized pH gradient gel electrophoresis MPRO cells were disrupted in lysis buffer.20 I applied 50 to 100 µL of each MPRO cell lysate (1.25 × 106cells/100-2.5 × 106cells/100 µL, about 100-200 µg protein) at the cathodic end of the immobilized pH gradient gel (IPG) strips (pH 3-10 L, pH 4-7 and pH 6-11, Pharmacia Biotech, Uppsala, Sweden), and 2-dimensional IPG electrophoresis (2DIPG) was conducted for 10 to16 hours (13 000 to 20 100 V-h) using Electrophoresis Power Supply ESP 3500 XL and Immobiline DryStrip Kit (Pharmacia Biotech). The electrophoresis in the second dimension was carried out in a 12% sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) gel with the Laemmli-SDScontinuous system in a PROTEAN (II xi 2-D cell, Bio-Rad, Hercules, CA), run at 40 mA constant current for 5 hours.21,22 The 2-dimensional gels were stained with Coomassie brilliant blue G-colloidal following the vendor's recommendations.23 Destaining was performed by soaking the gels in 10% acetic acid and 25% methanol solution for 60 seconds, then in 25% methanol solution for 24 hours at room temperature. Silver staining was performed according to the protocol of the manufacturer.24,25 The 2-dimensional maps of MPRO cells were compared by using the Adobe Photoshop 4.0 program Melanie III 2-D PAGE software (Genebio, Geneva, Switzerland) and checked manually. Proteins were recovered by punching out spots with a MultiFit Appendix 168 Research Pipet Tips (Volume: 100-1000 µL; Dot Scientific, Burton, MI). More than 200 visible protein spots were punched for later mass spectrometry analysis. Mass spectrometry analysis The punched samples were washed at room temperature in the following solutions: in 50% acetonitrile for 5 minutes; in 50% CH3CN/50 Mm NH4HCO3 for 30 minutes; then in 50% CH3CN/10 Mm NH4HCO3 for 30 minutes. After drying the sample gels in a SpeedVac Concentrator (Eppendorf, Hamburg, Germany), trypsin solution (0.05 µg trypsin/7 µL 10 Mm NH4HCO3) was added to the samples and they were incubated at 37°C for 24 hours. The supernatants of the trypsin digestion products were collected, 1 µL sample digest was mixed with 1.0 µL -cyano-4-hydroxy cinnamic acid (CHCA; 4.5 mg/mL in 50% CH3CN, 0.05% trifluoroacetic acid [TFA]) matrix solution, and 1 µL calibrants (100 fmol each). The mixture was loaded on a target of the sample plate, then injected to the Perseptive Biosystem Voyager-DE STR instrument (Perseptive Biosystem, Boston, MA). The spectra of the peptides were acquired in reflector/delayed extraction mode. The standards used for calibration of peptide masses are bradykinin (average (M+H) is 1061.23) and ACTH Clip (average (M+H) is 2466.70). The criteria we are using currently to identify proteins are: (1) "Coverage," ratio of the portion of protein sequence covered by matched peptides to the whole length of protein sequence, is 25% or more; (2) "Z score" is more than 1.5; (3) "Probability" is "1.0e + 000"; (4) "Coverage graphical" of the matched peptides from the protein candidate crosses the all length of the protein. Appendix 169 Peptide identification and database establishment Peptides were identified using the ProFound-Peptide Mapping search engine (http://www.proteometrics.com/profound_bin/WebProFound.exe), and subsequently searched against the SWISS-PROT (http://www.expasy.ch/) or PIR (http://wwwnbrf.georgetown.edu/) sites. The differential patterns of protein expression were analyzed with Melanie II 2-D Page Software (Bio-Rad) (http://www.expasy.ch/melanie/MelanieII/description.html). The 2-dimensional reference maps and the identified protein information were collected in a database (dbMCp) that contained information for each protein including: GenBank matches, Locus Link or UniGene clusters, expression patterns, tissue distribution, synonym(s) protein name, gene name(s), notations of possible functions in myeloid cell biology and differentiation, and hyperlinks to the database searches, 2-dimensional images, and related references. These data were gathered as separate entries in a file. Supplementary information is available on the website (http://bioinfo.mbb.yale.edu/expression/blood). The proteins identified from different sets of 2-dimensional gels were grouped into 12 categories according to their functions as documented in SWISS-PROT and National Center for Biotechnology Information (NCBI) databases. Furthermore, these proteins were classified into 5 expression patterns by their similarity to the ideal expression Appendix 170 patterns.20 The correlations at various levels of proteins or RNA were compared using both visual estimates and Melanie software estimates of protein spot intensities and the average difference between match and mismatch probe sets for each gene on the oligonucleotide chips. Protein synthesis inhibition by cycloheximide treatment A pilot dose-response experiment determined the dose that produced 95% inhibition of MPRO cell protein synthesis, assayed by incorporation of radiolabeled L-[35S]methionine. Based on dose-response experiment, MPRO cells (2 × 105 cells/mL) were treated with or without cycloheximide (final concentration 10 µL/mL) for 2 hours, then collected and sampled for proteomic analysis as described above. mRNA isolation and analysis The mRNA was isolated from MPRO cells at indicated time points during differentiation as previously described.20 Oligonucleotide chip analysis was also performed as previously described,20 except for the use of the more advanced Affymetrix chip probes (Murine Genome U74Av2 array), interrogating approximately 36 000 full-length mouse genes and expressed sequence tag (EST) clusters from the UniGene database. The resulting data were compared with human neutrophil gene expression analysis using the Affymetrix U60 set of oligonucleotide chips. Human neutrophils were prepared according to the method described previously.18 Criteria for considering cDNAs "present" Appendix 171 and for selecting those with significant average differences, as well as rescaling, threshold, and normalization methods were applied as previously described.20 To study mRNA expression we first tested the incorporation of results from previous work20 using Affymetrix 11K chips along with the present work set of measurements using the newer generation Affymetrix murine genome U74Av2 array. Comparison of the differences in expression at times 0 and 72 hours between the 2 different chips requires preprocessing of the data, because the probe sets corresponding to any given gene in the old and new chips are different. Genes were identified by their Locus Link ID, by extracting the ID for each accession number in both the 11K and 36K chips using the Stanford Source database (http://genome-www5.stanford.edu/cgi- bin/SMD/source/sourceSearch). We filtered out probe sets that had missing values of expression levels at either 0 or 72 hours. The remaining probe sets of the 11K chip were linked with the remaining probe sets of the 36K chip through a common Locus Link ID. Most of the remaining distinct 1906 Locus Link IDs had a single probe set per Locus Link ID, both in the old and new chips. However, 63 probe sets from the 11K chip were linked with more than one probe set on the 36K chip and 400 probe sets from the 36K chip were linked with more than one probe set of the 11K chip. We chose not to average the RNA levels of probe sets that belong to the same gene, because it would not be appropriate when the expression levels of one probe set dominate the others. Therefore, we evaluated the correlation between mRNA from the 11K chip and the 36K chip using only the subset of genes that had single probe sets on both chips. Using this subset we found that the correlation between mRNA levels of the 11K chip and the 36K chip is 0.75 Appendix 172 at 0 hours and 0.7 at 72 hours. These correlations were lower than the correlations between the mRNA levels at 0 and 72 hours using only the 11K chip (r = 0.89) or only the 36K chip (r = 0.84). Therefore, changes in RNA levels were not entirely reproducible using these 2 completely different chips. We compared the time course trends of 10 genes previously studied using Northern blots with the corresponding trends of the 11K and 36K chips. The trends of the 11K chip agreed with the Northern blots only in 6 of 10 instances, whereas the new 36K chip success rate was 9 of 10. The mRNA for several of the proteins we detected from 2-dimensional gels was reported as present on the new chip set but absent from the old chip set. We therefore used only data form the new chip set for comparisons with proteins and for further examination of changes in transcription factors. The use of only one replica of the new 36K chip, although not ideal, should be sufficient for exploring global relations between protein and mRNA. Northern blot analysis was performed as described previously.20 Results Proteomic analysis of MPRO differentiation The MPRO cell model is particularly useful for studying aspects of myeloid differentiation because large numbers of cells can be obtained, arrested at the promyelocyte stage of development, and, most importantly, synchronous differentiation can be induced by adding ATRA. The fully differentiated cells resemble mature Appendix 173 neutrophils both morphologically and in the expression of secondary granule proteins. For the purpose of initially scanning changes in protein levels during myeloid differentiation, we used 2D-IPG with wide-range, linear IPGs (pH 3-10) in the first dimension. Figure 1 shows analytical colloidal blue-stained 2D-IPG standard maps of differentiated MPRO cells at 0, 24, 48, and 72 hours after the cells were induced with ATRA. The expression patterns of more than 300 protein spots were followed through the entire series of gels. The protein spots in different gels could easily be cross-matched to each other, using Melanie III software, indicating the reproducibility of the method. A large portion of these products changed their relative intensities among the 4 maps, suggesting extensive protein expression changes during the course of MRPO differentiation. Protein identification The protein spots in the different sets of the gels were identified by MALDI-MS on the basis of peptide mass matching with the theoretical peptide masses in tryptic digests of all known proteins from mouse and human species.26 Of 220 protein spots analyzed, 193 yielded high-quality spectral data. The experimental peptide masses were matched to a total of 143 spots corresponding to 123 different known proteins, as presented in Table 1. The accession numbers, protein names, and theoretical pI and Mr values, as well as the number of peptide matches and probability of wrong assignment, are presented in the database dbMCp (http://bioinfo.mbb.yale.edu/expression/myelopoiesis). There were 29 spots with high-quality spectra but poor matches in public databases; another 21 spots Appendix 174 with good mass spectra matched many different proteins in the mouse database. The latter finding was probably attributable to high sequence homology, but can also be the result of a mixture of proteins in a single spot. On the pH 3 to 10 maps, 14 protein species were represented by multiple spots (Table 2) that differed due to the pI or Mr. These differences might be the result of alternative splicing or posttranslational modifications, or of chemical modification by protease inhibitors during sample preparation. Interestingly, some of these proteins showed the same phenomena in Jurkat T-cell 2-dimensional protein maps.27 Some spots with highquality spectra, but shifted from their expected position in the gel, might also represent posttranslational modifications. Proteins with lower than expected molecular weights may be digestion fragments of larger proteins. Most proteins with low molecular weight (< 14 kDa) usually presenting multiply matches, could not be identified. Protein expression patterns during MPRO development The 123 "known" proteins identified here were classified into 12 categories on the basis of their function, including 18% categorized as cytoskeletal proteins, 15% metabolismrelated molecules, and 10% signaling pathway-related proteins (Table 3). These proteins were abundant in the cell and easily detected by 2-dimensional electrophoresis. Smaller sets of proteins included 7 possible transcription factors and 5 cytokines; other categories, such as kinases and chromatin remodeling factors, contain even fewer members. Appendix 175 We also classified all known proteins according to their expression patterns during myeloid differentiation. We clustered the standardize protein expression level profiles (at 0, 24, 48 and 72 hours) using the GENECLUSTER version of the self-organizing maps (SOMs) clustering algorithm,28 with a rectangular 3 × 2 grid as the input node geometry. The final position of the nodes in the 4-dimensional (time course) space, represents the centers of 6 clusters generated by the SOM algorithm. One of these clusters was empty. Figure 4 shows the normalized expression profiles divided into the remaining 5 clusters representing trends such as down-regulation (Figure 4A) and up-regulation (Figure 4D,E) that occur during the cell maturation process. For example, the universal transcription factor Eef2 is down-regulated. This finding is consistent with the concurrent reduction of total RNA levels and cell size. Conversely, protein Es10 shows a pattern of up-regulation, as expected for a granule component. Thus, these profiles offer information about the roles of proteins in the different stages of the MPRO development. Correlation of gene expression at the RNA and protein levels One of the goals of this work is to search for global relationships between mRNA and protein levels during MRPO cell maturation. Previous studies in yeast showed weak correlations between average mRNA levels and average protein levels.29-32 These studies focused on the relationship, at one instant, between absolute amounts of mRNA (measured from Affymetrix GeneChip experiments) and protein. Here we investigate another quantity: the correlation between changes over many time points, in mRNA levels and in protein levels. This is only possible because we have available experiments Appendix 176 simultaneously done on protein and mRNA levels over an entire time course. In particular, we analyze the relationships between time course expression profiles of mRNA and proteins during a process of mammalian cellular development. This is the first time that the relationship between protein abundance and mRNA expression has been studied in terms of changes over time. To study mRNA expression, we used measurements taken at times 0 and 72 hours of the maturation process, using the Affymetrix 36K murine chip. To compare the mRNA changes with protein changes we first summed levels of proteins that are represented in more than one spot on the 2dimensional gels. We retained only mRNAs with an Affymetrix oligonucleotide probe set with the suffix "_at" (representing a probe set corresponding to a single gene). This procedure removes the ambiguity of multiple probe sets per Locus Link. We then screened mRNA that had a "present" Affymetrix indicator and an amplitude more than 20 at 0 and 72 hours and found 51 different proteins that satisfied these conditions. The correlation between the mRNA difference at 0 and 72 hours with the corresponding protein difference is r = 0.58, as presented in Figure 5 (the exact formula for the Pearson correlation coefficient r is given in the legend to Figure 5). Most proteins with increasing levels of mRNA also have increasing protein levels, with the exception of 2 outliers (enolase 1 and coronin). Overall, 11 of 51 proteins with upward/downward trends had an opposite mRNA trend. The reproducibility of the protein results was studied by repeating the induction experiments of MPRO cells and also by repeated analyses of the same cell samples. The induction experiments were repeated 3 times. In each experiment, the MPRO cells from Appendix 177 different time courses were analyzed by 2D-IPG 2 to 6 times. We found that the protein spot images were well reproduced, with only slight differences occurring at the far edges of gels. Quantitative analysis of 4 dilutions of the same samples showed that the intensity change of each protein was proportional. In comparisons of 2D-IPG of 0-hour and 72hour cells between 2 different induction experiments, we found that among the 220 analyzed proteins, 199 (90%) were reproducibly observed, and 21 were not observed in all gel sets. The direction of expression changes of proteins in 72 hours against 0 hours was similar in both experiments, with a correlation coefficient of 0.88. We measured protein abundance using both software and manual estimations of spot intensity. Using the Melanie III program from Genebio, we were able to compute the protein abundance of thousands of proteins across the gels and found a general consistency between measurements by eye and by software analysis (data not shown). We did not expect to find a general correlation for the changes in levels of these proteins and their mRNAs; rather, as previously hypothesized,29,32 we sought correlations between smaller, better defined groups of proteins. Although the correlation over all proteins and mRNA hovered around 0.3 for each of the time points, we found that the median correlation for cytoskeletal proteins alone rose to approximately 0.65, highlighting the importance of analyzing mRNA expression and protein abundance using well-defined features and functions. Appendix 178 Protein stability The level of any protein is theoretically determined by its cumulative rate of synthesis and by the rates of degradation or alteration (and an initial condition of protein level). For protein stability studies, MPRO cells (1.5 × 105 cells/mL) were treated for 2 hours with cycloheximide (final concentration 10 µL/mL, based on an initial dose-response experiment). The cycloheximide-treated and control cells were analyzed on 2 sets of IPGs (pH 4-7 and pH 6-11). As shown in Figures 6 and 7, the relative expression of most proteins remained the same after 2 hours of treatment. Quantitative measurements showed that 27.5% proteins dropped off significantly (fold change > 2), whereas 63.7% of proteins were stable over this time period (Figure 8). Nine proteins showed a relatively higher level of expression after cycloheximide treatment, indicating that posttranslation modifications of these proteins occurred less than 2 hours after their synthesis, or that their translation was relatively resistant to cycloheximide. Comparison of differentiated MPRO cells with normal neutrophils After 72 hours of ATRA treatment, the MPRO cells resembled mature neutrophils morphologically, including the presence of secondary granule proteins. To obtain a more complete picture of the differentiation state of the MPRO cells, I compared their RNA profiles with those of mature neutrophils. Human neutrophils were used rather than murine peripheral blood cells because the human cells are a more practical source of sufficient RNA for replicate analyses. In particular, we chose to focus on the levels of Appendix 179 mRNA encoding transcription factors, because they control the differentiation process and determine the expression of the other genes. A total of 219 known or probable transcription factors were represented in mRNA isolated at some stage of MPRO cell development. Comparison of oligonucleotide chip analyses showed that there were 49 transcription factors whose mRNA was reported as present in resting human neutrophils but whose homologues were reported as absent in 72-hour MPRO cells. To obtain more precise data, we performed Northern blot analysis of 20 mRNAs encoding transcription modulators (Table 4). Of these, the oligonucleotide chips reported 12 as present in human neutrophils but absent in 72-hour MPRO cells. Eleven of these 12 were detected as present in 72-hour MPRO cells by Northern blot analysis (Figure 9). These included Bach1, not previously studied in myeloid cell differentiation, but markedly elevated in the mature cells. Conversely Rybp was markedly reduced as the cells matured. This finding is surprising because the protein is a presumptive transcriptional repressor and part of the mammalian homologue of the Drosophila polycomb complex. Discussion We have used a 2-dimensional gel electrophoresis approach to explore the temporal patterns of protein expression during ATRA-induced myeloid development in the MPRO murine myeloid cell line.8 This global analysis has detected 123 known proteins and 29 "new" proteins out of 220 protein spots identified by tandem mass spectroscopy, including proteins in 12 functional categories such as transcription factors, cytokines, and others. Bioinformatic analysis of these proteins has revealed clusters with functional Appendix 180 importance to myeloid differentiation. Comparison of gene expression at the genomic and proteomic levels revealed some discrepancies between RNA and protein levels that indicate the importance of posttranscriptional and posttranslation processes during cell differentiation, although some differences undoubtedly arise at least in part from technical limitations of the current methods of measurement. These discrepancies may also be the result of varying translation and degradation efficiencies or might reflect posttranslation modifications. Nonetheless, overall there was a significant correlation between changes in mRNA and protein levels, consistent with the expectation that a substantial proportion of protein change is a consequence of changed mRNA levels, rather than posttranscriptional effects. Cycloheximide inhibition also showed that most of the proteins detected by gel electrophoresis were relatively stable, so that increased stability of proteins with maturation was not a likely explanation for the observed changes. We further examined the expression of transcription factor mRNA in MPRO cells and compared this with the expression pattern in mature human neutrophils. By combining oligonucleotide chip and Northern blot analysis, we observed that most of the transcription factor mRNAs detected in human neutrophils have homologues present in mature MPRO cells, although estimated relative RNA abundances could be quite different between species. The first comparison of mRNA levels to the protein abundances of their gene products 33 found a correlation coefficient of 0.48. These observations highlighted the limitations of functional studies performed only at mRNA level. Later, Anderson's group found a correlation coefficient of only 0.43 in a comparison of protein and mRNA abundances for Appendix 181 a single gene product across 60 human cell lines by an immunoaffinity high-performance liquid chromatography method and quantitative Northern analysis.19 In 1999, Gygi et al30 quantitatively compared mRNA and protein expression levels for 128 different genes expressed in yeast, using serial analysis of gene expression (SAGE) and capillary liquid chromatography-tandem mass spectrometry methods. Their results showed a correlation coefficient of 0.935 for the most abundant proteins; but the coefficient was only 0.356 for the 69% of 106 genes34 for which the transcript levels were less than 10 copies/cell. These prior studies examined static expression levels without correlation of changes in protein and mRNA levels during cell development, as performed in the present study. In general, we found a moderately high correlation (coefficient 0.58) between estimated protein and RNA levels. There are multiple technical considerations, both in measuring RNA and protein levels that might affect the results, but the general conclusion supports previous contentions35 that interpretations of changes in cell behavior based on changing mRNA levels is incomplete. Nevertheless, the correlation is sufficiently strong to indicate that the regulation of transcript levels is probably a major determinant of changes in protein levels during differentiation in this system. Because uninduced MPRO cells were in a steady state, one might expect to see better correlation at later time points, when changes in mRNA levels over time have been translated into protein levels. Some loss of correlation could derive from unstable proteins that are differentially regulated during cellular maturation. Using cycloheximide to inhibit protein synthesis, we Appendix 182 found that the large majority of the proteins in this system are relatively stable. However, protein stability is an important factor in posttranslational proteomic studies. Much progress has been made in understanding transcriptional regulation of the myeloid differentiation program. Transcription factors such as PU.1 and members of the C/EBP family have been found to play important roles in the expression of a variety of myeloid genes, both by examination of individual gene regulatory regions and by gene knock-out studies in mice.36-39 Our previous work20 initiated and the present study has established a database of transcription factors and target genes differentially regulated during myeloid differentiation. The results are limited by the sensitivity, accuracy, and comprehensiveness of the available oligonucleotide chips for mouse mRNAs. Detection of transcription factor proteins is difficult because they are often present at low abundance, may have basic pIs, and may be present in various modified forms that alter their mobility on 2-dimensional gels. Encouragingly, the present study identified 7 proteins potentially important to transcriptional regulation, including RNA polymerase II, Stat5a, Aiolos, Hmg1 and 2, Kruppel-related zinc finger protein F80-m, and Zfp101. Previous studies have shown that all 7 members of the signal transducers and activator of transcription (STAT) family are involved in regulating expression of cytokine-induced and growth factor-induced genes.40 Among them, Stat5 appears to have an important role in myeloid cell development, primarily by mediating granulocyte-macrophage colonystimulating factor (GM-CSF) signaling. At the mRNA level, several STAT proteins, including Stat1, 3, 5b, and 6, were moderately up-regulated in MPRO cells. Our data Appendix 183 showed decreased expression of Stat5a protein at the late stage of MPRO differentiation, as reported in other systems.40 Kruppel-related zinc finger protein F80-m and Aiolos are 2 newly identified transcription factors, with still unknown functions in myeloid cells, although Aiolos is known to interact with Ras to control cell death in T cells.41 In MPRO cells, we found that Aiolos is expressed at a fairly constant level throughout differentiation. In contrast, Kruppel-related zinc finger protein F80-m was strongly downregulated and Zfp101 slightly up-regulated. The high mobility group (HMG) box domain defines a family of proteins, mostly transcription factors, that specifically interact with DNA on the minor groove.42,43 Surprisingly, recent studies suggest a second quite different function for Hmg1 and 2 as cytokinelike factors.44,45 In this study, both Hmg1 and Hmg2 were detected by 2DE analysis. Hmg2 was significantly up-regulated indicating its possible important function in biologic processes in MPRO differentiation. Oligonucleotide chip analyses showed the presence of mRNAs for about 123 transcription- or chromatin-modifying factors in differentiated MPRO cells and 147 factors in mature human neutrophils. Overall, 49 of these factors represented in neutrophil mRNA were not detected by chip analysis of MPRO cells, but 11 of 12 were detectable by Northern blot analysis. In some cases the failure to find an mRNA by chip analysis was probably because the amount of transcript was below the threshold for oligonucleotide chip detection,46,47 but in other cases relatively strong Northern signals were obtained. Appendix 184 Several subsets of transcription factor mRNAs had patterns of expression that could be interpreted in terms of known function of the products. Myc is a well-known transcription factor that promotes growth rather than differentiation,48 and in turn is regulated by interactions with a family of proteins including Max, Mad, and Sin3B.49 In developing MPRO cells Myc is down-regulated and Mad is up-regulated. The related protein Mad4 is slightly down-regulated and Mad5 is markedly down-regulated and apparently absent from the mature cells. Mad5 differs from other proteins of this group in that it may act to stimulate as well as repress transcription. In addition, Sin3b is one of the more markedly up-regulated transcription factor mRNAs. The combined changes in Mad, Myc, and Sin3b would be expected to synergistically prevent activation of Myc target genes. PU.1 is a transcription factor implicated in the transcriptional control of neutrophilspecific genes and in neutrophil production, which is defective in PU.1 knockout mice. 50 Sp1, Purb, Klf9/Bteb1, and Maz are broadly expressed transcription factors that bind to purine-rich sites, including potentially some PU.1 sites. PU.1 is up-regulated almost 3fold at the RNA level, whereas all 4 of the latter factors are down-regulated during MPRO development, as is the SP1-like factor Klfl3. We have previously observed20 by Northern blot analysis that there is a shift in the balance of members of the C/EBP family of transcription factors at the mRNA level during MPRO differentiation, with some progressive down-regulation of C/EBP and upregulation first of C/EBP then C/EBP and . These results are consistent with the role of these factors in neutrophil development, deduced from both transcriptional analysis of Appendix 185 individual promoters and gene knockout effects on myelopoiesis. The present set of RNA analyses by oligonucleotide chip hybridization is more consistent with the Northern blot analyses than were the preliminary results,20 although C/EBP is still not represented on the chip. Overall, these coordinated changes in the expression of multiple transcription factors would serve to amplify differences in transcription and permit fine control of the timing and amplitude of regulation for multiple gene targets. As previously postulated,51 such reciprocal regulation of competing factors may be a common mechanism in differentiation. The changes in mRNA levels during maturation of myeloid cells include both the silencing of a number of genes and up-regulation of a number of other genes. The substantial changes in the level of some putative transcriptional repressors, both up (eg, Sin3b, Atf7ip) and down (eg, Rybp) during differentiation, suggest that specific repression of transcription provides an important and under-investigated means of regulating myeloid differentiation, in addition to more conventional mechanisms such as competition for binding sites and changes in activating factor levels. The striking morphologic changes in the maturing nuclei of "polymorphonuclear leukocytes" remain mysterious both in terms of mechanism and teleology. Some possible clues may be observed in the current RNA expression data. For example, Ran is a small guanosine triphosphatase (GTPase) required for nuclear import and export, and mRNA levels for Ran and Ran binding proteins 1 and 2 decline as the cells mature. This change could either be a cause or consequence of decreased nuclear import of macromolecules Appendix 186 coincident with nuclear condensation. Another protein, acinus, is implicated in causing chromatin condensation without DNA breakage during apoptosis52; its mRNA increases about 3-fold as MPRO cells mature and form highly condensed, multilobed nuclei. In summary, we have comprehensively and quantitatively analyzed both RNA and protein expression patterns during myeloid differentiation. Changes in protein levels correlated moderately well with changes in mRNA expression. Investigation of transcription factor mRNA representation showed considerable similarity to those of mature human neutrophils and highlight several transcription factors and other functional nuclear proteins whose mRNA levels change prominently during MPRO differentiation but which have not been investigated previously in the context of myeloid development. The number of transcription factors expressed in these cells greatly exceeds those previously identified as important for the regulation of specific myeloid genes. Currently emerging techniques53-55 for genomic analysis of factor binding sites in mammalian DNA may help to elucidate their gene targets and potential roles in myeloid differentiation. Appendix 187 Acknowledgments We express our gratitude to Dr S. Tsai (Program in Molecular Medicine, Fred Hutchinson Cancer Research Center, Seattle, WA) for his kind gift of MPRO cell line, and Mr Jeffrey J. Meyer (University of Chicago School of Medicine) for helpful advice. Supported by National Institutes of Health (NIH) grants CA42556, AI43558, DK54369, and HL63357, and by Gene Logic (S.M.W.); NIH grant HL63357 (Z.L.); NIH grant DK 54369, grants from the Arthritis Foundation and Charles H. Hood Foundation, and the John H. Pierce Pediatric Oncology Research Fund (P.E.N.); and NIH grant P50 HG02357-01 (M.G.). S.M.W. owns stock in and consults for Gene Logic Inc. Appendix 188 Figures and Tables Appendix: Change in mRNA expression vs. change in protein abundance levels Appendix: Change in mRNA expression vs. change in protein abundance levels Figure 1 Two-dimensional electrophoretograms of wide pH range of MPRO cells. Figure 1. Two-dimensional electrophoretograms of wide pH range of MPRO cells. MPRO cells differentiate to mature neutrophils in the presence of ATRA. Following exposure to 10 µM ATRA for 0, 24, 48, or 72 hours, MPRO cell lysate (2.5 × 106 cells/sample) was loaded for 2-dimensional electrophoretic (2DE) analysis. The gels were stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell (0 hour); (B) MPRO cells induced with ATRA for 24 hours; (C) MPRO cells induced with ATRA for Appendix 189 48 hours; (D) matured MPRO cells induced with ATRA for 72 hours. The most visible protein spots in the maps were subjected to MS analysis. The marked 2 DE maps could be found in our website (http://bioinfo.mbb.yale.edu/expression/myelopoiesis). *2 DE maps of panels A and D were published in our previous paper.20 Appendix 190 Figure 2 Two-dimensional electrophoretograms of MPRO cells in pH range 4 to 7. Figure 2. Two-dimensional electrophoretograms of MPRO cells in pH range 4 to 7. MPRO cell lysate (1.5 × 106 cells/sample) was loaded for 2DE analysis (pH 4-7). The gels were stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell (0 hour); (B) matured MPRO cells induced with ATRA for 72 hours. The other information is presented as in the legend to Figure 1. In these wide-range 2-dimensional maps, there is a loss of resolution in the region pH 4 to 7, most probably due to the fact that the pI values of many proteins occur in this range. Therefore, we also performed electrophoresis on pH 4 to 7 and pH 6 to 11 narrow-range IPGs to get better protein separation (Figures 2 and 3). These narrower pH gels allowed a higher resolution and more protein spots in the relative pH zones. The abundant protein spots could also be cross-correlated between the wide and narrow gels. Appendix 191 Figure 3 Two-dimensional electrophoretograms of MPRO cells in pH range 6 to 11. Figure 3. Two-dimensional electrophoretograms of MPRO cells in pH range 6 to 11. MPRO cell lysate (1.5 × 106 cells/sample) was loaded for basic pH 2DE analysis (pH 611). The gels were stained with brilliant blue G-colloidal dye. (A) Uninduced MPRO cell (0 hour); (B) matured MPRO cells induced with ATRA for 72 hours. The other information is presented as in the legend to Figure 1. Appendix 192 Table 1 Distribution of protein spots identified during myeloid differentiation Table 2. Protein species represented by multiple spots Theoretical value Symbol Aldh2 Atp5a1 Ddx5 Gapd Accession NP_033786 NP_031531 NP_031866 NP_032110 Hnrpa2b1 NP_058086 Hnrph1 Appendix NP_067485 * Gi# Protein ID MPRO6753036 004 MPRO006 MPRO6680748 087 MPRO088 MPRO6681157 206 MPRO207 MPRO6679937 035 MPRO085 MPRO7949053 223 MPRO227 MPRO229 MPRO10946928 155 MPRO154 kDa pl 56.52 Practical value % kDa pl 7.7 23 31~50 6.4~6.6 56.52 7.7 20 6~14 7.3~7.6 59.73 9.3 24 45~55 7.6~7.8 59.73 9.3 24 45~55 7.8~8.0 69.3 9.3 22 45~66 9.1~9.3 69.3 9.3 26 45~55 9.1~9.4 35.79 8.7 39 25~35 8.0~8.2 35.79 8.7 34 28~38 7.7~7.9 35.98 8.7 55 25~31 9.2~9.3 35.98 8.7 55 21~31 9.2~9.3 35.98 8.7 55 21~33 9.1~9.2 49.18 5.9 26 45~66 5.9~6.0 49.18 5.9 40 45~66 5.8~5.9 193 Hmg2 NP_032278 11527222 6680229 Pk3 NP_035229 6755074 2506796 Rbm3 STEFIN 3 Tpi Tpm5 NP_058089 P35175 NP_033441 P21107 Vim Vdac1 7949121 461911 6678413 136097 2078001 Q60932 10720404 MPRO076 MPRO104 MPRO023 MPRO008 MPRO014 MPRO015 MPRO005 MPRO033 MPRO012 MPRO073 MPRO083 MPRO112 MPRO093 MPRO110 MPRO228 MPRO235 24.16 6.9 26 18~28 7.2~7.4 14.16 6.9 26 14~21 7.6~7.8 57.9 7.2 48 45~66 7.2~7.4 57.87 7.2 42 150~200 7.0~7.5 16.59 6.8 25 7~14 6.6~6.8 16.59 6.8 25 12~16 6.2~6.4 10.99 5.9 48 1~6.5 6.2~6.4 10.99 5.9 53 1~6.5 5.8~6.0 26.69 6.9 26 15~25 6.9~7.1 26.69 6.9 40 18~28 6.7~6.9 29 4.7 46 6.5~14 7.5~7.7 29 4.7 27 21~31 4.6~4.8 51.55 4.9 25 31~45 4.7~4.9 53.67 5.1 28 40~50 4.9~5.0 32.33 8.7 49 21~33 8.8~9.0 32.33 8.7 35 21~31 8.7~8.9 Protein symbol, accession, and Gi# refer to NCBI UniGene database (if represented). Theoretical value refers from ProFound website (http://prowl.rockefeller.edu/cgibin/ProFound). Practical value is the observed value in 2 DE gels (see "Appendix"). Table 2 Protein species represented by multiple spots Appendix 194 Table 2. Protein species represented by multiple spots Theoretical value Symbol Aldh2 Atp5a1 Ddx5 Gapd Accession NP_033786 NP_031531 NP_031866 NP_032110 Hnrpa2b1 NP_058086 Hnrph1 Hmg2 Pk3 Appendix NP_067485 NP_032278 NP_035229 Gi#* Protein ID MPRO6753036 004 MPRO006 MPRO6680748 087 MPRO088 MPRO6681157 206 MPRO207 MPRO6679937 035 MPRO085 MPRO7949053 223 MPRO227 MPRO229 MPRO10946928 155 MPRO154 MPRO11527222 076 MPRO6680229 104 MPRO6755074 023 kDa pl 56.52 Practical value % kDa pl 7.7 23 31~50 6.4~6.6 56.52 7.7 20 6~14 7.3~7.6 59.73 9.3 24 45~55 7.6~7.8 59.73 9.3 24 45~55 7.8~8.0 69.3 9.3 22 45~66 9.1~9.3 69.3 9.3 26 45~55 9.1~9.4 35.79 8.7 39 25~35 8.0~8.2 35.79 8.7 34 28~38 7.7~7.9 35.98 8.7 55 25~31 9.2~9.3 35.98 8.7 55 21~31 9.2~9.3 35.98 8.7 55 21~33 9.1~9.2 49.18 5.9 26 45~66 5.9~6.0 49.18 5.9 40 45~66 5.8~5.9 24.16 6.9 26 18~28 7.2~7.4 14.16 6.9 26 14~21 7.6~7.8 57.9 7.2 48 45~66 7.2~7.4 195 2506796 Rbm3 STEFIN 3 Tpi Tpm5 NP_058089 P35175 NP_033441 P21107 Vim Vdac1 7949121 461911 6678413 136097 2078001 Q60932 10720404 MPRO008 MPRO014 MPRO015 MPRO005 MPRO033 MPRO012 MPRO073 MPRO083 MPRO112 MPRO093 MPRO110 MPRO228 MPRO235 57.87 7.2 42 150~200 7.0~7.5 16.59 6.8 25 7~14 6.6~6.8 16.59 6.8 25 12~16 6.2~6.4 10.99 5.9 48 1~6.5 6.2~6.4 10.99 5.9 53 1~6.5 5.8~6.0 26.69 6.9 26 15~25 6.9~7.1 26.69 6.9 40 18~28 6.7~6.9 29 4.7 46 6.5~14 7.5~7.7 29 4.7 27 21~31 4.6~4.8 51.55 4.9 25 31~45 4.7~4.9 53.67 5.1 28 40~50 4.9~5.0 32.33 8.7 49 21~33 8.8~9.0 32.33 8.7 35 21~31 8.7~8.9 Protein symbol, accession, and Gi# refer to NCBI UniGene database (if represented). Theoretical value refers from ProFound website (http://prowl.rockefeller.edu/cgibin/ProFound). Practical value is the observed value in 2 DE gels (see "Appendix"). Table 3 Classification of known proteins Appendix 196 Table 3. Classification of known proteins Category Cytoskeleton Energy metabolism Signaling pathway Cytokine Transcription modulators Chaperone Granule-related protein Mitochondrial RNA metabolism Transporter Chromatin Other categories Protein (gene) symbol Actb, Actg, Anxa1, Anxa11, Anxa2, Anxa3, Arpc3, Cappa1, Coro1a, ECP, KER1, KER8, KER10, KER47, KER59, Krt2-6g, KT14, SAC, Tpm5, Tuba6, Tubb5, vim Eno1, Gapd, Idh1, Idh2, Impdh2, Ldh1, Papss2/Atpsk2, Pygm, Taldo1, Tpi Arhgdib, Arhn, Ephb2, G4-1-pending, Gnb2-rs1, Hcph, Nme1, Pgk1, Pk3, Ptpn1, Rac2, Ran, Rin, Vav2 Hgf, IFI-205, IIIf5, Pbp, Spry1 Hmgb1, Hmg2, KRZF80M, Rnf17, Stat5a, Taf2e, Zfp101, ZFP1A3, Zfp354a Cab140, Cct2, Cct5, Cct6a, GROEL, Grp58, Hsc70, Hsp110, Hspa5/Grp78, Hspa8, P4hb, Ppia, Stip1 Cas1, Es10, Psmc1, Psma7, Psmc2, Sod1, STEFIN3 Got2, Aldh2, Atp5a1, Atp5b, Mor1 Hnrpa1, Hnrpa2b1, Hnrph1, Nsap1-pending, Rbm3, RNPC Slc23a2, Vdac1 Lmnb1, Pcna Abpa, Cftr, Crmp1, C4, Ddx5, Eef2, Eef1a1, Ehd1, Fut4, Gc, Gstm1, HPD76, IGVAP, Ltf, Tinag, Ube1x, H2-Ab1, Phb, Prdx1, Prdx2, Pdi4, Rag1, LOC56463, PRO2675, Tagln2, AA589396, Lgals3, Sfmbt Protein symbols refer to NCBI databases (see "Appendix"). Figure 4 Protein clusters according to their expression patterns. Appendix 197 Figure 4 Protein clusters according to their expression patterns. The 72 protein spots were grouped into 6 clusters (1 empty cluster is not shown). Each cluster is represented by the centroid (average pattern represented by a thick red line) for genes in the cluster. Expression level of each gene was standardized to have zero mean and unit SD across the 4 time points. Standardized expression levels are shown on y-axis and time points on xaxis. Appendix 198 Figure 5 The correlation between the mRNA difference at 0 and 72 hours and the corresponding protein difference. Figure 5. The correlation between the mRNA difference at 0 and 72 hours and the corresponding protein difference. Correlation between RNA expression level differences, R RNA(t = 72) RNA(t = 0), and protein level differences P P(t = 72) P(t = 0). Expression levels of proteins that have more than one conformation were summed. In this regression analysis we retained only RNA probe sets that correspond to single genes (the remaining probe sets lacked the ambiguity of multiple probe sets per Locus Link) and that had a "present" Affymetrix indicator and an amplitude more than 20 both at t = 0 and t = 72 hours. There were 51 different proteins that satisfy these conditions. The linear association r between changes in RNA levels (R) and changes in protein levels (P) of the remaining 51 genes is only 0.58, where r is the Pearson correlation coefficient defined as Appendix 199 r(P,R) = i (Pi (Ri / However, about 80% of the genes are located in the first and third quadrants, indicating a general trend that genes with increasing/decreasing levels of RNA also have increasing/decreasing protein levels. Appendix 200 Figure 6 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. Figure 6. Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. MPRO cells were treated with cycloheximide for 2 hours. MPRO cell lysate (1.5 × 106 cells/sample) was loaded for 2DE analysis (pH 4-7). (A) Control MPRO cells. (B) Cycloheximide-treated MPRO cells. The gels were stained with brilliant blue G-colloidal dye. (C,D) The magnified regions of 2 DE gels shown as inset in panels A Appendix 201 and B. The arrowheads point to protein spots that decrease in intensity after cycloheximide treatment; the arrows point to spots whose intensity increases after cycloheximide treatment. The other information is presented as in the legend to Figure 1. Appendix 202 Figure 7 Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. Figure 7. Two-dimensional electrophoretograms of cycloheximide inhibition of MPRO cells. MPRO cells from cycloheximide inhibition experiment were also analyzed by basic pH range 2 DE. MPRO cell lysate (1.5 × 106 cells/sample) was loaded for IPGs-PAGE pH 6 to 11 and stained with brilliant blue G-colloidal dye. (A) Control MPRO cells. (B) Cycloheximide-treated MPRO cells. (C,D) The magnified regions of 2 DE gels shown as Appendix 203 inset in panels A and B. The arrowheads point to protein spots that decrease in intensity after cycloheximide treatment; the arrows point to spots whose intensity increases after cycloheximide treatment. The other information is presented as in the legend to Figure 1. Appendix 204 Figure 8 Distribution of protein spots from cycloheximide experiment. Figure 8. Distribution of protein spots from cycloheximide experiment. In the cycloheximide experiment, MPRO cells were treated with cycloheximide for 2 hours; the untreated MPRO cells were used as a control. The protein inhibition patterns were compared with those of the control cells by Melanie-II software. For each protein, the xaxis value represents OD value of untreated with cycloheximide. The y-axis represents OD value after cycloheximide treatment. The information of proteins was collected in database dbMCp. Appendix 205 Table 4 Transcription factors analyzed by Northern blot assay Table 4. Transcription factors analyzed by Northern blot assay Symbol AA407540 Bach1 Baz1b Crem Creb1 Cutl1 Hipk3 Maz Mycbp Nmi pou2f1 Pou5f1 Pura Rybp Elf4 Sp1 Ncoa1 Fos p202 p204 0 h 24 h 2 1 4 2 2 2 5 4 1 1 3 1 3 5 0 0 0 0 0 0 3 3 7 1 3 1 5 4 2 2 2 2 3 4 0 0 0 0 0 0 48 h 72 h MPRO 72 h* Human 60K* 1 4 3 0 2 3 3 2 2 2 1 2 5 1 0 0 0 0 0 0 1 4 3 0 2 3 3 2 2 2 1 3 8 1 0 0 0 0 0 0 169.81/P 20/A 20/A 20/A 20/A 20/A 26.97/A 20/A 20/A 20/A 20/A 20/A 81.62/P 20/A 20/A 20/A 27.24/P 20/A N/A N/A N/A 679.32/P 679.32/P 821.08/P 233.22/P 125.96/P 41.14/P 1674.86/P 128.25/P 407.37/P 101.02/P 65.02/A 49.42/P 592.25/P 729.51/P 20/A 392.04/P N/A N/A N/A Band intensities at the different time courses from Northern blot assay were semiquantified on a scale from 1 (+) to 8 (++++++++). * The numbers in these columns are average differences in the value of hybridization intensity between the set of perfectly matched oligonucleotides and the set of mismatched oligonucleotides in the oligonucleotide array. "A" represents the genes that are absent, and "P" represents present in Affymetrix chip assay. The other information is presented as in the footnote to Table 2. N/A indicates the gene is not presented in Affymetrix chips. Appendix 206 Figure 9 Northern blot analysis of selected mRNAs. Figure 9. Northern blot analysis of selected mRNAs. Equivalent amounts of RNA from MPRO cells induced by ATRA at different time points (0 hour, 24 hours, 48 hours, and 72 hours) were resolved by formaldehyde-agarose gel electrophoresis, stained to verify the amount of loading. Twenty transcription factor genes were separately probed on the RNA filters. The gene symbol of each probe was listed at the left of related Appendix 207 Northern blot result. One of the RNA-blotted membrane photographs is shown with methylene blue-stained 28S and 18S RNA subunits demonstrating the quality and quantity of RNA loaded in individual lanes. Appendix 208 References 1. Phillips RL, Ernst RE, Brunk B, et al. The genetic program of hematopoietic stem cells. Science. 2000;288:1635-1640[Abstract/Free Full Text]. 2. Theilgaard-Monch K, Cowland J, Borregaard N. Profiling of gene expression in individual hematopoietic cells by global mRNA amplification and slot blot analysis. J Immunol Methods. 2001;252:175-189[CrossRef][Medline] [Order article via Infotrieve]. 3. Skalnik DG. Transcriptional mechanisms regulating myeloid-specific genes. Gene. 2002;284:1-21[CrossRef][Medline] [Order article via Infotrieve]. 4. Jacobsen FW, Rusten LS, Jacobsen SE. Direct synergistic effects of interleukin-7 on in vitro myelopoiesis of human CD34+ bone marrow progenitors. Blood. 1994;84:775779[Abstract/Free Full Text]. 5. Bennett CM, Kanki JP, Rhodes J, et al. Myelopoiesis in the zebrafish, Danio rerio. Blood. 2001;98:643-651[Abstract/Free Full Text]. 6. Reya T, Contractor NV, Couzens MS, Wasik MA, Emerson SG, Carding SR. Abnormal myelocytic cell development in interleukin-2 (IL-2)-deficient mice: evidence for the involvement of IL-2 in myelopoiesis. Blood. 1998;91:2935-2947[Abstract/Free Full Text]. Appendix 209 7. Sterkers Y, Preudhomme C, Lai JL, et al. Acute myeloid leukemia and myelodysplastic syndromes following essential thrombocythemia treated with hydroxyurea: high proportion of cases with 17p deletion. Blood. 1998;91:616622[Abstract/Free Full Text]. 8. Lawson ND, Krause DS, Berliner N. Normal neutrophil differentiation and secondary granule gene expression in the EML and MPRO cell lines. Exp Hematol. 1998;26:11781185[Medline] [Order article via Infotrieve]. 9. Berliner N. Molecular biology of neutrophil differentiation. Curr Opin Hematol. 1998;5:49-53[Medline] [Order article via Infotrieve]. 10. Tenen DG, Hromas R, Licht JD, Zhang DE. Transcription factors, normal myeloid development, and leukemia. Blood. 1997;90:489-519[Free Full Text]. 11. Samarkos M, Aessopos A, Fragodimitri C, et al. Neutrophil elastase in patients with homozygous beta-thalassemia and pseudoxanthoma elasticum-like syndrome. Am J Hematol. 2000;63:63-67[CrossRef][Medline] [Order article via Infotrieve]. 12. Kogan SC, Brown DE, Shultz DB, et al. BCL-2 cooperates with promyelocytic leukemia retinoic acid receptor alpha chimeric protein (PMLRARalpha) to block Appendix 210 neutrophil differentiation and initiate acute leukemia. J Exp Med. 2001;193:531543[Abstract/Free Full Text]. 13. Calvo KR, Knoepfler PS, Sykes DB, Pasillas MP, Kamps MP. Meis1a suppresses differentiation by G-CSF and promotes proliferation by SCF: potential mechanisms of cooperativity with Hoxa9 in myeloid leukemia. Proc Natl Acad Sci U S A. 2001;98:13120-13125[Abstract/Free Full Text]. 14. Orkin SH. Transcription factors and hematopoietic development. J Biol Chem. 1995;270:4955-4958[Free Full Text]. 15. Tsai S, Collins SJ. A dominant negative retinoic acid receptor blocks neutrophil differentiation at the promyelocyte stage. Proc Natl Acad Sci U S A. 1993;90:71537157[Abstract]. 16. Drexler HG, Quentmeier H, MacLeod RA, Uphoff CC, Hu ZB. Leukemia cell lines: in vitro models for the study of acute promyelocytic leukemia. Leuk Res. 1995;19:681691[CrossRef][Medline] [Order article via Infotrieve]. 17. Hacia JG, Makalowski W, Edgemon K, et al. Evolutionary sequence comparisons using high-density oligonucleotide arrays. Nat Genet. 1998;18:155- 158[CrossRef][Medline] [Order article via Infotrieve]. Appendix 211 18. Subrahmanyam YV, Baskaran N, Newburger PE, Weissman SM. A modified method for the display of 3'-end restriction fragments of cDNAs: molecular profiling of gene expression in neutrophils. Methods Enzymol. 1999;303:272-297[Medline] [Order article via Infotrieve]. 19. Anderson NL, Anderson NG. Proteome and proteomics: new technologies, new concepts, and new words. Electrophoresis. 1998;19:1853-1861[Medline] [Order article via Infotrieve]. 20. Lian Z, Wang L, Yamaga S, et al. Genomic and proteomic analysis of the myeloid differentiation program. Blood. 2001;98:513-524[Abstract/Free Full Text]. 21. Laemmli UK. Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nature. 1970;227:680-685[Medline] [Order article via Infotrieve]. 22. Studier FW. Analysis of bacteriophage T7 early RNAs and proteins on slab gels. J Mol Biol. 1973;79:237-248[Medline] [Order article via Infotrieve]. 23. Neuhoff V, Arold N, Taube D, Ehrhardt W. Improved staining of proteins in polyacrylamide gels including isoelectric focusing gels with clear background at nanogram sensitivity using Coomassie brilliant blue G-250 and R-250. Electrophoresis. 1988;9:255-262[Medline] [Order article via Infotrieve]. Appendix 212 24. Switzer RC 3rd, Merril CR, Shifrin S. A highly sensitive silver stain for detecting proteins and peptides in polyacrylamide gels. Anal Biochem. 1979;98:231-237[Medline] [Order article via Infotrieve]. 25. Gorg A, Obermaier C, Boguth G, et al. The current state of two-dimensional electrophoresis with immobilized pH gradients. Electrophoresis. 2000;21:10371053[CrossRef][Medline] [Order article via Infotrieve]. 26. Henzel WJ, Billeci TM, Stults JT, Wong SC, Grimley C, Watanabe C. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc Natl Acad Sci U S A. 1993;90:5011-5015[Abstract]. 27. Thiede B, Siejak F, Dimmler C, Jungblut PR, Rudel T. A two dimensional electrophoresis database of a human Jurkat T-cell line. Electrophoresis. 2000;21:27132720[CrossRef][Medline] [Order article via Infotrieve]. 28. Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999;96:2907-2912[Abstract/Free Full Text]. 29. Greenbaum D, Jansen R, Gerstein M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the Appendix 213 cellular population of proteins and transcripts. Bioinformatics. 2002;18:585- 596[Abstract/Free Full Text]. 30. Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999;19:1720-1730[Abstract/Free Full Text]. 31. Futcher B, Latter GI, Monardo P, McLaughlin CS, Garrels JI. A sampling of the yeast proteome. Mol Cell Biol. 1999;19:7357-7368[Abstract/Free Full Text]. 32. Greenbaum D, Luscombe NM, Jansen R, Qian J, Gerstein M. Interrelating different types of genomic data, from proteome to secretome: 'oming in on function. Genome Res. 2001;11:1463-1468[Abstract/Free Full Text]. 33. Anderson L, Seilhamer J. A comparison of selected mRNA and protein abundances in human liver. Electrophoresis. 1997;18:533-537[Medline] [Order article via Infotrieve]. 34. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol. 1999;17:994-999[CrossRef][Medline] [Order article via Infotrieve]. 35. Van Belle D, Andre B. A genomic view of yeast membrane transporters. Curr Opin Cell Biol. 2001;13:389-398[CrossRef][Medline] [Order article via Infotrieve]. Appendix 214 36. Yuo A. Differentiation, apoptosis, and function of human immature and mature myeloid cells: intracellular signaling mechanism. Int J Hematol. 2001;73:438452[Medline] [Order article via Infotrieve]. 37. Nagamura-Inoue T, Tamura T, Ozato K. Transcription factors that regulate growth and differentiation of myeloid cells. Int Rev Immunol. 2001;20:83-105[Medline] [Order article via Infotrieve]. 38. Kubota T, Kawano S, Chih DY, et al. Representational difference analysis using myeloid cells from C/EBP epsilon deletional mice. Blood. 2000;96:3953- 3957[Abstract/Free Full Text]. 39. Yamanaka R, Barlow C, Lekstrom-Himes J, et al. Impaired granulopoiesis, myelodysplasia, and early lethality in CCAAT/enhancer binding protein epsilon-deficient mice. Proc Natl Acad Sci U S A. 1997;94:13187-13192[Abstract/Free Full Text]. 40. Ward AC, van Aesch YM, Schelen AM, Touw IP. Defective internalization and sustained activation of truncated granulocyte colony-stimulating factor receptor found in severe congenital neutropenia/acute myeloid leukemia. Blood. 1999;93:447- 458[Abstract/Free Full Text]. Appendix 215 41. Romero F, Martinez AC, Camonis J, Rebollo A. Aiolos transcription factor controls cell death in T cells by regulating Bcl-2 expression and its cellular localization. EMBO J. 1999;18:3419-3430[Abstract/Free Full Text]. 42. Massaad-Massade L, Navarro S, Krummrei U, Reeves R, Beaune P, Barouki R. HMGA1 enhances the transcriptional activity and binding of the estrogen receptor to its responsive element. Biochemistry. 2002;41:2760-2768[CrossRef][Medline] [Order article via Infotrieve]. 43. Webb M, Payet D, Lee KB, Travers AA, Thomas JO. Structural requirements for cooperative binding of HMG1 to DNA minicircles. J Mol Biol. 2001;309:7988[CrossRef][Medline] [Order article via Infotrieve]. 44. Czura CJ, Wang H, Tracey KJ. Dual roles for HMGB1: DNA binding and cytokine. J Endotoxin Res. 2001;7:315-321[Medline] [Order article via Infotrieve]. 45. Yang H, Wang H, Tracey KJ. HMG-1 rediscovered as a cytokine. Shock. 2001;15:247-253[Medline] [Order article via Infotrieve]. 46. Dong G, Loukinova E, Chen Z, et al. Molecular profiling of transformed and metastatic murine squamous carcinoma cells by differential display and cDNA microarray reveals altered expression of multiple genes related to growth, apoptosis, Appendix 216 angiogenesis, and the NF-kappaB signal pathway. Cancer Res. 2001;61:47974808[Abstract/Free Full Text]. 47. Taniguchi M, Miura K, Iwao H, Yamanaka S. Quantitative assessment of DNA microarrayscomparison with Northern blot analyses. Genomics. 2001;71:34- 39[CrossRef][Medline] [Order article via Infotrieve]. 48. Nikiforov MA, Kotenko I, Petrenko O, et al. Complementation of Myc-dependent cell proliferation by cDNA expression library screening. Oncogene. 2000;19:48284831[CrossRef][Medline] [Order article via Infotrieve]. 49. Sommer A, Hilfenhaus S, Menkel A, et al. Cell growth inhibition by the Mad/Max complex through recruitment of histone deacetylase activity. Curr Biol. 1997;7:357365[Medline] [Order article via Infotrieve]. 50. Anderson KL, Smith KA, Perkin H, et al. PU.1 and the granulocyte and macrophage colony-stimulating factor receptors play distinct roles in late-stage myeloid cell differentiation. Blood. 1999;94:2310-2318[Abstract/Free Full Text]. 51. Orkin SH. Diversification of haematopoietic stem cells to specific lineages. Nat Rev Genet. 2000;1:57-64[CrossRef][Medline] [Order article via Infotrieve]. Appendix 217 52. Sahara S, Aoto M, Eguchi Y, Imamoto N, Yoneda Y, Tsujimoto Y. Acinus is a caspase-3 activated protein required for apoptotic chromatin condensation. Nature. 1999;401:168-173[CrossRef][Medline] [Order article via Infotrieve]. 53. Weinmann AS, Yan PS, Oberley MJ, Huang TH, Farnham PJ. Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev. 2002;16:235-244[Abstract/Free Full Text]. 54. Nau GJ, Richmond JF, Schlesinger A, Jennings EG, Lander ES, Young RA. Human macrophage activation programs induced by bacterial pathogens. Proc Natl Acad Sci U S A. 2002;99:1503-1508[Abstract/Free Full Text]. 55. Horak CE, Mahajan MC, Luscombe NM, Gerstein M, Weissman SM, Snyder M. GATA-1 binding sites mapped in the beta-globin locus by using mammalian chIp-chip analysis. Proc Natl Acad Sci U S A. 2002;99:2924-2929[Abstract/Free Full Text]. Appendix 218 Appendix This section contains the genes described in this paper, including figures, tables, and text. AA589396: dendritic cell protein; Abpa: androgen-binding protein: subunit alpha; Actb: put. Beta-actin; Actg: actin, gamma, cytoplasmic; Aldh2: aldehyde dehydrogenase 2, mitochondrial; Anxa1: lipocortin I protein annexin 1; Anxa11: annexin A11; Anxa2: annexin II calpactin I heavy chain; Anxa3: annexin A3; Arhgdib: RHO GDP-dissociation inhibitor 2(RHO GDI2); Arhn: rho7; Arpc3: actin-related protein 2/3 complex, subunit 3 (21 kDa); Arp2/3 complex subunit p21-Arc, Atp5a1: ATP synthase, H+ transporting, mitochondrial F1 complex, alpha subunit, isoform 1; Atp5b: ATP synthase, H+ transporting mitochondrial F1 complex, alpha subunit; C4: MHC complement component C4; Cab140: 170 kDa glucose regulated protein GRP170 precursor; Cappa1: F-actin capping protein alpha-1 subunit; Cas1: catalase 1; Cct2: chaperonin containing TCP-1 beta subunit ; Cct5: chaperonin subunit 5 (epsilon); Cct6a: Chaperonin subunit 6a (zeta); Cftr: cystic fibrosis transmembrane conductance regulator homolog; Coro1a: coronin, actin-binding protein 1A; Crmp1: collapsin response mediator; Ddx5: DEAD (aspartateglutamate-alanine-aspartate) box polypeptide 5; ECP: EndoA' cytokeratin 5' end put.); putative; Eef1a1: eukaryotic translation elongation factor 1 alpha 1; Eef2: elongation factor 2; Ehd1: "EH-domain containing 1, PAST, HPAST, H-PAST"; Eno1: alpha enolase; Ephb2: protein-tyrosine kinase (EC 2.7.1.112) sek-3, Eph receptor A4; Es10: sid478p/Esterase 10; Fut4: fucosyltransferase 4; G4-1-pending: phosphatase subunit gene g4-1; Gapd: glyceraldehyde-3-phosphate dehydrogenase; Gc: vitamin D-binding protein precursor; Gnb2-rs1: guanine nucleotide binding protein, beta-2, related sequence1, p205, Appendix 219 Rack1, Gnb2l1, GB-like; Got2: glutamate oxaloacetate transaminase 2, mitochondrial; mitochondrial aspartate aminotransferase; GROEL: chaperonin groEL precursor; Grp58: glucose regulated protein, 58 kDa; endoplasmic reticulum protein; phospholipase C, alpha; Gstm1: glutathione-S-transferase, mu1; H2-Ab1: histocompatibility 2, class II antigen A, beta 1; Hcph: PTPN6 tyrosine phosphatase, me, hcp, PTPN6, Ptp1C, SHP-1; Hgf: hepatocyte growth factor precursor; Hmg2: high mobility group protein 2; Hmgb1: high mobility group protein 1; Hnrpa1: heterogeneous nuclear ribonucleoprotein A1; Hnrpa2b1: heterogeneous nuclear ribonucleoprotein A2; heterogenous nuclear ribonucleoprotein A2/B1; Hnrph1: heterogeneous nuclear ribonucleoprotein H1; HPD76: hypothetical protein DKFZp761C10121.1; Hsc70: dnaK-type molecular chaperone hsc73/Heat shock protein cognate 70; Hsp110: heat shock protein, 110 kDa; Hspa5: glucose-regulated protein, 78 kDa; Hspa5/Grp78: 78 kDa glucose-regulated protein precursor (GRP 78); Hspa8: dnaK-type molecular chaperone hsc70; Idh1: isocitrate dehydrogenase 1(NADP+), soluble; Idh2: NADP+-specific isocitrate dehydrogenase; IFI205: interferon-activatable protein 205; IGVAP: Ig Vkappa, antiphenyloxazolone; Il1f5: interleukin 1 receptor antagonist homolog 1; Impdh2: inosine-5'-monophosphate dehydrogenase; KER1: keratin 1; KER8: keratin 8, type II cytoskeletal, embryonic; KER10: keratin 10, type I, cytoskeletal; KER14: keratin 8, type I cytoskeletal 14; KER47: 47 kDa keratin; KER59: keratin, 59K type I cytoskeletal; Krt2-6g: keratin, type II cytoskeletal 6; KRZF80M: Kruppel-related zinc finger protein F80-M; KT14: keratin, type I, cytoskeletal; Ldh1: lactate dehydrogenase 1, A chain; Lgals3: galectin-3; Lmnb1: lamin B1; LOC56463: p100coactivator; Ltf: lactotransferrin precursor; Mor1: malate dehydrogenase; Nme1: nucleoside diphosphate kinase A; Nsap1-pending: syncrip; P4hb: Appendix 220 protein disulfide-isomerase, PDI; Papss2/Atpsk2: ATP sulfurylase/APS kinase, 2: PAPS synthetase; Pbp: hippocampal cholinergic neurostimulating peptide precursor protein, phosphatidylethanolamine-binding protein; Pcna: proliferating cell nuclear antigen; Pdi4: peptidyl arginine deiminase, type IV; PAD type IV; Pgk1: phosphoglycerate kinase 1; Phb: prohibitin; Pk3: pyruvate kinase 3; Ppia: peptidylprolyl isomerase A; cyclophilin A ; Prdx1: proliferation-associated gene A, osteoblast specific factor 3; Prdx2: Antioxidant protein 2; PRO2675: PRO2675; Psma7: proteasome (prosome, macropain) subunit, alpha type 7, Proteasome subunit RC6-1; Psmc1: protease (prosome, macropain) 26S subunit, ATPase 1; Psmc2: 26S protease regulatory subunit 7, MSS1 protein; Ptpn1: protein tyrosine phosphatase; Pygm: muscle glycogen phosphorylase; Rac2: RAS-related C3 botulinum substrate 2, p21-Rac2, EN-7 protein; Rag1: recombination activating gene 1; Ran: GTP-binding nuclear protein Ran (TC4); Rbm3: RNA binding motif protein 3; Rin: RAS-like protein expressed in neuro; Rnf17: RING finger protein Mmip-2; RNPC: RNP particle component; SAC: spectrin alpha chain; Sfmbt: Scm-related gene containing 4 mbt domains; Slc23a2: solute carrier family 23, (nucleobase transporters) member 1; Sod1: putative peroxisomal antioxidant enzyme, superoxide dismutase 1; Spry1: sprouty homolog 1 (Drosophila); Stat5a: signal transducer and activator of transcription 5A; STEFIN3: stefin 3; Stip1: extendin/Stress-induced phosphoprotein1; Taf2e: TATA box binding protein (Tbp)-associated factor, RNA polymerase II, E; Tagln2: transgelin 2; Taldo1: transaldolase; Tinag: tubulointerstitial nephritis antigen; Tpi: triosephosphate isomerase, TIM; Tpm5: tropomyosin 5, cytoskeletal type; Tuba6: tubulin alpha 6; Tubb5: tubulin, beta 5; Ube1x: ubiquitin-activating enzyme E1 X; Vav2: Vav2 oncogene; Vdac1: voltage-dependent anion-selective channel protein 1; vim: vimentin; Zfp101: zinc finger Appendix 221 protein 101; ZFP1A3: Aiolos/zinc finger protein, subfamily 1A, 3; Zfp354a: transcription factor 17. Appendix 222