Communication between Layers in Biological Transcriptional Networks by Alex Tsankov Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2005 c Alex Tsankov. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 19, 2005 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moe Win Associate Professor, Massachusetts Institute of Technology Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pamela Silver Professor, Harvard Medical School Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Communication between Layers in Biological Transcriptional Networks by Alex Tsankov Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2005, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering Abstract Chromatin-immunoprecipitation experiments in combination with microarrays (known as ChIP-chip) have recently allowed biologists to map where proteins bind in the yeast genome. The combinatorial binding of different proteins at or near a gene controls the transcription (copying) of a gene and the production of the functional RNA or protein that the gene encodes. Therefore, ChIP-chip data provides powerful insight on how genes and gene products (i.e., proteins, RNA) interact and regulate one another in the underlying network of the cell. Much of the current work in modeling yeast transcriptional networks focuses on the regulatory effect of a class of proteins known as transcription factors (TF). However, other sets of factors also influence transcription, including histone modifications and states (HS), histone modifiers (HM) and remodelers, nuclear processing (NP), and nuclear transport (NT) proteins. In order to gain a holistic understanding of the non-linear process of transcription, our work examines the communication between all five forementioned classes (or layers) of regulators. We use vastly available rich-media ChIP-chip data for various proteins within the five classes to model a multi-layered transcriptional network of the yeast species Saccharomyces cerevisiae. Following the introduction in Chapter 1, Chapter 2 describes the non-trivial process of incorporating the different sources of data into a coherent set and normalizing the heterogeneous data to improve biological accuracy. Using the normalized data, Chapter 3 finds biologically meaningful pairwise statistics between proteins, including filtered correlation coefficient, and mutual information p-values. It then combines the p-values of the two complementary approaches in order to increase the reliability of our predictions. Chapter 4 uncovers group-wise relationships between proteins using a novel semi-supervised clustering algorithm that preserves information about elements of a cluster in order to better capture group-wise dependencies. Throughout the theoretical analysis, we confirm various known biological processes and uncover several novel hypotheses. Based on the developed methodology, Chapter 5 builds a multi-layered transcriptional network and quantifies the communication between levels in biological transcriptional networks. 3 Thesis Supervisor: Moe Win Title: Associate Professor, Massachusetts Institute of Technology Thesis Supervisor: Pamela Silver Title: Professor, Harvard Medical School 4 Acknowledgments The fruition of this project would not have been possible without the cumulative contributions of many people. I first want to express my gratitude for my advisor, Moe Win. You are the most thoughtful professor I have ever me, and a boss I can truly call a friend. Thank you for your trust and loyal support during both happy times and tough moments at MIT. Thank you for being open-minded about my research direction and for your encouragements as I first started learning biology. I also want to acknowledge Ae Suwansantisuk and Hyundong Shin for their valuable contributions in proofreading the final document. In addition, I want to thank Professor Jaakola, Sourav Dey, and Obrad Scepanovic for their insights during Machine Learning class. This project required the bridging of two disciplines and would not have even taken place without all the help form Pam Silver’s lab at Harvard Medical School (HMS). Thank you Pam for bringing me into the family, granting me computing resources, and for taking a chance on someone who knew nearly nothing about biology nine months ago. Mike, thank you for your patience and experimental validations, without which the project would be incomplete. And finally, I want to especially thank Jason Casolari, who’s creative insight inspired this undertaking. Thank you for always being honest with me, even during times when everything was foreign to me. Thank you for being my mentor in this new field, for explaining what YPD meant every Sunday afternooon, and for guiding me with your ingenious insight and biological intuition–I will never forget it. In the end, I want to express my eternal gratitude towards my family. Thank you Dona, for reaching out to me as your younger brother and for understanding me and accepting me with all my flaws. Fetzi, thank you for becoming my best friend at a time when I had no other friends. And Sparti, thank you for your unconditional love and care from the day I was born. You are all a large part of who I am. 5 6 Contents 1 Introduction 13 1.1 Regulation of Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 ChIP-chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Data Preparation 19 2.1 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Finding Missing P -values . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Evaluating Data Normalization: Correlation Analysis . . . . . . . . . 27 3 Pairwise Statistics 3.1 3.2 3.3 3.4 31 Filtered Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Filtered Correlation Coefficient P -values . . . . . . . . . . . . . . . . 34 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Mutual Information P -values . . . . . . . . . . . . . . . . . . . . . . 39 3.2.2 Evaluating Data Classification: Mutual Information Analysis . . . . . 41 Combining P -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Biological Predictions using Combined P -values . . . . . . . . . . . . 45 Minimized Mutual Information P -value . . . . . . . . . . . . . . . . . . . . . 49 4 Group-wise Relationships: PCA and Clustering 4.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 7 51 51 4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 Semi-Supervised Clustering . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.4 Biological Validation of Semi-Supervised Clustering . . . . . . . . . . 59 4.2.5 Active Processes and SIR2 . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Network of the Nucleus 5.1 5.2 67 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1.1 Assigning Links: Second Pass . . . . . . . . . . . . . . . . . . . . . . 71 5.1.2 Network Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Analysis of Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6 Conclusion 81 A Semi-Supervised Clustering based on KL-divergence 83 8 List of Figures 1-1 Layers in the yeast transcriptional network. . . . . . . . . . . . . . . . . . . 15 2-1 A histogram of the binding profile of protein Polymerase III. . . . . . . . . . 24 2-2 Correlation coefficient distance matrix using the BR, N &P R, P V , and SD data representations of a test set of proteins. . . . . . . . . . . . . . . . . . . 28 3-1 Venn diagram for the subsets of genes bound by proteins POL3 and TBP. . . 39 3-2 Filtered correlation coefficient, mutual information, and combined p-values for a test set of proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4-1 Casolari et al. model for nuclear transport and validation using Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4-2 Potential pitfall in using hierarchical clustering. . . . . . . . . . . . . . . . . 57 4-3 Confirmation of known biological processes and new insight using semi-supervised clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4-4 Active biological processes and new insight using semi-supervised clustering. 65 5-1 Multi-layered transcriptional network of highly significant interactions. . . . 73 A-1 Semi-supervised clustering of all nuclear transport and nuclear processing factors binding profiles using the KL distance metric. . . . . . . . . . . . . . . . 9 86 10 List of Tables 2.1 Total correlation coefficient distance at likely and unlikely interactions using different data representations. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Mutual information evaluation at probable and unlikely interactions using different data representations. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 74 Total links and percentage of links realized within and between the five layers of the entire and the positive edge transcriptional network. . . . . . . . . . . 5.3 42 Comparison between various protein-protein interaction data sets and our network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 29 76 Network topology statistics for the entire and the positive edge transcriptional network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 78 12 Chapter 1 Introduction Intricate communication between genes regulate a complicated network within and between millions of cells that make our body function. The recent emergence of novel, data-rich experimental techniques that can simultaneously monitor the behavior of all genes within a cell has enabled computational biologists to study gene networks quantitatively. Understanding the process of transcription, or how genes are turned “on” and copied into more functional forms, is central to modeling gene regulatory networks. 1.1 Regulation of Transcription Ordered sequences of deoxyribonucleic acid (DNA) base pairs, protected within the nucleus of each cell, encode the unique hereditary information of each individual. Genes are segments of DNA located on different chromosomes (large bundles of continuous DNA) that encode a unique product with a specific cellular function. During transcription, a gene is copied (or expressed) to a more mobile strand of information, known as RNA. For most genes, their RNA products are mere messengers (or mRNA) of the DNA code. The mRNA transports the DNA’s information outside of the nucleus where it is translated into proteins, another even more functional string composed of amino acid molecules. The final product of some genes, however, is the RNA itself. Such strands execute specific tasks within the body in the 13 same manner as proteins. The repertoire of RNAs and proteins expressed from “on” genes in each cell allows for the myriad of functions necessary for cell survival and, on a larger scale, for the many different cell types found in complex organisms. The transcription of genes is regulated by various RNAs and proteins, or products of other genes. Hence, genes regulate each other’s activity through their respective products. Several classes of proteins or factors, which we refer to as layers or levels, collaborate to regulate the process of transcription. A set of proteins, known as transcription factors (TF), can enhance or suppress how actively a gene g works to produce its corresponding protein by binding to the DNA preceding g, or the promoter of g. Nuclear transport factors (NTs) represent another layer of proteins that affects transcription by controling the flux of molecules (such as TFs, mRNAs, etc) going in and out of the nucleus. Nuclear processing factors (NPs) also control transcription by executing functions within the nucleus, such as the actual copying of DNA to RNA or post-transcriptional processing of mRNAs. Chromatin, a macromolecular complex consisting of DNA and proteins that is found in eukaryotes (nucleus-containing organisms), also has profound affects on transcription. Chromatin is organized into packed, inaccessible and unpacked, accessible regions of DNA by structural proteins such as histones; therefore, histones have the potential to alter the expression state of genes. A fourth class of proteins that regulate transcription consists of histone modifiers (HMs), which change the accessibility of a gene’s DNA by adding or removing acetyl, methyl, or other molecular groups on specific histone amino acids sites. We also include nucleosome remodelers in this category, or protein complexes that displace and reposition macromolecules of eight histones called nucleosomes. And finally, the actual histone states (HSs) in terms of the acetylation and methylation levels on specific histone amino acids, form yet another layer of control in transcription. For example, some proteins bind to specific patterns of histone acetylation levels in order to regulate the nearby DNA. The protein classes described above each contribute to the expression of a gene in unique and situation-specific manners. Indeed, large research fields are devoted to the study of each of the described gene/protein classes and how particular stimuli are able to alter the 14 transcriptional output of a gene. In order to better understand the non-linear process of transcription, it is necessary to examine the genome-wide interplay of all of these transcriptional regulators both within particular fields, such as how modification of a histone at one amino acid affects modification at other amino acids, and between fields of study, such as the relationship between histone modification and recruitment of transcription factors to particular genes. The budding yeast Saccharomyces cerevisiae offers a unique opportunity to achieve this essential next step in understanding transcriptional control as it has a fully sequenced and annotated genome as well as an extraordinary body of literature regarding functions for many of these genes and their resulting RNAs or proteins. Figure 1 summarizes the different layers of organization known to regulate the transcriptional network in Saccharomyces cerevisiae which we will model. In order to bettter understand the non-linear process of transcription, we need information about where different classes of proteins bind to in the genome. Figure 1-1: The five layers in the yeast transcriptional network that we aim to model. Twosided arrows represent undirected functional connections between components within a level as well as interplay of components between layers of organization. 15 1.2 ChIP-chip A biochemical technique known as chromatin immunoprecipitation (ChIP) allows for the chemical tethering of particular proteins to chromatin. The recent advent of ChIP in combination with microarrays (known as ChIP-chip or genomic location analysis) allows biologists to adapt this technique to a genome-wide scale, identifying the DNA binding sites for particular proteins across the entire genome [10]. Each ChIP-chip experiment begins by harvesting a yeast strain in a carefully prepared solution simulating a specific environmental condition. Most ChIP-chip experiments use the standard YPD (Yeast extract Peptone Dextrose) solution containing the sugar dextrose, in order to simulate rich media conditions where the yeast population grows exponentially with time. Next, addition of formaldehyde crosslinks or “fixes” any proteins bound to DNA inside the normal growing cell, or in vivo. The inner contents of the cell are then extracted and the strands of DNA along with the bound proteins are sheared into fragments by sonication. Using an anti-body that binds specifically to the protein i of interest, one can isolate all DNA fragments bound to the designated protein i by precipitation (called immunoprecipitation). The DNA-(protein i) crosslink is then reversed in order to release the DNA and digest the protein. The free DNA, previously bound by protein i, is then amplified by Polymerase Chain Reaction (PCR) and labeled with a fluorescent dye (Cy5). In parallel, a sample of sheared DNA is taken as a control and is not enriched for binding with protein i using immunoprecipitation. This sample is also amplified using PCR and labeled using a different fluorescent dye (Cy3). Ultimately, the ratio of Cy5 and Cy3 fluorescence intensities at each gene will allow one to infer protein i’s predilection for binding to g. Microarray technology when coupled with ChIP enables biologists to simultaneously measure the binding preference of protein i at all genes in the genome relative to other genes in just one experiment. One first needs to prepare a microarray matrix of wells, where each well g contains a probe specific to gene g (for example the complementary DNA of gene g) . Next, each microarray well is filled with a sample from both the pool of sheared DNA enriched (Cy5) and unenriched (Cy3) for protein i. Inside each well g, the differently dyed fragments 16 originally sheared from gene g will competitively bind to the complementary DNA specific to g during the process of hybridization. The array of wells can then be scanned using a laser that detects the intensities of the fluorescent red (Cy5) and green (Cy3) dye [39]. The ratio of intensities (Cy5/Cy3), also known as a binding ratio, is proportional to the ratio of the number of gene g fragments bound by protein i to the number of g fragments randomly resulting from the same process without immunoprecipitation. Each ChIP-chip experiment is often repeated three or more times in order to reduce the inherent experimental noise and the resulting binding ratios from each experiment are usually combined using weighted averaging. ChIP-chip provides powerful, in vivo measurements of how genes and gene products (i.e. proteins, RNA) interact and regulate one another in the complex underlying network of a cell. Much of the current work in modeling yeast networks focuses on the regulatory effect of TFs [2]. However, recent publications [4,6,7,13–20] show that other sets of proteins (including HMs, HSs, NPs, and NTs) also control gene expression. Hence, transcription has several levels of organization. The next four chapters use ChIP-chip data to infer the underlying communication between different layers in the transcriptional network of the yeast species Saccharomyces cerevisiae. Following the introduction in Chapter 1, four chapters build up the theory behind our transcriptional network model. Chapter 2 describes the preliminary data preparation on which we will base all future analysis. It first describes the integration of the different ChIP-chip data sets into one coherent set and then explores the crucial issue of binding data normalization and representation. Using a normalized data set, Chapter 3 finds biologically meaningful pairwise statistics between binding profiles of proteins, including filtered correlation coefficient, mutual information, and combined p-values. The next chapter uncovers group-wise binding relationships between proteins using Principal Component Analysis (PCA) and clustering. Based on the measures developed in Chapter 3, we introduce a novel semi-supervised clustering algorithm that preserves information about elements of a cluster in order to better capture group-wise dependencies between proteins. Throughout the 17 theoretical analysis, we confirm various known biological processes and predict several novel hypotheses. And finally, Chapter 5 combines the methodology developed in the previous chapters in order to build a multi-layered transcriptional network of the nucleus. To the best of our knowledge, our finalized network is the first attempt in the literature to quantify the communication between layers in biological transcriptional networks. 18 Chapter 2 Data Preparation This chapter explores the extremely important issue of data preparation. Our data comes from several different sources and in various forms. The first section explains the nontrivial procedure for integrating all the different sources of data into one coherent set. The second part of the chapter attempts to reconcile the discrepancies between ChIP-chip experiments done in different labs by normalizing the data. 2.1 Data Integration Integrating the publicly available data sets proved to be a tedious but non-trivial part of the project. We downloaded published genome-wide binding data for several sets of proteins that may be involved in regulation of genes, including T F s [1, 17, 21], N T s [13, 14], N P s [8, 14, 16, 23, 24, 28], HM s [4, 6, 7, 15, 20, 21, 25], and HSs [17, 18, 26, 27]. All of the ChIPchip experiments were performed on yeast strains grown in rich media conditions, where the sugar dextrose and other nutrients are readily present so that the yeast can quickly multiply. Since some genes only become active during specific environmental conditions, we would ideally want ChIP-chip binding data for yeast grown in various media. However, from our experience and as evidenced in [7], binding relationships in biological processes that are not turned “on” during rich media conditions can often still be detected using our ChIP-chip 19 data. For example, our analysis in Section 4.2.4 captures the collaborative effect of GAL3 and GAL80, two transcription factors that regulate genes involved in breakdown of the sugar galactose, using ChIP-chip data for yeast grown in dextrose-rich, YPD medium. In addition, the authors in [7] found that the RSC nucleosome remodeling complex associates with many genes involved in nitrogen regulation and non-fermentative carbohydrate metabolism using YPD ChIP-chip data. In order to explain the relationship fully, more experiments under different conditions were necessary, but the initial prediction came from looking at rich media ChIP-chip data. Thus, we felt that usage of the vastly more complete YPD datasets would still allow for the formation of hypotheses reflecting other growth conditions. The data sets further differed in the microarray probes used to detect binding relationships, where some experiments used probes that target the open reading frames (ORF), or regions of genes that code for RNA, while other experiments used probes for intergenic regions, or the promoter or control regions of genes. We used a gene-centric approach, where each relevant probe was assigned to its corresponding gene(s). For ORF arrays, we simply assigned the ChIP-chip information at each ORF to the corresponding gene. For intergenic arrays, we assigned each DNA probe (or fragment) to the gene that it most likely regulates. Biologists commonly refer to the DNA region preceding the site where transcription starts as the upstream intergenic region of a gene, and the region following the site where transcription ends as the downstream intergenic region of a gene. In yeast, each intergenic region can control zero genes (e.g., telomeres at the end of chromosomes, intergenic regions at the downstream end of two genes), a single gene, or two genes (e.g., intergenic regions at the upstream end of two genes). Moreover, there may exist several probes for a single intergenic region. For example, small genes that encode tRNAs (a class of RNAs that have a role in protein production) are often contained within the intergenic region upstream of a longer gene; therefore, tRNA gene probes also measure the binding of factors that regulate the longer gene. To further complicate matters, for some genes it is still not known which intergenic region controls them. Hence, we needed to develop a many-to-many mapping from intergenic fragments to the genes they might control. The 20 mapping algorithm uses the union of intergenic probe-gene assignment pairs as defined by several authors [1, 6, 8, 9, 15, 19, 21, 23, 24, 27], where we included assignment pairs from at least one author from each different lab when it was available. Moreover, when two or more intergenic fragments mapped to the same gene, the probe that contains the most amount of information was chosen. Since ChIP-chip experiments contain more information at the tails of the binding distribution, we chose the most bound fragment for multiple probes that were consistently bound and the least bound fragment for multiple probes that were consistently not bound. For each experiment, we obtained binding ratio (BR) data or the combination of BR and p-value (PV) data. We first mapped the binding ratios and p-values to all annotated genes for which data was available using the assignment algorithm described above. We then integrated the data into a single BR matrix and a single P V matrix, where rows represent factors, columns represent unique genes and entries represent the data. For example, the (i, g)th entry of the matrix P V , P Vi,g , represents the p-value of factor i binding to gene g. Missing data was annotated with NaNs. As previously explained, a binding ratio for protein i and gene g represents a weighted average of ratios of the number of immunoprecipitated DNA fragments from g enriched with protein i to the number of control (unenriched) DNA fragments from g that occur at random. The p-value for the binding ratio of protein i at gene g measures the probability of erroneously deciding that protein i binds to g when the null hypothesis (i does not bind to g) is in fact true. Hence, small p-values correspond to large binding ratios. The error model used for calculating the p-values varies as chosen by the authors of each paper. Since we are integrating binding data from various papers that use different ChIP-chip experimental protocols and different error models, we needed to derive a normalized representation of the binding data in order to compare binding profiles across papers. 21 2.2 Data Normalization To normalize the binding data, we first sorted the protein-gene binding interactions for each protein (i.e., row-wise in our matrices) from most bound to least bound, as defined by the authors of each paper. We then mapped the sorted binding data to uniformly spaced discrete points in the interval [0, 1], where 1 and 0 correspond to the most and least bound genes for each protein, respectively. Hence we transformed the information in the BR and P V matrices to a percentile rank P R matrix of the same dimensions. For example, P Ri,g = 0.95 means that gene g is in the top 5% of genes most bound by protein i. Moreover, since each protein binds to a varying number of genes, each author usually defines a strength of binding threshold, above which all protein-gene interactions are classified as bound. We further normalized the P R matrix by subtracting the strength of binding threshold from each entry, making all interactions classified as bound positive and all interactions classified as unbound negative. We called this matrix the normalized N data matrix. By thresholding N at 0, we also easily derived a bound/unbound B data representation matrix, where all bound interactions were mapped to 1 and all unbound interactions to 0. Each data representation has inherent advantages and disadvantages that may prove more biologically useful or less informative depending on the nature of the analysis. The bound/unbound B representation of the data seems most natural for representing the presence or absence of a biological interaction. However, ChIP-chip data contains significant experimental noise that varies from lab to lab and protein to protein; therefore, the classified bound/unbound data contains a large number of false positives and false negatives. The data set from [1] estimates the false positive rate to be around 4-6% and the false negative rate around 24%, for a binding threshold at p-value = 0.001. The percentile rank P R matrix removes discrepancies between the different averaging techniques used to derive binding ratios BR and the different error models used to calculate p-values P V . The P R matrix measures the pairwise binding strengths of gene-protein interactions consistently across data from different papers; however, it contains no information about which percentile determines significant binding for a given factor and the number of gene targets vary greatly for 22 different proteins. The normalized N matrix improves the P R matrix by subtracting the percentile threshold that each author used to classify interactions as bound and unbound. This achieves a soft, unclassified representation than the B matrix, where the more negative or positive an entry in the N matrix is, the more confident one can be that the gene-protein interaction is unbound or bound, respectively. While the P R and the N matrices measure binding strength relative to other interactions, they fail to capture the absolute strength of binding. To remedy this last shortcoming, we derive the SD matrix in the next section by finding the missing p-values in our data sets. 2.2.1 Finding Missing P -values We consider the P V matrix as the most reliable source of information about making decisions on the presence or absence of a binding interaction. Further, nearly all of data sets with pvalues base their calculation on adaptations of the single array error model [11]. Specifically, the data sets from [13, 14, 16] use a two-sided model, which can easily be converted to the equivalent right-sided single array model by the following conversion: 1 BRi,g > 1 : P Vi,g|1−sided = P Vi,g|2−sided 2 1 BRi,g ≤ 1 : P Vi,g|1−sided = 1 − P Vi,g|2−sided . 2 (2.1) (2.2) Reconciling these discrepancies provides reliable, consistent p-values for two thirds of the proteins in the data. We need to find the missing p-values for the other third based on the available BR information in order to obtain a complete P V matrix. Figure 2-1 shows the binding disribution of protein Polymerase III (POL3) across the natural logarithm of its binding ratios. The figure reveals that the distribution of binding ratios consists of two heterogeneous classes: i) binding ratios measured at genes that are not bound by POL3 and ii) binding ratios at genes that are bound by POL3, congregated at the right tail of the distribution. We need to model the distribution of the unbound genes in order to find 23 appropriate p-values. Figure 2-1: A histogram of the binding profile of protein Polymerase III: the number of genes bound by Pol III versus the natural logarithm of the binding ratios. Mathematically, let Xi,g denote random variables describing the natural logarithm of the binding ratios of protein i (POL3 in our example) at genes g and lets assume that the binding of i at each gene g, or Xi,g , is independent and identically distributed (i.i.d.). Since there is inherent noise in ChIP-chip experiments, it makes sense to use random variables to model our data. Following the convention of using capital letters for random variables and lowercase letters for their realizations, we use xi,g to denote the measured outcomes of random variable Xi,g at each gene g. Since random variables Xi,g are identically distributed at all genes g, let Xi denote the underlying binding tendency of protein i that we want to describe. Therefore, Figure 2-1 shows a representative, empirical example of the estimated distribution of Xi , or P̂Xi (xi ). In order to extract p-values, we want to estimate the distribution of the natural logarithm of binding ratios at all genes g under the null hypothesis H0 that protein i does not bind to g, or the noise distribution P̂Xi |H0 (xi |H0 ). ChIP-chip experiments introduce various, independent sources of multiplicative noise, which corresponds to several independent sources of additive noise in the logarithm domain. Therefore, the Central Limit Theorem makes it 24 reasonable to assume that the distribution of the logarithm of binding ratios for the class of unbound genes is Gaussian [39]. Since logarithm of binding ratios in our data result from weighted averages of logarithm of binding ratios from single independent ChIP-chip experiments and since linear combinations of independent Gaussian random variables is also Gaussian, we can reasonably model PXi |H0 (xi |H0 ) using a Gaussian distribution. The Gaussian noise distribution PXi |H0 (xi |H0 ) is uniquely defined by its mean and variance. We use the peak or mode of the overall distribution P̂Xi (xi ) (e.g., the peak at log(BR) = 0 in Figure 2-1) to estimate the mean of the noise distribution, µ̂Xi |H0 . This follows from the fact that the alternative hypothesis that protein i binds to gene g, H1 , is is significantly less likely than the null hypothesis (on average, about 4% of gene targets are classified as bound, or Pr(H1 ) ≈ 0.04 and Pr(H0 ) ≈ 0.96). Since the overall distribution decomposes as follows, PXi (xi ) = Pr(H0 )PXi |H0 (xi |H0 ) + Pr(H1 )PXi |H1 (xi |H1 ) (2.3) ≈ 0.96PXi |H0 (xi |H0 ) + 0.04PXi |H1 (xi |H1 ) , the peak of the unbound distribution should roughly correspond to the observed mode of the overall distribution P̂Xi (xi ). Moreover, since the noise distribution is Gaussian, the mostly unaffected peak of PXi |H0 (xi |H0 ) corresponds to the mean µ̂Xi |H0 . Hence, we can write µ̂Xi |H0 = modeg∈Gi (xi,g ) , (2.4) where Gi is the set of all genes with measured binding information for protein i. In order to estimate the variance of the noise distribution, σˆ2 Xi |H0 , we consider genes with a binding ratio smaller than µ̂Xi |H0 to be extremely unlikely targets of protein i. Hence, we use Ui to denote the set of genes to the left of the mode of P̂Xi (xi ), or the genes the make up the left side of the unbound distribution (the distribution to the left of the peak at log(BR) = 0 in Figure 2-1). Due to the symmetry of Gaussian distributions, we estimate the variance 25 of PXi |H0 (xi |H0 ) only using the left side of the noise distribution. The maximum likelihood estimator for variance of Gaussian random variables states that 1 X σˆ2 Xi |H0 = (xi,g − µ̂Xi |H0 )2 , |Ui | g∈U (2.5) i where |Ui | denotes the number of elements in set Ui , or the cardinality of Ui . The p-value P Vi,g corresponds to the probability that just as extreme or more extreme of an observation could occur if we assume the null hypothesis H0 that the factor i does not bind to gene g. To calculate the missing p-value for observation xi,g , or P Vi,g , we integrate our estimated noise distribution over the interval [xi,g , ∞): Z ∞ Z ∞ P̂Xi |H0 (xi |H0 )dxi = P Vi,g = xi,g xi,g 1 q exp[ 2π σˆ2 Xi −(xi − µ̂Xi )2 ]dxi . 2σˆ2 X (2.6) i After obtaining a complete P V matrix, we wanted to normalize the P V matrix so that we can assume a nearly Gaussian binding distribution for entries across rows for in the following analysis. Using the inverse of the Gaussian Q-function, we obtained the normalized SD matrix, where each entry SDi,g represents the number of standard deviations of confidence in rejecting the null hypothesis that protein i binds to gene g. Mathematically, ∞ −w2 1 √ exp[ ]dw 2 2π z = Q−1 (P Vi,g ) . Z Q(z) = (2.7) SDi,g (2.8) Since each row in the SD matrix approximates a Gaussian distribution with zero mean and unit variance, the SD matrix, just like the N matrix, fixes the problem in normalizing the data set so that binding levels can be compared across different experiments. Moreover, since the entries in the SD matrix represent standard deviation of confidence in rejecting the null hypothesis, the SD matrix preserves the absolute strength of binding of gene-protein interactions in a continuous manner, which the N matrix fails to capture. In the following 26 section we explore the advantages of the SD data matrix in relation to correlation analysis. 2.2.2 Evaluating Data Normalization: Correlation Analysis We use the measure of correlation coefficient distance to gauge the performance of the various data representations in relation to correlation analysis. We define correlation coefficient distance as one minus the correlation coefficient; therefore, a distance of 0 represents full positive linear dependence and a distance of 1 denotes no linear relationship [?]. Figure 2-2 illustrates the concepts from the previous paragraph in Section 2.2.1. It shows four correlation coefficient distance matrices using the BR, N &P R, P V , and SD data representations of a test set of proteins. Note that since the N and P R rows are simply shifted versions of one another and since correlation analysis is invariant to scales and shifts, the two matrices produce an identical correlation coefficient distance matrix, which we denote as resulting from N &P R. Two known biological processes are shown: RSC2, RSC3, RSC8, and STH1 are all components of the RSC nucleosome remodeling complex [7] and MLP1, MLP2, NIC96, and KAP95 all play role in nuclear transport [13]. We would expect the members of each biological processes to share similar binding profiles. This is shown by small correlation distance values in the two squares along the diagonal. The BR matrix clearly performs worse than the P V matrix in confirming the known relationships, corroborating the fact that enrichment ratios contain less reliable information than p-values. The N and P R matrices are ranked version of the P V matrix, so the three, naturally, share very similar results. Since correlation coefficient analysis assumes Gaussian binding profiles, the SD matrix improves on the P V matrix even though (2.8) shows that the two are mere transformations of each other. Moreover, the SD matrix outperforms the N matrix not only because its row vector entries approximate a normal distribution, but also because it preserves information about the absolute strength of binding. The performance of the different data representation in correlation analysis also extends to the whole data. We generated a test set of 1447 probable interactions, most of which were confirmed or suggested by published literature. Therefore, the known relationships discussed 27 (a) BR correlation coefficient distance. (b) N &P R correlation coefficient distance. (c) P V correlation coefficient distance. (d) SD correlation coefficient distance. Figure 2-2: Correlation coefficient distance matrix using the (a) BR, (b) N &P R, (c) P V , and (d) SD data representations of a test set of proteins. We define correlation coefficient distance as one minus the correlation coefficient; therefore, a distance of 0 represents full positive linear dependence and a distance of 1 denotes no linear relationship [?]. Protein names are listed on the x and y axes with an “o” or i at the end of the names signifying that the data came from an ORF or intergenic array, respectively. Two known biological processes are shown, namely RSC2-RSC3-RSC8-STH1 [7] and MLP1-MLP2-NIC96-KAP95 [13]. The SD matrix performs the best in confirming the known relationships followed closely by the P V and N &P R matrices, while the BR matrix clearly performs the worst. 28 Matrix Total Distance at Probable Interactions BR 978.5 log(BR) 974.3 N &P R 945.8 PV 966.3 SD 879.7 Total Distance at Unlikely Interactions 5496.9 5415.7 5341.4 5763.9 5374.5 Table 2.1: Total correlation coefficient distance for a test set of likely and unlikely interactions. A smaller total distance at known interactions should correspond to a smaller false negative rate for a given data representation, while a large total distance at unlikely interactions should lead to a smaller false positive rate. The SD matrix performs best in normalizing the data since it has the lowest total distance at known interactions and a comparable total distance to all the other matrices at unlikely interactions. The P V , N , and P R matrices do slightly worse but still better than both the log(BR), and BR data representations. in Figure 2-2 represent a subset of this test set. We also created a test set of 5926 extremely unlikely interactions. Table 2.1 shows the computed total correlation coefficient distance for a test set of likely and unlikely interactions. A smaller total distance at known interactions should correspond to a smaller false negative rate for a given data representation, while a large total distance at unlikely interactions should lead to a smaller false positive rate. The SD matrix again performs best in normalizing the data since it has the lowest total distance at known interactions and a comparable total distance to all the other matrices at unlikely interactions. The P V , N , and P R matrices do slightly worse but still better than both the log(BR), and BR data representations. Note that the log(BR) data representation has a near Gaussian distribution for entries across rows. The fact that it slightly outperforms the BR matrix shows that the Gaussian data distribution accounts for some but not all of the improvement in normalizing the data. These results substantiate our decision to use the SD matrix for the correlation analysis that follows. The following chapter utilizes the different data representations in building a statistical method for detecting pairwise relationships between proteins. 29 30 Chapter 3 Pairwise Statistics With an integrated and normalized data set, this chapter tries to find pairwise binding relationships between two proteins. We introduce two biologically meaningful pairwise measures of binding dependencies: filtered correlation coefficient and mutual information. We also find consistent methods for evaluating the p-values of the two analyses. In the end, we combine the p-values from the two complementary pairwise approaches in order to reduce the number of false positive and false negative biological predictions and increase the reliability of our overall analysis. 3.1 Filtered Correlation Coefficient This section introduces the pairwise measure of filtered correlation coefficient, an extension of the standard correlation analysis commonly encountered in biology. This technique extracts the binding relationships between two factors with great accuracy, by isolating the analysis on the pertinent dimensions in ChIP-chip data. To estimate the filtered correlation coefficient between two proteins i and j, we use the SD matrix (as justified in Section 2.2.2). As in Section 2.2.1, we consider binding of factors i and j at genes g, denoted as Xi,g and Xj,g , as i.i.d. random variables with measured outcomes xi,g and xj,g , respectively. Moreover, we again denote the underlying binding tendency of the two factors i and j as random variables 31 Xi and Xj . Since the rows of the SD data representation closely resemble samples from a Gaussian distribution, we assume that Xi and Xj are jointly Gaussian. For factors i and j, let Gi and Gj represent the set of all genes for which we have binding information, respectively. Moreover, we use Fi and Fj to denote the filtered sets of genes bound by proteins i and j, respectively, and Fi,j = Fi ∪ Fj to represent the overall filtered set of genes, or the set of genes classified as bound by at least one of the two factors. Using the SD data matrix, the following equations find the means, filtered variances and filtered covariance of the binding tendencies of two proteins using maximum-likelihood (ML) estimators for jointly Gaussian random variables as shown below: µ̂Xi = µ̂Xj = σˆ2 Xi = σˆ2 Xj = σ̂Xi ,Xj = 1 X xi,g |Gi | g∈G i 1 X xj,g |Gj | g∈G j 1 X (xi,g − µ̂Xi )2 |Fi,j | g∈F i,j X 1 (xj,g − µ̂Xj )2 |Fi,j | g∈F i,j X 1 (xi,g − µ̂Xi )(xj,g − µ̂Xj ) . |Fi,j | g∈F (3.1) (3.2) (3.3) (3.4) (3.5) i,j Note that the estimates of the means, µ̂Xi and µ̂Xj , consider all genes for which we have binding information, while the estimates of the variances, σˆ2 Xi and σˆ2 Xj , and covariance, σ̂Xi ,Xj , consider only the genes classified as bound by at least one of the two factors. We call these estimates the filtered variances and filtered covariance of binding profiles Xi and Xj . We again use the ML estimator for jointly Gaussian random variables to estimate the filtered correlation coefficient between binding profiles Xi and Xj , or ρ̂Xi ,Xj , as follows: 32 ρ̂Xi ,Xj = q σ̂Xi ,Xj . (3.6) σˆ2 Xi σˆ2 Xj The difference between the filtered correlation coefficient and the standard correlation coefficient is that the estimates of the filtered variances and covariance of binding profiles Xi and Xj consider just a filtered subset of genes. As we would expect, the filtered correlation coefficient still retains some of the important properties of the correlation coefficient. For example, 0 ≤ |ρ̂Xi ,Xj | ≤ 1, with ρ̂Xi ,Xj = 0 if and only if two data vectors have no linear dependence (uncorrelated) and ρ̂Xi ,Xj = 1 if and only if one data vector is a shifted and scaled version of the other. This directly follows from Schwartz’s inequality, which states that 1 X | (xi,g − µ̂Xi )(xj,g − µ̂Xj )| ≤ |Fi,j | g∈F i,j s ( 1 X 1 X (xi,g − µ̂Xi )2 )( (xj,g − µ̂Xj )2 ) , |Fi,j | g∈F |Fi,j | g∈F i,j i,j (3.7) or equivalently that |σ̂Xi ,Xj | ≤ q σˆ2 Xi σˆ2 Xj , where the equality holds true if and only if at every g, xi,g = αxj,g + β for some choice of constants α and β. In addition, the filtered correlation coefficient is also shift and scale invariant. This is a particularly useful property for our analysis, since it normalizes for the fact that some binding profiles may vary more than others. To prove that this property still holds, let Xi0 = aXi + b and Xj0 = cXj + d represent linear transformations of random variables Xi and Xj with transformed observations x0i,g = axi,g + b and x0j,g = cxj,g + d, respectively. The estimates for the means, filtered variances, and filtered covariance change as follows: 33 µ̂Xi0 = 1 X a X 1 X 0 xi,g = (axi,g + b) = (xi,g ) + b = aµ̂Xi + b 0 |Gi | |Gi | g∈G |Gi | g∈G 0 (3.8) 1 |G0j | (3.9) g∈Gi µ̂Xj0 = σˆ2 Xi0 = σˆ2 Xj0 = σ̂Xi0 ,Xj0 = X i x0j,g = g∈G0j 1 0 |Fi,j | 1 0 |Fi,j | X i 1 c (cxj,g + d) = (xj,g ) + d = cµ̂Xj + d |Gj | g∈G |Gj | g∈G X j j (x0i,g − µ̂Xi0 )2 = 1 (axi,g + b − aµ̂Xi − b)2 = a2 σˆ2 Xi (3.10) |Fi,j | g∈F (x0j,g − µ̂Xj0 )2 = 1 (cxj,g + d − cµ̂Xj − d)2 = c2 σˆ2 Xj (3.11) |Fi,j | g∈F 0 g∈Fi,j X X X i,j 0 g∈Fi,j X i,j 1 X 0 (xi,g − µ̂Xi0 )(x0j,g − µ̂Xj0 ) = acσ̂Xi ,Xj . 0 |Fi,j | 0 (3.12) g∈Fi,j Substituting the new estimates for filtered variances and covariance into 3.6 we see that the filtered correlation coefficient of the scaled and shifted versions of the data is the same as that of the original binding profiles Xi and Xj : σ̂Xi0 ,Xj0 ρ̂Xi0 ,Xj0 = q σˆ2 Xi0 σˆ2 Xj0 =q acσ̂Xi ,Xj a2 σˆ2 Xi c2 σˆ2 = ρ̂Xi ,Xj . (3.13) Xj Filtered correlation coefficient is a simple but powerful measure of binding relationships between two factors. It isolates the analysis on the pertinent dimensions (or genes) of the ChIP-chip data and predicts biological interaction between factors with great accuracy, as shown in the following sections and chapters. Moreover, it can compare vastly different data sets in a consistent manner. We explore the issue of significance in filtered correlation analysis next. 3.1.1 Filtered Correlation Coefficient P -values Ultimately, we want to use our filtered correlation coefficient to make decisions on whether two factors are linearly related. Hence, under our null hypothesis H0 the binding profiles of two factors are not linearly related (i.e., ρXi ,Xj = 0) and under our alternative hypothesis 34 H1 they are linearly related (i.e., ρXi ,Xj 6= 0). For sample size n and filtered correlation coefficient ρ̂Xi ,Xj , we want to evaluate the probability that we reject the null hypothesis when it is actually true, or the p-value. To evaluate the significance of our filtered correlation coefficient, we use the test statistic √ ρ̂Xi ,Xj n − 2 T = . 1 − ρ̂2Xi ,Xj (3.14) Assuming that binding profiles Xi and Xj are jointly Gaussian, [35] shows that T is a Student-T random variable of n−2 degrees of freedom. Further, T results from a generalized likelihood ratio test and hence defines the optimal decision boundary [35]. Let tρ̂,n−2 denote one positive outcome of the random variable T for a given ρ̂Xi ,Xj = ρ̂ and n − 2 . Also, let xi = [xi,g1 . . . xi,g|G| ] and xj = [xj,g1 . . . xj,g|G| ] represent the row vectors of binding data for proteins i and j across all genes g in the set G, following the convention of using boldface to represent vectors. Since, tρ̂,n−2 only depends on n − 2 and the estimated value ρ̂ found using xi and xj , tρ̂,n−2 is completely determined by xi and xj . We represent the filtered correlation coefficient p-value for a given tρ̂,n−2 , or for a given xi and xj , as pvF CC (xi , xj ). To find this quantity, we need to find the probability that test statistic t can have a value more extreme than tρ̂,n−2 . This corresponds to integrating the distribution of T , PT (t), over the two disjoint intervals [−∞, −tρ̂,n−2 ] ∪ [tρ̂,n−2 , ∞] where T exceeds the outcome tρ̂,n−2 . Due to the symmetry of the distribution of Student-T random variable T , we can write Z ∞ pvF CC (xi , xj ) = P r(T ≥ tρ̂,n−2 ) = 2 PT (t)dt . (3.15) tρ̂,n−2 Note that we implicitly assume that both highly positive and negative values of ρ̂ are significant here. If we wanted to simply consider positive/negative correlation coefficients as significant, we would need to remove the factor of two and only integrate over the interval corresponding to the right/left tail of T ’s distribution. The filtered correlation coefficient measures how consistently two binding profiles fluctuate above and below their respective means. It represents a normalized measure of the linear 35 relationship between two data vectors. However, two random entities can have no linear relationship (i.e., ρXi ,Xj = 0) but still dependent on each other in a non-linear fashion. We explore a more general notion of probabilistic dependence between random variables in the next section. 3.2 Mutual Information Mutual information is a general measure of the probabilistic dependence between two random variables. As in the filtered correlation analysis, we again consider binding profiles Xi and Xj of proteins i and j as random variables, but now their outcomes only take on NXi and NXj discrete values. The entropy of binding profile Xi measures the amount of uncertainty in predicting the observations of Xi and forms the basis for calculating the mutual information. Given a probability mass distribution of binding profile Xi , PXi (Xi = xi,r ) for r = 1, . . . , NXi , the entropy of X is define as N Xi X PXi (Xi = xi,r ) log PXi (Xi = xi,r ) H(Xi ) = − r=1 N Xi NXj XX =− PXi ,Xj (Xi = xi,r , Xj = xj,s ) log PXi (Xi = xi,r ) . (3.16) r=1 s=1 Suppose now that we observe the binding profile Xj , which might be related to Xi . The conditional entropy of Xi given Xj measures the amount of uncertainty that Xi contains with prior knowledge of Xj and can be calculated as follows: 36 N Xj X H(Xi |Xj ) = − PXj (Xj = xj,s )H(Xi |Xj = xj,s ) s=1 N Xj NXi X X =− PXj (Xj = xj,s ) PXi |Xj (Xi = xi,r |Xj = xj,s ) log PXi |Xj (Xi = xi,r |Xj = xj,s ) s=1 NXi N Xj r=1 XX =− PXi ,Xj (Xi = xi,r , Xj = xj,s ) log PXi |Xj (Xi = xi,r |Xj = xj,s ) . (3.17) r=1 s=1 The amount that the uncertainty in Xi decreases with the observation of Xj ; hence, the entropy reduction H(Xi ) − H(Xi |Xj ) corresponds to the amount of information that Xj contains about Xi . Combining (3.16) and (3.17), the mutual information between random binding profiles Xi and Xj takes the form: I(Xi ; Xj ) = H(Xi ) − H(Xi |Xj ) = H(Xj ) − H(Xj |Xi ) N = NXi Xj X X PXi ,Xj (Xi = xi,r , Xj = xj,s )(log PXi |Xj (Xi = xi,r |Xj = xj,s ) − log PXi (Xi = xi,r )) r=1 s=1 NXi N Xj = XX PXi ,Xj (Xi = xi,r , Xj = xj,s ) log r=1 s=1 PXi ,Xj (Xi = xi,r , Xj = xj,s ) . PXi (Xi = xi,r )PXj (Xj = xj,s ) Note that the second equality in the first line shows that mutual information is symmetric, or I(Xi ; Xj ) = I(Xj ; Xi ). This can easily be verified by observing that trading the places of all the Xi and Xj in (3.18) results in no change. Another property of the mutual information is that it is non-negative, or I(Xi ; Xj ) ≥ 0. In order to estimate the mutual information between the binding profiles Xi and Xj of proteins i and j, we need to estimate the marginal and joint probability mass functions of Xi and Xj based on the data. Using the P V and P R data representations (justification discussed in Section 3.2.2), we classify each data entry as a 1 or a 0, signifying the presence or absence of an interaction, respectively. Hence, we model our binding profiles as Bernoulli 37 (3.18) random variables with i.i.d. random samples Xi,g and Xj,g and outcomes xi,g and xj,g , where xi,g , xj,g ∈ {0, 1} at all genes g. Let Gi,j denote the set of all genes with binding observations for both proteins i and j, and let Gi,j have w gene members. We can use maximum likelihood estimators for the parameters of Bernoulli random variables to estimate the marginal and joint mass distributions using P̂ (Xi = 1) = 1 w X g∈{g:xi,g =1} 1= v w (3.19) w−v P̂ (Xi = 0) = 1 − P̂ (Xi = 1) = w X 1 u P̂ (Xj = 1) = 1= w w (3.20) (3.21) g∈{g:xj,g =1} w−u P̂ (Xj = 0) = 1 − P̂ (Xj = 1) = w X 1 h P̂ (Xi = 1, Xj = 1) = 1= w w (3.22) (3.23) g∈{g:xi,g =1,xj,g =1} P̂ (Xi = 1, Xj = 0) = P̂ (Xi = 0, Xj = 1) = P̂ (Xi = 0, Xj = 0) = 1 w 1 w 1 w X 1= v−h w (3.24) 1= u−h w (3.25) 1= w−v−u+h , w (3.26) g∈{g:xi,g =1,xj,g =0} X g∈{g:xi,g =0,xj,g =1} X g∈{g:xi,g =0,xj,g =0} where h denotes the number of genes bound by both proteins, v the number of genes bound by protein with binding profile Xi and u the number of genes bound by protein with profile Xj . Setting NXi = NXj = 2 and substituting our estimated distributions in (3.18) gives us ˆ i ; Xj ). the estimated mutual information between binding profiles Xi and Xj , I(X Mutual information provides a more general framework for measuring the dependence between two random variables. Moreover, the following sections and chapters demonstrate that it is a very natural and biologically meaningful measure of protein dependence in ChIPchip data. We ultimately want to use our estimate of I(Xi ; Xj ) in order to make decisions 38 on whether two proteins with binding profiles Xi and Xj participate in the same biological process. In the next section, we introduce a method for finding p-values for the mutual information analysis. 3.2.1 Mutual Information P -values P -values allow us to make unbiased decisions about biological dependence based on mutual information. In this scenario, under the null hypothesis H0 the two factors have no binding dependence (i.e., I(Xi ; Xj ) = 0). The p-value measures the probability that an estimated mutual information of a given significance or greater can occur at random. Note that our estimate of the mutual information in (3.19)-(3.26) depends only on four parameters, namely ˆ i ; Xj ) maps to the Venn diagram shown in Figure 3-1, where h, u, v, and w. Hence, each I(X Xi is the binding profile of the TATA-Box Protein (TBP) and Xj is the binding profile of POL3. Figure 3-1: Venn diagram for the subsets of genes bound by proteins POL3 and TBP: w denotes the number of genes with observed binding information about both POL3 and TBP, v is the number of genes bound by TBP, u is the number of genes bound by POL3, and h is the number of genes bound by both. The classification of bound was chosen based on estimated p-values (see Section 2.2.1) below the threshold 0.001. 39 Given a superset of w genes and two subsets of u and v genes, the probability of the two subsets having an overlap of h elements at random has a hypergeometric distribution PH|U,V,W (h|u, v, w). Hence, we can calculate the probability of estimating a mutual informaˆ i ; Xj ) = îh,u,v,w at random using tion I(X w−v v u−h h w u ˆ i ; Xj ) = îh,u,v,w |H0 ) = PH|U,V,W (h|u, v, w) = Pr(I(X . (3.27) The denominator in (3.27) represents the number of ways to choose u − h non-overlaping elements from the complement of the set containing v objects, times the number of ways to choose the h overlapping elements from the set with v objects. The product hence represents the number of ways to choose a subset of u elements from a superset of w objects, such that exactly h of them overlap with a pre-designated subset of v elements. The numerator computes all the possible ways of choosing a subset of u elements from a set of size w, normalizing (3.27) in order to obtain a probability. ˆ i ; Xj ) = îh,u,v,w to hypergeometric The mapping from mutual information estimate I(X probabilities is not one to one. For example, the parameters (h = 1, u = 2, v = 2, w = 20) and (h = 300, u = 600, v = 600, w = 6000) have the same îh,u,v,w but hypergeometric probabilities of 0.1895 and 2.3028 × 10−165 , respectively. This exaggerated example shows that the mutual information estimate does not take into account the sample size w in judging the significance of the dependence between two random variables, unlike the corresponding hypergeometric probability PH|U,V,W (h|u, v, w). Since a larger value of w should increase ˆ i ; Xj ), random variable H when conditioned on U = u, the confidence in our estimate I(X V = v, and W = w is a more appropriate test statistic for evaluating mutual information significance. As before, outcomes h, u, v, and w are completely determined by discrete vectors of binding data xi and xj for proteins i and j. To find the p-value, or the probability of randomly estimating a mutual information of a positive binding relationship equally or more significant than îh,u,v,w , we sum over the right tail of the hyper-geometric test statistic H|U = u, V = v, W = w for the outcome h as follows: 40 min (u,v) pvM I (xi , xj ) = Pr(H ≥ h|u, v, w) = X r=h w−v v u−r r w u . (3.28) As before, we express our mutual information p-value pvM I (xi , xj ) as a function of the binding data xi and xj . The hyper-geometric p-values above only evaluate the significance of synergistic binding between two factors, while mutual information can capture both positive and negative binding relationships. To evaluate the p-value of a negative binding relationship, or the probability of having an overlap of h or smaller at random, we would need to sum from 0 to h in (3.28). However, due to the high sensitivity at h = 0 and due to the large amount of false positives and false negatives in our binding data, this evaluation proves unreliable. Opposing binding relationships can evince interesting biological phenomena, as well, but it becomes clear later why such relationships complicate future analyses and their biological interpretation. Hence, we avoid using mutual information in finding negative binding relationships and will exclusively use the filtered correlation coefficients for that purpose. Now that the we have developed a framework for evaluating the significance of our mutual information analysis, we can substantiate our decision for using the P V and P R ˆ i ; Xj ). data representations in classifying the data and finding I(X 3.2.2 Evaluating Data Classification: Mutual Information Analysis This section gauges how accurately the different data representations classify their binding information into 0s and 1s, in order to estimate the mutual information and find p-values as in Sections 3.2 and 3.2.1. We generated a test set of 1447 probable protein-protein binding dependencies, most of which were confirmed or suggested by published literature. We also created a test set of 20707 extremely unlikely relationships. In order to compare the different matrices in an unbiased manner, we designated reasonable thresholds for classifying the data sets into a similar number of bound (1s) and unbound (0s) gene-protein interactions. Table 3.1 lists the data representations, their corresponding threshold test and the number of 41 Data Matrix BR PR N PV N + PR PV + PR Classification Total Entries Hit Rate (captured Test for Binding Classified Bound /probable links) BR ≥ 1.9 81450 738/1447 P R ≥ .965 77377 1402/1447 N ≥ −.002 79639 1409/1447 P V ≤ .015 80948 1445/1447 N ≥ −.002 or P R ≥ .99 80215 1414/1447 P V ≤ .015 or P R ≥ .99 81461 1446/1447 Total Noise at Unlikely Links 5283.9 3784.5 1460.5 1021.1 1394.7 1021.0 Table 3.1: Hit rate (capture/probable links) at probable interactions and total noise at unlikely interactions based on mutual information estimates using different data representations (see text). entries classified as bound (all around 80,000) in the first three columns. The classified data for each data representation was then used to find mutual information p-values, pvM I , as in Section 3.2.1. The next column of Table 3.1 enumerates the hit rate, or the number of relationships captured at a significance level of 10−10 divided by the total number of probable links in our test set. The last column lists the total noise at unlikely links, computed by accumulating the total significance at interactions in second test set. The significance of an interaction between proteins i and j was calculated by replacing pvC with pvM I in (4.4). A higher hit rate should correspond to fewer false negatives for a given data representation, while less total noise at unlikely relationships should lead to fewer false positives. The P V (equivalent to thresholding on SD) matrix performs best in individually classifying the data, since it has the highest hit rate at known interactions and the lowest total noise at unlikely interactions. The N and P R matrices perform slighly worse in confirming probable relationships. However, the N and P R matrices introduce 43% and 271% more noise than the P V matrix, respectively, which will undoubtedly lead to more false positive claims. Since some proteins have very few gene targets at the chosen thresholds, the classified, binary data would provide little information about their behavior. In order to use mutual information to compare these proteins with the rest, the last two rows in Table 3.1 classify the data using a combination of the top two data representations (P V and N ) and the P R data representation, guaranteeing that at least the top 1% of most immunoprecipitated gene probes is considered bound for each factor. The number of entries classified 42 as bound increase negligibly, since most publications already considered the strongest 1% of gene-protein interactions for each protein as bound. Creating a buffer of bound targets slightly improves the hit rate of both the individual N and P V matrix at no expense of added noise. These results substantiate our decision to use the combination of the P V and P R matrices for classifying the data in order to estimate the mutual information p-values. Now that we have a method for evaluating the significance of both our filtered correlation coefficient and mutual information analyses, we can combine the evidence from the two approaches in order to improve the reliability of our biological predictions. 3.3 Combining P -values The p-value calculations for both the filtered correlation coefficient and mutual information estimation test similar null hypotheses. In Section 3.1.1, we test the null hypothesis that two binding profiles Xi and Xj are not linearly dependent (ρx,y = 0) while in the last section we considered the null hypothesis that the two profiles are not probabilistically dependent (I(Xi ; Xj ) = 0). Ultimately, we aim to find out whether two proteins belong in the same biological process. Hence we want to combine the p-values from the two analyses in order to incorporate both sources of evidence in testing the overall null hypothesis that two proteins with binding profiles Xi and Xj do not share a biological function. We combine p-values using Fisher’s method. It assumes that the p-values result from independent studies that use continuous random variables as test statistics to challenge similar null hypotheses. Since we use the same data source to estimate the filtered correlation coefficient and mutual information, our studies are not completely independent. Moreover, the hyper-geometric test statistic is not continuous. Despite these approximations, the authors in [37] and Section 3.3.1 demonstrate how combining p-values using Fisher’s method improves biological prediction. In order to derive Fisher’s method for combining p-values for synergistic binding relationships, we only consider positive correlations as significant for the following. Hence, in this section we calculate our p-values for the filtered correlation coefficients using a one-sided test, or 43 using (3.15) without the factor of 2. Let p1 = Pr(T ≥ tρ̂,n−2 ) and p2 = Pr(H ≥ h|u, v, w) represent two p-value observations that result from test statistics T and H|U = u, V = v, W = w and observations (ρ̂, n − 2) and h, respectively. Using the assumption of independent studies, the joint probability of observing events T ≥ tρ̂,n−2 and H ≥ h|u, v, w under the null hypothesis equals Pr(T ≥ tρ̂,n−2 , H ≥ h|u, v, w) = Pr(T ≥ tρ̂,n−2 ) Pr(H ≥ h|u, v, w) = p1 p2 . (3.29) From the equation above, it seems natural to consider the observation of p-value pairs (p1 = 0.01, p2 = 0.01) and (p1 = 0.1, p2 = 0.001) with equivalent products p1 p2 as evenly significant sources of evidence for rejecting the null hypothesis H0 . Intuitively, this refers to having two equally strong sources of evidence for dependence between two proteins or having a slightly stronger and a slightly weaker source of evidence. Hence, we want to make decisions about the strength of our combined evidence using the test statistic K = P1 P2 , where P1 and P2 are the p-values resulting from the filtered correlation coefficient and mutual information estimations, respectively. Since P1 and P2 depend on assumed independent random variables T and H, they are themselves independent random variables. The distributions of p-values P1 and P2 resulting from a continuous test statistics is exactly uniform on the interval [0, 1] [37]. Although, our test statistic for mutual information is discrete, assuming a continuous uniform distribution for P2 is just an approximation that only slightly affects the accuracy of our evaluation [37]. To find the overall p-value for a given p-value product p1 p2 , we need to evaluate the probability of randomly arriving at a p-value product more significant than the observation k2 = p1 p2 . This corresponds to the region in probability space where p1 p2 ≤ k2 . Hence, we need to find the volume of the region p1 p2 ≤ k2 over the joint distribution of P1 and P2 . Since P1 and P2 were shown to be independent and uniformly distributed random variables, their joint distribution PP1 ,P2 (p1 , p2 ) is a 2-dimensional unit cube defined on {(p1 , p2 ) : p1 ∈ [0, 1], p2 ∈ [0, 1]}. The combined p-value from the two analyses for observed p-value product 44 k2 , pvC (k2 ), follows from the integration below: ZZ pvC (k2 ) = = PP1 ,P2 (p1 , p2 )dp1 dp2 p1 p2 ≤k2 , 0≤p1 ,p2 ≤1 Z 1 Z k2 k2 1dp1 + dp1 = p1 |k02 +k2 ln p1 |1k2 = p 1 k2 0 k2 − k2 ln k2 . (3.30) Since p1 = pvF CC (xi , xj ) and p2 = pvM I (xi , xj ) for two proteins i and j with binding data vectors xi and xj , the combined p-value for the binding relationship between proteins i and j only depends on xi and xj and can be equivalently denoted as pvC (xi , xj ). The equation in (3.30) only depends on the number of dimensions, 2, and the product of our observed p-values, k2 . The extension for combining p-values from n independent experiments, also depends only on n and the observation kn = p1 p2 · · · pn . Now, the integration takes place over the n dimensional region p1 p2 · · · pn ≤ kn of the joint distribution PP1 ,...,Pn (p1 , . . . , pn ). Again from independence, the joint distribution is an n-dimensional unit hypercube defined on {(p1 , . . . , pn ) : p1 ∈ [0, 1], . . . , pn ∈ [0, 1]} and the combined p-value is pvC (kn ) = kn n−1 X (− ln kn )r r=0 r! . (3.31) The following section mitigates the problem of assuming independent analyses and illustrates how combining p-values can improve biological prediction. 3.3.1 Biological Predictions using Combined P -values To show the benefit of combining p-values we present an example of a biological prediction that is representative of the rest of the pairwise protein analysis. Figure 3-2 shows filtered correlation coefficient, mutual information, and combined p-values for a test set of proteins. The filtered correlation coefficient and mutual information p-values were created using onesided models that only considered positive relationships as significant. The scale on the right is in units of log10 (pv) ranging from -2 (or p-value = .01) to -20 (or p-value = 10−20 ). Protein 45 names are listed on the x and y axis with and “o” or “i” at the end of the names signifying that the data came from an ORF or intergenic array, respectively. Since combining p-values integrates evidence from the two complementary pairwise analyses, it should lead to fewer errors in deciding whether two factors have a significant binding dependence. The first 3 factors, RSC3, RSC8, and RSC, are components of the RSC nucleosome remodeling complex. Since these proteins carry out tasks as part of the same complex, we would expect a high degree of similarity in their binding profiles, as shown in [7]. Indeed, each subfigure shows a p-value ≤ 10−20 for the interactions within the entire complex (dark blue box on the top left). For individual analyses, we consider a p-value of 10−10 as significant for a pairwise interaction. However, since we are combining two studies that are not fully independent, we consider a p-value of 10−20 as significant for the combined p-values. If our two analyses were completely dependent on one another, the p-values testing the same null hypothesis should be identical and a significance of 10−10 would translate to a significance of 10−20 when the p-values are combined. However, since our pairwise studies are not fully dependent on one another, a p-value threshold of 10−20 for combined p-values is more stringent than a p-value threshold of 10−10 for a single analysis. The next two proteins LRP1 and RRP6, are involved in mRNA degradation and surveillence, respectively. In [16], the authors show that LRP1 depends on RRP6 for recruitment to a large fraction of its gene targets. Therefore, we would expect similarity in their binding profiles and both analyses capture this interaction with a low p-value. Curiously, protein PRP20 also shows high similarity with the LRP1-RRP6 biological process. PRP20, also known as RanGEF in humans, binds preferentially to inactive genes [13]. This leads us to a novel biological hypothesis. It is known that the exosome (a complex of proteins which also contains LRP1 and RRP6) controls protein synthesis by degrading the mRNAs of genes that are improperly formed, processed, or transcribed at a given time. For example, we would not want the GAL proteins, involved in galactose breakdown, to be synthesized when the cell is grown in a dextrose-rich YPD medium. Moreover, [13] shows that PRP20 binds to genes that should be turned off during a given condition, such as GAL genes in YPD. 46 (a) Filtered correlation coefficient p-values for a test set of proteins. (b) Mutual information p-values for test a set of proteins. (c) Combined p-values for test a set of proteins. Figure 3-2: Filtered correlation coefficient (a), mutual information (b), and combined (c) p-values for a test set of proteins. The filtered correlation coefficient and mutual information p-values were created using one-sided models that only considered positive relationships as significant. Protein names are listed on the x and y axis with an “o” or “i” at the end of the names signifying that the data came from an ORF or intergenic array, respectively. The scale on the right is in units of log10 (pv) ranging from -2 (or p-value = .01) to -20 (or p-value = 10−20 ). 47 The fact that the exosome binds to a similar set of gene targets as PRP20 suggests that inactive genes are actually not completely turned off. It may be possible that the process of transcription is leaky, in which case inactive genes that should not be transcribed in a given condition may be erroneously transcribed. Indeed, there is a growing body of evidence in the transcription field to suggest that RNA polymerase is promiscuous in its induction of transcription at non-ORF sites within the genome, the functional consequences of which are still unclear. This suggests that the exosome may have a role in quality control at inactive genes, binding to them in order to readily degrade their unwanted mRNA. The last 2 proteins, Polymerase III (POL3) and the TATA-Box Protein (TBP), also associate significantly for all three p-value calculations. In a recent paper [28], the authors show that the TBP preferentially associates with gene targets of POL3 in various environmental conditions. Although, the POL3 and TBP experiments were done in different laboratories, our normalized analyses confirm the results in [28]. After identifying connections within 3 biological processes, we consider the interplay between the three complexes. The authors in [7] show that the RSC nucleosome remodeling complex has a preference for POL3 gene targets. Hence, we should expect significant interaction in the top right corner of Figures 3-2(a), 3-2(b), and 3-2(c). The filtered correlation coefficient analysis finds all 6 interactions significant (p-value ≤ 10−10 ) but the mutual information and combined p-value calculation finds 5/6 relationships as significant (p-value ≤ 10−10 and p-value ≤ 10−20 , respectively), making an error on RSC3-POL3. The filtered correlation coefficient analysis additionally predicts interactions LRP1-POL3 and LRP1TBP but no relationships between its recruitment factor RRP6 and POL3 or TBP. Due to this discrepancy and due to the lack of literature supporting this interaction, we consider these two predictions as likely false positives. The mutual information and the combined p-value do not make this probable mistake. In summary, the filtered correlation makes 2 potentially false positive claims while the mutual information and combined p-values make one false negative error. Moreover, if we reduce the combined p-value threshold for significance to a less conservative and more accurate 10−18 (since our complementary analyses are not 48 completely dependent), we obtain 0 errors. This example accurately represents the overall trend that correlation coefficient and mutual information tests will generate a higher number of false positives and false negatives, respectively. This is because mutual information makes a hard decision on classifying the data as 0s and 1s prior to the analysis and cannot capture relationships when a significant number of errors are made in the classification, resulting in false negatives. In contrast, the filtered correlations cost function considers a soft and continuous version of the data in terms of standard deviations of confidence and does capture these relationships. However, this also leads to considering unbound genes as partially bound and creates a higher number of false positives. Combining the two complementary analyses leads to fewer errors or to a more optimal operational point on an ROC curve of false negative rate versus false positive rate. We use this concept in building a more reliable network of our nucleus in Chapter 5. The next section attempts to reduce the false positive rate of predictions based on mutual information. 3.4 Minimized Mutual Information P -value The previous section revealed that mutual information p-values are highly sensitive to the chosen cutoff for classifying the binding data into 1s and 0s. To alleviate this problem, this section introduces minimized mutual information p-values. This method allows for the binding cutoffs to take on several values from an allowable set and finds the cutoffs that minimize the mutual information p-value between two proteins. Section 3.2.2 shows that transforming data into 0s and 1s seems most appropriate when using the P V and P R matrices. Let A = {0.001, 0.005, 0.01, 0.02} denote the allowable set of p-value cutoffs for thresholding the P V matrix. Moreover, let ci , cj ∈ A denote the p-value cutoffs of proteins i and j for transforming continuous row vectors xi and xj from the P V matrix to binary vectors, respectively. For each combination of cutoffs ci , cj ∈ A, lets also consider the strongest 1% of gene-protein interactions for proteins i and j as bound. Then the minimized mutual information p-value, pvM M I , between binding vectors xi and xj takes the form 49 pvM M I (xi , xj ) = min ci ∈A,cj ∈A pvM I (xi , xj ) , (3.32) where pvM I is the mutual information p-value previously calculated in (3.28). The minimized mutual information p-value mitigates the sensitivity in the mutual information analysis due to hard classification of the data. However, since this analysis allows for the same protein to have different bound cutoffs, it is only suited for making decisions on pairwise interactions between two proteins and does not allow for unbiased group-wise comparisons. Hence, we will use the minimized mutual information p-values for making decisions about adding links in a network between two proteins (Chapter 5) but refrain from using it for inferring group-wise relationships between proteins (Chapter 4). Filtered correlation coefficient, mutual information, and combined p-values allow us to judge the strength of pairwise interactions between two proteins. The following chapter introduces clustering and PCA, which can evince group-wise relationships between factors that regulate transcription. 50 Chapter 4 Group-wise Relationships: PCA and Clustering Biological processes depend on the collaborative interaction of several proteins. In the previous chapter we developed pair-wise statistics in order to describe the relationship between two proteins. This chapter introduces Principal Component Analysis (PCA) and clustering, two methods that can evince group-wise dependencies between proteins. The two techniques verify known biological mechanisms and uncover novel cellular processes. 4.1 Principal Components Analysis PCA is a common technique for data reduction and visualization. It determines the directions of the greatest variance within a data matrix by calculating the orthonormal eigenvectors associated with the largest eigenvalues of the scatter matrix of the data [36]. The principal components, usually referred to as scores, are row vectors of the same dimensionality as row vectors of the data matrix. The data can be modeled by taking linear combinations of the scores; the associated weights with each score are called loads. Since most of the variance within a data set can usually be captured with a small number of principal components, PCA exploits the common information within the data in order to reduce its dimensionality. 51 One can choose the number of principal components one wishes to calculate; more principal components can characterize the data better at the cost of having to keep track of more information. For visualization purposes, two or three principle components are often used. For a given choice of data representation, let xi = [xi,g1 . . . xi,g|G| ] denote a row vector of binding data for protein i across all genes g in the set G. If performing PCA on the binding profiles of each protein i, we can write xi = µ̂i 1 + N X wni sn . (4.1) n=1 The PCA equation (4.1) uses N principal components, with row vector scores s1 , . . . , sN and i } specific to each xi , where µ̂ is a scalar that is the mean across the scalar loads {w1i , . . . , wN elements of xi , and 1 is row a vector of all ones. Figure 4-1 shows how PCA can provide useful biological insight. The large isolation of protein Prp20 and Nup100 from the rest of the nuclear pore related proteins in Figure 4-1(b) supports a model introduced in [13]. Casolari et al. shows that Prp20, known as RanGEF in humans, binds preferentially to transcriptionally inactive genes. RanGEF is also known to catalyze the unloading of transcription factors (TF) transported into the nucleus by nuclear import proteins, such as Kap. The left part of Figure 4-1(a) illustrates how RanGEF converts RanGDP to RanGTP which in turn unloads the TF transported into the nucleus by Kap. Hence, the authors suggest that Prp20 may contribute to transcriptional activation by facilitating the release of transcription factors at genes requiring fast activation. Accordingly, induction of said genes leads to experimentally confirmed loss of RanGEF association and a gain in binding at the nuclear pore on the periphery of the nucleus, as illustrated on the right part of Figure 4-1(a). Thus, Prp20 and the rest of the nuclear pore proteins should bind to very few genes in common at any one time as shown by the great separation between the two in Figure 4-1(b). Further, the results suggest that the other aberrant nuclear factor, Nup100, also has a role that is orthogonal to the workings of the majority of the nuclear pore proteins. The following section discusses another method for finding biological processes using ChIP-chip data. 52 (a) Casolari et al. Model for Nuclear Transport (picture from [13]) Principal Component Scatter Plot with Colored Clusters 100 1 2 3 Nup100 Second Principal Component 80 60 40 20 0 −20 −50 Prp20 0 50 100 150 200 250 First Principal Component (b) Principal Component Analysis Validates the Model Figure 4-1: The subfigures show (a) Casolari et al. model for nuclear transport and (b) its validation using Principal Component Analysis (PCA) [13]. Figure (b) plots the loads on the first two principal components. The colors denote clusters constructed using hierarchical clustering based on correlation coefficient distance, as defined in Section 2.2.2, and a cutoff for three clusters. 53 4.2 Clustering Clustering of genomic data can evince group-wise relationships between proteins or genes. We first introduce K-means and hierarchical clustering and then adapt the algorithms in order to use the pair-wise distance metrics developed in the previous chapter. We then develop a novel semi-supervised clustering algorithm that preserves information about elements within each cluster in order to better capture group-wise dependencies between proteins. 4.2.1 K-means Clustering K-means clustering is one of the most commonly encountered algorithms in biology, due to its ease of implementation. K-means clustering uses the expectation maximization (EM) algorithm to partition (cluster) N elements into K disjoint subsets Sk , k = 1, . . . , K. Given an appropriate choice of data representation, let xi denote a row vector of binding data for protein i, as in Section 4.1. In order to find groups of related proteins, we represent binding profiles as elements that we want to cluster. The K-means algorithm finds K disjoint subsets Sk , k = 1, . . . , K that minimize the overall cost function J= K X X d2 (xi , ck ) , (4.2) k=1 i∈Sk where d(xi , ck ) is the distance between binding profile xi and centroid ck of subset Sk . The algorithm usually uses Euclidean distance for d and defines ck as the mean of all the elements in Sk : ck = 1 X xi . |Sk | i∈S (4.3) k The user initiates the clustering by designating the K underlying groups of objects. Then the K centroids are initialized, usually by assigning each element to one of the K clusters at random and using (4.3). The algorithm then iterates between (i) assigning all the elements to subset Sk with the “closest” (in terms of distance d) centroid ck (E-step) and (ii) reestimating 54 the cluster centroids based on the new assignments using (4.3) (M-step). The procedure stops once no further change in assignments occurs. We adapted the K-means algorithm in order to incorporate the pairwise statistical analysis from Chapter 3. The following formula converts the combined p-values from Section 3.3 to a non-negative distance, where a small p-value corresponds to a small distance between the binding profiles xi and xj of two proteins i and j: d(xi , xj ) = − log10 (1 − pvC (xi , xj )) . (4.4) K-means clustering (as any EM based algorithm) does not guarantee global minimization of the cost function in (4.2), and its performance heavily depends on the initialization of the K centroids. Due to this problem, the K-means algorithm often fails to identify known protein clusters (e.g., the RSC complex). Other EM based partitioning algorithms, such as fuzzy membership clustering, would also suffer in performance due to random initialization; therefore, the next section introduces a different family of algorithms based on hierarchical clustering. 4.2.2 Hierarchical Clustering Hierarchical clustering has wide applicability in biology, as well. In addition, hierarchical partitioning does not require a priori knowledge of the number of clusters and can be easily visualized using a tree structure (called dendrogram), making it often preferable to K-means clustering. Hierarchical clustering is subdivided into agglomerative methods, which proceed by series of fusions of the N objects into groups, and divisive methods, which separate N elements successively into finer subsets. We adapted the agglomerative hierarchical clustering algorithm in order to incorporate the pairwise statistical analysis from Chapter 3. As in the previous section, we use binding profiles as elements and define all the pairwise distances between elements using (4.4). At the start, the algorithm treats each element as a cluster, and proceeds for N − 1 iterations. At each iteration, the algorithm links the two most similar clusters (represented as linking 55 two child nodes to a parent node on a dendrogram) until all N elements are unified into one partition. Hierarchical clustering algorithms can define distance (or similarity) between clusters in various ways. The following equations describe several common distances for linking two clusters Ck and Cl : Nearest Neighbor : d(Ck , Cl ) = Farthest Neighbor : d(Ck , Cl ) = Average : d(Ck , Cl ) = d(xi , xj ) (4.5) max d(xi , xj ) (4.6) min i∈Ck ,j∈Cl i∈Ck ,j∈Cl XX 1 d(xi , xj ) . |Ck ||Cl | i∈C j∈C k (4.7) l Hierarchical clustering based on combined p-values from Section 3.3 confirms known biological processes with great accuracy, especially when using the average distance for linking clusters. However, hierarchical clustering only considers pairwise relationships between binding profiles. Figure 4-2 uses an example to illustrate the potential pitfalls with clustering solely on pairwise distances. The Venn diagrams on the left and on the right describe two hypothetical relationships between three factors (proteins). Each circle or oval represents the subset of genes classified as bound by the six different factors. In the left scenario, each possible pair of factors (i.e., A↔B, A↔C, and B↔C) have a number of genes that they bind to in common. But when considering the group-wise relationship (i.e., A↔B↔C), we find no common intersection in bound genes between all three factors. In contrast, the scenario on the right shows that factors X, Y, and Z share a large common intersection and thus interact in a pairwise as well as group-wise manner. Since both sets of three factors have equivalent pairwise distances, hierarchical clustering cannot distinguish any difference between the two scenarios. However, the much stronger group-wise interaction of the scenario on the right makes it more likely for factors X-Y-Z to share a common biological process than for factors A-B-C. In the next section, we develop a novel semi-supervised clustering algorithm that preserves information about elements within each cluster in order to better capture group-wise dependencies between proteins. 56 Figure 4-2: The Venn diagrams on the left and on the right describe two hypothetical relationships between three factors. Since both groups of three factors have equivalent pairwise overlaps in bound genes, hierarchical clustering cannot distinguish any difference between the two scenarios (see text). However, the much stronger group-wise interaction of the scenario on the right makes it more likely for factors X-Y-Z to share a common biological process than for factors A-B-C. 4.2.3 Semi-Supervised Clustering Semi-supervised clustering derives its name from the fact that it retains information about elements within each cluster as it partitions the objects. In order to preserve information about the elements of cluster Ck the algorithm maintains two vectors, fk and xk . Vector fk = [fk,g1 . . . fk,g|G| ] records the fraction of elements that bind to each gene g in the set G. Vector xk = [xk,g1 . . . xk,g|G| ] represents the averaged binding profile at gene g ∈ G for all members within the partition, based on the SD data representation. When merging two clusters Ck and Cl into Co , the resulting two vectors for cluster Co are weighted combinations of the vectors for Ck and Cl : 1 (|Ck |xk + |Cl |xl ) |Ck | + |Cl | 1 fo = (|Ck |fk + |Cl |fl ) . |Ck | + |Cl | xo = (4.8) (4.9) Note that merged vector fo still maintains the fraction of genes bound by the elements of 57 the new cluster and xo still represent the average binding profile of all the joined objects. To define similarity between clusters, we again use the distance based on combined p-values in (4.4). Although there are many choices for how to define similarity, we wanted to incorporate the robust measures we derived in the previous chapter. Appendix A develops the algorithm with a more theoretically intuitive but less biologically meaningful distance based on Kullback-Leibler(KL) divergence. To find combined p-values for the relationship between two clusters, the algorithm evaluates the filtered correlation coefficient and mutual information but incorporates the groupwise dependence by modifying how to select the pertinent sets. For filtered correlation, let Fk and Fl represent the filtered subsets of genes for clusters Ck and Cl , respectively, and let Fk,l = Fk ∪ Fl again denote the filtered subset of genes over which we will estimate the variances and covariance of two partitions using (3.1) - (3.5). The filtered subsets Fk and Fl no longer consist of the set of genes bound by one factor but now represent the sets of genes bound by a fraction of objects within a cluster. Mathematically, letting fthresh represent the fraction of proteins that need to bind to a gene in the filtered subsets, we define Fk = {g : fk,g ≥ fthresh } and Fl = {g : fl,g ≥ fthresh }, respectively. Having defined Fk,l = Fk ∪ Fl , the algorithm uses the averaged binding vectors xk and xl in (3.1) (3.5) to compute the means, filtered variances, and filtered covariance of clusters Ck and Cl . Next, using (3.6), (3.14), and (3.15), the algorithm derives the filtered correlation coefficient p-value between the two partitions. For finding mutual information p-values, we can now assign u = |Fk |, v = |Fl |, h = |Fk ∩ Fl |, and w equal to the number of genes with binding information for both clusters. Using these quantities, we can use (3.28) to find the mutual information p-value between two clusters. And finally, we again combine p-values using (3.30). Similar to hierarchical clustering, the algorithm treats each element as a cluster at the start and proceeds for N − 1 iterations. At each iteration, the algorithm links the two most similar partitions, based on the combined p-value distance in (4.4), until all N elements are unified into one partition. However, the new method for finding the combined p-values 58 between clusters incorporates the group-wise dependence of elements, rectifying the potential problem with using hierarchical clustering. The next section validates the algorithm using several known biological processes. 4.2.4 Biological Validation of Semi-Supervised Clustering In order to evaluate the performance of the new algorithm, we clustered the SD data representation using a bound cutoff of P V ≤ .015 ∨ P R ≥ .99, where ∨ denotes logical or, consistent with our results in Sections 2.2.2 and 3.2.2. Moreover, we set fthresh = 0.5 in order to consider genes bound by half or more of the factors in a partition as representative of the biological process of the partition. We first clustered the whole data set and then extracted the produced clusters into separate plots to facilitate visualization. Figure 4-3 illustrates how the algorithm performs on a selected group of biological processes using tree structures or dendograms. The horizontal branches on the dendrogram represent the merging of two similar objects at an iteration in the algorithm, while the length of the vertical branches corresponds to the similarity distance (as in (4.4)) of the fused nodes. Unlike K-means, semi-supervised clustering does not decide the final number of clusters but allows the user to choose a significant threshold distance for a cluster. In Figure 4-3, we chose a distance of less than 303 as significant, which corresponds to a combined p-value of 10−20 in our implementation. For visualization purposes, however, we color coded clusters of distance ≤ 253 or of combined p-value ≤ 10−70 . Figure 4-3(a) shows the clustering of four known biological processes using different colors. It again illustrates the significant association of components of the RSC nucleosome remodeling complex (RSC, RSC1, RSC2, RSC3, RSC8, RSC9, and STH1) and of components of the Polymerase III transcriptional machinery (POL3, TBP, BRF, TRF4, and RPC34). It also demonstrates the significant clustering between transcription factors STE12, DIG1, and TEC1. In [5], the authors show that STE12 activates two different sets of genes under different conditions. In the presence of a nutrient limiting agent butanol, binding of STE12, DIG1, and TEC1 activates genes necessary for growth of filaments, or structures that expand 59 (a) Validation of the RSC, POL3, STE12, and GAL biological processes. (b) Clustering of the MCM/ORC, SIR, and LRP1 biological processes. Figure 4-3: Confirmation of known biological processes and new insight using semi-supervised clustering (see text). The two dendrograms above show a subset of the results from clustering the whole data set using the SD data representation and a bound cutoff of P V ≤ .015∨P R ≥ .99. We set fthresh = 0.5 in order to consider genes bound by half or more of the factors in a partition as representative of the biological process of the partition. The horizontal branches signify the merging of two similar objects, while the length of the vertical branches corresponds to the similarity distance (as in (4.4)) of the fused nodes. For visualization purposes, different colors represent clusters of distance ≤ 253 or of combined p-value ≤ 10−70 . 60 the size of the cell and facilitate storage of nutrients. During pheromone exposure, binding of STE12 and DIG1 (no TEC1) induces genes involved in mating. Although our data monitors the binding of the three factors under no treatment in YPD, the clustering still evinces the strong binding dependence between STE12 and DIG1, and their auxiliary relationship with TEC1. Moreover, Figure 4-3(a) also confirms the significant interdependence between the POL3 machinery and RSC complex (p-value ≤ 10−40 ). It further suggests a previously unknown relationship between the STE12 cluster and the POL3 and RSC biological processes. And finally, the right most partition shows significant similarity between transcription factors GAL3 and GAL80. As discussed in Section 2.1, the two transcription factors regulate the expression of genes that breakdown the sugar galactose. Similar to the STE12 analysis, using YPD data again uncovers the general relationship between the two galactose factors. However, we do not believe that the GAL transcription factors share significant functions with the previous three processes (p-value ≥ 10−5 ). Figure 4-3(b) displays a dendogram of several other known and hypothesized biological mechanisms. The rightmost colored tree corroborates our hypothesis that LRP1, RRP6, and PRP20 may synergistically bind at inactive genes. As discussed in Sections 3.3.1 and 4.1, LRP1 and RRP6 may bind to inactive genes in order to readily degrade their unwanted mRNA, while PRP20 may bind at inactive genes to facilitate fast activation [13]. Moreover, the association LRP1 and RRP6 with mRNA export factor YRA1, supports the results in [16] that demonstrate coupling of mRNA production quality assurance to mRNA export. Further, the similar relationship between NUP100 and PRP20 suggests that NUP100 may also have a role that is orthogonal to the workings of the majority of the nuclear pore factors, as predicted in the PCA analysis. However, this cluster does not seem to share significant commonality with the rest of the colored trees on the figure. The other three colored clusters in Figure 4-3(b) also provide validation of our analysis and new biological insight. During normal growing conditions, a yeast cell periodically cycles through several phases of growth. If a mother cell has sufficient nutrients in its environment it may decide to divide into two daughter cells, in a phase of the cell cycle called mitosis. To 61 make sure that the daughter cells inherit the genetic information of the mother cell, prior to mitosis and during the S phase of the cell cycle, the mother cell makes two identical copies of all of its genetic material in a process known as DNA replication. In [8], the authors show that the Origin Recognition Complex (ORC) and the MiniChromosome Maintenance (MCM) proteins bind together at origins of replication, or sites along the chromosomes where DNA replication is initiated. The two left most colored trees confirm the significant association of components of the ORC (ORC1, ORC1-6) and MCM (MCM3, MCM4, and MCM7) complex at both intergenic and open reading frame (ORF) regions of the genome, denoted by an “i” and “o” at the end of each protein name, respectively. Although DNA replication only takes place during the S phase of the cell cycle, and despite the fact that our unsynchronized ChIP-chip data samples cells during all phases of growth, the semi-supervised clustering still uncovers the relationship described in [8]. The next colored partition confirms the significant association between components of the SIR complex at both intergenic and ORF regions [21]. Moreover, the SIR complex associates significantly (p-value ≤ 10−35 ) with the MCM/ORC cluster, suggesting a mutual dependence between the two processes. The binding profile of SIR2o also has a close relationship with RAP1o, a TF involved in transcriptionally active processes, which we will explore in more detail in the next section. 4.2.5 Active Processes and SIR2 Transcriptionally active processes, or simply active processes in our context, refer to biological mechanisms that occur at highly expressed genes. For example, the model introduced in Section 4.1 [13] of how highly expressed genes bind to the nuclear pore describes an active process. To gauge how actively a gene is expressed in YPD, we downloaded data for the number of mRNAs produced per hour, or the transcriptional frequency of a gene, as measured in [12]. The results from [12] show that only about 21% of the genes produce more than 6 mRNAs/hr. Therefore, we define genes with transcriptional frequency in the top 20% as active. To determine whether a factor participates in active processes, we again calculate filtered correlation coefficient, mutual information, and combined p-values between 62 the bound set of a factor and the set of active genes. Mathematically, we let A represent the set of active genes and let Fi represent the set of genes classified as bound for a potential active factor i. Then we define Fi,j = Fi ∪ A in the filtered correlation coefficient equations (3.1) - (3.5), and use (3.6), (3.14), and (3.15) in succession to derive p-values. For finding mutual information p-values, we can now assign u = |A|, v = |Fi |, h = |A ∩ Fi |, and w equal to the number of genes with observed information for both data sets. Using these quantities, we can use (3.28) to find the mutual information p-value. And finally, we again combine p-values using (3.30) and consider a factor as active if it has a combined p-value ≤ 10−20 or a mutual information p-value ≤ 10−10 (justified in Section 5.1). In order to explore group-wise relationships between active factors, we categorized all proteins as active or non-active using the stringent criterion above. Figure 4-4 displays the results of clustering only the active factors using the SD data representation and a bound cutoff of P V ≤ .015 ∨ P R ≥ .99 once again. Note that all the factors included have a significant similarity distance of ≤ 283, or cluster combined p-value ≤ 10−40 . The results provide interesting biological validation and exciting new insight. Figure 4-4 includes all the nuclear proteins considered to have a preference for binding to active genes in [13] (CSE1, NUP116, NIC96, MLP1, MLP2, XPO1, NUP2, KAP95, and NUP60) and in [14] (THO2, NPL3, and HMT1), justifying the above criterion for finding active factors. Moreover, the tight clustering of the active nuclear pore factors once again validates the model in [13], which suggests that these factors share a biologically active mechanism. The large similarity between nuclear proteins THO2, NPL3, and HMT1 also seems to confirm the results in [14]. Yu et al. introduced a model showing that methyltransferase HMT1 plays a role in disassociating transcriptional elongation protein THO2 and mRNA export factor NPL3. Moreover, nonfunctional (mutated) HMT1 leads to failed export of mRNA produced at a selected active gene. Since all three factors bind preferentially to highly expressed genes, they seem to share an active biological process, whereby HMT1 dependent dissociation of THO2 and NPL3 may prove necessary for export of mRNAs produced by active genes. Further, their close relationship with the nuclear pore proteins implies that this mechanism might 63 also occur at the periphery of the nucleus. The red tree on the left part of Figure 4-4(b) illustrates another active biological process. Histone acetyltransferases (HATs) ESA1 and GCN5 have a predilection for binding to active genes [4], while data H3i, H2Bi, and iNuclsm measuring nucleosome depletion also overlaps significantly with promoters of active genes [27]. Moreover, transcription factors FHL1 and RAP1 regulate protein biosynthesis and other active processes [38]. The nearby congregation of RAP1, nucleosome depletion, and HATs ESA1 and GCN5 seems to support a conjecture mentioned in [27]. Berstein et al. showed that RAP1 is necessary but not sufficient for the mechanism that displaces nucleosomes at highly expressed genes. Hence, the authors conjectured that other factors, including ESA1 and GCN5, might be required along with RAP1 in order to reposition nucleosomes at induced genes. Moreover the large similarity between this process and the nuclear pore factors as seen in Figure 4-4(a) again implies that this active process takes place at the periphery of the nucleus. Surprisingly, Figure 4-4(b) also uncovers a great similarity between the binding of SIR2 at ORF regions, denoted as SIR2o, and other active proteins. The SIR complex does not seem to bind to DNA directly but interacts with other factors, such as RAP1 or deacetylation at histone 4, in order to access the nearby genes. This explains the association of SIR2 with active factor RAP1 in Figure 4-3(b). However, SIR2 is a histone deacetylase that plays an important role in silencing transcription at telomeres (or ends of chromosomes) and matingtype gene loci HML and HMR [21]. Hence, the designation of SIR2o as an active factor and its significant binding similarity with all other active factors is unexpected. In Figure 4-4(a), SIR2o associates with several active transcription factors, including GAT3, FHL1, RAP1, and YAP5. It also has a significant binding similarity with semi-active (combined p-value ≤ 10−10 instead of ≤ 10−20 ) transcription factors PDR1, RGM1, and SMP1. These transcription factors (TFs) regulate various different biological pathways, including biosynthesis, environmental response, and metabolism. However, all the enumerated TFs bind preferentially to nucleosome depleted promoters in a functionally cooperative manner [27]. This may suggest a role for SIR2 in nucleosome displacement. Moreover, SIR2 deacetylates 64 (a) Active cluster, part I (b) Active cluster, part II Figure 4-4: Active biological processes and new insight using semi-supervised clustering. Figures 4-4(a) and 4-4(a) display the results of clustering all active factors (see text), using the SD data representation and a bound cutoff of P V ≤ .015 ∨ P R ≥ .99 once again. We again set fthresh = 0.5 in order to consider genes bound by half or more of the factors in a partition as representative of the biological process of the partition. All members of the cluster associate significantly (combined p-value ≤ 10−40 ), but for visualization purposes different colors represent clusters of distance ≤ 253 or of combined p-value ≤ 10−70 . 65 lysine 16 on histone 4, which is generally associated with active processes. Finding the full extent of SIR2o’s role in active processes, and why it only appears to take place at ORF and not intergenic regions may uncover exciting and new biological insight. Figure 4-4 also includes several other factors with proposed roles in active processes. First, histone deacetylase HOS2 was shown to play a role in deacetylation at active genes in [20] and clusters tightly with the nuclear pore factors in Figure 4-4(a). Second, [6] shows that histone methyltransferase SET1, also in Figure 4-4(a), adds three methyl groups at lysine 4 of histone 3 during transcription (i.e., at active genes), which may serve as memory that a gene was recently turned “on”. Third, Santos-Rosa et al. also shows an intimate connection between SET1 and nucleosome remodeling complex protein ISW1, shown in Figure 4-4(b) [25]. The authors show that ISW1 recruitment at active genes is dependent upon methylation by SET1. Further, the paper illustrates that proper transcription at selected genes depends upon the collaborative function of both ISW1 and SET1. And finally, [17] shows that high acetylation level at histone 3 lysine 18 at both intergenic and ORF regions correlates with gene activity. Figure 4-4(b) includes the three factors, designated as H3K18o, H3K18no, and H3K18ni, where “o”, “no”, and “ni” at the end refer to ORF, normalized ORF, and normalized intergenic regions [17]. Our clustering analysis captures the collective relationship between all the mentioned active factors for the first time in the literature. In the next chapter, we use the theory developed thus far to build a multi-layered transcriptional network of the nucleus. 66 Chapter 5 Network of the Nucleus This chapter builds on the methodology developed in the previous chapters in order to build a network of the nucleus. Our holistic approach is the first attempt in the literature to quantify the communication between layers in eukaryotic transcriptional networks. Analysis of the finalized network will hopefully unveil a general view of the non-linear process of transcription. The next section describes the model we used to determine the interplay between various proteins. 5.1 Network Model In order to build a transcriptional network of the nucleus, we represent each factor as a node (or oval) and use the pairwise statistics developed in Chapter 3 to find significant binding relationships between factors, represented as edges (or links) between nodes. Using the convention of italicizing sets but not members of sets, we again define the following five categories of proteins: Transcription Factors (T F ), Nuclear Transport proteins (N T ), Nuclear Processing proteins (N P ), Histone Modifiers and Nucleosome Remodelers (HM ), and Histone State (HS) in terms of acetylation levels, methylation levels, and nucleosome distribution. We use an undirected network model, since our ChIP-chip data cannot evince directionality in binding dependencies, or whether factor i causes factor j to bind to gene g. 67 Therefore, an edge between two factors i and j contains zero or two arrows in our model, lacking information about directionality. Our network model consists of two types of edges—a positive and a negative edge representing a significant synergistic and an opposing binding relationship, respectively. Since our mutual information p-values only evaluate the significance of positive binding relationships, decisions on extending negative edges will solely depend on correlation analysis. Hence, we use the two sided integration in (3.15) to evaluate the significance of finding both extremely positive or negative binding relationships. To avoid ambiguity when using correlation analysis, we set the p-value between proteins i and j to one if ρ̂Xi ,Xj < 0 when considering positive edges and if ρ̂Xi ,Xj ≥ 0 when considering negative edges. For simplicity, we first consider protein-protein interactions, excluding all factors within the HS level. Mathematically, let νi and νj denote the nodes (or vertices) for proteins i and − j, respectively. Then + i,j and i,j represent a positive and a negative binding relationship − between i and j, respectively. Moreover, + i,j and i,j can only equal to 1 or 0, corresponding to a presence or absence of a significant binding relationship between i and j. We add a positive edge (+ i,j = 1) between proteins i and j from all layers except the set HS if their respective binding profiles xi and xj have a combined p-value pvC ≤ 10−20 as in (3.30) or a minimized mutual information p-value pvM M I ≤ 10−10 as in (3.32): i∈ / HS, j ∈ / HS : −10 + ) ∨ (pvC (xi , xj ) ≤ 10−20 ) . i,j = 1 if (pvM M I (xi , xj ) ≤ 10 (5.1) The decision to use the or condition (denoted by the symbol ∨) above is substantiated by our observation in Section 3.3.1 that mutual information analysis leads to very few false positive claims at a significance level of 10−10 . In addition, the test incorporates the information from the filtered correlation analysis in the evaluation of the combined p-value. Since filtered correlation analysis leads to more false positives, it needs to combine its evidence with that of the mutual information analysis and meet a stricter significance threshold to create an edge. 68 Negative edges in the network prove much more complicated to assign and interpret. Section 3.2.1 discusses the inherent difficulty with using mutual information p-values for finding opposing binding relationships. Moreover, it is possible to consider the overlap between extremely bound and unbound sets of genes but the biological interpretation of significant overlap in that scenario is not clear. Hence, it seems reasonable to solely base our decision on assigning negative edges in the network using filtered correlation coefficient p-values. To assign a negative link (− i,j = 1) between the binding profiles xi and xj of the proteins i and j not in the set HS, we use a filtered correlation coefficient p-value pvF CC threshold of 10−10 : i∈ / HS, j ∈ / HS : −10 − ). i,j = 1 if (pvF CC (xi , xj ) ≤ 10 (5.2) Data from the HS class of factors requires particular care. For proteins, it seems natural to classify binding data at a gene as 0s and 1s, representing an absence or presence of association with a gene’s DNA. However, ChIP-chip data of histone acetylation levels, for example, can indicate gradations of acetylation (i.e., hypo, medium, hyperacetylation, etc.). Hence, it seems more natural to consider the continuum of data for members of the HS layer. Although mutual information analysis extends to continuous distributions, finding p-values for this scenario that are consistent with the previously developed pairwise statistics proves more difficult. Hence, we decided to use standard correlation coefficient p-values (using ChIP-chip data at all genes) to quantify the relationships between two factors within the HS layer. The p-values for standard correlation coefficient estimates, pvSCC , can also be assigned using (3.6), (3.14), and (3.15) in succession. Mathematically, let i and j represent two factors from the HS layer with binding profiles xi and xj . We extend positive and negative links in our network model if the p-value pvSCC ≤ 10−20 : i ∈ HS, j ∈ HS : −20 + ) ∧ (ρ̂Xi ,Xj ≥ 0) i,j = 1 if (pvSCC (xi , xj ) ≤ 10 (5.3) i ∈ HS, j ∈ HS : −20 − ) ∧ (ρ̂Xi ,Xj < 0) . i,j = 1 if (pvSCC (xi , xj ) ≤ 10 (5.4) 69 In the above equations, ∧ stands for a logical and, requiring that positive and negative interactions have a positive and negative estimate of the filtered correlation coefficient, respectively. Now let us consider relationships between one protein not in the set HS and one factor from the set HS. Ultimately, we want to make statements such as genes bound by transcription factor i have high acetylation. In order to use filtered correlation coefficient analysis in this context, we need to let the non-HS factor determine the pertinent set of dimensions or genes. Letting Fi denote the set of genes bound by the non-HS protein i, we define the filtered set Fi,j = Fi in (3.1) - (3.5) and use (3.6), (3.14), and (3.15) in succession to derive p-values. Mutual information p-value analysis again proves difficult to extend in this context. For non-HS protein i and HS factor j, we assign positive and negative links in our network using the following thresholds on our filtered correlation coefficient p-value pvF CC : i∈ / HS, j ∈ HS : −4 + i,j = 1 if (pvF CC (xi , xj ) ≤ 10 ) ∧ ρ̂Xi ,Xj ≥ 0 (5.5) i∈ / HS, j ∈ HS : −4 − i,j = 1 if (pvF CC (xi , xj ) ≤ 10 ) ∧ ρ̂Xi ,Xj < 0 . (5.6) Using a lower threshold for assigning links in the scenario above is necessary due to the lower dimensionality of the data. Since, our filtered set now only contains the set of genes defined as bound by a single factor, it often proves impossible to achieve such high significance thresholds as in the previous analyses. Moreover, we used an empirical, non-parametric method for calculating p-values for the above scenario in order to check if they matched our analytical p-value derivation in Section 3.1.1. The algorithm first finds the filtered correlation for the genes bound by the non-HS factor, g ∈ Fi , as described above. Then it picks 10000 random sets of genes of the same size, |Fi |, finds the filtered correlation coefficient for each random set and counts the n filtered correlation coefficients that exceed that of factor i. Then, the non-parametric p-value equals n . 10000 Both approaches achieve very similar p- values, proving the reliability of our method for finding p-values introduced in Section 3.1.1. The next section considers how we can exploit clustering in order to uncover other significant 70 interactions omitted by our initial thresholds. 5.1.1 Assigning Links: Second Pass The thresholding approach for finding significant binding relationships introduced in the previous section has an inherent sensitivity to the chosen cutoffs. To alleviate this problem and to minimize the chance of false negatives, this section describes a second pass in our algorithm for assigning links. As Section 4.2.4 illustrated, clustering ChIP-chip data at a stringent significance cutoff (combined p-value ≤ 10−70 ) evinces various known biological complexes. Moreover, we noticed in Section 3.3.1 that we can infer the binding relationship between POL3 and RSC3 based on the interactions between POL3 and other components of the RSC complex (i.e. RSC8, RSC). Generally, we can deduce that protein i’s binding profile xi has a significant relationship with protein j1 ’s binding profile xj1 contained in cluster Cj , xj1 ∈ Cj , if the combined binding relationship between xi and all the binding vector components of Cj is significant. Using (3.31), we can find the total p-value for the binding relationship between xi and the cluster Cj , or pvT (xi , Cj ), by combining the pvalues for the individual binding dependencies between xi and each component of Cj , where Cj = {xj1 , . . . , xj|Cj | }: |Cj |−1 pvT (xi , Cj ) = pvC (pv(xi , xj1 ) · · · pv(xi , xj|Cj | )) = k|Cj | X (− ln k|Cj | )r . r! r=0 (5.7) Equation 5.7 only depends on the number of p-values combined, |Cj |, and the product of Q|Cj | pv(xi , xjm ), as in (3.31). However, (5.7) incorporates p-values the p-values k|Cj | = m=1 based on binding relationships between independent data sets xi , xj1 , . . . , xj|Cj | . In this context, our use of (3.31) satisfies the independent studies assumption and we do not have to double the p-values as in Section 3.3.1. Moreover, (5.7) does not specify which type of p-values to combine. For a possible positive or negative link, our network algorithm incorprates the combined pvC (xi , xj ) or filtered correlation coefficient pvF CC (xi , xj ) p-values, respectively. For extending a potential positive or negative link, we consider a total p-value 71 pvT (xi , Cj ) ≤ 10−20 or ≤ 10−10 as significant, respectively. With the additional evidence that protein i’s binding profile associates significantly with the biologically related components of Cj , the threshold for significant binding relationship between i and j1 should decrease. Repeating this process for every possible combination of i and j1 , where j1 ∈ Cj , the second pass of our network algorithm assigns positive and negative edges as follows: −20 ] ∧ [(pvM M I (xi , xj1 ) ≤ 10−7.5 ) ∨ (pvC (xi , xj1 ) ≤ 10−15 )] + i,j1 = 1 if [pvT (xi , Cj ) ≤ 10 −10 − ] ∧ [(pvF CC (xi , xj1 ] ≤ 10−7.5 )] . i,j1 = 1 if [pvT (xi , Cj ) ≤ 10 (5.8) The first pass of the network algorithm found 4692 positive and 2629 negative binding relationships, while the second pass uncovered 806 and 574 more significant binding relationships, respectively. The larger number of positive links reflects the fact that significant positive binding relationships are easier to find using ChIP-chip data than significant negative binding relationships. Actual visualization of all the significant links proves unwieldy. Therefore, Figure 5-1 displays only highly significant positive binding relationships that have a minimized mutual information p-value ≤ 10−20 or combined p-value ≤ 10−40 . Different colored ovals represent factors from the various levels and lines between ovals denote significant synergistic interactions between factors. Inspecting the graph, one can notice neighborhoods of nodes that share common biological processes. For example, the MCM-ORC cluster occupies the top left corner of Figure 5-1, with the SIR complex directly to its right. Continuing to the right from the SIR complex, one can observe a congregation of nuclear transport proteins, factors from the active cluster, the POL3 machinery, and the RSC complex at the middle right end of the figure. Having built a network of the nucleus, the next section verifies the ability of our network model to predict protein-protein interactions by comparing it with several previous protein-protein studies. 72 Figure 5-1: Network of highly significant positive binding relationships that have a minimized mutual information p-value ≤ 10−20 or combined p-value ≤ 10−40 . Colored ovals represent factors from the various levels and lines between ovals denote significant synergistic interactions between factors. Different colors represent the five layers: yellow = T F , green = HS, red = HM , blue = N P , and pink = N T . Graph rendering was performed with Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/doc/pajekman.htm). 73 Data Set Interactions Yu et al. 395 Gavin et al. 462 Ho et al. 137 Ito et al. 43 Uetz et al. 17 Overlap P-Value 134 7.42×10−46 129 3.11×10−34 45 9.32×10−16 8 0.0280 4 0.0521 Table 5.1: Comparison between various protein-protein interaction data sets and our network. The first column lists the source of the data. The second column records the number of predictions from each data set between proteins considered in this work. The third column lists the overlap in predicted significant interactions between the five studies and our network. The last column finds p-values for the significance in the overlap using the hypergeometric test statistic discussed in Section 3.2.1 (see text). 5.1.2 Network Comparison Numerous previous large scale and small scale methods have predicted protein-protein interactions with varied success. For comparison, we downloaded data from five protein-protein interaction studies [30–34]. Ito et al. and Uetz et al. used an experimental technique called Yeast Two-Hybrid (Y2H) to infer associations between proteins, while Gavin et al. and Ho et al. used a more accurate mass spectrometry method. Yu et al. incorporated the significant predictions from the above data sets along with interactions found using several small scale studies. Small scale studies generally have less noise in their predictions than the high-throughput methods in [31–34]; therefore, the Yu et al. data set is probably most reliable. Table 5.1 demonstrates the results of the comparison. The first column lists the source of the data. The second column records the number of predictions from each data set between proteins considered in this work. The third column lists the overlap in predicted significant interactions between the five studies and our network. The last column finds p-values for the significance in the overlap using the hypergeometric test statistic discussed in Section 3.2.1. The p-values were calculated by setting u equal to entries in the second column, h to entries in the third column, v = 3771, or the number of significant positive protein-protein binding relationships in our network (i.e., excluding edges with factors in the HS layer), and w = 43956, or all possible protein-protein relationships that could be predicted. 74 The table above shows that the three most reliable data sets overlap significantly with the predictions made in our network. The Y2H data does not have enough predictions in common with our data set in order to accurately determine the significance in the overlap. Moreover, the Y2H technique is considered to have the most amount of noise. Generally, we do not expect a full overlap between the methods since they target different biological mechanisms. Our method finds biological complexes that associate at or near DNA but does not capture other cellular protein-protein interactions. For example, most of the above data sets show that protein Hrp1 and Nab2 form a complex but our analysis demonstrates that the two proteins have very different binding profiles. Moreover, our method can capture binding dependencies between proteins that consistently bind to the same gene targets and hence participate in similar biological processes, but that do not necessarily bind at the same time or associate with one another. And finally, it is generally agreed upon that the data sets above have a large amount of noise. In summary, this comparison validates the ability of our method to find protein-protein interactions predicted in previous studies. Having fully specified the network structure enables us to infer the underlying communication between layers in the eukaryotic transcriptional network in the next section. 5.2 Analysis of Network Topology This section introduces several statistical methods for quantifying the interplay between the layers in our yeast transcriptional network. Table 5.2 shows the number of links within and between the five categories of factors (namely T F , HS, HM , N P , and N T ) for the entire and the positive edge network. Each entry within the table represents the total number of links and the percentage of possible links realized (in parenthesis) between factors from the layers listed in the corresponding entry in the first row and column. For example, the first entry in Table 5.2(a) shows that there are 3659 links between pairs of transcription factors, which represents 17.2% of the total possible links that could be realized between all pairs of TFs. Percentage of links realized represents a normalized measure of the connectedness between two layers, which accounts for the number of factors in each layer. Given two layers 75 Layers TF HS HM NP NT TF HS HM 3659(17.2%) 1018(9.28%) 737(7.58%) 1018(9.28%) 616(44.7%) 776(31.2%) 737(7.58%) 776(31.2%) 198(18.3%) 409(7.32%) 236(16.5%) 208(16.4%) 171(5.16%) 276(32.5%) 100(13.3%) NP 409(7.32%) 236(16.5%) 208(16.4%) 118(33.6%) 97(22.5%) NT 171(5.16%) 276(32.5%) 100(13.3%) 97(22.5%) 95(79.2%) (a) Total links and percentage of links captured for the entire network. Layers TF HS TF 2099(9.84%) 602(5.49%) HS 602(5.49%) 513(37.2%) HM 519(5.33%) 323(13%) NP 328(5.87%) 151(10.6%) NT 108(3.26%) 138(16.3%) HM 519(5.33%) 323(13%) 161(14.9%) 178(14%) 85(11.3%) NP 328(5.87%) 151(10.6%) 178(14%) 113(32.2%) 93(21.5%) NT 108(3.26%) 138(16.3%) 85(11.3%) 93(21.5%) 87(72.5%) (b) Total links and percentage of links captured for the positive edge network. Table 5.2: Total links and percentage of links realized within and between the five layers of the (a) entire and the (b) positive edge transcriptional network. Each entry within each table represents the total number of links and the percentage of possible links realized (in parenthesis) between factors from the layers listed in the corresponding entry in the first row and column. For example, the first entry in Table (a) shows that there are 3659 links between pairs of transcription factors, which represents 17.2% of the total possible links that could be realized between all pairs of TFs. with n and m factors, respectively, and k mutual interactions the inter-level percentage of k links realized is simply 100 nm %. Moreover, if a layer with n elements has l interactions 2l within its layer, the intra-level percentage of links realized is 100 n(n−1) %. Table 5.2(a) measures the level of connectedness or communication within and between layers. By examining the percentage of links realized, we see that most of the communication occurs within layers. The diagonal in Table 5.2(a) shows that the five layers have 17.2%, 44.7%, 18.3%, 33.6%, and 79.2% of intra-connectedness. Moreover, inspecting the entries in row 1 of Table 5.2(a) from left to right, we see that the percentages gradually decrease, with TFs being most highly associated with HSs, followed by HMs, NPs, and NTs. These results confirm the current thought in biology that close collaboration between TFs, HSs, and HMs 76 induces specific classes of genes while NPs and NTs carry out more general mechanisms related to transcription. However, there are still a number of interactions between the levels, illustrating that significant amount of communication between layers of the yeast transcriptional network. Several other network statistics can also quantify the behavior of the different classes of proteins. For, example we can count the number of links stemming from each node, or the number of degrees, and find a distribution of degrees for nodes within a layer. Computing the average of that distribution measures the average level of connectivity for members from each layer. Table 5.3 shows the average number of degrees for proteins within the five layers in both the entire and the positive edge network. Members of the HS level seem most promiscuous in their association with other factors, but, overall, all layers have a similar number of average degrees. Clustering coefficient is a common method for measuring the cliquishness of each node in the network, or how much the node’s nearest neighbors (adjacent nodes) interact with one another. Finding the average clustering coefficient for all proteins within a particular class measures the tendency for cliquishness in each layer. Let node i of layer L have ki nearest neighbors. Moreover, let its ki adjacent nodes have ci connections between them out of a possible ki (ki −1) 2 links. Then the clustering coefficient at node i, or the fraction of links realized between i’s nearest neighbors becomes 2ci . ki (ki −1) Finally, the average clustering coefficient for all factors i in layer L, CCL , takes the form: CCL = 1 X 2ci |L| i∈L ki (ki − 1) (5.9) Table 5.3 shows the average clustering coefficient for proteins within the five layers in both the entire and the positive edge network. The preference for cliquish interaction between nearest neighbors of nodes seems strongest amongst members of the N T class of proteins. Given that almost all nuclear transport factors considered seem to associate with members of the nuclear pore, this measure seems consistent with biological mechanism of the N T class. In summary, the network topology statistics discussed in this section confirm that the 77 Layer Avg. Number Avg. Clustering Name of Degrees Coefficient TF 47.2 0.458 HS 68.9 0.427 HM 47.3 0.495 NP 43.9 0.539 NT 52.1 0.631 All 50.5 0.473 (a) Network topology statistics for the entire network. Layer Avg. Number Avg. Clustering Name of Degrees Coefficient TF 27.8 0.526 HS 42.3 0.482 HM 30.4 0.539 NP 36.1 0.575 NT 37.4 0.676 All 31.4 0.532 (b) Network topology statistics for the positive edge network. Table 5.3: Network topology statistics for (a) the entire and (b) the positive edge transcriptional network. 78 T F , HM , and HS levels seem most tightly coupled. However, over a thousand significant interactions also occur between factors outside of these subsets, demonstrating that eukaryotic transcription depends on the intricate interplay between all five layers presented. 79 80 Chapter 6 Conclusion Genome-wide binding data from ChIP-chip experiments provides a wealth of information about protein-gene interactions and about the underlying workings of eukaryotic cells. In order to gain a holistic understanding of the non-linear process of transcription, our work examines the communication between various classes of regulators in the yeast specie Saccharomyces cerevisiae. We used ChIP-chip data to quantify the interplay between five categories of factors that affect transcription: histone states, histone modifiers and nucleosome remodelers, transcription factors, nuclear processing proteins, and nuclear transport factors. Following the introduction in Chapter 1, Chapter 2 described the non-trivial process of incorporating the different sources of ChIP-chip data into a coherent set. Due to the heterogeneity of the obtained data, this chapter further discussed the need for data normalization and showed the benefits of a normalized data set. Chapter 3 used the processed data to find pairwise statistics that test binding dependencies between two proteins. Combining the complementary pairwise measures of filtered correlation coefficient and mutual information p-values reduced the false positive and false negative rates in our biological predictions and increased the reliability of our analysis. Next, Chapter 4 uncovered group-wise relationships between factors using Principal Component Analysis (PCA) and clustering. PCA allowed us to visualize subsets of the data in 2-D. Based on the developed pairwise measures, Chapter 4 also introduced a novel semi-supervised clustering algorithm that preserves information 81 about elements of a cluster in order to better capture group-wise dependencies between proteins. And finally, Chapter 5 combined the methodology developed in the previous chapters in order to build a multi-layered transcriptional network of the nucleus. Throughout the theoretical analysis, we validated various known biological processes that occur in dextrose rich conditions. Moreover, our analysis and previous literature [7] showed that usage ChIP-chip experiments in rich media YPD conditions can still provide insight about biological processes in other growth environments. Further, the fact that our data does not synchronize cells at a particular phase of growth, does not preclude formulation of hypothesis about biological processes that occur only in a specific phase of the cell cycle, as shown with the MCM-ORC complex in Section 4.2.4. Our theoretical analysis also uncovered several novel biological hypothesis. In particular, Chapter 3 suggests that proteins LRP1 and RRP6 might participate in a mechanism that ensures quality control at falsely transcribed genes, by binding to inactive genes in order to readily degrade unwanted production of their mRNAs. Moreover, Chapter 4 hypothesized that SIR2, long thought to silence the transcription of DNA, might have a role in transcriptional activation at the ORF region of genes. And finally, Chapter 5 quantified the communication between layers in biological transcriptional networks for the first time in the literature. The network topology statistics discussed in Chapter 5 confirm that the T F , HM , and HS levels seem most tightly coupled. However, over a thousand significant interactions also occur between factors outside of these three categories, demonstrating that eukaryotic transcription depends on the intricate interplay between all five layers presented. 82 Appendix A Semi-Supervised Clustering based on KL-divergence Semi-supervised clustering is more theoretically intuitive when objects are represented using distibutions. We chose to define the probability distribution of object i as the normalized binding profile of binding ratios across all genes, or normalized rows in the BR matrix. Specifically, if xi = [xi,g1 . . . xi,g|G| ] denotes a row vector from the BR matrix for protein i across all genes g in the set G, the normalized binding profile (or binding distribution) for protein i is xi,g . g∈G xi,g P (g|i) = P (A.1) We implemented a semi-supervised clustering algorithm that can perform hierarchical partitioning of distributions as defined in (A.1). We treat the binding distribution P (g|i) for each protein i as an object (or node) and merge objects into clusters (aggregations of nodes) based on a distance metric using the Kullback-Leibler divergence. The KL-divergence represents how different two distributions are. For the normalized binding profiles, it is defined as 83 D(P (g|Ck )kP (g|Ck ∪ Cl )) = X g∈G P (g|Ck ) log P (g|Ck ) , P (g|Ck ∪ Cl ) (A.2) where Ck and Cl represent two clusters and Ck ∪ Cl denotes the merged partition of Ck and Cl . In information theoretic terms, the KL-divergence represents the information lost by joining Ck and Cl into a combined distribution for Ck ∪ Cl . Note that the KL-divergence is not a distance metric; it does not satisfy the triangle inequality nor is it commutative. For the semi-supervised clustering algorithm, we define the following distance measure based on the KL-divergence: d(Ck , Cl ) = |Ck |D(P (g|Ck )kP (g|Ck ∪ Cl )) + |Cl |D(P (g|Cl )kP (g|Ck ∪ Cl )) . (A.3) The above equation (A.3) can intuitively be interpreted as the total information lost in combining the distributions of Ck and Cl into P (g|Ck ∪ Cl ), where the weights |Ck | and |Cl | account for the number of nodes in each cluster. When a partition Ck contains just a single protein, |Ck | = 1. Note that this distance, which we refer to as the KL-distance, is commutative. The hierarchical clustering proceeds by initially treating all objects as clusters and then successively merging partitions that are “closest”, in terms of the KL distance in (A.3). When objects are merged into clusters, the new clusters preserve information about their component nodes in their distribution. The combined distribution is a weighted average of the constituent distributions: P (g|Ck ∪ Cl ) = 1 (|Ck |P (g|Ck ) + |Cl |P (g|Cl )) . |Ck | + |Cl | (A.4) The performance of the clustering algorithm is illustrated in Figure A-1. Figure A-1 partitions the nuclear processing and nuclear transport factors and shows similar results as the PCA analysis in Section 4.1. The related NTs cluster together once again. In addition, HMT1 (a protein that adds methyl groups to other proteins), THO2 (a protein that 84 starts transcription), and YRA1 (a protein that exports mRNA from the nucleus) also cluster tightly with the NTs in [13], suggesting a common regulatory mechanism. Moreover, proteins RRP6, an mRNA surveillance factor, and mRNA export factor NPL3 cluster together, supporting the results in [16] that demonstrate coupling of mRNA quality assurance to mRNA export. The algorithm also captures the relationship between LRP1 and PRP20 discussed in section 3.3.1. However, when the algorithm is run on the entire data, it fails to find commonality between several known biological processes. In the end, the pairwise statistics discussed in Chapter 3 prove much more biologically meaningful than KL-divergence. 85 1.2 Distance 1 0.8 0.6 0.4 0.2 Genes 0 YAL002W YAL001C YAL004W YAL003W YAL005C YAL008W YAL007C YAL009W YAL011W YAL010C YAL013W YAL012W YAL016W YAL015C YAL014C YAL017W YAL018C YAL019W YAL020C YAL023C YAL022C YAL021C YAL025C YAL024C YAL028W YAL027W YAL026C YAL030W YAL029C YAL033W YAL032C YAL031C YAL035C−A YAL034C YAL035W YAL036C YAL038W YAL037W YAL039C YAL041W YAL040C YAL043C−A YAL042W YAL043C YAL046C YAL044C YAL048C YAL047C YAL053W YAL051W YAL049C YAL055W YAL054C YAL058C−A YAL056W YAL058W YAL060W YAL059W YAL062W YAL061W YAL063C YAL064W YAL065C YAL068C YAL067C YAR003W YAR002W YAR007C YAR008W YAR009C YAR015W YAR014C YAR010C YAR019C YAR018C YAR027W YAR023C YAR020C YAR029W YAR028W YAR033W YAR031W YAR044W YAR042W YAR035W YAR053W YAR050W YAR062W YAR061W YAR060C YAR066W YAR064W YAR068W YAR070C YAR069C YAR073W YAR071W YAR075W YBL001C YBL002W YBL004W YBL003C YBL005W−A YBL005W YBL005W−B YBL007C YBL006C YBL009W YBL008W YBL011W YBL010C YBL012C YBL013W YBL014C YBL016W YBL015W YBL019W YBL018C YBL017C YBL020W YBL021C YBL024W YBL023C YBL022C YBL026W YBL025W YBL027W YBL029W YBL028C YBL031W YBL030C YBL032W YBL033C YBL036C YBL035C YBL034C YBL038W YBL037W YBL041W YBL040C YBL039C YBL043W YBL042C YBL044W YBL045C YBL046W YBL049W YBL047C YBL050W YBL051C YBL054W YBL053W YBL052C YBL056W YBL055C YBL059W YBL058W YBL057C YBL060W YBL061C YBL063W YBL062W YBL067C YBL066C YBL064C YBL069W YBL068W YBL072C YBL071C YBL070C YBL075C YBL074C YBL079W YBL078C YBL076C YBL081W YBL080C YBL083C YBL082C YBL085W YBL084C YBL086C YBL088C YBL087C YBL090W YBL089W YBL091C YBL092W YBL093C YBL095W YBL094C YBL096C YBL098W YBL097W YBL099W YBL100C YBL101W−B YBL101W−A YBL101C YBL102W YBL103C YBL106C YBL105C YBL104C YBL109W YBL107C YBL113C YBL112C YBL111C YBR002C YBR001C YBR003W YBR004C YBR006W YBR005W YBR007C YBR009C YBR008C YBR012W−A YBR010W YBR011C YBR012W−B YBR013C YBR016W YBR015C YBR014C YBR018C YBR017C YBR020W YBR019C YBR022W YBR021W YBR023C YBR024W YBR025C YBR028C YBR027C YBR026C YBR030W YBR029C YBR033W YBR031W YBR036C YBR035C YBR034C YBR038W YBR037C YBR041W YBR040W YBR039W YBR043C YBR042C YBR046C YBR045C YBR044C YBR048W YBR047W YBR050C YBR049C YBR051W YBR053C YBR052C YBR054W YBR055C YBR056W YBR058C YBR057C YBR060C YBR059C YBR063C YBR062C YBR061C YBR064W YBR065C YBR067C YBR066C YBR070C YBR069C YBR068C YBR072W YBR071W YBR075W YBR074W YBR073W YBR076W YBR077C YBR078W YBR080C YBR079C YBR082C YBR081C YBR084C−A YBR083W YBR085W YBR084W YBR086C YBR087W YBR088C YBR089W YBR091C YBR090C YBR093C YBR092C YBR094W YBR096W YBR095C YBR098W YBR097W YBR100W YBR099C YBR103W YBR102C YBR101C YBR104W YBR105C YBR106W YBR108W YBR107C YBR110W YBR109C YBR114W YBR112C YBR111C YBR117C YBR115C YBR119W YBR118W YBR122C YBR121C YBR120C YBR124W YBR123C YBR127C YBR126C YBR125C YBR129C YBR128C YBR131W YBR130C YBR134W YBR133C YBR132C YBR136W YBR135W YBR139W YBR137W YBR140C YBR142W YBR141C YBR146W YBR145W YBR143C YBR148W YBR147W YBR149W YBR150C YBR153W YBR152W YBR151W YBR155W YBR154C YBR158W YBR157C YBR156C YBR160W YBR159W YBR162W−A YBR161W YBR162C YBR163W YBR164C YBR165W YBR166C YBR168W YBR167C YBR169C YBR171W YBR170C YBR175W YBR173C YBR172C YBR176W YBR177C YBR178W YBR180W YBR179C YBR182C YBR181C YBR184W YBR183W YBR187W YBR186W YBR185C YBR189W YBR188C YBR192W YBR191W YBR193C YBR194W YBR195C YBR198C YBR197C YBR196C YBR200W YBR199W YBR202W YBR201W YBR203W YBR205W YBR204C YBR207W YBR206W YBR210W YBR209W YBR208C YBR212W YBR211C YBR215W YBR214W YBR213W YBR217W YBR216C YBR220C YBR218C YBR224W YBR222C YBR221C YBR225W YBR227C YBR228W YBR230C YBR229C YBR233W YBR231C YBR235W YBR234C YBR237W YBR236C YBR238C YBR240C YBR239C YBR242W YBR241C YBR243C YBR244W YBR245C YBR246W YBR248C YBR247C YBR250W YBR249C YBR252W YBR251W YBR253W YBR255W YBR254C YBR257W YBR256C YBR259W YBR258C YBR260C YBR263W YBR261C YBR267W YBR265W YBR264C YBR268W YBR269C YBR271W YBR270C YBR274W YBR273C YBR272C YBR276C YBR275C YBR279W YBR278W YBR280C YBR282W YBR281C YBR285W YBR284W YBR283C YBR287W YBR286W YBR289W YBR288C YBR290W YBR293W YBR291C YBR295W YBR294W YBR297W YBR296C YBR298C YBR299W YBR300C YBR301W YCL004W YCL002C YCL005W YCL006C YCL008C YCL007C YCL011C YCL010C YCL009C YCL014W YCL016C YCL019W YCL018W YCL017C YCL020W YCL022C YCL024W YCL023C YCL025C YCL028W YCL027W YCL030C YCL029C YCL032W YCL031C YCL033C YCL034W YCL035C YCL036W YCL038C YCL037C YCL039W YCL041C YCL042W YCL043C YCL047C YCL045C YCL044C YCL048W YCL049C YCL051W YCL050C YCL052C YCL055W YCL054W YCL057W YCL056C YCL058C YCL061C YCL059C YCL063W YCL064C YCL066W YCL065W YCL067C YCL069W YCL068C YCR001W YCL074W YCL073C YCR003W YCR002C YCR006C YCR005C YCR004C YCR008W YCR007C YCR010C YCR009C YCR012W YCR011C YCR013C YCR015C YCR014C YCR016W YCR018C YCR017C YCR019W YCR020C YCR020C−A YCR023C YCR021C YCR024C−A YCR024C YCR026C YCR025C YCR030C YCR028C YCR027C YCR033W YCR032W YCR034W YCR036W YCR035C YCR038C YCR037C YCR041W YCR040W YCR039C YCR043C YCR042C YCR045C YCR044C YCR048W YCR047C YCR046C YCR051W YCR050C YCR053W YCR052W YCR054C YCR059C YCR057C YCR063W YCR061W YCR060W YCR066W YCR065W YCR068W YCR067C YCR069W YCR072C YCR071C YCR075C YCR073C YCR079W YCR077C YCR076C YCR082W YCR081W YCR083W YCR084C YCR089W YCR088W YCR086W YCR091W YCR090C YCR094W YCR093W YCR092C YCR096C YCR095C YCR101C YCR100C YCR098C YCR103C YCR102C YCR105W YCR104W YCR107W YCR106W YDL001W YDL003W YDL002C YDL004W YDL006W YDL005C YDL008W YDL007W YDL010W YDL009C YDL011C YDL013W YDL012C YDL014W YDL015C YDL017W YDL016C YDL018C YDL020C YDL019C YDL022W YDL021W YDL023C YDL025C YDL024C YDL029W YDL028C YDL027C YDL031W YDL030W YDL035C YDL033C YDL038C YDL037C YDL036C YDL040C YDL039C YDL044C YDL043C YDL042C YDL046W YDL045C YDL047W YDL049C YDL048C YDL051W YDL050C YDL053C YDL052C YDL056W YDL055C YDL054C YDL058W YDL057W YDL060W YDL059C YDL061C YDL062W YDL063C YDL064W YDL065C YDL066W YDL068W YDL067C YDL070W YDL069C YDL073W YDL072C YDL071C YDL075W YDL074C YDL078C YDL077C YDL076C YDL080C YDL079C YDL082W YDL081C YDL085W YDL084W YDL083C YDL086W YDL087C YDL089W YDL088C YDL090C YDL092W YDL091C YDL093W YDL095W YDL094C YDL097C YDL096C YDL099W YDL098C YDL102W YDL101C YDL100C YDL104C YDL103C YDL105W YDL107W YDL106C YDL108W YDL109C YDL112W YDL111C YDL110C YDL114W YDL113C YDL116W YDL115C YDL118W YDL117W YDL119C YDL120W YDL121C YDL124W YDL123W YDL122W YDL126C YDL125C YDL129W YDL128W YDL127W YDL131W YDL130W YDL133W YDL132W YDL136W YDL135C YDL134C YDL138W YDL137W YDL141W YDL140C YDL139C YDL143W YDL142C YDL146W YDL145C YDL144C YDL147W YDL148C YDL150W YDL149W YDL152W YDL151C YDL153C YDL155W YDL154W YDL156W YDL158C YDL157C YDL159W YDL160C YDL163W YDL161W YDL165W YDL164C YDL166C YDL168W YDL167C YDL170W YDL169C YDL171C YDL173W YDL172C YDL176W YDL175C YDL174C YDL178W YDL177C YDL180W YDL179W YDL182W YDL181W YDL183C YDL185W YDL184C YDL186W YDL188C YDL187C YDL189W YDL190C YDL193W YDL192W YDL191W YDL195W YDL194W YDL198C YDL197C YDL201W YDL200C YDL199C YDL202W YDL203C YDL204W YDL206W YDL205C YDL208W YDL207W YDL210W YDL209C YDL211C YDL212W YDL213C YDL215C YDL214C YDL218W YDL217C YDL216C YDL219W YDL220C YDL224C YDL223C YDL222C YDL225W YDL226C YDL229W YDL228C YDL227C YDL230W YDL231C YDL233W YDL232W YDL236W YDL235C YDL234C YDL237W YDL238C YDL241W YDL240W YDL239C YDL244W YDL243C YDL247W YDL246C YDL245C YDR002W YDR001C YDR004W YDR003W YDR007W YDR006C YDR005C YDR011W YDR009W YDR014W YDR013W YDR012W YDR016C YDR015C YDR018C YDR017C YDR021W YDR020C YDR019C YDR023W YDR022C YDR025W YDR027C YDR026C YDR029W YDR028C YDR031W YDR030C YDR032C YDR033W YDR034C YDR035W YDR036C YDR037W YDR039C YDR038C YDR041W YDR040C YDR044W YDR043C YDR042C YDR046C YDR045C YDR047W YDR049W YDR048C YDR051C YDR050C YDR053W YDR052C YDR055W YDR054C YDR056C YDR057W YDR058C YDR061W YDR060W YDR059C YDR063W YDR062W YDR065W YDR064W YDR066C YDR068W YDR067C YDR070C YDR069C YDR073W YDR072C YDR071C YDR075W YDR074W YDR077W YDR076W YDR078C YDR080W YDR079W YDR083W YDR082W YDR081C YDR085C YDR084C YDR087C YDR086C YDR089W YDR088C YDR090C YDR092W YDR091C YDR096W YDR094W YDR093W YDR098C YDR097C YDR100W YDR099W YDR101C YDR103W YDR104C YDR106W YDR105C YDR108W YDR107C YDR109C YDR110W YDR111C YDR112W YDR114C YDR113C YDR115W YDR116C YDR118W YDR117C YDR119W YDR121W YDR120C YDR122W YDR123C YDR124W YDR126W YDR125C YDR128W YDR127W YDR131C YDR130C YDR129C YDR134C YDR132C YDR137W YDR135C YDR138W YDR140W YDR139C YDR142C YDR141C YDR145W YDR144C YDR143C YDR147W YDR146C YDR150W YDR149C YDR148C YDR152W YDR151C YDR156W YDR153C YDR160W YDR158W YDR157W YDR161W YDR162C YDR163W YDR165W YDR164C YDR167W YDR166C YDR168W YDR170C YDR169C YDR172W YDR173C YDR174W YDR175C YDR178W YDR177W YDR176W YDR179W−A YDR179C YDR180W YDR182W YDR181C YDR183W YDR184C YDR189W YDR188W YDR185C YDR191W YDR190C YDR193W YDR192C YDR195W YDR194C YDR196C YDR197W YDR198C YDR199W YDR201W YDR200C YDR204W YDR203W YDR206W YDR205W YDR207C YDR210W YDR208W YDR212W YDR211W YDR216W YDR214W YDR213W YDR219C YDR217C YDR223W YDR222W YDR221W YDR225W YDR224C YDR227W YDR226W YDR230W YDR229W YDR228C YDR232W YDR231C YDR235W YDR233C YDR236C YDR237W YDR238C YDR241W YDR240C YDR239C YDR242W YDR243C YDR245W YDR244W YDR247W YDR246W YDR248C YDR251W YDR249C YDR252W YDR254W YDR253C YDR256C YDR255C YDR259C YDR258C YDR257C YDR261C YDR260C YDR262W YDR263C YDR265W YDR264C YDR266C YDR268W YDR267C YDR270W YDR272W YDR271C YDR275W YDR273W YDR279W YDR277C YDR276C YDR280W YDR281C YDR283C YDR282C YDR285W YDR284C YDR286C YDR288W YDR287W YDR291W YDR290W YDR289C YDR293C YDR292C YDR296W YDR295C YDR294C YDR297W YDR298C YDR299W YDR300C YDR302W YDR301W YDR303C YDR305C YDR304C YDR307W YDR306C YDR308C YDR310C YDR309C YDR312W YDR311W YDR313C YDR315C YDR314C YDR317W YDR316W YDR318W YDR320C YDR319C YDR322W YDR321W YDR325W YDR324C YDR323C YDR327W YDR326C YDR329C YDR328C YDR332W YDR331W YDR330W YDR334W YDR333C YDR337W YDR336W YDR335W YDR339C YDR338C YDR340W YDR342C YDR341C YDR345C YDR343C YDR347W YDR346C YDR350C YDR349C YDR348C YDR352W YDR351W YDR354W YDR353W YDR355C YDR356W YDR357C YDR358W YDR360W YDR359C YDR362C YDR361C YDR363W YDR364C YDR368W YDR367W YDR365C YDR370C YDR369C YDR371W YDR373W YDR372C YDR375C YDR374C YDR377W YDR376W YDR378C YDR380W YDR379W YDR382W YDR381W YDR385W YDR384C YDR383C YDR386W YDR387C YDR389W YDR388W YDR390C YDR392W YDR391C YDR395W YDR394W YDR393W YDR396W YDR397C YDR399W YDR398W YDR401W YDR400W YDR402C YDR403W YDR404C YDR406W YDR405W YDR407C YDR409W YDR408C YDR412W YDR411C YDR410C YDR414C YDR413C YDR416W YDR415C YDR420W YDR419W YDR418W YDR421W YDR422C YDR425W YDR424C YDR423C YDR427W YDR428C YDR430C YDR429C YDR433W YDR432W YDR431W YDR434W YDR435C YDR438W YDR437W YDR436W YDR440W YDR439W YDR442W YDR441C YDR443C YDR446W YDR444W YDR448W YDR447C YDR450W YDR449C YDR451C YDR452W YDR453C YDR457W YDR456W YDR454C YDR459C YDR458C YDR462W YDR461W YDR460W YDR464W YDR463W YDR466W YDR465C YDR469W YDR468C YDR467C YDR471W YDR470C YDR472W YDR474C YDR473C YDR476C YDR475C YDR478W YDR477W YDR479C YDR480W YDR481C YDR483W YDR482C YDR484W YDR486C YDR485C YDR488C YDR487C YDR489W YDR492W YDR490C YDR494W YDR493W YDR497C YDR496C YDR495C YDR499W YDR500C YDR501W YDR502C YDR506C YDR505C YDR503C YDR508C YDR507C YDR511W YDR510W YDR509W YDR513W YDR512C YDR515W YDR514C YDR518W YDR517W YDR516C YDR519W YDR520C YDR521W YDR523C YDR522C YDR527W YDR524C YDR528W YDR530C YDR529C YDR531W YDR532C YDR534C YDR533C YDR539W YDR538W YDR536W YDR541C YDR540C YDR542W YDR545W YDR544C YEL002C YEL001C YEL004W YEL003W YEL005C YEL007W YEL006W YEL011W YEL009C YEL013W YEL012W YEL014C YEL015W YEL016C YEL018W YEL017W YEL019C YEL021W YEL020C YEL022W YEL024W YEL023C YEL026W YEL025C YEL027W YEL029C YEL032W YEL031W YEL030W YEL034W YEL035C YEL038W YEL037C YEL036C YEL040W YEL039C YEL043W YEL042W YEL041W YEL044W YEL046C YEL048C YEL047C YEL049W YEL051W YEL050C YEL052W YEL053C YEL056W YEL055C YEL054C YEL058W YEL057C YEL059C−A YEL061C YEL060C YEL062W YEL063C YEL065W YEL064C YEL066W YEL069C YEL067C YEL071W YEL070W YEL072W YEL075C YEL073C YEL076C−A YEL076C YEL076W−C YEL077C YER002W YER001W YER003C YER005W YER004W YER007C−A YER006W YER007W YER009W YER008C YER012W YER011W YER010C YER014W YER013W YER016W YER015W YER019C−A YER018C YER017C YER020W YER019W YER023W YER022W YER021W YER025W YER024W YER028C YER027C YER026C YER030W YER029C YER032W YER031C YER035W YER034W YER033C YER037W YER036C YER040W YER039C YER038C YER042W YER041W YER044C−A YER044C YER043C YER046W YER045C YER048C YER047C YER049W YER051W YER050C YER053C YER052C YER056C YER055C YER054C YER056C−A YER057C YER060W YER059W YER058W YER060w−A YER061C YER063W YER062C YER066W YER065C YER064C YER068W YER067W YER070W YER069W YER071C YER073W YER072W YER074W YER076C YER075C YER078C YER077C YER080W YER079W YER081W YER083C YER082C YER084W YER085C YER087W YER086W YER088C YER090W YER089C YER092W YER091C YER093C−A YER093C YER094C YER096W YER095W YER100W YER098W YER101C YER103W YER102W YER104W YER106W YER105C YER109C YER107C YER111C YER110C YER112W YER114C YER113C YER116C YER115C YER117W YER119C YER118C YER119C−A YER120W YER121W YER123W YER122C YER125W YER124C YER127W YER126C YER129W YER128W YER130C YER131W YER132C YER133W YER136W YER134C YER138C YER137C YER141W YER140W YER139C YER143W YER142C YER145C YER144C YER146W YER148W YER147C YER150W YER149C YER154W YER153C YER151C YER156C YER155C YER157W YER159C YER158C YER161C YER160C YER163C YER162C YER166W YER165W YER164W YER167W YER168C YER171W YER170W YER169W YER173W YER172C YER176W YER175C YER174C YER178W YER177W YER179W YER180C YER182W YER184C YER183C YER185W YER186C YER189W YER188W YER187W YER190W YFL001W YFL003C YFL002C YFL006W YFL005W YFL004W YFL008W YFL007W YFL009W YFL011W YFL010C YFL012W YFL013C YFL013W−A YFL014W YFL016C YFL018C YFL017C YFL021W YFL020C YFL023W YFL022C YFL024C YFL026W YFL025C YFL029C YFL028C YFL027C YFL031W YFL030W YFL036W YFL034W YFL033C YFL037W YFL038C YFL040W YFL039C YFL041W YFL044C YFL042C YFL046W YFL045C YFL047W YFL049W YFL048C YFL051C YFL050C YFL053W YFL052W YFL054C YFL055W YFL056C YFL059W YFL058W YFL062W YFL061W YFL060C YFL063W YFL064C YFL067W YFL066C YFL065C YFR002W YFR001W YFR004W YFR003C YFR005C YFR007W YFR006W YFR009W YFR008W YFR010W YFR012W YFR011C YFR013W YFR014C YFR017C YFR016C YFR015C YFR019W YFR018C YFR023W YFR022W YFR021W YFR024C−A YFR025C YFR027W YFR026C YFR030W YFR029W YFR028C YFR031C−A YFR031C YFR034C YFR033C YFR032C YFR036W YFR035C YFR038W YFR037C YFR040W YFR039C YFR041C YFR042W YFR043C YFR045W YFR044C YFR046C YFR048W YFR047C YFR049W YFR051C YFR050C YFR052W YFR053C YFR057W YFR055W YGL002W YGL001C YGL003C YGL005C YGL004C YGL006W YGL009C YGL008C YGL010W YGL011C YGL012W YGL014W YGL013C YGL016W YGL015C YGL017W YGL018C YGL019W YGL021W YGL020C YGL022W YGL023C YGL027C YGL026C YGL025C YGL029W YGL028C YGL030W YGL032C YGL031C YGL033W YGL035C YGL036W YGL037C YGL039W YGL038C YGL040C YGL043W YGL044C YGL047W YGL045W YGL048C YGL050W YGL049C YGL053W YGL052W YGL051W YGL055W YGL054C YGL057C YGL056C YGL060W YGL059W YGL058W YGL063W YGL061C YGL066W YGL065C YGL064C YGL068W YGL067W YGL071W YGL070C YGL069C YGL073W YGL072C YGL075C YGL074C YGL078C YGL077C YGL076C YGL080W YGL079W YGL083W YGL082W YGL081W YGL085W YGL084C YGL086W YGL087C YGL090W YGL089C YGL091C YGL093W YGL092W YGL096W YGL095C YGL094C YGL098W YGL097W YGL101W YGL100W YGL099W YGL103W YGL102C YGL105W YGL104C YGL106W YGL108C YGL107C YGL111W YGL110C YGL114W YGL113W YGL112C YGL116W YGL115W YGL119W YGL117W YGL120C YGL122C YGL121C YGL123W YGL124C YGL126W YGL125W YGL127C YGL129C YGL128C YGL130W YGL132W YGL131C YGL134W YGL133W YGL135W YGL137W YGL136C YGL139W YGL138C YGL141W YGL140C YGL144C YGL143C YGL142C YGL145W YGL146C YGL149W YGL148W YGL147C YGL151W YGL150C YGL153W YGL152C YGL154C YGL156W YGL155W YGL158W YGL157W YGL160W YGL159W YGL161C YGL162W YGL163C YGL166W YGL165C YGL164C YGL168W YGL167C YGL169W YGL170C YGL172W YGL171W YGL173C YGL174W YGL175C YGL178W YGL176C YGL179C YGL181W YGL180W YGL185C YGL184C YGL183C YGL187C YGL186C YGL190C YGL189C YGL192W YGL191W YGL193C YGL195W YGL194C YGL198W YGL197W YGL196W YGL200C YGL199C YGL202W YGL201C YGL203C YGL205W YGL206C YGL208W YGL207W YGL211W YGL210W YGL209W YGL212W YGL213C YGL216W YGL215W YGL214W YGL219C YGL217C YGL220W YGL222C YGL221C YGL224C YGL223C YGL226W YGL225W YGL228W YGL227W YGL229C YGL231C YGL230C YGL234W YGL233W YGL232W YGL237C YGL236C YGL238W YGL241W YGL239C YGL243W YGL242C YGL245W YGL244W YGL248W YGL247W YGL246C YGL250W YGL249W YGL253W YGL252C YGL251C YGL255W YGL254W YGL256W YGL258W YGL257C YGL259W YGL261C YGL263W YGL262W YGR003W YGR002C YGR001C YGR004W YGR005C YGR007W YGR006W YGR008C YGR010W YGR009C YGR012W YGR011W YGR014W YGR013W YGR015C YGR017W YGR016W YGR019W YGR021W YGR020C YGR023W YGR024C YGR026W YGR025W YGR027C YGR029W YGR028W YGR031W YGR030C YGR032W YGR034W YGR033C YGR036C YGR035C YGR040W YGR038W YGR037C YGR042W YGR041W YGR045C YGR044C YGR043C YGR046W YGR047C YGR049W YGR048W YGR052W YGR054W YGR053C YGR056W YGR055W YGR059W YGR058W YGR057C YGR060W YGR061C YGR064W YGR063C YGR062C YGR066C YGR065C YGR068C YGR067C YGR070W YGR069W YGR071C YGR074W YGR072W YGR077C YGR076C YGR075C YGR079W YGR078C YGR080W YGR082W YGR081C YGR084C YGR083C YGR086C YGR085C YGR089W YGR088W YGR087C YGR091W YGR090W YGR094W YGR093W YGR092W YGR096W YGR095C YGR097W YGR099W YGR098C YGR101W YGR100W YGR103W YGR102C YGR105W YGR104C YGR106C YGR108W YGR109C YGR112W YGR111W YGR110W YGR113W YGR114C YGR116W YGR115C YGR118W YGR117C YGR119C YGR121C YGR120C YGR122W YGR124W YGR123C YGR126W YGR125W YGR127W YGR129W YGR128C YGR131W YGR130C YGR133W YGR132C YGR136W YGR135W YGR134W YGR137W YGR138C YGR142W YGR141W YGR140W YGR144W YGR143W YGR145W YGR147C YGR146C YGR149W YGR148C YGR152C YGR150C YGR153W YGR155W YGR154C YGR157W YGR156W YGR160W YGR159C YGR158C YGR162W YGR161C YGR165W YGR164W YGR163W YGR167W YGR166W YGR169C YGR168C YGR170W YGR172C YGR171C YGR173W YGR174C YGR176W YGR175C YGR177C YGR179C YGR178C YGR181W YGR180C YGR182C YGR184C YGR183C YGR186W YGR185C YGR189C YGR188C YGR187C YGR191W YGR190C YGR194C YGR193C YGR192C YGR195W YGR196C YGR199W YGR198W YGR197C YGR201C YGR200C YGR203W YGR202C YGR206W YGR205W YGR204W YGR208W YGR207C YGR211W YGR210C YGR209C YGR212W YGR213C YGR215W YGR214W YGR218W YGR217W YGR216C YGR219W YGR220C YGR222W YGR221C YGR223C YGR225W YGR224W YGR228W YGR227W YGR226C YGR230W YGR229C YGR232W YGR231C YGR234W YGR233C YGR235C YGR237C YGR236C YGR240C YGR239C YGR238C YGR243W YGR241C YGR246C YGR245C YGR244C YGR248W YGR247W YGR249W YGR250C YGR252W YGR251W YGR253C YGR254W YGR255C YGR256W YGR258C YGR257C YGR260W YGR259C YGR263C YGR262C YGR261C YGR265W YGR264C YGR266W YGR267C YGR270W YGR269W YGR268C YGR271W YGR272C YGR275W YGR274C YGR273C YGR277C YGR276C YGR278W YGR280C YGR279C YGR281W YGR282C YGR284C YGR283C YGR287C YGR286C YGR285C YGR288W YGR289C YGR292W YGR290W YGR293C YGR294W YGR295C YGR296W YHL002W YHL001W YHL005C YHL003C YHL007C YHL006C YHL010C YHL009C YHL008C YHL012W YHL011C YHL015W YHL014C YHL013C YHL017W YHL016C YHL018W YHL019C YHL022C YHL021C YHL020C YHL024W YHL023C YHL025W YHL027W YHL026C YHL028W YHL029C YHL030W YHL032C YHL031C YHL034C YHL033C YHL036W YHL035C YHL039W YHL038C YHL040C YHL042W YHL041W YHL044W YHL043W YHL046C YHL048W YHL047C YHR001W YHL050C YHL049C YHR001W−A YHR002W YHR004C YHR003C YHR006W YHR005C YHR007C YHR009C YHR008C YHR012W YHR011W YHR010W YHR014W YHR013C YHR015W YHR017W YHR016C YHR019C YHR018C YHR020W YHR021C YHR023W YHR022C YHR024C YHR026W YHR025W YHR029C YHR028C YHR027C YHR031C YHR030C YHR033W YHR032W YHR034C YHR036W YHR035W YHR038W YHR037W YHR040W YHR039C YHR041C YHR042W YHR043C YHR045W YHR044C YHR046C YHR048W YHR047C YHR051W YHR050W YHR049W YHR052W YHR053C YHR055C YHR054C YHR058C YHR057C YHR056C YHR060W YHR059W YHR063C YHR062C YHR061C YHR065C YHR064C YHR067W YHR066W YHR068W YHR070W YHR069C YHR072W YHR071W YHR074W YHR073W YHR075C YHR076W YHR077C YHR078W YHR081W YHR080C YHR083W YHR082C YHR085W YHR084W YHR088W YHR087W YHR086W YHR090C YHR089C YHR093W YHR092C YHR091C YHR095W YHR094C YHR098C YHR097C YHR096C YHR099W YHR100C YHR102W YHR101C YHR105W YHR104W YHR103W YHR106W YHR107C YHR110W YHR109W YHR108W YHR111W YHR112C YHR114W YHR113W YHR115C YHR117W YHR116W YHR119W YHR118C YHR122W YHR121W YHR120W YHR124W YHR123W YHR128W YHR127W YHR126C YHR131C YHR129C YHR134W YHR133C YHR132C YHR136C YHR135C YHR137W YHR138C YHR140W YHR139C YHR141C YHR143W YHR142W YHR143W−A YHR146W YHR144C YHR148W YHR147C YHR150W YHR149C YHR151C YHR152W YHR153C YHR155W YHR154W YHR157W YHR156C YHR158C YHR159W YHR160C YHR163W YHR162W YHR161C YHR165C YHR164C YHR167W YHR166C YHR170W YHR169W YHR168W YHR172W YHR171W YHR176W YHR175W YHR174W YHR178W YHR177W YHR182W YHR181W YHR179W YHR184W YHR183W YHR186C YHR185C YHR187W YHR189W YHR188C YHR190W YHR191C YHR192W YHR194W YHR193C YHR196W YHR195W YHR197W YHR199C YHR198C YHR200W YHR201C YHR202W YHR203C YHR206W YHR205W YHR204W YHR208W YHR207C YHR209W YHR211W YHR210C YHR214C−B YHR212C YHR214W−A YHR214W YHR215W YHR218W YHR216W YHR219W YIL001W YIL003W YIL002C YIL004C YIL006W YIL005W YIL009W YIL008W YIL007C YIL011W YIL010W YIL015W YIL014W YIL013C YIL018W YIL016W YIL019W YIL020C YIL022W YIL021W YIL023C YIL025C YIL024C YIL029C YIL027C YIL026C YIL031W YIL030C YIL034C YIL033C YIL036W YIL035C YIL037C YIL039W YIL038C YIL041W YIL040W YIL042C YIL044C YIL043C YIL046W YIL045W YIL047C YIL049W YIL048W YIL050W YIL051C YIL053W YIL052C YIL055C YIL056W YIL057C YIL060W YIL062C YIL061C YIL064W YIL063C YIL067C YIL066C YIL065C YIL069C YIL068C YIL072W YIL070C YIL075C YIL074C YIL073C YIL076W YIL077C YIL078W YIL080W YIL079C YIL082W−A YIL083C YIL086C YIL085C YIL084C YIL088C YIL087C YIL090W YIL089W YIL092W YIL091C YIL093C YIL095W YIL094C YIL097W YIL096C YIL098C YIL099W YIL101C YIL103W YIL102C YIL104C YIL106W YIL105C YIL108W YIL107C YIL111W YIL110W YIL109C YIL113W YIL112W YIL116W YIL115C YIL114C YIL118W YIL117C YIL121W YIL120W YIL119C YIL123W YIL122W YIL125W YIL124W YIL126W YIL128W YIL127C YIL130W YIL129C YIL133C YIL132C YIL131C YIL134W YIL135C YIL136W YIL137C YIL140W YIL139C YIL138C YIL142W YIL141W YIL144W YIL143C YIL145C YIL147C YIL146C YIL148W YIL150C YIL149C YIL152W YIL151C YIL153W YIL154C YIL156W YIL155C YIL157C YIL159W YIL158W YIL162W YIL161W YIL160C YIL166C YIL164C YIL168W YIL170W YIL169C YIL171W YIL172C YIL174W YIL173W YIL175W YIL177C YIL176C YIR002C YIR001C YIR005W YIR004W YIR003W YIR007W YIR006C YIR010W YIR009W YIR008C YIR012W YIR011C YIR014W YIR013C YIR016W YIR015W YIR017C YIR018W YIR019C YIR023W YIR022W YIR021W YIR025W YIR024C YIR028W YIR027C YIR026C YIR029W YIR030C YIR032C YIR031C YIR033W YIR035C YIR034C YIR037W YIR036C YIR041W YIR039C YIR038C YJL001W YIR042C YJL003W YJL002C YJL004C YJL005W YJL006C YJL008C YJL007C YJL012C YJL011C YJL010C YJL015C YJL013C YJL019W YJL017W YJL016W YJL021C YJL020C YJL024C YJL023C YJL026W YJL025W YJL027C YJL030W YJL029C YJL033W YJL032W YJL031C YJL034W YJL035C YJL037W YJL036W YJL038C YJL041W YJL039C YJL043W YJL042W YJL046W YJL045W YJL044C YJL048C YJL047C YJL051W YJL050W YJL049W YJL053W YJL052W YJL055W YJL054W YJL056C YJL058C YJL057C YJL060W YJL059W YJL062W YJL061W YJL063C YJL066C YJL065C YJL070C YJL069C YJL068C YJL071W YJL072C YJL073W YJL075C YJL074C YJL076W YJL077C YJL079C YJL078C YJL082W YJL081C YJL080C YJL083W YJL084C YJL085W YJL087C YJL086C YJL089W YJL088W YJL092W YJL091C YJL090C YJL094C YJL093C YJL097W YJL095W YJL100W YJL099W YJL098W YJL102W YJL101C YJL105W YJL104W YJL103C YJL106W YJL107C YJL110C YJL109C YJL108C YJL112W YJL111W YJL114W YJL113W YJL115W YJL117W YJL116C YJL118W YJL119C YJL120W YJL122W YJL121C YJL124C YJL123C YJL126W YJL125C YJL129C YJL128C YJL127C YJL131C YJL130C YJL134W YJL133W YJL132W YJL137C YJL136C YJL140W YJL139C YJL138C YJL142C YJL141C YJL144W YJL143W YJL146W YJL145W YJL147C YJL149W YJL148W YJL154C YJL153C YJL151C YJL156C YJL155C YJL161W YJL160C YJL158C YJL163C YJL162C YJL165C YJL164C YJL167W YJL166W YJL168C YJL171C YJL170C YJL172W YJL174W YJL173C YJL175W YJL176C YJL177W YJL179W YJL178C YJL181W YJL180C YJL183W YJL182C YJL184W YJL186W YJL185C YJL188C YJL187C YJL189W YJL191W YJL190C YJL193W YJL192C YJL194W YJL196C YJL195C YJL198W YJL197W YJL201W YJL200C YJL203W YJL202C YJL204C YJL207C YJL206C YJL210W YJL208C YJL211C YJL213W YJL212C YJL214W YJL217W YJL216C YJL219W YJL218W YJL220W YJL221C YJL222W YJL225C YJL223C YJR002W YJR001W YJR005W YJR004C YJR003C YJR007W YJR006W YJR008W YJR009C YJR010C−A YJR010W YJR011C YJR014W YJR013W YJR015W YJR017C YJR016C YJR018W YJR019C YJR020W YJR022W YJR021C YJR025C YJR024C YJR027W YJR026W YJR029W YJR028W YJR031C YJR032W YJR033C YJR037W YJR035W YJR034W YJR040W YJR039W YJR042W YJR041C YJR043C YJR045C YJR044C YJR046W YJR047C YJR048W YJR050W YJR049C YJR052W YJR051W YJR055W YJR054W YJR053W YJR057W YJR056C YJR060W YJR059W YJR058C YJR061W YJR062C YJR064W YJR063W YJR066W YJR065C YJR067C YJR068W YJR069C YJR071W YJR070C YJR072C YJR074W YJR073C YJR075W YJR077C YJR076C YJR079W YJR078W YJR082C YJR080C YJR086W YJR084W YJR083C YJR087W YJR088C YJR092W YJR091C YJR090C YJR094C YJR093C YJR094W−A YJR096W YJR095W YJR097W YJR098C YJR099W YJR100C YJR101W YJR103W YJR102C YJR105W YJR104C YJR108W YJR107W YJR106W YJR110W YJR109C YJR112W YJR111C YJR115W YJR114W YJR113C YJR117W YJR116W YJR120W YJR119C YJR118C YJR122W YJR121W YJR123W YJR125C YJR124C YJR127C YJR126C YJR128W YJR129C YJR132W YJR131W YJR130C YJR133W YJR134C YJR137C YJR136C YJR135C YJR138W YJR139C YJR142W YJR141W YJR140C YJR144W YJR143C YJR147W YJR145C YJR149W YJR148W YJR150C YJR152W YJR151C YJR155W YJR154W YJR153W YJR157W YJR156C YJR159W YJR158W YJR160C YKL001C YJR161C YKL002W YKL003C YKL006C−A YKL004W YKL005C YKL007W YKL006W YKL009W YKL008C YKL010C YKL012W YKL011C YKL015W YKL014C YKL013C YKL017C YKL016C YKL019W YKL018W YKL022C YKL021C YKL020C YKL023W YKL024C YKL027W YKL026C YKL025C YKL028W YKL029C YKL031W YKL033W YKL032C YKL035W YKL034W YKL037W YKL036C YKL039W YKL038W YKL040C YKL042W YKL041W YKL045W YKL043W YKL046C YKL047W YKL048C YKL050C YKL049C YKL051W YKL053W YKL052C YKL055C YKL054C YKL058W YKL057C YKL056C YKL060C YKL059C YKL062W YKL061W YKL063C YKL064W YKL065C YKL068W YKL067W YKL071W YKL070W YKL069W YKL073W YKL072W YKL076C YKL075C YKL074C YKL078W YKL077W YKL081W YKL080W YKL079W YKL083W YKL082C YKL085W YKL084W YKL086W YKL088W YKL087C YKL090W YKL089W YKL093W YKL092C YKL091C YKL095W YKL094W YKL098W YKL096W YKL099C YKL101W YKL100C YKL104C YKL103C YKL107W YKL106W YKL105C YKL109W YKL108W YKL112W YKL111C YKL110C YKL114C YKL113C YKL118W YKL117W YKL116C YKL120W YKL119C YKL121W YKL122C YKL126W YKL125W YKL124W YKL127W YKL128C YKL132C YKL130C YKL129C YKL134C YKL133C YKL136W YKL135C YKL137W YKL139W YKL138C YKL141W YKL140W YKL143W YKL142W YKL144C YKL146W YKL145W YKL149C YKL148C YKL147C YKL150W YKL151C YKL154W YKL155C YKL157W YKL156W YKL159C YKL160W YKL161C YKL163W YKL162C YKL164C YKL166C YKL165C YKL169C YKL168C YKL167C YKL171W YKL170W YKL173W YKL172W YKL175W YKL174C YKL176C YKL179C YKL178C YKL182W YKL181W YKL180W YKL184W YKL183W YKL185W YKL187C YKL186C YKL189W YKL188C YKL191W YKL190W YKL194C YKL193C YKL192C YKL195W YKL196C YKL201C YKL199C YKL197C YKL204W YKL203C YKL205W YKL207W YKL206C YKL208W YKL209C YKL210W YKL211C YKL212W YKL214C YKL213C YKL216W YKL215C YKL217W YKL219W YKL218C YKL221W YKL220C YKR001C YKL224C YKL222C YKR003W YKR002W YKR005C YKR004C YKR008W YKR007W YKR006C YKR010C YKR009C YKR013W YKR012C YKR011C YKR015C YKR014C YKR016W YKR017C YKR020W YKR019C YKR018C YKR021W YKR022C YKR025W YKR024C YKR026C YKR028W YKR027W YKR030W YKR029C YKR031C YKR034W YKR035C YKR037C YKR036C YKR041W YKR039W YKR038C YKR042W YKR043C YKR044W YKR046C YKR045C YKR047W YKR048C YKR051W YKR050W YKR049C YKR053C YKR052C YKR055W YKR054C YKR058W YKR057W YKR056W YKR060W YKR059W YKR062W YKR061W YKR063C YKR064W YKR065C YKR067W YKR066C YKR068C YKR070W YKR069W YKR072C YKR071C YKR074W YKR076W YKR075C YKR078W YKR079C YKR080W YKR082W YKR081C YKR084C YKR083C YKR086W YKR085C YKR087C YKR089C YKR088C YKR091W YKR090W YKR093W YKR092C YKR094C YKR096W YKR095W YKR097W YKR099W YKR098C YKR101W YKR100C YKR104W YKR103W YKR102W YKR106W YKR105C YLL002W YLL001W YLL004W YLL003W YLL005C YLL006W YLL007C YLL008W YLL010C YLL009C YLL012W YLL011W YLL014W YLL013C YLL015W YLL019C YLL018C YLL021W YLL020C YLL024C YLL023C YLL022C YLL026W YLL025W YLL029W YLL028W YLL027W YLL032C YLL031C YLL033W YLL034C YLL035W YLL038C YLL036C YLL040C YLL039C YLL043W YLL042C YLL041C YLL046C YLL045C YLL049W YLL048C YLL050C YLL052C YLL051C YLL054C YLL053C YLL055W YLL057C YLL056C YLL058W YLL060C YLL061W YLL063C YLL062C YLL066C YLL064C YLR002C YLR001C YLL067C YLR004C YLR003C YLR005W YLR006C YLR007W YLR009W YLR008C YLR011W YLR010C YLR013W YLR012C YLR014C YLR015W YLR016C YLR017W YLR019W YLR018C YLR021W YLR020C YLR023C YLR022C YLR025W YLR024C YLR026C YLR028C YLR027C YLR031W YLR030W YLR029C YLR033W YLR032W YLR036C YLR035C YLR034C YLR038C YLR037C YLR040C YLR039C YLR041W YLR043C YLR042C YLR045C YLR044C YLR048W YLR047C YLR046C YLR050C YLR049C YLR052W YLR051C YLR055C YLR054C YLR053C YLR057W YLR056W YLR060W YLR059C YLR058C YLR063W YLR061W YLR064W YLR066W YLR065C YLR068W YLR067C YLR070C YLR069C YLR072W YLR071C YLR073C YLR075W YLR074C YLR077W YLR079W YLR078C YLR081W YLR080W YLR084C YLR083C YLR082C YLR086W YLR085C YLR088W YLR087C YLR091W YLR090W YLR089C YLR092W YLR093C YLR096W YLR095C YLR094C YLR099C YLR097C YLR100W YLR103C YLR102C YLR104W YLR105C YLR107W YLR106C YLR109W YLR108C YLR110C YLR113W YLR114C YLR116W YLR115W YLR117C YLR119W YLR118C YLR124W YLR121C YLR120C YLR125W YLR126C YLR128W YLR127C YLR129W YLR131C YLR130C YLR133W YLR132C YLR135W YLR134W YLR136C YLR138W YLR137W YLR143W YLR142W YLR141W YLR145W YLR144C YLR147C YLR146C YLR150W YLR149C YLR151C YLR153C YLR152C YLR156W YLR155C YLR154C YLR158C YLR157C YLR159W YLR160C YLR162W YLR161W YLR163C YLR164W YLR165C YLR167W YLR166C YLR168C YLR169W YLR170C YLR171W YLR173W YLR172C YLR175W YLR174W YLR177W YLR176C YLR180W YLR179C YLR178C YLR182W YLR181C YLR186W YLR185W YLR183C YLR188W YLR187W YLR191W YLR190W YLR189C YLR193C YLR192C YLR195C YLR194C YLR197W YLR196W YLR198C YLR200W YLR199C YLR204W YLR203C YLR205C YLR207W YLR206W YLR208W YLR210W YLR209C YLR212C YLR211C YLR215C YLR213C YLR217W YLR216C YLR218C YLR220W YLR219W YLR223C YLR222C YLR221C YLR224W YLR225C YLR226W YLR228C YLR227C YLR231C YLR229C YLR232W YLR233C YLR234W YLR237W YLR236C YLR238W YLR239C YLR241W YLR240W YLR242C YLR243W YLR244C YLR246W YLR245C YLR247C YLR249W YLR248W YLR251W YLR250W YLR253W YLR252W YLR254C YLR257W YLR256W YLR258W YLR260W YLR259C YLR263W YLR262C YLR264W YLR265C YLR268W YLR267W YLR266C YLR271W YLR270W YLR274W YLR273C YLR272C YLR275W YLR276C YLR279W YLR278C YLR277C YLR282C YLR281C YLR283W YLR284C YLR285W YLR287C YLR286C YLR287C−A YLR288C YLR289W YLR291C YLR290C YLR293C YLR292C YLR297W YLR296W YLR295C YLR299W YLR298C YLR301W YLR300W YLR303W YLR305C YLR304C YLR307W YLR306W YLR308W YLR310C YLR309C YLR313C YLR312C YLR315W YLR314C YLR316C YLR318W YLR317W YLR320W YLR319C YLR322W YLR321C YLR323C YLR324W YLR325C YLR326W YLR328W YLR327C YLR330W YLR329W YLR332W YLR336C YLR333C YLR340W YLR338W YLR342W YLR341W YLR345W YLR344W YLR343W YLR347C YLR346C YLR350W YLR349W YLR348C YLR352W YLR351C YLR353W YLR355C YLR354C YLR357W YLR356W YLR360W YLR359W YLR362W YLR361C YLR363C YLR366W YLR364W YLR369W YLR368W YLR367W YLR371W YLR370C YLR372W YLR373C YLR375W YLR374C YLR376C YLR378C YLR377C YLR381W YLR380W YLR379W YLR383W YLR382C YLR386W YLR385C YLR384C YLR388W YLR387C YLR390W YLR389C YLR394W YLR393W YLR392C YLR396C YLR395C YLR399C YLR398C YLR397C YLR400W YLR401C YLR405W YLR404W YLR403W YLR407W YLR406C YLR409C YLR408C YLR412W YLR411W YLR410W YLR413W YLR414C YLR417W YLR415C YLR418C YLR420W YLR419W YLR422W YLR421C YLR423C YLR425W YLR424W YLR427W YLR426W YLR430W YLR429W YLR431C YLR432W YLR433C YLR435W YLR434C YLR436C YLR438W YLR437C YLR439W YLR441C YLR440C YLR443W YLR442C YLR446W YLR445W YLR449W YLR448W YLR447C YLR451W YLR450W YLR454W YLR453C YLR452C YLR456W YLR455W YLR458W YLR457C YLR459W YLR461W YLR460C YLR462W YLR463C YLR464W YLR466W YLR465C YML001W YLR467W YML003W YML002W YML004C YML005W YML006C YML007W YML008C YML010W YML009C YML011C YML013C−A YML012W YML014W YML013W YML015C YML017W YML016C YML020W YML019W YML018C YML022W YML021C YML024W YML023C YML027W YML026C YML025C YML029W YML028W YML031W YML030W YML032C YML034W YML035C YML035C−A YML036W YML037C YML039W YML038C YML040W YML041C YML042W YML045W YML043C YML046W YML047C YML048W−A YML048W YML049C YML051W YML050W YML052W YML054C YML053C YML055W YML056C YML058C−A YML057W YML058W YML060W YML059C YML062C YML061C YML063W YML065W YML064C YML067C YML066C YML070W YML069W YML068W YML072C YML071C YML074C YML073C YML077W YML076C YML075C YML079W YML078W YML082W YML081W YML080W YML085C YML083C YML087C YML086C YML088W YML092C YML091C YML094W YML093W YML095C−A YML096W YML095C YML098W YML097C YML100W−A YML100W YML099C YML102W YML101C YML104C YML103C YML106W YML105C YML107C YML109W YML108W YML112W YML111W YML110C YML113W YML114C YML117W YML116W YML115C YML117W−A YML118W YML119W YML120C YML121W YML124C YML123C YML126C YML125C YML127W YML129C YML128C YML131W YML130C YML132W YMR001C YML133C YMR003W YMR002W YMR005W YMR004W YMR007W YMR006C YMR008C YMR010W YMR009W YMR012W YMR011W YMR013C YMR014W YMR015C YMR018W YMR017W YMR016C YMR020W YMR019W YMR022W YMR021C YMR025W YMR024W YMR023C YMR027W YMR026C YMR028W YMR030W YMR029C YMR031W−A YMR031C YMR033W YMR032W YMR034C YMR035W YMR036C YMR038C YMR037C YMR040W YMR039C YMR041C YMR043W YMR042W YMR044W YMR046C YMR045C YMR048W YMR047C YMR050C YMR049C YMR052W YMR051C YMR053C YMR054W YMR055C YMR059W YMR058W YMR056C YMR061W YMR060C YMR064W YMR063W YMR062C YMR066W YMR065W YMR068W YMR067C YMR070W YMR069W YMR071C YMR072W YMR073C YMR075C−A YMR075W YMR074C YMR077C YMR076C YMR079W YMR078C YMR080C YMR083W YMR081C YMR085W YMR084W YMR087W YMR086W YMR088C YMR090W YMR089C YMR093W YMR092C YMR091C YMR094W YMR095C YMR096W YMR098C YMR097C YMR100W YMR099C YMR102C YMR101C YMR106C YMR105C YMR104C YMR108W YMR107W YMR109W YMR111C YMR110C YMR113W YMR112C YMR115W YMR114C YMR116C YMR118C YMR117C YMR119W−A YMR119W YMR123W YMR121C YMR120C YMR125W YMR124W YMR129W YMR128W YMR126C YMR130W YMR131C YMR134W YMR133W YMR132C YMR135W−A YMR135C YMR136W YMR137C YMR140W YMR139W YMR138W YMR142C YMR141C YMR144W YMR143W YMR145C YMR147W YMR146C YMR149W YMR148W YMR152W YMR151W YMR150C YMR153C−A YMR153W YMR155W YMR154C YMR156C YMR158W YMR157C YMR161W YMR160W YMR159C YMR163C YMR162C YMR165C YMR164C YMR167W YMR166C YMR168C YMR170C YMR169C YMR173W YMR172W YMR171C YMR173W−A YMR174C YMR177W YMR176W YMR175W YMR179W YMR178W YMR181C YMR180C YMR184W YMR183C YMR182C YMR186W YMR185W YMR189W YMR188C YMR187C YMR191W YMR190C YMR194W YMR193W YMR192W YMR196W YMR195W YMR198W YMR197C YMR200W YMR199W YMR201C YMR203W YMR202W YMR206W YMR205C YMR204C YMR208W YMR207C YMR211W YMR210W YMR209C YMR213W YMR212C YMR215W YMR214W YMR217W YMR216C YMR218C YMR220W YMR219W YMR223W YMR222C YMR221C YMR225C YMR224C YMR228W YMR227C YMR226C YMR230W YMR229C YMR232W YMR231W YMR234W YMR233W YMR235C YMR237W YMR236W YMR238W YMR240C YMR239C YMR241W YMR243C YMR244C−A YMR244W YMR246W YMR245W YMR247C YMR251W YMR250W YMR251W−A YMR253C YMR252C YMR255W YMR256C YMR259C YMR258C YMR257C YMR261C YMR260C YMR263W YMR262W YMR264W YMR267W YMR265C YMR269W YMR268C YMR272C YMR271C YMR270C YMR274C YMR273C YMR277W YMR276W YMR275C YMR278W YMR279C YMR281W YMR280C YMR285C YMR283C YMR282C YMR288W YMR287C YMR290W−A YMR289W YMR290C YMR292W YMR293C YMR294W−A YMR294W YMR295C YMR297W YMR296C YMR298W YMR299C YMR302C YMR301C YMR300C YMR304W YMR303C YMR306W YMR305C YMR308C YMR310C YMR309C YMR312W YMR311C YMR313C YMR315W YMR314W YMR316C−B YMR316W YMR317W YMR319C YMR318C YMR322C YMR321C YMR323W YNL001W YNL002C YNL004W YNL003C YNL006W YNL005C YNL008C YNL010W YNL009W YNL014W YNL012W YNL016W YNL015W YNL017C YNL019C YNL018C YNL021W YNL020C YNL022C YNL025C YNL024C YNL027W YNL026W YNL028W YNL030W YNL029C YNL032W YNL031C YNL034W YNL033W YNL035C YNL036W YNL037C YNL040W YNL039W YNL038W YNL042W YNL041C YNL045W YNL044W YNL046W YNL048W YNL047C YNL050C YNL049C YNL053W YNL052W YNL051W YNL054W YNL055C YNL056W YNL059C YNL058C YNL061W YNL062C YNL063W YNL064C YNL067W YNL066W YNL065W YNL069C YNL068C YNL072W YNL071W YNL070W YNL073W YNL074C YNL077W YNL076W YNL075W YNL078W YNL079C YNL081C YNL080C YNL083W YNL085W YNL084C YNL087W YNL086W YNL088W YNL090W YNL089C YNL092W YNL091W YNL094W YNL093W YNL095C YNL097C YNL096C YNL099C YNL098C YNL102W YNL101W YNL100W YNL103W YNL104C YNL107W YNL105W YNL108C YNL111C YNL110C YNL113W YNL112W YNL117W YNL115C YNL114C YNL119W YNL120C YNL123W YNL122C YNL121C YNL124W YNL125C YNL128W YNL127W YNL126W YNL129W YNL130C YNL132W YNL131W YNL135C YNL134C YNL133C YNL136W YNL137C YNL138W YNL141W YNL139C YNL142W YNL144C YNL147W YNL146W YNL145W YNL149C YNL148C YNL152W YNL151C YNL155W YNL154C YNL153C YNL157W YNL156C YNL158W YNL160W YNL159C YNL162W YNL161W YNL165W YNL163C YNL166C YNL168C YNL167C YNL171C YNL169C YNL172W YNL175C YNL173C YNL177C YNL176C YNL178W YNL181W YNL180C YNL183C YNL182C YNL187W YNL186W YNL185C YNL189W YNL188W YNL191W YNL190W YNL193W YNL195C YNL194C YNL197C YNL196C YNL201C YNL200C YNL199C YNL202W YNL203C YNL206C YNL205C YNL204C YNL208W YNL207W YNL210W YNL209W YNL214W YNL213C YNL211C YNL216W YNL215W YNL218W YNL217W YNL219C YNL220W YNL221C YNL223W YNL222W YNL227C YNL225C YNL224C YNL228W YNL229C YNL232W YNL231C YNL230C YNL234W YNL233W YNL237W YNL236W YNL235C YNL239W YNL238W YNL241C YNL240C YNL243W YNL242W YNL244C YNL246W YNL245C YNL247W YNL249C YNL248C YNL250W YNL251C YNL253W YNL252C YNL254C YNL256W YNL255C YNL258C YNL257C YNL261W YNL260C YNL259C YNL262W YNL263C YNL266W YNL265C YNL264C YNL268W YNL267W YNL272C YNL271C YNL270C YNL273W YNL274C YNL275W YNL276C YNL281W YNL279W YNL278W YNL282W YNL283C YNL287W YNL286W YNL284C YNL289W YNL288W YNL290W YNL292W YNL291C YNL293W YNL294C YNL296W YNL295W YNL299W YNL298W YNL297C YNL300W YNL301C YNL304W YNL302C YNL305C YNL306W YNL307C YNL311C YNL310C YNL308C YNL312W YNL313C YNL314W YNL315C YNL317W YNL316C YNL318C YNL320W YNL319W YNL321W YNL323W YNL322C YNL326C YNL325C YNL327W YNL328C YNL331C YNL330C YNL329C YNL333W YNL332W YNL336W YNL335W YNL334C YNL338W YNL339C YNR003C YNR002C YNR001C YNR004W YNR005C YNR006W YNR007C YNR010W YNR009W YNR008W YNR012W YNR011C YNR015W YNR014W YNR013C YNR017W YNR016C YNR019W YNR018W YNR020C YNR021W YNR022C YNR024W YNR023W YNR027W YNR026C YNR025C YNR028W YNR029C YNR030W YNR032W YNR031C YNR034W YNR033W YNR037C YNR036C YNR035C YNR038W YNR039C YNR040W YNR041C YNR044W YNR043W YNR042W YNR046W YNR045W YNR048W YNR047W YNR049C YNR051C YNR050C YNR054C YNR053C YNR052C YNR057C YNR055C YNR059W YNR058W YNR060W YNR062C YNR061C YNR063W YNR064C YNR068C YNR067C YNR066C YNR070W YNR069C YNR072W YNR071C YNR073C YNR075W YNR074C YNR076W YOL001W YOL004W YOL003C YOL002C YOL006C YOL005C YOL008W YOL007C YOL009C YOL011W YOL010W YOL013C YOL012C YOL015W YOL014W YOL016C YOL017W YOL018C YOL020W YOL019W YOL021C YOL023W YOL022C YOL025W YOL024W YOL026C YOL028C YOL027C YOL030W YOL029C YOL033W YOL032W YOL031C YOL036W YOL034W YOL039W YOL038W YOL037C YOL041C YOL040C YOL042W YOL044W YOL043C YOL045W YOL046C YOL048C YOL047C YOL049W YOL051W YOL050C YOL053W YOL052C YOL054W YOL056W YOL055C YOL058W YOL057W YOL059W YOL061W YOL060C YOL063C YOL062C YOL065C YOL064C YOL068C YOL067C YOL066C YOL069W YOL070C YOL072W YOL071W YOL073C YOL076W YOL075C YOL079W YOL078W YOL077C YOL081W YOL080C YOL083W YOL082W YOL084W YOL087C YOL086C YOL089C YOL088C YOL092W YOL091W YOL090W YOL093W YOL094C YOL097C YOL096C YOL095C YOL099C YOL098C YOL100W YOL101C YOL103W YOL102C YOL104C YOL107W YOL105C YOL110W YOL109W YOL108C YOL112W YOL111C YOL113W YOL114C YOL117W YOL116W YOL115W YOL120C YOL119C YOL123W YOL122C YOL121C YOL125W YOL124C YOL127W YOL126C YOL128C YOL130W YOL129W YOL132W YOL131W YOL133W YOL135C YOL134C YOL137W YOL136C YOL140W YOL139C YOL138C YOL142W YOL141W YOL144W YOL143C YOL145C YOL146W YOL147C YOL149W YOL148C YOL152W YOL151W YOL150C YOL154W YOL155C YOL156W YOL158C YOL157C YOL161C YOL159C YOL164W YOL163W YOL165C YOR001W YOL166C YOR003W YOR002W YOR004W YOR006C YOR005C YOR008C YOR007C YOR009W YOR011W YOR010C YOR013W YOR012W YOR015W YOR014W YOR016C YOR018W YOR017W YOR019W YOR020C YOR023C YOR022C YOR021C YOR026W YOR025W YOR027W YOR030W YOR028C YOR031W YOR032C YOR035C YOR034C YOR033C YOR037W YOR036W YOR039W YOR038C YOR040W YOR042W YOR041C YOR044W YOR043W YOR045W YOR047C YOR046C YOR049C YOR048C YOR052C YOR051C YOR053W YOR055W YOR054C YOR057W YOR056C YOR060C YOR059C YOR058C YOR061W YOR062C YOR063W YOR065W YOR064C YOR066W YOR067C YOR069W YOR068C YOR073W YOR071C YOR070C YOR075W YOR074C YOR078W YOR077W YOR076C YOR080W YOR079C YOR084W YOR083W YOR081C YOR085W YOR086C YOR088W YOR087W YOR091W YOR090C YOR089C YOR092W YOR093C YOR094W YOR096W YOR095C YOR098C YOR097C YOR099W YOR101W YOR100C YOR102W YOR103C YOR105W YOR104W YOR108W YOR107W YOR106W YOR110W YOR109W YOR113W YOR112W YOR111W YOR114W YOR115C YOR118W YOR117W YOR116C YOR120W YOR119C YOR122C YOR121C YOR125C YOR124C YOR123C YOR127W YOR126C YOR130C YOR129C YOR128C YOR132W YOR131C YOR134W YOR133W YOR135C YOR136W YOR137C YOR139C YOR138C YOR140W YOR142W YOR141C YOR144C YOR143C YOR147W YOR146W YOR145C YOR149C YOR148C YOR150W YOR151C YOR154W YOR153W YOR152C YOR156C YOR155C YOR158W YOR157C YOR159C YOR160W YOR161C YOR163W YOR162C YOR164C YOR165W YOR166C YOR168W YOR167C YOR170W YOR172W YOR171C YOR174W YOR173W YOR176W YOR175C YOR177C YOR179C YOR178C YOR181W YOR180C YOR182C YOR184W YOR185C YOR187W YOR186W YOR190W YOR189W YOR188W YOR191W YOR192C YOR193W YOR195W YOR194C YOR197W YOR196C YOR200W YOR199W YOR198C YOR202W YOR201C YOR204W YOR203W YOR206W YOR205C YOR207C YOR208W YOR209C YOR210W YOR212W YOR211C YOR215C YOR213C YOR217W YOR216C YOR218C YOR220W YOR219C YOR222W YOR221C YOR223W YOR225W YOR224C YOR227W YOR226C YOR230W YOR229W YOR228C YOR232W YOR231W YOR233W YOR234C YOR238W YOR237W YOR236W YOR240W YOR239W YOR241W YOR243C YOR242C YOR244W YOR245C YOR248W YOR247W YOR246C YOR250C YOR249C YOR252W YOR251C YOR253W YOR255W YOR254C YOR257W YOR256C YOR258W YOR260W YOR259C YOR262W YOR261C YOR266W YOR265W YOR264W YOR269W YOR267C YOR271C YOR270C YOR272W YOR274W YOR273C YOR276W YOR275C YOR278W YOR280C YOR279C YOR282W YOR281C YOR285W YOR284W YOR283W YOR286W YOR287C YOR289W YOR288C YOR291W YOR290C YOR292C YOR294W YOR293W YOR296W YOR295W YOR297C YOR299W YOR298W YOR302W YOR301W YOR300W YOR304C−A YOR303W YOR305W YOR304W YOR308C YOR307C YOR306C YOR311C YOR310C YOR315W YOR313C YOR312C YOR317W YOR316C YOR319W YOR321W YOR320C YOR323C YOR322C YOR325W YOR324C YOR326W YOR328W YOR327C YOR330C YOR329C YOR332W YOR334W YOR333C YOR336W YOR335C YOR338W YOR337W YOR341W YOR340C YOR339C YOR344C YOR342C YOR346W YOR345C YOR347C YOR349W YOR348C YOR352W YOR351C YOR350C YOR354C YOR353C YOR356W YOR355W YOR359W YOR358W YOR357C YOR361C YOR360C YOR364W YOR363C YOR362C YOR366W YOR365C YOR368W YOR367W YOR369C YOR371C YOR370C YOR373W YOR372C YOR374W YOR377W YOR375C YOR380W YOR378W YOR382W YOR381W YOR383C YOR385W YOR384W YOR386W YOR388C YOR387C YOR390W YOR389W YOR392W YOR391C YOR394W YOR393W YPL001W YPL003W YPL002C YPL006W YPL005W YPL004C YPL008W YPL007C YPL010W YPL009C YPL011C YPL012W YPL013C YPL014W YPL015C YPL016W YPL018W YPL017C YPL020C YPL019C YPL022W YPL021W YPL023C YPL024W YPL026C YPL029W YPL028W YPL027W YPL030W YPL031C YPL033C YPL032C YPL034W YPL036W YPL035C YPL038W YPL037C YPL039W YPL041C YPL040C YPL043W YPL042C YPL045W YPL044C YPL048W YPL047W YPL046C YPL050C YPL049C YPL052W YPL051W YPL053C YPL054W YPL055C YPL058C YPL057C YPL056C YPL060W YPL059W YPL063W YPL061W YPL066W YPL065W YPL064C YPL068C YPL067C YPL070W YPL069C YPL071C YPL072W YPL073C YPL076W YPL075W YPL074W YPL078C YPL077C YPL079W YPL080C YPL081W YPL083C YPL082C YPL085W YPL084W YPL088W YPL087W YPL086C YPL090C YPL089C YPL093W YPL092W YPL091W YPL095C YPL094C YPL097W YPL096W YPL100W YPL099C YPL098C YPL101W YPL102C YPL104W YPL103C YPL105C YPL108W YPL106C YPL111W YPL110C YPL109C YPL113C YPL112C YPL116W YPL115C YPL118W YPL117C YPL119C YPL120W YPL121C YPL124W YPL123C YPL122C YPL126W YPL125W YPL129W YPL128C YPL127C YPL131W YPL130W YPL132W YPL133C YPL135W YPL134C YPL137C YPL139C YPL138C YPL142C YPL141C YPL140C YPL143W YPL145C YPL147W YPL146C YPL150W YPL149W YPL148C YPL152W YPL151C YPL155C YPL154C YPL153C YPL157W YPL156C YPL160W YPL159C YPL158C YPL162C YPL161C YPL164C YPL163C YPL166W YPL165C YPL167C YPL168W YPL169C YPL170W YPL172C YPL171C YPL173W YPL174C YPL175W YPL177C YPL176C YPL179W YPL178W YPL181W YPL180W YPL184C YPL183C YPL182C YPL185W YPL186C YPL189W YPL188W YPL187W YPL191C YPL190C YPL194W YPL193W YPL192C YPL196W YPL195W YPL198W YPL197C YPL200W YPL199C YPL201C YPL203W YPL202C YPL204W YPL206C YPL205C YPL208W YPL207W YPL211W YPL210C YPL209C YPL213W YPL212C YPL215W YPL214C YPL216W YPL218W YPL217C YPL220W YPL219W YPL222W YPL221W YPL223C YPL225W YPL224C YPL226W YPL228W YPL227C YPL230W YPL229W YPL232W YPL231W YPL233W YPL235W YPL234C YPL237W YPL236C YPL239W YPL238C YPL240C YPL242C YPL241C YPL243W YPL244C YPL245W YPL247C YPL246C YPL249C YPL248C YPL253C YPL252C YPL250C YPL255W YPL254W YPL257W YPL256C YPL258C YPL260W YPL259C YPL262W YPL261C YPL265W YPL264C YPL263C YPL267W YPL266W YPL270W YPL269W YPL268W YPL271W YPL272C YPL274W YPL273W YPL277C YPL279C YPL278C YPL280W YPL281C YPR001W YPL283C YPL282C YPR002W YPR003C YPR006C YPR005C YPR004C YPR008W YPR007C YPR009W YPR011C YPR010C YPR012W YPR013C YPR016C YPR015C YPR019W YPR018W YPR017C YPR020W YPR021C YPR024W YPR023C YPR022C YPR026W YPR025C YPR028W YPR027C YPR029C YPR031W YPR030W YPR032W YPR033C YPR036W YPR035W YPR034W YPR040W YPR039W YPR041W YPR043W YPR042C YPR045C YPR044C YPR048W YPR047W YPR046W YPR050C YPR049C YPR051W YPR053C YPR056W YPR055W YPR054W YPR058W YPR057W YPR061C YPR060C YPR059C YPR062W YPR063C YPR066W YPR065W YPR067W YPR069C YPR068C YPR071W YPR070W YPR072W YPR074C YPR073C YPR076W YPR075C YPR080W YPR079W YPR078C YPR082C YPR081C YPR084W YPR083W YPR087W YPR086W YPR085C YPR089W YPR088C YPR090W YPR092W YPR091C YPR094W YPR093C YPR097W YPR095C YPR098C YPR101W YPR100W YPR103W YPR102C YPR106W YPR105C YPR104C YPR108W YPR107C YPR109W YPR111W YPR110C YPR113W YPR112C YPR116W YPR115W YPR114W YPR118W YPR117W YPR119W YPR120C YPR124W YPR122W YPR121W YPR127W YPR125W YPR129W YPR128C YPR130C YPR132W YPR131C YPR135W YPR134W YPR133C YPR137W YPR136C YPR139C YPR138C YPR140W YPR143W YPR141C YPR145W YPR144C YPR149W YPR148C YPR147C YPR150W YPR151C YPR154W YPR153W YPR152C YPR156C YPR155C YPR158W YPR157W YPR160W YPR159W YPR161C YPR163C YPR162C YPR165W YPR164W YPR166C YPR168W YPR167C YPR171W YPR169W YPR172W YPR174C YPR173C YPR175W YPR176C YPR178W YPR177C YPR179C YPR180W YPR181C YPR184W YPR183W YPR182W YPR185W YPR186C YPR187W YPR188C YPR189W YPR191W YPR190C YPR192W YPR193C YPR196W YPR194C YPR197C YPR198W YPR199C YPR202W YPR201W YPR200C YPR204W YPR203W Hmt1 Mlp1 Xpo1 Mlp2 Nup60 Cse1 Nup116 Yra1 Nup2 Nic96 Kap95 Nsp1 Nup84Nup145 Tho2 Rsc9 Npl3 Rrp6 Nup100 Nab2 Lrp1 Prp20 Hrp1 Proteins Hmt1 Mlp1 Xpo1 Mlp2 Nup60 Cse1 Nup116 Yra1 Nup2 Nic96 Kap95 Nsp1 Nup84 Nup145 Tho2 Rsc9 Npl3 Rrp6 Nup100 Nab2 Lrp1 Prp20 Hrp1 Proteins Figure A-1: Semi-supervised clustering of all nuclear transport and nuclear processing factors binding profiles [13–16] using the KL distance metric. The upper plot illustrates the dendrogram. The vertical length of each branch is proportional to the distances between clusters. The lower plot illustrates the vectors of binding ratios of each factor on a scale from 0 to 2. The binding vectors are arranged in the same order as their corresponding protein in the dendrogram. 86 Bibliography [1] C. T. Harbinson et al. Transcrptional regulatory code of a eukaryotic genome. Nature, 431:99–103, September 2004. [2] Z. Bar-Joseph et al. Computational discovery of gene modules and regulatory networks. Nature Biotechnology, 21(11):1337–1342, November 2003. [3] T. I. Lee et al. Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science, 21(11):1337–1342, October 2002. [4] F. Robert et al. Global Position and Recruitment of HATs and HDACs in the Yeast Genome. Molecular Cell, 2004. [5] J. Zeitlinger et al. Program-Specific Distribution of a Transcription Factor Dependent on Partner Transcription Factor and MAPK Signaling. Cell, May 2, 2003. [6] H. H. Ng et al. Targeted recruitment of Set1 Histone Methylase by Elongating Pol II Provides a Localized Mark and Memory of Recent Transcriptional Activity. Molecular Cell, March, 2003. [7] H. H. Ng et al. Genome-wide location and regulated recruitement of the RSC nucleosome-remodeling complex. Genes and Development, 2002. [8] J. J. Wyrick et al. Genome-Wide Distribution of ORC and MCM Proteins in S. cerevisiae: High-Resolution Mapping of Replication Origins. Cell, December 14, 2001. [9] I. Simon et al. Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle. Cell, September 21, 2001. [10] B. Ren et al. Genome-Wide Location and Function of DNA Binding Proteins. Science, December 22, 2000. [11] C. J. Roberts et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science, 2000. [12] F. C. Holstege et al. Dissecting the Regulatory Circuitry of a Eukaryotic Genome. Cell, November 25, 1998. 87 [13] J. M. Casolari et al. Genome-wide Localization of the Nuclear Transport Machinery Couples Transcriptional Status and Nuclear Organization. Cell, 117:427–439, May 2004. [14] M. C. Yu et al. Arginine methyltransferase affects interactions and recruitment of mRNA processing and export factors. Genes and Development, 2004. [15] M. Damelin et al. The Genome-wide Localization of Rsc9, a Component of the RSC Chromatin-Remodeling Complex, Changes in Response to Stress. Molecular Cell, 9:563– 573, March 2002. [16] H. Hieronymus et al. Genome-wide mRNA surveillance is coupled to mRNA transport. Genes and Development, 2004. [17] S. K. Kurdistani et al. Mapping Global Histone Acetylation Patterns to Gene Expression. Cell, 117:721–733, June 11, 2004. [18] D. Robyr et al. Microarray Deacetylation Maps Determine Genome-Wide Functions for Yeast Histone Deacetylases. Cell, 109:437–446, May 17, 2002. [19] S. K. Kurdistani et al. Genome-wide binding map of the histone deacetylase Rpd3 in yeast. Nature Genetics, June 24, 2002. [20] A. Wang et al. Requirement of Hos2 Histone Deacetylase for Gene Activity in Yeast. Science, 2002. [21] J. Lieb et al. Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nature Genetics, August, 2001. [22] P. L. Nagy et al. Genomewide demarcation of RNA polymerase II transcription units revealed by physical fractionation of chromatin. PNAS, May 27, 2003. [23] J. V. Geisberg et al. Cellular Stress Alters the Transcriptional Properties of PromoterBound Mot1-TBP Complexes. Molecular Cell, May 21, 2004. [24] Z. Moqtaderi et al. Genome-Wide Occupancy Profile of the RNA Polymerase III Machinery in Saccharomyces cerevisiae Reveals Loci with Incomplete Transcription Complexes. Molecular and Cellular Biology, December, 2003. [25] H. Santos-Rosa et al. Methylation of Histone H3 K4 Mediates Association of the Isw1p ATPase with Chromatin. Molecular Cell, November, 2003. [26] B. E. Bernstein et al. Methylation of histone H3 Lys 4 in coding regions of active genes. PNAS, June 25, 2002. [27] B. E. Bernstein et al. Global nucleosome occupancy in yeast. Genome Biology, August 20, 2004. 88 [28] J. Kim et al. Global Role of TATA Box-Binding Protein Recruitment to Promoters in Mediating Gene Expression Profiles. Molecular and Cellular Biology, September, 2004. [29] N. M. Luscombe et al. Genomic analysis o regulatory network dynamics reveals large topological changes. Letters to Nature, September 16, 2004. [30] H. Yu et al. Genomic analysis of essentiality within protein networks. TRENDS in Genetics, June, 2004. [31] A. C. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 2002. [32] Y. Ho et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 2002. [33] T. Ito et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. PNAS, 2000. [34] P. Uetz et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 2000. [35] R. J. Larsen, M. L. Marx. An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, 2001. [36] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. [37] T. L. Bailey, M. Gribskov et al. Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 1998. [38] H. W. Mewes. Munich Information Center for Protein Sequences: Comprehensive Yeast Genome Database. http://mips.gsf.de/genre/proj/yeast/index.jsp, 2004. [39] Steen Knudsen. Guide to Analysis of Dna Microarray Data. Wiley, 2004 89