Computational tools ENRICH - From Expression, Through Annotation, to Function Goren Tali1 and Manor Ohad1. 1 Department of Computer Science, The Hebrew University Of Jerusalem, Israel. Visit the ENRICH site at http://www.huji.ac.il/~manor460/enrich ABSTRACT cant. In addition we wished to use present biological information Which transcription factor controls stress response in yeast? Which and obtainable data to reconstruct novel and previously known genes regulate amino acid biosynthesis? Do mitochondrial genes biological networks. have a stronger response to heat shock than peripheral genes? The For this purpose we developed a computational tool in the amount of available biological data is rapidly increasing. Gene form of a computer program, which we call ‘ENRICH’. ENRICH annotation, cellular localization, Chromatin structure, protein enables to combine heterogeneous data, involving gene annota- structure, function and interactions, ChIP and gene expression data tions, gene expression data, chromatin immunoprecipitation are overwhelming in their richness of information, but how does (ChIP), cellular localization analysis, protein-protein interactions one extract meaningful biological conclusions from it? and more. ENRICH is a computational aid tool aimed to solve these is- ENRICH handles matrix-based data, of two main types: binary sues. ENRICH allows users an easy interactive interface for up- and continuous. A binary matrix can be viewed as annotations loading their biological data to the program, being data of gene from any type. A continuous matrix may contain various types of annotation tables, experimental gene expression or other. These measures over genes, such as expression, ChIP, etc. data sets, which we refer to as matrices, can then be manipulated in ENRICH allows to perform a global statistical analysis of such a variety of ways which we implemented in ENRICH, to allow the crossed types of biological information, determine how significant user more freedom with the uploaded data. the examined associations are, and find enrichment within different We also implemented a large variety of statistical tests, both data sets. paired and unpaired that can be preformed upon the data to find enriched aspects of it, also implemented are refinement methods 1.1 Statistical Hypothesis Testing such as Bonferroni and FDR (false discovery rate. The result is a It is easier to show a universal hypothesis is false than prove it is tool, which can be used to upload various data types, investigate it true. Therefore, we use statistical hypothesis testing to determine statistically, and save the biological results. the probability that a given hypothesis is true. The process of statistical hypothesis testing is usually composed of the following 1 INTRODUCTION Computational biology has the potential to revolutionize the under- steps: 1. Formulate the null hypothesis – commonly, states that standing of the cell by offering an unprecedented view of the mo- "there is no phenomenon", and that the observations lecular underpinnings of biology phenomena. Computational anal- could have arisen through chance. The alternative hy- ysis is essential to transform the masses of generated data into a pothesis is commonly in contrast to the null hypothesis mechanistic understanding of processes and pathways in the cell. and it is accepted if the observed data values are suffi- The project aims at extracting a comprehensive overview from ciently improbable under the null hypothesis. An exam- the existing huge amount of information arising from genome-wide ple of the null hypothesis may be: “There is no connec- analyses. Our goal is to connect information from different sys- tion between the annotation ‘Ribosome Assembly’ and tematic global studies and resources to infer biological function, the transcription factor ‘RAP1’ ”. The alternative hy- and check whether the connections found are statistically signifi- pothesis says that the two are indeed related. 2. Identify a test sufficient statistics that can be used to assess the validity of the null hypothesis. We will elaborate Department of Computer Science, Hebrew University of Jerusalem, Israel 1 Goren T. and Manor O. on the sufficient statistics of each statistical test we implemented in the next section. 3. Compute the P-value, which is the probability that a test lecular biology, and with which ENRICH is aimed at processing and analyzing. 1. statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the P-value, the stronger the evidence against the null hypothesis. 4. 2. Compare the P-value to an acceptable significance value (sometimes called an alpha value). If the observed effect 3. is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. 4. 2 METHODS 2.1 Implementation 5. Our main desire was to develop an easy-to-use, flexible and consistent application which will reduce otherwise complex tasks to Gene Ontology Annotation (GO) - The GO database provides a useful tool to annotate and analyze the functions of a large number of genes. Gene Expression - Results of microarray experiments, which indicate the RNA level of the genome in different conditions. ChIP on chip – Data from Chromatin Immunoprecipitation, a technique used for identification of the DNA sequences to which specific proteins bind in vivo. Cellular localization – Proteins were marked using different methods, and their sub-cellular localization was checked. Protein-protein interactions – Different experiments in which interactions between different proteins in the cell were measured. only a few lines of code. We decided to implement ENRICH using the Perl programming language. We had several considerations in choosing so: Perl is a high level programming language with strong support for arrays and hashes. It has a wide availability of modules and packages related to mathematics and statistics, which were helpful for us. In addition Access to Perl is available to anyone, and many Bioinformatic tools are implemented in Perl. A 2.3 Basic Operations All input files to ENRICH are in a matrix-based format. We therefore found it useful to implement the following three types of functions: 1. disadvantage of Perl, of which we were aware from the beginning of the work process, is that it is relatively slow (compared to the C programming language) in performing numerical calculations. ENRICH was developed so that it can be used in three different 2. ways: 1. As a command line interactive interpreter, where the user types one command at a time as seen in figure 1. 2. Running the program with an added script of ENRICH commands to be preformed. 3. Using the ENRICH as a module and incorporating ENRICH functions into any Perl program. 2.2 ENRICH input ENRICH handles tab delimited text files in the structure of a matrix: the first row contains the headers of the columns and the first column contains the headers of the rows. Matrices with two header rows are also handled. Every cell in the matrix contains the data relevant to the specific line and row headers. The data in each cell of a matrix may contain numeric continuous values or binary values (0 or 1). We will elaborate on some representative types of data sets, which are widely used in the research field of computational mo- 2 3. Basic necessary matrix manipulations, which include mathematical simple operations over all matrix’ values. For example: Transpose, LogScale, Normal, Average, Negate, AddCon (Add a constant numeric value to all matrix values), etc. Convenient and helpful data queries. For example: select rows/columns by name/sum, get rows/columns number, get row headers, extract P-values, get number of rows/columns, etc. Necessary functions for simple use and orientation of ENRICH. For example: Help, Whos (print out all the currently used variables), Load, Save, etc. Most of ENRICH functions are explained in more detail in the Appendix section. ENRICH - From Expression, Through Annotation, to Function binary values. There are 2 possible test for binary values matrices: χ2 test or the hypergeometric test. a. χ2 Test A non-parametric test of statistical significance for binary values. The hypothesis tested with χ2 is whether or not two different samples are different enough in some characteristic or aspect of their behavior. If there is a significant difference, we can generalize from our samples that the populations from which our samples are drawn are also different in their behavior or some tested characteristic. The null hypothesis of this test states there is no difference among different sample groups [3]. b. Hypergeometric Test Describes the number of successes in a sequence of n draws from a finite population without replacement. Let there be n ways for a ‘good’ selection and m ways for a ‘bad’ selection out of a total n+m possibilities. Take N samples and let xi equal 1 if selection i is successful and 0 if it is not. Let X be the total number of successful selections: N Figure 1. An example of using ENRICH in order to load to datasets, one being ChIP data and the other GO annotations data. Then, a HyperGeometric test is performed upon these matrices, and the result is refined using the False Discovery Rate correction and saved to a new file in the user's directory. 2.4 Statistical Testing All statistical tests are performed in ENRICH using the function ‘enrich()’. The input for performing a statistical test by ‘Enrich()’ is two matrices of the following format: Matrix A with n rows and m columns. Matrix B with n rows and k columns. A and B both contain the same row values (i.e. all the yeast genes). The three above specifications are not necessary requirements. If A and B don’t contain exactly the same row values, ENRICH output would refer only to rows which appear in both A and in B. The function ‘ENRICH’ performs the statistical test for every pair of columns (vectors) x A and y B . The output is a matrix C with m rows and k columns in which every cell C x , y contains the P value that is the result of the statistical test performed on vector x and vector y. ’Enrich()’ performs a statistical test according to the method specified by the user. ‘Enrich()’ checks the type of values matrices A and B contain (may be either decimal or binary) and calls the matching function out of the following three accordingly: EnrichBin(), EnrichFlo() or EnrichAnno(). Each function may call a variety of statistical tests, which can operate on the relevant type of values in the matrices. We will explain briefly what is the main purpose of each statistical test and how is it performed. 2.4.1 EnrichBin – Performs the test specified by the user on two matrices containing X xi i 1 The probability of i successful selections is then: n m # ways for i successes # ways for N i failures i N i P( x i ) total number of ways to select m n N 2.4.2 EnrichFlo – Performs the test specified by the user on two matrices containing decimal values. There are a few possible tests for decimal values matrices, which will now be described shortly. a. Paired Student's t Test A parametric test to compare two paired groups. Given two paired sets Xi and Yi of n measured values, the paired t-test determines whether they differ from each other in a significant way under the assumptions that the paired differences are independent and identically normally distributed [2]. Correlations – Correlation is a measure of the relation between two or more variables [5]. The correlation coefficients values range from -1 to +1. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. We implemented in ENRICH two methods to calculate a correlation coefficient: Pearson and Spearman, both will now be elaboreted. a. Pearson Correlation Rank Test The Pearson Correlation Coefficient measures the linear relationship between two variables, and is calculated as follows: rxy cov( y, x) var( y ) * var( x) It determines the extent to which values of the two variables are proportional to each other. That is, the correla- 3 Goren T. and Manor O. tion is high if it can be ‘summarized’ by a straight line. b. Spearman Correlation Rank Test The Spearman Correlation Rank Test is a nonparametric correlation test. The Spearman Rank Correlation can be thought of as the regular Pearson correlation coefficient, except that Spearman correlation is computed from ranks. 2.4.3 EnrichAnno – Performs the test specified by the user on two matrices: matrix A containing binary values, and matrix B containing decimal values. As in ‘EnrichFlo’ and ‘EnrichBin’, ‘EnrichAnno’ performs the statistical test for every pair of columns (vectors) x A and y B . Every vector x is divided into two independent groups, according to the matching binary values in the binary vector y. There are 3 possible tests for two independent data sets, on which we will soon elaborate. a. Unpaired Student's t Test A parametric test to evaluate the differences in means between two groups. The P-value reported with a t-test represents the probability of error involved in accepting the research hypothesis about the existence of a difference. This is the probability of error associated with rejecting the hypothesis of no difference between the two categories of observations (corresponding to the groups) in the population when, in fact, the hypothesis is true [2]. b. Wilcoxon Rank-sum Test (Mann-Whitney-Wilcoxon) A nonparametric alternative to the paired t-test. This test assumes that there is no information in the magnitudes of the differences between paired observations. It calculate the differences and rank them from smallest to largest by absolute value. The test then adds all the ranks associated with positive differences, giving the suffiecient statistic [1]. c. Kolmogorov-Smirnoff’s Test The Kolmogorov-Smirnov test (KS test) tries to determine if two datasets differ significantly. This test has the advantage of making no assumption about the distribution of data - it is non-parametric and distribution free. The KS test is only appropriate for testing data against a continuous distribution, such as the normal distribution. It is based on the empirical distribution function (ECDF)[1]. To compare two empirical cumulative distributions S N(x) containing N events, and SM(x) containing M events, the statistic DMN is calculated: DMN max S N ( x) S M ( x) over all x. 2.5 Multiple Hypothesis Correction The Statistical analysis ENRICH performs of a data set typically involves not just a single hypothesis, but rather many. For any particular test, a pre-set probability α of a type-1 error (i.e., a false positive, rejecting the null hypothesis when in fact it is true) is assigned. The problem of multiple comparisons is that we would like to control the false positive rate not just for any single test but also for the entire collection of tests that makes up our experiment. Multiple Hypothesis correction methods attempt to keep the overall chance of getting any false positives at the same level (e.g. 4 0.05). This is done by the ENRICH function ‘Refine’. It refines a P-value matrix with a cut off specified by the user. The refinement is performed according to the specified method out of two possibilities: The Bonferroni correction or FDR. 2.5.1 The Bonferroni correction The Bonferroni correction is a multiple-comparison correction used when several dependent or independent statistical tests are being performed repeatedly. If a particular outcome of an experiment is unlikely to happen, the fact that the experiment is repeated multiple times will increase the probability that the outcome appears at least once [4]. While a given α value may be appropriate for each individual comparison, it is not for the set of all comparisons. In order to avoid a lot of false positives, the α value needs to be lowered to account for the number of comparisons being performed. The Bonferroni correction tests each null hypothesis independently from outcome of others to level α/m. 2.5.2 False Discovery Rate Correction The FDR is the fraction of false positives among all tests declared significant. The motivation for using the FDR is that we may be running a very large number of tests, with those being declared significant being subjected to further studies. For example, searching for differently expressed genes a certain microarray experiment. The set of all genes in this experience is obviously huge, and we want to find the significantly differently expressed genes. The idea is that the statistical procedure results in a significant enrichment of differently expressed genes, controlling the fraction of false positives within the enriched setting by specifying a value for the FDR. Choosing an FDR of 5% means that (on average) 5% of the genes we picked as being significant are actually false positives (and 95% of those genes declared significant do indeed have differential expression). Hence, screening genes with an FDR of 5% results in a significant enrichment of genes that are truly differentially expressed [6]. Suppose a total of N hypotheses are tested, S of which are judged significant (by the criteria being used for each test). If we had complete knowledge, we would know that n of the hypotheses have the null hypothesis true and m=N-n have the alternative hypothesis true, and we might find that F of the true nulls were called significant, while T of the alternative true were called significant, as can be seen in table 1. Null true Alternative true Total Called significant F T Called not significant n-F m-T Total K N-K N n m Table 1. The FDR multiple hypothesis correction method For this experiment, the false discovery rate is the fraction of tests called significant that are actually true nulls, FDR = F/K. This is actually: # false positive E # positive ENRICH - From Expression, Through Annotation, to Function 3 RESULTS In order to test the use of ENRICH, we decided to try and create a partial regulation network for the yeast S. Cerevisiae. We picked a few key biological processes in the cell's life i.e. response, metabolism and ubiquitination, and used ENRICH as described in figure 1 in order to gain insight about the transcription factors that are involved in these processes. The process shown in figure 2 uses two sets of data represented in two binary matrices. The first matrix is a result of a ChIP analysis done in Rick Young’s lab, which shows for every known yeast transcription factor all the genes it binds. The other matrix is a Gene Ontology (GO) matrix, representing for each gene the GO annotations that it holds. Using these two matrices, ENRICH creates a new matrix of pvalues, where each cell represents the p-value obtained from performing a statistical significance test (such as a Hyper Geometric test) over two columns from the previous matrices. The p-values are then refined according to the multiple hypothesis correction. Resulting in a decimal matrix, which is then transformed into a binary one using a significance threshold. The resulting matrix has a binary value for every pair of transcription factor and GO annotation, being one if there is a strong enrichment between them or zero if not. Then, ENRICH is used to select all the GO annotations related to a certain process, i.e. response. The GO annotations are selected such that there is minimum overlap between them (e.g. “response to stress” and “response to stress conditions”) and the selected GO annotations and the enriched transcription factors are then visualized by DOT/NEATO as a directed graph (or network). The resulted networks can be seen in figures 3 and 4. A closer look will reveal a set of transcription factors that ENRICH found to be associated with each process, and each transcription factor is also associated with a sub-process within each process. In order to evaluate the quality of ENRICH predictions of subprocesses and transcription factors, we can look at the predictions for the following processes. MBP1 is a known Transcription factor involved in regulation of cell cycle progression from G1 to S phase [14], which usually indicates a crucial point of regulation, and ENRICH has predicted it to be involved in response with DNA damage which seems very reasonable. CAD1 is a transcription factor known to be involved in iron metabolism, considering that iron is an inorganic substance [15], ENRICH placed CAD1 as connected to response to inorganic compounds. YAP7 is a putative transcription factor of unknown role in yeast, yet ENRICH strongly places it as involved with response to chemical and abiotic stimulus. Considering that all the other TF’s we discussed were well categorized by ENRICH, we believe this case to be the same, although further investigation is necessary. 3.1.2. Case study of results for Metabolism process A representing set of transcription factors from the results of the metabolism process, give us some encouragement about the rest of them. GCR1 and GCR2, known transcription factors involved in glycolysis [16-18]. THI2 is known as involved in the biosynthesis of thiamine [19,20], MET4, MET31, MET32 in the biogenesis of sulfur amino acids[21-23], and GAT1 and GAT3 in nitrogen compound metabolism [24-26]. 3.1.3. Case study of results for Ubiquitin related processes In the case of the four proteins ENRICH found to be involved in the ubiquitin cycle, things are less decisive. Apart from RPN4, which is a transcription factor that stimulates expression of proteasome genes [27-28] and therefore highly connected to the ubiquitin cycle, the rest of the proteins seem less connected according to the literature. REB1 is a RNA polymerase enhancer protein [29], RCS1 is involved in iron utilization [30,31] and ADR1 in alcohol related genes and peroxisomal proteins [32]. Yet they were the genes ENRICH found to be highly enriched with the ubiquitin annotation, therefore we suggest that each one of them does play a role in the ubiquitin cycle of some sort. 3.1.1. Case study of results for Transport process MSN2 is a Transcriptional activator related to Msn4p; activated in stress conditions, binds DNA at stress response elements of responsive genes, inducing gene expression [7,8], and we can see that ENRICH has predicted it to be associated with two subprocesses of response, namely response to stimulus and response to stress, which is reassuring. MCM1 is Transcription factor involved in cell-type-specific transcription and pheromone response [9-11], and ENRICH has predicted it to be involved in two transport sub-processes that involve the response to pheromone. STE12, TEC1 and DIG1 that were predicted by ENRICH to be involved in response to pheromone induction, are known to induce mating and growth in response to pheromone induction [12,13]. 3.2 Response versus localization During our work we asked ourselves the following Biological question: Do Mitochondrial genes react differently to heat shock conditions than Peripheral genes do? In order to answer that question we used ENRICH in the way described in figure 5, where we use localization information and experimental annotations to obtain a result matrix where we can see for various hit shock conditions the reaction intensity for mitochondrial genes versus peripheral genes. The results can be seen in figure 6, where a dark cell represent a high p-value, meaning that there was no significance for those genes, and red cell represents a high enrichment for that gene group during the relevant condition. The results clearly show that genes located in the mitochondria react much stronger to a heat 5 Goren T. and Manor O. shock than genes located in the cell periphery. An interesting biological insight gained very easily by using ENRICH with just a few commands. When looking at literature about this subject, we haven't found anything decisive, but the fact that there are mitochondrial HSP (heat shock proteins) and the fact that the mitochondria is involved in apoptotic pathways, perhaps strengthens the results. Still further investigation of the issue remains to be done. Figure 3. The regulatory network created by ENRICH (using Neato) for the Metabolism process. The Transcription factors are filled with light green and the sub-processes are dark green. Figure 2. on the left the two data sets used in the process: ChIP data and GO annotation data. A HyperGeometric test is done in order to check enrichment for every pair of columns. The result is a p-value matrix where each cell such as the yellow one is a result of the test over two columns marked in purple. Then, a multiple hypothesis correction is done and using a significance threshold a binary matrix is created where each cell informs whether there is Figure 4. The regulatory network created by ENRICH (using Neato) for the Response and Ubiqiuitin processes. The Transcription factors are empty and the sub-processes are dark green. an enrichment between the two columns. 6 ENRICH - From Expression, Through Annotation, to Function more powerful response to heat shock conditions than do peripheral genes. DISCUSSION Figure 5. on the left the two data sets used in the process: experimental expression data and experimental localization data. An Unpaired T-test is performed in order to check enrichment for every pair of columns, resulting in a p-value matrix. That matrix is then transformed to binary using a threshold resulting in the top matrix where each cell represents a specific experiment vs. a cell compartment. The yellow columns are the columns of the mitochondria and the cell periphery which interest us. The bottom matrix is an experiment vs. conditions matrix, in which every line is a different experiment and every column is a different condition performed, for this query, we choose only the various heat shock conditions in this matrix. Then a HyperGeometric test is performed in order to find how enriched are these cell compartments in respect to heat shock conditions. In this project we presented ENRICH, a program we created, aimed in aiding researchers performing various statistical significance tests. We spent a large portion of the work on writing and cleaning the code itself, and trying to make it as convenient as possible. We then moved on to test our program on real data sets; we used various data types such as ChIP data from Rick Young’s lab, experimental expression data from David Botstein’s lab and others, localization data and more. Our first aim was to try and create a yeast regulation network, not of transcription factors and their targets, but rather of processes and the transcription factors involved in the process and its sub-processes. The results and the way they were obtained shows the power of ENRICH as an easy to use tool which we feel can be very useful, for gaining biological insights. It still lacks many features, among them visualization and incorporated clustering methods. However, we strongly feel that it can be developed into a strong and useful computational aid tool for researchers. ACKNOWLEDGEMENTS We would like to thank Prof. Nir Friedman and his Ph.D. student Tommy Kaplan for guiding us through the project. REFERENCES [1] [2] [3] [4] [5] [6] Figure 6. The resulted table of the process done in figure 4. The Experiment column indicates which kind of heat shock condition was used in the related experiments, the Cell Periphery and Mitochondrion columns show the enrichment for genes located in the cell periphery and mitochondria respectively. A red cell shows of high enrichment whereas a black cell of low enrichment. From these results it is clear that the mitochondrial genes have a much [7] [8] Nonparametric statistics for the behavioral sciences, by Sidney Siegel (international student edition). Sensory Evaluation of Food: Statistical Methods and Procedures, by Michael O'Mahony. Conover, W. J. (1998). Practical Nonparametric Statistics (3rd Ed.). Abdi, H. "((2007). Bonferroni and Sidak corrections for multiple comparisons. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.". Allison, D.B., G.L. Gadbury, M. Heo, J.R. Fernandez, C.-K. Lee, T.A. Prolla, and R. Weindruch. 2002. A mixture model approach for the analysis of microarray gene expression data. Computational Statisrtcis and Data analysis 39: 1-20. Benjamini, Y., and Hochberg, T. 1995. Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B 85: 289-300. Martinez-Pastor MT, et al. (1996) The Saccharomyces cerevisiae zinc finger proteins Msn2p and Msn4p are required for transcriptional induction through the stress response element (STRE). EMBO J 15(9): 2227-35. Gorner W, et al. (1998) Nuclear localization of the C2H2 zinc finger protein Msn2p is regulated by 7 Goren T. and Manor O. [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] 8 stress and protein kinase A activity. Genes Dev 12(4): 586-97. Passmore S, et al. (1989) A protein involved in minichromosome maintenance in yeast binds a transcriptional enhancer conserved in eukaryotes. Genes Dev 3(7): 921-35. Elble R and Tye BK (1991) Both activation and repression of a-mating-type-specific genes in yeast require transcription factor Mcm1. Proc Natl Acad Sci U S A 88(23): 10966-70. Lydall D, et al. (1991) A new role for MCM1 in yeast: cell cycle regulation of SW15 transcription. Genes Dev 5(12B): 2405-19. Tedford K et al. Regulation of the mating pheromone and invasive growth responses in yeast by two MAP kinase substrates. Curr Biol. 1997 Apr 1; 7(4):228-38. Liu H et al. Elements of the yeast pheromone response pathway required for filamentous growth of diploids. Science. 1993 Dec 10;262(5140): 1741-4. Koch C, et al. (1993) A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. Science 261(5128): 1551-7. Lesuisse E and Labbe P (1995) Effects of cadmium and of YAP1 and CAD1/YAP2 genes on iron metabolism in the yeast Saccharomyces cerevisiae. Microbiology 141 (Pt 11): 2937-43. Uemura H and Jigami Y (1992) Role of GCR2 in transcriptional activation of yeast glycolytic genes. Mol Cell Biol 12(9): 3834-42. Holland MJ, et al. (1987) The GCR1 gene encodes a positive transcriptional regulator of the enolase and glyceraldehyde-3-phosphate dehydrogenase gene families in Saccharomyces cerevisiae. Mol Cell Biol 7(2): 813-20. GCR2, a new mutation affecting glycolytic gene expression in Saccharomyces cerevisiae. Mol Cell Biol 10(12): 6389-96. Nishimura H, et al. (1992) Cloning and characteristics of a positive regulatory gene, THI2 (PHO6), of thiamin biosynthesis in Saccharomyces cerevisiae. FEBS Lett 297(1-2): 155-8. Nosaka K, et al. (1994) Isolation and characterization of the THI6 gene encoding a bifunctional thiamin-phosphate pyrophosphorylase/hydroxyethylthiazole kinase from Saccharomyces cerevisiae. J Biol Chem 269(48): 30510-6. Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J 17(21): 6327-36. Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] yeast sulfur amino acid metabolism. Mol Cell Biol 17(7): 3640-8. Blaiseau PL, et al. (1997) Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol 17(7): 3640-8. Kuruvilla FG, et al. (2001) Carbon- and nitrogenquality signaling to translation are mediated by distinct GATA-type transcription factors. Proc Natl Acad Sci U S A 98(13): 7283-8. Cooper TG (2002) Transmitting the signal of excess nitrogen in Saccharomyces cerevisiae from the Tor proteins to the GATA factors: connecting the dots. FEMS Microbiol Rev 26(3): 223-38. Cox KH, et al. (1999) Genome-wide transcriptional analysis in S. cerevisiae by mini-array membrane hybridization. Yeast 15(8): 703-13. Ng DT, et al. (2000) The unfolded protein response regulates multiple aspects of secretory and membrane protein biogenesis and endoplasmic reticulum quality control. J Cell Biol 150(1):77-88 Xie Y and Varshavsky A (2001) RPN4 is a ligand, substrate, and transcriptional regulator of the 26S proteasome: a negative feedback circuit. Proc Natl Acad Sci U S A 98(6):3056-61 Morrow BE, et al. (1989) Proteins that bind to the yeast rDNA enhancer. J Biol Chem 264(15):9061-8 Gil R, et al. (1991) RCS1, a gene involved in controlling cell size in Saccharomyces cerevisiae. Yeast 7(1):1-14 Yamaguchi-Iwai Y, et al. (1996) Iron-regulated DNA binding by the AFT1 protein controls the iron regulon in yeast. EMBO J 15(13):3377-84 Simon M, et al. (1991) The Saccharomyces cerevisiae ADR1 gene is a positive regulator of transcription of genes encoding peroxisomal proteins. Mol Cell Biol 11(2):699-704 APPENDIX A We will now elaborate on some ENRICH functions in more detail. Enrich Orientation: Help - Print the documentation of all functions or a desired certain function. Whos - Print all the currently used variables of the user. Load - Load a new file into the program and returns a data handle of this file. Save – Save into a file the data of a given handle. Mathematical Operations: AddCon - Add a constant numeric value to all values in all cells in the matrix. Transpose - Turn rows into columns and vice versa. ENRICH - From Expression, Through Annotation, to Function LogScale - Convert all the matrix values into logarithmic scale values in any base. Presentation of data on a logarithmic scale can be helpful when the data covers a large range of values; the logarithm reduces this to a more manageable range. Normal – Normalize all matrix values according to standard normal distribution. Standard means the expected value is 0 and the variance is 1. AvgR/AvgC - Calculate the average value of the rows/columns respectively. Convert2bin - Converts the numeric values of a matrix to binary values using a given threshod (assigning 1 to values above the threshold and 0 to values below the threshold). Negate – Reverse the sign of all the values in the matrix. Data Queries: SelectRowsByName/SelectColsByName - Select from the matrix only rows/columns respectively, whose headers either contain or match exactly a specific word. SelectRowsBySum - Select from the matrix only rows whose numeric values sum up over a certain threshold. GetGenes - Return all the matrix row headers. GetRowsNum/GetColsNum - Return the number of rows/columns respectively of the matrix. ExtractPvals - Extract and write to a file the col- umn headers that received a P-value below a specified threshold APPENDIX B Additional information and software such as: the ENRICH perl script, the EnrichFunctions perl module, a running example and other results and figures, are all available at the ENRICH web site http://www.huji.ac.il/~manor460/enrich 9