Int. J., Vol. x, No. x, xxxx How noisy and replicable are DNA microarray data? Suman Sundaresh1,3,*, She-pin Hung2,3,*, G.Wesley Hatfield2,3, and Pierre Baldi1,3 1. School of Information and Computer Science, University of California, Irvine CA 92697 2. Department of Microbiology and Molecular Genetics, College of Medicine, University of California, Irvine CA 92697 3. Institute for Genomics and Bioinformatics, University of California, Irvine CA 92697 E-mail: suman@uci.edu, shung@uci.edu, gwhatfie@uci.edu, pfbaldi@uci.edu (corresponding author) * These authors contributed equally to this work. Abstract: This paper analyzes variability in highly replicated measurements of DNA microarray data conducted on nylon filters and Affymetrix GeneChipsTM with different cDNA targets, filters and imaging technology. Replicability is assessed quantitatively using correlation analysis as a global measure and differential expression analysis and ANOVA at the level of individual genes. Keywords: DNA microarrays, sources of variation, replication, correlation, differential expression analysis, ANOVA Bibliographical notes: Suman Sundaresh is a PhD student in the Computer Science Department at UC Irvine. She gained her MSc and BSc (Hons) in Computer Science from the National University of Singapore. Her research interests are in the areas of data mining, machine learning and biomedical informatics. She-pin Hung is a post doctoral researcher in the Department of Microbiology and Molecular Genetics conjunctions with the Institute for Genomics and Bioinformatics at UC Irvine. She received her PhD from the University of California at Irvine in 2002. Her research interests are in the areas of global gene expression profiling with the use of DNA microarrays and bioinformatics. G. Wesley (Wes) Hatfield, Ph.D., is a Professor of Microbiology and Molecular Genetics in the College of Medicine and Associate Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. Dr Hatfield holds a Ph.D. degree from Purdue University, and a B.A. degree from the University of California at Santa Barbara. His primary areas of scientific expertise include molecular biology, biochemistry, microbial physiology, functional genomics, and bioinformatics. His recent academic interests include the application and development of genomic and bioinformatics methods to elucidate the effects of chromosome structure and DNA topology on gene expression. He has received national recognition for his scientific contributions including the Eli Lilly and Company Research Award bestowed by the Copyright © 200x Inderscience Enterprises Ltd. 1 S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi American Society of Microbiology. Pierre Baldi is a Professor in the School of Information and Computer Science and the Department of Biological Chemistry and the Director of the Institute for Genomics and Bioinformatics at the University of California, Irvine. He received his PhD from the California Institute of Technology in 1986. From 1986 to 1988 he was a postdoctoral fellow at the University of California, San Diego. From 1988 to 1995 he held faculty and member of the technical staff positions at the California Institute of Technology and at the Jet Propulsion Laboratory. He was CEO of a startup company from 1995 to 1999 and joined UCI in 1999. He is the recipient of a 1993 Lew Allen Award at JPL and a Laurel Wilkening Faculty Innovation Award at UCI. Dr. Baldi has written over 100 research articles and four books. His research focuses in biological and chemical informatics, AI, and machine learning. Introduction This paper analyzes and quantifies certain aspects of “noise” contained in DNA microarray data. A DNA microarray experiment comprises several steps such as cDNA spotting, mRNA extraction, target preparation, hybridization, image scanning and analyses. These procedures can be further subdivided into dozens of other elementary steps each of which can introduce some amount of variability and noise. In addition to the variability introduced by the instruments and the experimenter, there is biological variability which also has multiple sources ranging from fluctuations in the environment to the inherently stochastic nature of nano-scale regulatory chemistry (Barkai and Leibler, 2000; Hasty et al., 2000; McAdams and Arkin, 1999) [4,7,17] — transcription alone involves dozens of individual molecular interactions. These compounded forms of “noise” may lead one to doubt whether any reliable signal can be extracted at all from DNA microarrays. Here, we show that, while certainly noisy, DNA microarray data do contain reliable information. In this study, we look at highly replicated (up to 32x) experiments performed by different experimenters at different times in the same laboratory, using, as a model organism, wild type Escherichia coli. In addition, we obtain these microarray measurements using two different formats, nylon filters and Affymetrix GeneChips TM. Given the overwhelming number of variables that can in principle contribute to the variability, we focus on a particular subset of variables of great relevance to biologists. In particular, we measure the consistency of the results obtained using the filter technology across different filters and mRNA preparations. We also compare filters to Affymetrix GeneChipTM technology and study the effects of five different image processing methods. Replicability is assessed quantitatively using correlation analysis and differential expression analysis. We use correlation as a global measure of similarity between two sets of measurements. While a correlation close to one is a good sign, it is a global measure that provides little information at the level of individual genes. Thus, we use differential expression analysis at the level of individual genes to detect which genes seem to behave differently in two different sets of measurements. The data sets and software used in our analysis are available over the Web at http://www.igb.uci.edu/servers/dmss.html. How noisy and replicable are DNA microarray data? Our approach differs from and complements previous related studies (Coombes et al. 2002; Piper et al., 2002) [5,18]. In particular, we use higher levels of replication (32x), relatively simpler biological samples (E. coli versus S. cerevisiae or human B-cell Lymphoma cell lines) and more diverse microarray technologies (filters and Affymetrix Gene Chips). Part of these other studies also focus on the analysis of variables that are outside the scope of the present study, such as exposure time or inter-laboratory variability. Methods Filter Dataset The first dataset (“filter dataset”) we use consists of 32 sets of measurements from 16 nylon filter DNA microarrays containing duplicate probe sites for each of 4,290 open reading frames (ORFs) hybridized with 33P-labeled cDNA targets from wild-type Escherichia coli cells cultured at 37oC under balanced growth conditions in glucose minimal salts medium. The experimental design and methods for these experiments are described in detail in Arfin et. al. (2000), Baldi and Hatfield (2002) and Hung et. al. (2002) [1,2,10] and illustrated in Figure 1. Each filter contains duplicate probes (spots) for each of the 4,290 open reading frames (ORFs) of the E. coli genome. In Experiment 1, Filters 1 and 2 were hybridized with 33P-labeled, random hexamer generated, cDNA targets complementary to each of three independently prepared RNA preparations (RNA1) obtained from the cells of three individual cultures of a wild-type (wt) E. coli strain. These three 33P-labeled cDNA target preparations were pooled prior to hybridization to the full-length ORF probes on the filters (Experiment 1). Following phosphorimager analysis, these filters were stripped and again hybridized with pooled, 33P-labeled cDNA targets complementary to each of another three independently prepared RNA preparations (RNA2) from the wt strain (Experiment 2). This procedure was repeated two more times with filters 3 and 4, using two more independently prepared pools of cDNA targets (Experiment 3, RNA3; Experiment 4, RNA4). Another set of filters, Filters 3 and 4, were used for Experiments 3 and 4 as described for Experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with cDNA targets complementary to four independently prepared sets of pooled RNA. Thus, since each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, 16 measurements (D1-D16) for each ORF from four experiments were obtained. These procedures were performed with another two pairs of filters 5-8 for experiments 5-8 to obtain another 16 measurements (D17-D32) for each ORF. The filter dataset is fairly representative of other filter datasets in the sense that it corresponds to experiments carried out by different people at different times in the same laboratory. In particular, of the 32 filter measurements, the data from measurements 1-16 were obtained 6 months later than the data from measurements 17-32. During this intervening period the efficiency of the 33P labeling was improved. Consequently, more signals marginally above background were detected on the filters for measurements 1-16. S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi In fact, when we edit out all of the genes that contain one or more measurements at or below background in at least one experiment we observe the expression of 2,607 genes for measurements 1-16 and 1,579 genes for measurements 17-32. If we consider the dataset containing all 32 measurements, we find that only 1257 genes have all 32 expression measurements above background. The log (natural) transformed values (Speed, 2002) [19] of these above background 1257 gene expression values for all 32 measurements were used for subsequent analyses. GeneChip Dataset To address another DNA microarray technology, we use a second dataset (“GeneChip dataset”) that contains data from four Affymetrix GeneChipTM experiments that measure the expression levels of the same E. coli RNA preparations used for the filter experiments 1-4. The experimental design and methods for these experiments are illustrated in Figure 2 and described in detail by Hung et al. (2002) [10]. The four GeneChip measurements are each processed by five methods, MAS 4.0 and MAS 5.0 software of Affymetrix, dChip software of Li and Wong (2001) [15], RMA (Irizarry et. al., 2003) [12,13] and GCRMA (Wu and Irizarry, 2004) [21]. The dataset thus consists of 20 replicate measurements. The 2370 genes whose expression levels are above background for all the 20 measurements are used in the subsequent analyses. We log (natural) transformed the measures processed with MAS 4.0, 5.0 and dChip. RMA and GCRMA functions differ from most other expression measuring methods as they return the expression measures in log (base 2). The gene expression measurements for each experiment of both datasets from the filters and the GeneChips are globally normalized by dividing each expression measurement with a value above background on all sixteen filters or four GeneChips by the sum of all the gene expression measurements of that filter or GeneChip. Thus, the signal for each measurement can be expressed as a fraction of the total signal for each filter or GeneChip, or, by implication, as a fraction of total mRNA. This normalization is not applied to the RMA and GCRMA measurements which already have a built-in normalization step. The datasets obtained from the experiments described above allow us to investigate the effects of not only the environmental and biological factors, but also the consistency of measurements taken from two different DNA microarray technologies. Image Processing Software The GeneChip dataset contains 20 replicates, where each of the four GeneChip measurements is processed with five image processing software, MAS4.0, dChip and MAS5.0, RMA and GCRMA. In the Affymetrix MAS 4.0 software, the mean and standard deviation of the PM (perfect match) - MM (mismatch) differences of a probe set in one array are computed after excluding the maximum and the minimum values obtained for that probe set. If, among the remaining probe pairs, a difference deviates by more than 3SD from the mean, that probe pair is declared an outlier and not used for the average difference calculation of both the control and the experimental array. A flaw of this approach is that a probe with a large response might well be the most informative but may be consistently How noisy and replicable are DNA microarray data? discarded. Furthermore, if multiple arrays are compared at the same time, this method tends to exclude probes inconsistently measured among GeneChips. Li and Wong (2001) [15] developed a statistical model-based analysis method to detect and handle cross-hybridizing probes, image and/or GeneChip defects, and to identify outliers across GeneChip sets. A probe set from multiple chips is modeled and the standard deviation between a fitted curve and the actual curve for each probe set for each GeneChip is calculated. Probe pair sets containing an anomalous probe pair measurement(s) are declared outliers and discarded. The remaining probe pair sets are remodeled and the fitted curve data is used for average difference calculations. These methods are implemented in a software program, dChip, which can be obtained from the authors. A different empirical approach to improve the consistency of average difference measurements has been implemented in the more recent Affymetrix MAS 5.0 software. In this implementation, if the MM value is less than the PM value MAS 5.0 uses the MM value directly. However, if the MM value is larger than the PM value, MAS 5.0 creates an adjusted MM value based on the average difference intensity between the ln PM and ln MM, or if the measurement is too small, some fraction of PM. The adjusted MM values are used to calculate the ln (PM – adjusted MM) for each probe pair. The signal for a probe set is calculated as a one-step bi-weight estimate of the combined differences of all of the probe pairs of the probe set. In Irizarry et. al., (2003) [12,13], it is demonstrated that the ln(PM-adjusted MM) technique in MAS5.0 results in gene expression estimates with elevated variances. The RMA (robust multi-array analysis) approach applies a global background adjustment and normalization and fits a log-scale expression effect plus probe effect model robustly to the data. This method has been implemented in the Bioconductor affy package (http://www.bioconductor.org). The cel files of this GeneChip data set were preprocessed with the default setting at the probe level and the expression value of each gene was obtained. An extension of RMA is discussed in Wu and Irizarry (2004) [21] based on molecular hybridization theory takes into account the GC content of each of the probe sequence for the calculation of non-specific binding. This method, called GCRMA, is also available as part of the Bioconductor project in the gcrma package. The cel files of this GeneChip data set were pre-processed with the default setting at the probe level and the expression value of each gene was obtained. Correlation and Differential Expression Analyses To measure the consistency between two sets of measurements globally, such as different filters or different cDNA target preparations, we use the Pearson’s correlation coefficient. We compute and analyze matrices of correlation coefficients between different sets of measurements. The correlation coefficient provides a global measure of similarity but little information about possible fluctuations at the level of individual genes. A high level of global similarity with a correlation of, for instance, 0.95 can hide significant fluctuations at the level of individual gene measures. To address the issue of fluctuations at the level of single genes, we also perform differential analysis to detect false positives. The term “false” here is used in reference to the experimental set up and not to the underlying physical reality. In other words, differences in expression across two different S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi measurements may well be real and result, for instance, from random fluctuations, but they are false positive in the sense that ideally they should have not occurred since all conditions are supposed to be the “same”. Several methods for differential analysis have been developed in the literature, such as fold analysis, t-test, regularized t-test (Baldi and Hatfield 2002; Baldi and Long 2001) [2,3], SAM (Tusher et. al. 2001) [20], applied to raw data or to data transformed in different ways (Durbin et al. 2002; Huber et al. 2002) [6,9]. The primary goal here is to get a “ballpark” sense for the false positives rates between different sets of measurements under a typical and widespread analysis protocol. Thus, for illustration purposes, we use the t-test applied to the log-transformed data with a detection threshold corresponding to p-values of 0.005 or less. To isolate the effects of the filters, targets and combination of factors, we obtained the number of genes significantly differentially expressed (as false positives) by comparing duplicate measurements (same cDNA targets or same filters) with other pairs of duplicate measurements. In this case, we assume a normal distribution for the expression levels of the duplicate measurements of each gene. We also perform a two-way factorial ANOVA to estimate the percentage contributions of cDNA targets and filters to the total variance. Results Correlations within Filter Data A 32 x 32 correlation matrix of the duplicate measurements of all above-background target signals present on the 16 filters described in Figure 1 is shown in Figure 3. The correlations are plotted as an intensity matrix, where the darker cells correspond to numbers closer to 1 indicating a stronger correlation. The reference chart for the intensities is shown on the left. These results clearly demonstrate strong correlations among the first 16 measurements of the filter experiments (D1-D16 vs. D1-D16) as well as strong correlations among the measurements of the experiments performed 6 months earlier (D17-D32 vs. D17-D32). However, low correlation is observed among measurements of experiments performed at different times (D1-D16 vs. D17-D32). These results demonstrate that significant variance can be introduced into a DNA microarray experiment when experimental parameters such as personnel, reagents, protocols, and experimental methods vary. For example, we know that during this time frame, the 33P labeling was improved. We also notice that the measurements obtained using RNA3, D9-D12, do not correlate as well with the other 12 measurements taken during the same time frame. The reason is unknown and may have to do with that particular RNA preparation or some day-to-day variation in the experimental procedure, or a combination of both. Since two filters are hybridized with the same cDNA targets and two cDNA target preparations are hybridized to the same filter for each of eight cDNA target preparations, we are able to examine the correlations both between filters and cDNA target preparations ( How noisy and replicable are DNA microarray data? Figure 1). Figure 4 below shows typical scatter plots (A) and intensity image (B) observed when analyzing groups of 8 measurements (e.g. D1-D8, D25-D32) corresponding to a quadrant in Figure 1. Duplicate measurements (e.g. D1-D2, D3-D4) are averaged and compared with other duplicate measurements. The intensity image shows that there is a higher correlation when the same cDNA targets are hybridized to different filters (D1-D2 with D3-D4) as opposed to when different cDNA targets are hybridized to the same filters (D1-D2 with D5-D6). We have summarized the observations from Figure 3 in Table 1. When duplicate measurements of each filter are compared, a high average correlation of 0.97 is observed. A reasonably high correlation is also observed when we compare the measurements among different filters hybridized with the same targets (0.95). However, less correlation is observed when different targets are hybridized to the same filters (0.92). This demonstrates greater variance among target preparations (biological variance) than among filters (experimental variance). Thus, it stands to reason that the variance is even greater when different target preparations are hybridized to different filters. It should be noted that it has been demonstrated that the variability among target preparations can be significantly reduced by pooling independently prepared target samples prior to hybridization ( Arfin et al., 2000; Baldi and Hatfield, 2002; Hung et al., 2002) [1,2,10]. In addition to confirming earlier suggestions that the experimental and biological variables of a DNA microarray experiment contribute more variance than differences among the microarrays themselves (Arfin et al., 2000) [1], these data demonstrate both the subtle differences among replicated gene measurements obtained from DNA microarray experiments as well as more dramatic differences that are observed when basic changes in experimental protocols are adopted. Correlations within GeneChip Data When considering different array formats that require different target preparation methods and a fundamentally different probe design, the sources and magnitudes of the experimental errors are different. To illustrate this, we examine the correlations among data sets of another DNA microarray format. We use data sets obtained with Affymetrix GeneChips which are manufactured by the in situ synthesis of short single-stranded oligonucleotide probes, complementary to sequences within each ORF, directly synthesized on a glass surface. Nylon filter arrays are manufactured by the attachment of full-length, double-stranded, DNA probes of each E. coli ORF directly onto the filter as described earlier. The intensity matrix in Figure 5 shows how the log-transformed GeneChip expression data are correlated when processed with either Affymetrix MAS 4.0, MAS 5.0, dChip software, RMA or GCRMA. Overall, the correlations among the different measurements of the GeneChip experiments are high (>0.7) and comparable to those observed in the first 16 measurements in the filter dataset. Looking from bottom left, the first four rows and columns in Figure 5 compare the consistency of measurements obtained from four GeneChips processed with the MAS 4.0 software, the data in rows 5-8 and columns 5-8 compare the consistency of measurements obtained from four GeneChips processed with the dChip software, and the data in rows 912 , columns 9-12 compare the consistency of measurements obtained from four S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi GeneChips processed with the MAS 5.0 software and so on. Average correlations are calculated for each of the 15 blocks (see Figure 5) comparing pairs of software. These correlations reveal that the most consistent set of measurements are obtained with GCRMA (average correlation = 0.97) and with RMA as a close second (average correlation = 0.96). MAS 5.0 software performs marginally better (average correlation = 0.83) than the previous MAS 4.0 version (average correlation = 0.81) and dChip is better correlated within itself than both of the MAS versions (average correlation = 0.85). It is also apparent that close correlations are observed when MAS 4.0 and MAS 5.0 processed data are compared to one another (average correlation = 0.82), and that this correlation is better than between dChip and MAS 4.0 or MAS 5.0 processed data (average correlation 0.78). Another interesting observation is that RMA and GCRMA are better correlated with the MAS versions as compared with dChip as shown by the averaged correlation numbers in the boxes. Correlation of Filter with GeneChip Data In order to be able to compare GeneChip and filter data, the exact same four pooled total RNA preparations used for the nylon filter (experiments 1-4, Figure 1) are used for hybridization to four E. coli Affymetrix GeneChips (Figure 2) as described by Hung et al. (2002). In this case, however, instead of having four measurements for each gene expression level, as for each filter experiment, only one measurement was obtained from each GeneChip. On the other hand, this single measurement is the average of the difference between hybridization signals from approximately 15 perfect match (PM) and mismatch (MM) probe pairs for each ORF. While these are not equivalent to duplicate measurements because different probes are used, these data can increase the reliability of each gene expression level measurement (Baldi and Hatfield, 2002) [2]. Nevertheless, large differences in the average difference of individual probe pairs are often observed. In order to compare the different technologies, we first averaged the filter measurements obtained from the same cDNA target, resulting in one expression profile per cDNA target. We then compared the four averaged filter measurements to the measurements obtained from the four GeneChips (each corresponding to one cDNA target). The data in Figure 6 indicate weak correlations (< 0.4) between the filter measurements and Affymetrix GeneChip data no matter which image processing software was used. We also notice that the measurements obtained from experiment 3 (D9-D12), which were observed earlier to be less consistent with the other measurements, are strikingly un-correlated with the GeneChip measurements. The low correlation between the filter and GeneChip measurements are possibly due to the probe-specific effects due to the differences in the hybridization efficiencies for different probes. Individual outliers can have a large effect on the average difference for probe pair sets of individual GeneChips. In fact, it has been reported that this variance can be five times greater than the variance observed among GeneChips (Li and Wong, 2001) [15]. These probe effects are less for filters containing full-length ORF probes hybridized to targets generated with random hexamers than for Affymetrix GeneChips that query only a limited number of target sequences. As a result, it is expected that the signal How noisy and replicable are DNA microarray data? intensities obtained from Affymetrix GeneChips are less correlated to in vivo transcript levels than signal intensities obtained from filters, thus providing a rationale for why signal intensities obtained from different microarray platforms may not correlate well with one another. Differential Analysis within Filter Data We performed a statistical t-test between 16 duplicate pairs of measurements to study the magnitudes of variances attributed by cDNA targets, and/or filters. The number of genes which are found to be significantly different (p < 0.005) in each of these comparisons is shown in Table 2. Light grey cells identify experiments that compare different filters hybridized with the same cDNA targets. Dark grey cells identify experiments that compare the same filters hybridized with different cDNA targets. The values represent the number of false positive measurements observed when duplicate pairs of measurements from different cDNA target preparations or filters are compared to one another. The results of Table 2 demonstrate that the average “false positive” genes obtained when the same targets are hybridized to different filters is about 2% whereas when the different targets are hybridized to the same filters, the number increases to about 6%. When different targets are hybridized to different filters, the average percentage of false positives rises to 10%. This is without taking into account the effects of the large time gap between the sets of 16 measurements. In other words, even with sample pooling, variances contributed by different cDNA target preparations are, on average, about 3-4 times higher than variances contributed by different filters. Similar relative effects are observed using more sophisticated differential analysis methods that compensate for the relationship between levels of gene expression and their variances. These methods include the approach of Huber et al. (2002) [9] that uses a global arcsinh transformation to eliminate variance fluctuations as well as the local approach of Baldi and Long (2001) [3] and Long et al., (2001) [16] that uses a regularized t-test in which the variance of each gene is estimated by taking into account the variance of genes with similar expression levels. During the six-month time difference between D1-16 and D17-32, we earlier noted that the correlation between the measurements dropped significantly. This can be attributed to several experimental changes including reagents, personnel and labeling. A t-test between the two sets of 16 measurements show a significant (p<0.005) difference in the mean values of about 75% of the 1257 genes. To estimate the contribution of the two factors, cDNA targets and filters, to the total variance, we performed an ANOVA analysis (Coombes et al. 2002; Kerr et al. 2000) [5,14] for each of the 4 sets of 8 measurements involving 2 filters and 2 cDNA targets. Each quadrant in Figure 1 (e.g. D1-D8) corresponds to one set of 8 measurements. For each gene in each quadrant, we performed a 2-way factorial ANOVA to obtain the sum-of squares between (SSB) the different cDNA target groups and filter groups respectively. We summarized the SSB of cDNA targets and filters as a percentage contribution to the total variance and averaged over all genes. On average, 41.3 % of the total variance came from differences in cDNA and 20% from filters. The interaction S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi (cDNA x filter) contributed 13.5% to the total variance. These estimates further validate the effect of the differences in biological factors on the total variance in the filter dataset. Differential Analysis within GeneChip and with Filter Data To address the effects of different image processing software packages, we apply differential analysis to the GeneChip measurements. The number of false positive genes with p-values less than 0.005 based on a standard t-test are shown in Figure 7. Since the RMA and GCRMA functions transform the data differently from MAS4.0, 5.0 and dChip, it is not meaningful to perform t-tests comparing RMA and GCRMA to the other three software, since all the means of all 1794 genes will appear significantly different. We can however compare the MAS and dChip software, and they are presented below. While the replicates obtained using the similar MAS4.0 and MAS5.0 software packages exhibit very low levels of falsely identified significant genes, differences of up to 147 genes (5-6%) are observed when comparing dChip to the Affymetrix software. This does not imply that one software package is better than the others but rather that the users should be cognizant of these differences that are generated by different analysis methods. Figure 7 also shows the large number of falsely positive genes detected (38%) when the filter dataset is compared with the GeneChip dataset (regardless of the software used) using the standard t-test with p<0.005 supporting the earlier observation from the correlation analysis that these two datasets may not be combined. Discussion Many experimental designs and applications of DNA array experiments are possible. However, no matter what the purpose of these gene expression profiling experiments, a sufficient number of experiments must be performed for statistical analysis of the data, either through multiple measurements of homogeneous samples (replication) or multiple sample measurements (e.g. across time or subjects). Basically, this is because each gene expression profiling experiment results in the simultaneous measurement of the expression levels of thousands of genes. In such a high dimensional experiment, many genes will show large changes in expression levels between two experimental conditions simply by chance alone. In the same manner, many truly differentially expressed genes will show small changes. These false positive and false negative observations arise from chance occurrences exacerbated by biological variance as well as experimental and measurement errors. Thus, if we compare the gene expression patterns of cells simply grown under two different treatment conditions or between two genotypes, experimental replication is required for the assignment of statistical significance to each differential gene measurement. Such replications quickly become labor intensive and prohibitively expensive. This leads to the question – how many replicates are required for the data to be considered reliable for further analyses? The short answer is - enough to provide a robust estimate of the standard deviation of the mean of each gene measurement. Due to the prohibitive costs of generating replicates, this problem reduces to finding a method that will produce more robust estimates of the standard deviation of a small set of individual gene measurements with few replications. There are a few methods that address this problem. Techniques by Durbin et. al. (2002) How noisy and replicable are DNA microarray data? [6] and Huber et. al. (2002) [9] apply a transformation on the entire dataset so as to render the variance a constant that is independent of the mean. Another method (Baldi and Long 2001; Long et al., 2001) [3,16] has shown that the confidence in the interpretation of DNA microarray data with a low number of replicates can be improved by using a Bayesian statistical approach that incorporates information of within treatment measurements. This method is based on the observation that genes of similar expression levels exhibit similar variance and hence more robust estimates of the variance of a gene can be derived by pooling neighboring genes with comparable expression levels (Arfin et al., 2000; Baldi and Hatfield, 2002; Hatfield et al., 2003; Hung et al., 2002; Long et al., 2001) [1,2,8,10,16]. However, it would be advantageous if the factors introducing variance in the data could be identified upfront and pre-adjusted to minimize noise in the data. This brings us to the next question - what introduces noise in DNA microarray data? The results presented here demonstrate that the variability inherent in highlyreplicated (upto 32x) DNA microarray data can result from a number of disparate factors operating at different times and levels in the course of a typical experiment. These numerous factors are often interrelated in complex ways, but for the purpose of simplicity we have broken them down into two major categories: biological variability and experimental variability. Other sources of variability involve DNA microarray fabrication methods as well as differences in imaging technology, signal extraction, and data processing. This study confirms earlier assertions that, even with carefully controlled experiments with isogenic model organisms, the major sources of variance come from uncontrolled biological factors (Hatfield et al., 2003) [8]. Our ability to control biological variation in a model organism such as E. coli with an easily manipulated genetic system is an obvious advantage for gene expression profiling experiments. However, most systems are not as easily controlled. For example, human samples obtained from biopsy materials not only differ in genotype but also in cell types. Thus, care should be taken to reduce this source of biological variability as much as possible, for example, with the use of laser-capture techniques for the isolation of single cells from animal and human tissues. A related study conducted on human cells (Coombes et. al 2002) [5] found that the differences between two target preparations had a relatively small contribution to the variation compared to membrane reuse and exposure time to phosphorimager screens (the latter was outside the scope of our study). In our experiments, each of the eight sets of pooled E. coli. RNA was extracted on different days which may account for the greater variation as seen in the correlation and differential analyses as compared with experimental factors. With regard to the second dataset obtained from Affymetrix GeneChip experiments, one possibility for poor correlation between signal intensities between filter and Affymetrix GeneChip experiments can be attributed to probe effects. Nevertheless, signal ratios obtained from the same probe on two different arrays can ameliorate these probe effects. Thus, the overall differential expression profiles obtained from different microarray platforms should be comparable. In support of this conclusion, Hung et al. (2002) [10] have demonstrated that, with appropriate statistical analysis, similar results can be obtained when the same experiments are performed with pre-synthesized filters containing full-length ORF probes and Affymetrix GeneChips. An additional source of biological variation, even when comparing the gene profiles of isogenic cell types, comes from the conditions under which the cells are cultured. In this regard, it has been recommended that standard cell-specific media should be adopted S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi for the growth of cells queried by DNA array experiments (Baldi and Hatfield, 2002) [2]. While this is not possible in every case, many experimental conditions for the comparison of two different genotypes of common cell lines can be standardized. The adoption of such medium standards would reduce experimental variations and facilitate the crosscomparison of experimental data obtained from different experiments, different microarray formats, and/or different investigators. However, even employing these precautions, non-trivial and sometimes substantial variance in gene expression levels, even between genetically identical cells cultured in the same environment such as those revealed in this study, are observed. This simple fact can result from a variety of influences including environmental differences, phase differences between the cells in the culture, periods of rapid change in gene expression, and multiple additional stochastic effects. To emphasize the importance of microenvironments encountered during cell growth, Piper et al. (2002) [18] have recently demonstrated that variance among replicated gene measurements is dramatically decreased when isogenic yeast cells are grown in chemostats rather than batch cultures. Biological variance can be even further exacerbated by experimental errors. For example, if extreme care in the treatment and handling of the RNA is not taken during the extraction of the RNA from the cell and its subsequent processing. It is often reported that the cells to be analyzed are harvested by centrifugation and frozen for RNA extraction at a later time. It is important to consider the effects of these experimental manipulations on gene expression and mRNA stability. If the cells encounter a temperature shift during the centrifugation step, even for a short time, this could cause a change in the gene expression profiles due to the consequences of temperature stress. If the cells are centrifuged in a buffer with even small differences in osmolarity from the growth medium, this could cause a change in the gene expression profiles due to the consequences of osmotic stress. Also, removal of essential nutrients during the centrifugation period could cause significant metabolic perturbations that would result in changes in gene expression profiles. Each of these and other experimentally caused gene expression changes will confound the interpretation of the experiment. These are not easy variables to control. Therefore, the best strategy is to harvest the RNA as quickly as possible under conditions that ‘freeze” it at the same levels that it occurs in the cell population at the time of sampling. Several methods are available that address this issue (Baldi and Hatfield, 2002) [2]. There are numerous other sources of experimental variability such as: differences among protocols; different techniques employed by different personnel; differences between reagents; and, differences among instruments and their calibrations, as well as others. While these sources of variance are usually less than those that come from biological sources, they can dominate the results of a DNA microarray experiment. This is illustrated by the poor correlation between the two replicated data sets reported here, one obtained six months after the other. Although there is good correlation among the replicated measurements of each set, there is much less correlation among the measurements between these sets. In this case, the major difference can be attributed to improvements in the cDNA target labeling protocol. It is reassuring to observe that carefully executed and replicated DNA microarray experiments produce data with high global correlations (in the 0.9 range). This high correlation, however, should not be interpreted as a sign that replication is not necessary. Replication as well as proper statistical analysis remain important in order to monitor experimental variability and because the variability of individual genes can be high. It is also reassuring to know that while correlations of expression measurements across How noisy and replicable are DNA microarray data? technologies remain low, overall differential expression profiles obtained from different microarray platforms can be compared (Hung et al., 2002) [10]. Finally, the comprehensive and diverse datasets for wild type E. coli under standard growth conditions that have been compiled in the present study and are available via the Web (http://www.igb.uci.edu/servers/dmss.html) may serve as a useful set of reference data for DNA microarray researchers and bioinformaticians interested in further developing the technology. Acknowledgments This work was supported in part by the UCI Institute of Genomics and Bioinformatics, by grants from the NIH (GM-055073 and GM068903) to GWH, by a Laurel Wilkening Faculty Innovation Award to PB, and by a Sun Microsystems Award to PB. SH was supported by a postdoctoral training grant fellowship from the University of California Biotechnology Research and Education Program. We are grateful to Cambridge University Press for permission to reproduce materials from a book by PB and GWH titled “DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling” ISBN: 0521800226 [11]. References 1 Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S., and Hatfield, G. W. (2000) Global gene expression profiling in Escherichia coli K12. The effects of integration host factor. J. Biol. Chem. 275, 29672-29684 2 Baldi, P., and Hatfield, G. W. (2002) DNA microarrays and gene expression: From experiments to data analysis and modeling, Cambridge University Press, Cambridge, UK 3 Baldi, P., and Long, A. D. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17, 509-519 4 Barkai N., Leibler S. (2000) Biological rhythms: Circadian clocks limited by noise. Nature 20, 403:267-268 5 Coombes, K.R., Highsmith, W.E., Krogmann, T.A., Baggerly, K.A., Stivers, D.N., Abruzzo, L.V. (2002) Identifying and quantifying sources of variation in microarray data using highdensity cDNA membrane arrays. J. Comput. Biol 9, 655-669 6 Durbin, B., Hardin, J., Hawkins, D., Rocke D. M. (2002) A Variance-Stabilizing Transformation for Gene Expression Microarray Data, Bioinformatics 18, S105-S110 (ISMB 2002) 7 Hasty J., Pradines J., Dolnik M., Collins J.J. (2000) Noise-based switches and amplifiers for gene expression. Proc Natl Acad Sci USA, 29, 97:2075-2080 8 Hatfield, G. W., Hung, S.-P., and Baldi, P. (2003) Differential analysis of DNA microarray gene expression data. Mol. Microbiol. 47, 871-877 9 Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 suppl. 1 , S96-S104 (ISMB 2002). 10 Hung, S.-P., Baldi, P., and Hatfield, G. W. (2002) Global gene expression profiling in Escherichia coli K12: The effects of leucine-responsive regulatory protein. J. Biol. Chem. 277, 40309-40323 S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi 11 Hung, S.-P., Hatfield, G. W., Sundaresh, S., and Baldi, P. (2003) Understanding DNA Microarrays: Sources and Magnitudes of Variances in DNA Microarray Data Sets. Genomics, Proteomics, and Vaccines. G. Grandi Editor, John Wiley and Sons. 12 Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P. (2003) Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics .Vol. 4, Number 2: 249-264 13 Irizarry, R.A., Bolstad, B.M., Collin, R., Cope, L.C., Hobbs, B., and Speed, T.P. (2003), Summaries of Affymetrix GeneChip probe level data Nucleic Acids Research 31(4):e15 14 Kerr, M.K., Martin, M., Churchill, G.A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol. 7, 819-837 15 Li, C., and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl. Acad. Sci. U. S. A. 98, 31-36 16 Long, A. D., Mangalam, H. J., Chan, B. Y., Tolleri, L., Hatfield, G. W., and Baldi, P. (2001) Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J. Biol. Chem. 276, 19937-19944 17 McAdams, H. H., Arkin, A. (1999) It's a noisy business! Genetic regulation at the nanomolar scale. Trends in Genetics 15, 65-69. 18 Piper, M. D., Daran-Lapujade, P., Bro, C., Regenberg, B., Knudsen, S., Nielsen, J., and Pronk, J. T. (2002) Reproducibility of oligonucleotide microarray transcriptome analyses. An interlaboratory comparison using chemostat cultures of Saccharomyces cerevisiae. J. Biol. Chem. 277, 37001-37008 19 Speed, T. (2002) Always log spot intensities and ratios, Speed Group Microarray Page, http://www.stat.berkeley.edu/users/terry/zarray/html/log.html 20 Tusher, V. G., Tibshirani, R., Chu, R. (2001) Significance analysis of microrarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. U. S. A., 98, 5116-5121 21 Wu, Z., and Irizarry, R.A. Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Proceedings of the 8'th International Conference on Computational Molecular Biology (RECOMB 2004). To appear. How noisy and replicable are DNA microarray data? Figure 1 Experimental design for nylon filter DNA array experiments (“filter dataset”) S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi Figure 2. Experimental design of the Affymetrix GeneChip experiments (“GeneChip dataset”). The same twelve total RNA preparations used for the 4 pooled RNA sets (RNA1-RNA4) in filter experiments 1-4 were used for the preparation of biotin-labeled RNA targets for hybridization to four Affymetrix GeneChips. The Affymetrix *.cel file generated by data obtained with a confocal laser scanner was used as the raw data source for all subsequent analyses. How noisy and replicable are DNA microarray data? Figure 3 Correlation intensity matrix for the filter dataset S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi (A) (B) Figure 4 (A) Scatter plots of duplicate log-transformed filter measurements (Line y=x superimposed) (B) A typical correlation intensity matrix when comparing the effects of different filters and cDNA targets How noisy and replicable are DNA microarray data? Figure 5 Correlation intensities among data processed with dChip, MAS 4.0, MAS 5.0, RMA and GCRMA (Note: The intensity range is between 0.7 and 1.0) Figure 6 Low correlation intensities observed (<0.4) when comparing measurements from filters (Expt1-4) with GeneChips (Note: The intensity range is between 0.0 and 0.4) S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi Figure 7 Number of false positive genes identified by different data processing methods Comparison Average Correlation Duplicate measurements from each filter1 Same targets hybridized to different filters2 Different targets hybridized to the same 0.974 0.951 filters3 0.917 filters4 0.859 Different targets hybridized to different (excluding the effects of the time gap between the sets of 16 measurements and the labeling improvements) Table 1 The comparison of average correlation values from the correlation intensity matrix shown in Figure 3 1 The average correlation values from the correlation matrix illustrated in Figure 3 of D1vD2, D3vD4, D5vD6, D7vD8, D9vD10, D11vD12, D13vD14, D15vD16, D17vD18, D19vD20, D21vD22, D23vD24, D25vD26, D27vD28, D29vD30, and D31vD32. 2 The average correlation values from the correlation matrix illustrated in Figure 3 of D1vD3, D1vD4, D2vD3, D2vD4, D5vD7, D5vD8, D6vD7, D6vD8, D9vD11, D9vD12, D10vD11, D10vD12, D13vD15, D13vD16, D14vD15, D14vD16, D17vD19, D17vD20, D18vD19, D18vD20, D21vD23, D21vD24, D22vD23, D22vD24, D25vD27, D25vD28, D26vD27, D26vD28, D29vD31, D29vD32, D30vD31 and D30vD32. 3 The average correlation values from the correlation matrix illustrated in Figure 3 of D1vD5, D1vD6, D2vD5, D2vD6, D3vD7, D3vD8, D4vD7, D4vD8, D9vD13, D9vD14, D10vD13, D10vD14, D11vD15, D11vD16, D12vD15, D12vD16, D17vD21, D17vD22, D18vD21, D18vD22, D19vD23, D19vD24, D20vD23, D20vD24, D25vD29, D25vD30, D26vD29, D26vD30, D27vD31, D27vD32, D28vD31 and D28vD32. 4 The average correlation values from the correlation matrix illustrated in Figure 3 of all other cells in quadrants D1-D16 vs D1-D16 and D17-D32 vs D17-32 except those How noisy and replicable are DNA microarray data? belonging to the other three categories above. We do not consider cells in the other quadrants because they measure correlations of experiments across the time gap, the effects of which we do not want to include in this comparison. D1, 2 D3, 4 D5, 6 D7, 8 D9, 10 D11, 12 D13, 14 D15, 16 D17, 18 D19, 20 D21, 22 D23, 24 D25, 26 D27, 28 D29, 30 D31, 32 1,2 0 21 63 69 309 265 234 211 385 370 395 339 423 418 385 356 3,4 21 0 58 58 241 224 167 160 337 315 322 296 363 351 321 296 5,6 63 58 0 27 221 194 148 143 392 366 400 360 430 417 386 350 7,8 69 58 27 0 158 122 107 83 338 332 359 311 370 371 351 306 9,10 309 241 221 158 0 13 84 58 493 463 492 463 522 509 472 418 11,12 265 224 194 122 13 0 66 47 437 415 444 405 473 461 436 382 13,14 234 167 148 107 84 66 0 22 466 443 472 432 516 495 454 386 15,16 211 160 143 83 58 47 22 0 418 407 420 397 474 451 427 365 17,18 385 337 392 338 493 437 466 418 0 23 53 60 102 104 113 111 19,20 370 315 366 332 463 415 443 407 23 0 50 68 103 101 105 110 21,22 395 322 400 359 492 444 472 420 53 50 0 15 72 65 66 76 23,24 339 296 360 311 463 405 432 397 60 68 15 0 77 68 66 77 25,26 423 363 430 370 522 473 516 474 102 103 72 77 0 24 87 109 27,28 418 351 417 371 509 461 495 451 104 101 65 68 24 0 88 87 29,30 385 321 386 351 472 436 454 427 113 105 66 66 87 88 0 62 31,32 356 296 350 306 418 382 386 365 111 110 76 77 109 87 62 0 Table 2 Matrix of significant genes with p<0.005 found in sets of experiment pairs of logtransformed filter data (out of 1257 total genes)