IJTM/IJCEE PAGE TEMPLATEv2 - Home | Georgia State University

advertisement
Int. J., Vol. x, No. x, xxxx
How noisy and replicable are DNA microarray data?
Suman Sundaresh1,3,*, She-pin Hung2,3,*,
G.Wesley Hatfield2,3, and Pierre Baldi1,3
1. School of Information and Computer Science, University of
California, Irvine CA 92697
2. Department of Microbiology and Molecular Genetics, College of
Medicine, University of California, Irvine CA 92697
3. Institute for Genomics and Bioinformatics, University of California,
Irvine CA 92697
E-mail: suman@uci.edu, shung@uci.edu, gwhatfie@uci.edu,
pfbaldi@uci.edu (corresponding author)
* These authors contributed equally to this work.
Abstract: This paper analyzes variability in highly replicated measurements of
DNA microarray data conducted on nylon filters and Affymetrix GeneChipsTM
with different cDNA targets, filters and imaging technology. Replicability is
assessed quantitatively using correlation analysis as a global measure and
differential expression analysis and ANOVA at the level of individual genes.
Keywords: DNA microarrays, sources of variation, replication, correlation,
differential expression analysis, ANOVA
Bibliographical notes: Suman Sundaresh is a PhD student in the Computer
Science Department at UC Irvine. She gained her MSc and BSc (Hons) in
Computer Science from the National University of Singapore. Her research
interests are in the areas of data mining, machine learning and biomedical
informatics.
She-pin Hung is a post doctoral researcher in the Department of Microbiology
and Molecular Genetics conjunctions with the Institute for Genomics and
Bioinformatics at UC Irvine. She received her PhD from the University of
California at Irvine in 2002. Her research interests are in the areas of global
gene expression profiling with the use of DNA microarrays and bioinformatics.
G. Wesley (Wes) Hatfield, Ph.D., is a Professor of Microbiology and Molecular
Genetics in the College of Medicine and Associate Director of the Institute for
Genomics and Bioinformatics at the University of California, Irvine. Dr
Hatfield holds a Ph.D. degree from Purdue University, and a B.A. degree from
the University of California at Santa Barbara. His primary areas of scientific
expertise include molecular biology, biochemistry, microbial physiology,
functional genomics, and bioinformatics. His recent academic interests include
the application and development of genomic and bioinformatics methods to
elucidate the effects of chromosome structure and DNA topology on gene
expression. He has received national recognition for his scientific contributions
including the Eli Lilly and Company Research Award bestowed by the
Copyright © 200x Inderscience Enterprises Ltd.
1
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
American Society of Microbiology.
Pierre Baldi is a Professor in the School of Information and Computer Science
and the Department of Biological Chemistry and the Director of the Institute
for Genomics and Bioinformatics at the University of California, Irvine. He
received his PhD from the California Institute of Technology in 1986. From
1986 to 1988 he was a postdoctoral fellow at the University of California, San
Diego. From 1988 to 1995 he held faculty and member of the technical staff
positions at the California Institute of Technology and at the Jet Propulsion
Laboratory. He was CEO of a startup company from 1995 to 1999 and joined
UCI in 1999. He is the recipient of a 1993 Lew Allen Award at JPL and a
Laurel Wilkening Faculty Innovation Award at UCI. Dr. Baldi has written over
100 research articles and four books. His research focuses in biological and
chemical informatics, AI, and machine learning.
Introduction
This paper analyzes and quantifies certain aspects of “noise” contained in DNA
microarray data. A DNA microarray experiment comprises several steps such as cDNA
spotting, mRNA extraction, target preparation, hybridization, image scanning and
analyses. These procedures can be further subdivided into dozens of other elementary
steps each of which can introduce some amount of variability and noise.
In addition to the variability introduced by the instruments and the experimenter,
there is biological variability which also has multiple sources ranging from fluctuations in
the environment to the inherently stochastic nature of nano-scale regulatory chemistry
(Barkai and Leibler, 2000; Hasty et al., 2000; McAdams and Arkin, 1999) [4,7,17] —
transcription alone involves dozens of individual molecular interactions. These
compounded forms of “noise” may lead one to doubt whether any reliable signal can be
extracted at all from DNA microarrays. Here, we show that, while certainly noisy, DNA
microarray data do contain reliable information.
In this study, we look at highly replicated (up to 32x) experiments performed by
different experimenters at different times in the same laboratory, using, as a model
organism, wild type Escherichia coli. In addition, we obtain these microarray
measurements using two different formats, nylon filters and Affymetrix GeneChips TM.
Given the overwhelming number of variables that can in principle contribute to the
variability, we focus on a particular subset of variables of great relevance to biologists. In
particular, we measure the consistency of the results obtained using the filter technology
across different filters and mRNA preparations. We also compare filters to Affymetrix
GeneChipTM technology and study the effects of five different image processing methods.
Replicability is assessed quantitatively using correlation analysis and differential
expression analysis. We use correlation as a global measure of similarity between two
sets of measurements. While a correlation close to one is a good sign, it is a global
measure that provides little information at the level of individual genes. Thus, we use
differential expression analysis at the level of individual genes to detect which genes
seem to behave differently in two different sets of measurements. The data sets and
software
used
in our
analysis
are
available
over
the
Web
at
http://www.igb.uci.edu/servers/dmss.html.
How noisy and replicable are DNA microarray data?
Our approach differs from and complements previous related studies (Coombes et al.
2002; Piper et al., 2002) [5,18]. In particular, we use higher levels of replication (32x),
relatively simpler biological samples (E. coli versus S. cerevisiae or human B-cell
Lymphoma cell lines) and more diverse microarray technologies (filters and Affymetrix
Gene Chips). Part of these other studies also focus on the analysis of variables that are
outside the scope of the present study, such as exposure time or inter-laboratory
variability.
Methods
Filter Dataset
The first dataset (“filter dataset”) we use consists of 32 sets of measurements from 16
nylon filter DNA microarrays containing duplicate probe sites for each of 4,290 open
reading frames (ORFs) hybridized with 33P-labeled cDNA targets from wild-type
Escherichia coli cells cultured at 37oC under balanced growth conditions in glucose
minimal salts medium. The experimental design and methods for these experiments are
described in detail in Arfin et. al. (2000), Baldi and Hatfield (2002) and Hung et. al.
(2002) [1,2,10] and illustrated in
Figure 1.
Each filter contains duplicate probes (spots) for each of the 4,290 open reading
frames (ORFs) of the E. coli genome. In Experiment 1, Filters 1 and 2 were hybridized
with 33P-labeled, random hexamer generated, cDNA targets complementary to each of
three independently prepared RNA preparations (RNA1) obtained from the cells of three
individual cultures of a wild-type (wt) E. coli strain. These three 33P-labeled cDNA target
preparations were pooled prior to hybridization to the full-length ORF probes on the
filters (Experiment 1). Following phosphorimager analysis, these filters were stripped and
again hybridized with pooled, 33P-labeled cDNA targets complementary to each of
another three independently prepared RNA preparations (RNA2) from the wt strain
(Experiment 2).
This procedure was repeated two more times with filters 3 and 4, using two more
independently prepared pools of cDNA targets (Experiment 3, RNA3; Experiment 4,
RNA4). Another set of filters, Filters 3 and 4, were used for Experiments 3 and 4 as
described for Experiments 1 and 2. This protocol results in duplicate filter data for four
experiments performed with cDNA targets complementary to four independently
prepared sets of pooled RNA. Thus, since each filter contains duplicate spots for each
ORF and duplicate filters were used for each experiment, 16 measurements (D1-D16) for
each ORF from four experiments were obtained. These procedures were performed with
another two pairs of filters 5-8 for experiments 5-8 to obtain another 16 measurements
(D17-D32) for each ORF.
The filter dataset is fairly representative of other filter datasets in the sense that it
corresponds to experiments carried out by different people at different times in the same
laboratory. In particular, of the 32 filter measurements, the data from measurements 1-16
were obtained 6 months later than the data from measurements 17-32. During this
intervening period the efficiency of the 33P labeling was improved. Consequently, more
signals marginally above background were detected on the filters for measurements 1-16.
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
In fact, when we edit out all of the genes that contain one or more measurements at or
below background in at least one experiment we observe the expression of 2,607 genes
for measurements 1-16 and 1,579 genes for measurements 17-32. If we consider the
dataset containing all 32 measurements, we find that only 1257 genes have all 32
expression measurements above background. The log (natural) transformed values
(Speed, 2002) [19] of these above background 1257 gene expression values for all 32
measurements were used for subsequent analyses.
GeneChip Dataset
To address another DNA microarray technology, we use a second dataset (“GeneChip
dataset”) that contains data from four Affymetrix GeneChipTM experiments that measure
the expression levels of the same E. coli RNA preparations used for the filter experiments
1-4. The experimental design and methods for these experiments are illustrated in Figure
2 and described in detail by Hung et al. (2002) [10]. The four GeneChip measurements
are each processed by five methods, MAS 4.0 and MAS 5.0 software of Affymetrix,
dChip software of Li and Wong (2001) [15], RMA (Irizarry et. al., 2003) [12,13] and
GCRMA (Wu and Irizarry, 2004) [21]. The dataset thus consists of 20 replicate
measurements. The 2370 genes whose expression levels are above background for all the
20 measurements are used in the subsequent analyses. We log (natural) transformed the
measures processed with MAS 4.0, 5.0 and dChip. RMA and GCRMA functions differ
from most other expression measuring methods as they return the expression measures in
log (base 2).
The gene expression measurements for each experiment of both datasets from the
filters and the GeneChips are globally normalized by dividing each expression
measurement with a value above background on all sixteen filters or four GeneChips by
the sum of all the gene expression measurements of that filter or GeneChip. Thus, the
signal for each measurement can be expressed as a fraction of the total signal for each
filter or GeneChip, or, by implication, as a fraction of total mRNA. This normalization is
not applied to the RMA and GCRMA measurements which already have a built-in
normalization step.
The datasets obtained from the experiments described above allow us to investigate
the effects of not only the environmental and biological factors, but also the consistency
of measurements taken from two different DNA microarray technologies.
Image Processing Software
The GeneChip dataset contains 20 replicates, where each of the four GeneChip
measurements is processed with five image processing software, MAS4.0, dChip and
MAS5.0, RMA and GCRMA.
In the Affymetrix MAS 4.0 software, the mean and standard deviation of the PM
(perfect match) - MM (mismatch) differences of a probe set in one array are computed
after excluding the maximum and the minimum values obtained for that probe set. If,
among the remaining probe pairs, a difference deviates by more than 3SD from the mean,
that probe pair is declared an outlier and not used for the average difference calculation
of both the control and the experimental array. A flaw of this approach is that a probe
with a large response might well be the most informative but may be consistently
How noisy and replicable are DNA microarray data?
discarded. Furthermore, if multiple arrays are compared at the same time, this method
tends to exclude probes inconsistently measured among GeneChips.
Li and Wong (2001) [15] developed a statistical model-based analysis method to
detect and handle cross-hybridizing probes, image and/or GeneChip defects, and to
identify outliers across GeneChip sets. A probe set from multiple chips is modeled and
the standard deviation between a fitted curve and the actual curve for each probe set for
each GeneChip is calculated. Probe pair sets containing an anomalous probe pair
measurement(s) are declared outliers and discarded. The remaining probe pair sets are
remodeled and the fitted curve data is used for average difference calculations. These
methods are implemented in a software program, dChip, which can be obtained from the
authors.
A different empirical approach to improve the consistency of average difference
measurements has been implemented in the more recent Affymetrix MAS 5.0 software.
In this implementation, if the MM value is less than the PM value MAS 5.0 uses the MM
value directly. However, if the MM value is larger than the PM value, MAS 5.0 creates
an adjusted MM value based on the average difference intensity between the ln PM and ln
MM, or if the measurement is too small, some fraction of PM. The adjusted MM values
are used to calculate the ln (PM – adjusted MM) for each probe pair. The signal for a
probe set is calculated as a one-step bi-weight estimate of the combined differences of all
of the probe pairs of the probe set.
In Irizarry et. al., (2003) [12,13], it is demonstrated that the ln(PM-adjusted MM)
technique in MAS5.0 results in gene expression estimates with elevated variances. The
RMA (robust multi-array analysis) approach applies a global background adjustment and
normalization and fits a log-scale expression effect plus probe effect model robustly to
the data. This method has been implemented in the Bioconductor affy package
(http://www.bioconductor.org). The cel files of this GeneChip data set were preprocessed with the default setting at the probe level and the expression value of each gene
was obtained.
An extension of RMA is discussed in Wu and Irizarry (2004) [21] based on molecular
hybridization theory takes into account the GC content of each of the probe sequence for
the calculation of non-specific binding. This method, called GCRMA, is also available as
part of the Bioconductor project in the gcrma package. The cel files of this GeneChip
data set were pre-processed with the default setting at the probe level and the expression
value of each gene was obtained.
Correlation and Differential Expression Analyses
To measure the consistency between two sets of measurements globally, such as different
filters or different cDNA target preparations, we use the Pearson’s correlation coefficient.
We compute and analyze matrices of correlation coefficients between different sets of
measurements.
The correlation coefficient provides a global measure of similarity but little
information about possible fluctuations at the level of individual genes. A high level of
global similarity with a correlation of, for instance, 0.95 can hide significant fluctuations
at the level of individual gene measures. To address the issue of fluctuations at the level
of single genes, we also perform differential analysis to detect false positives. The term
“false” here is used in reference to the experimental set up and not to the underlying
physical reality. In other words, differences in expression across two different
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
measurements may well be real and result, for instance, from random fluctuations, but
they are false positive in the sense that ideally they should have not occurred since all
conditions are supposed to be the “same”.
Several methods for differential analysis have been developed in the literature, such
as fold analysis, t-test, regularized t-test (Baldi and Hatfield 2002; Baldi and Long 2001)
[2,3], SAM (Tusher et. al. 2001) [20], applied to raw data or to data transformed in
different ways (Durbin et al. 2002; Huber et al. 2002) [6,9]. The primary goal here is to
get a “ballpark” sense for the false positives rates between different sets of measurements
under a typical and widespread analysis protocol. Thus, for illustration purposes, we use
the t-test applied to the log-transformed data with a detection threshold corresponding to
p-values of 0.005 or less. To isolate the effects of the filters, targets and combination of
factors, we obtained the number of genes significantly differentially expressed (as false
positives) by comparing duplicate measurements (same cDNA targets or same filters)
with other pairs of duplicate measurements. In this case, we assume a normal distribution
for the expression levels of the duplicate measurements of each gene. We also perform a
two-way factorial ANOVA to estimate the percentage contributions of cDNA targets and
filters to the total variance.
Results
Correlations within Filter Data
A 32 x 32 correlation matrix of the duplicate measurements of all above-background
target signals present on the 16 filters described in
Figure 1 is shown in Figure 3. The correlations are plotted as an intensity matrix, where
the darker cells correspond to numbers closer to 1 indicating a stronger correlation. The
reference chart for the intensities is shown on the left. These results clearly demonstrate
strong correlations among the first 16 measurements of the filter experiments (D1-D16
vs. D1-D16) as well as strong correlations among the measurements of the experiments
performed 6 months earlier (D17-D32 vs. D17-D32). However, low correlation is
observed among measurements of experiments performed at different times (D1-D16 vs.
D17-D32). These results demonstrate that significant variance can be introduced into a
DNA microarray experiment when experimental parameters such as personnel, reagents,
protocols, and experimental methods vary. For example, we know that during this time
frame, the 33P labeling was improved.
We also notice that the measurements obtained using RNA3, D9-D12, do not
correlate as well with the other 12 measurements taken during the same time frame. The
reason is unknown and may have to do with that particular RNA preparation or some
day-to-day variation in the experimental procedure, or a combination of both.
Since two filters are hybridized with the same cDNA targets and two cDNA target preparations are
hybridized to the same filter for each of eight cDNA target preparations, we are able to
examine the correlations both between filters and cDNA target preparations (
How noisy and replicable are DNA microarray data?
Figure 1). Figure 4 below shows typical scatter plots (A) and intensity image (B) observed when
analyzing groups of 8 measurements (e.g. D1-D8, D25-D32) corresponding to a
quadrant in
Figure 1. Duplicate measurements (e.g. D1-D2, D3-D4) are averaged and compared
with other duplicate measurements. The intensity image shows that there is a higher
correlation when the same cDNA targets are hybridized to different filters (D1-D2 with
D3-D4) as opposed to when different cDNA targets are hybridized to the same filters
(D1-D2 with D5-D6).
We have summarized the observations from Figure 3 in Table 1. When duplicate
measurements of each filter are compared, a high average correlation of 0.97 is observed.
A reasonably high correlation is also observed when we compare the measurements
among different filters hybridized with the same targets (0.95). However, less correlation
is observed when different targets are hybridized to the same filters (0.92). This
demonstrates greater variance among target preparations (biological variance) than
among filters (experimental variance). Thus, it stands to reason that the variance is even
greater when different target preparations are hybridized to different filters. It should be
noted that it has been demonstrated that the variability among target preparations can be
significantly reduced by pooling independently prepared target samples prior to
hybridization ( Arfin et al., 2000; Baldi and Hatfield, 2002; Hung et al., 2002) [1,2,10].
In addition to confirming earlier suggestions that the experimental and biological
variables of a DNA microarray experiment contribute more variance than differences
among the microarrays themselves (Arfin et al., 2000) [1], these data demonstrate both
the subtle differences among replicated gene measurements obtained from DNA
microarray experiments as well as more dramatic differences that are observed when
basic changes in experimental protocols are adopted.
Correlations within GeneChip Data
When considering different array formats that require different target preparation
methods and a fundamentally different probe design, the sources and magnitudes of the
experimental errors are different. To illustrate this, we examine the correlations among
data sets of another DNA microarray format. We use data sets obtained with Affymetrix
GeneChips which are manufactured by the in situ synthesis of short single-stranded
oligonucleotide probes, complementary to sequences within each ORF, directly
synthesized on a glass surface. Nylon filter arrays are manufactured by the attachment of
full-length, double-stranded, DNA probes of each E. coli ORF directly onto the filter as
described earlier.
The intensity matrix in Figure 5 shows how the log-transformed GeneChip expression
data are correlated when processed with either Affymetrix MAS 4.0, MAS 5.0, dChip
software, RMA or GCRMA. Overall, the correlations among the different measurements
of the GeneChip experiments are high (>0.7) and comparable to those observed in the
first 16 measurements in the filter dataset.
Looking from bottom left, the first four rows and columns in Figure 5 compare the
consistency of measurements obtained from four GeneChips processed with the MAS 4.0
software, the data in rows 5-8 and columns 5-8 compare the consistency of measurements
obtained from four GeneChips processed with the dChip software, and the data in rows 912 , columns 9-12 compare the consistency of measurements obtained from four
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
GeneChips processed with the MAS 5.0 software and so on. Average correlations are
calculated for each of the 15 blocks (see Figure 5) comparing pairs of software.
These correlations reveal that the most consistent set of measurements are obtained
with GCRMA (average correlation = 0.97) and with RMA as a close second (average
correlation = 0.96). MAS 5.0 software performs marginally better (average correlation =
0.83) than the previous MAS 4.0 version (average correlation = 0.81) and dChip is better
correlated within itself than both of the MAS versions (average correlation = 0.85). It is
also apparent that close correlations are observed when MAS 4.0 and MAS 5.0 processed
data are compared to one another (average correlation = 0.82), and that this correlation is
better than between dChip and MAS 4.0 or MAS 5.0 processed data (average correlation
0.78). Another interesting observation is that RMA and GCRMA are better correlated
with the MAS versions as compared with dChip as shown by the averaged correlation
numbers in the boxes.
Correlation of Filter with GeneChip Data
In order to be able to compare GeneChip and filter data, the exact same four pooled total
RNA preparations used for the nylon filter (experiments 1-4,
Figure 1) are used for hybridization to four E. coli Affymetrix GeneChips (Figure 2) as
described by Hung et al. (2002). In this case, however, instead of having four
measurements for each gene expression level, as for each filter experiment, only one
measurement was obtained from each GeneChip. On the other hand, this single
measurement is the average of the difference between hybridization signals from
approximately 15 perfect match (PM) and mismatch (MM) probe pairs for each ORF.
While these are not equivalent to duplicate measurements because different probes
are used, these data can increase the reliability of each gene expression level
measurement (Baldi and Hatfield, 2002) [2]. Nevertheless, large differences in the
average difference of individual probe pairs are often observed.
In order to compare the different technologies, we first averaged the filter
measurements obtained from the same cDNA target, resulting in one expression profile
per cDNA target. We then compared the four averaged filter measurements to the
measurements obtained from the four GeneChips (each corresponding to one cDNA
target).
The data in Figure 6 indicate weak correlations (< 0.4) between the filter
measurements and Affymetrix GeneChip data no matter which image processing
software was used. We also notice that the measurements obtained from experiment 3
(D9-D12), which were observed earlier to be less consistent with the other measurements,
are strikingly un-correlated with the GeneChip measurements.
The low correlation between the filter and GeneChip measurements are possibly due
to the probe-specific effects due to the differences in the hybridization efficiencies for
different probes. Individual outliers can have a large effect on the average difference for
probe pair sets of individual GeneChips. In fact, it has been reported that this variance
can be five times greater than the variance observed among GeneChips (Li and Wong,
2001) [15].
These probe effects are less for filters containing full-length ORF probes hybridized
to targets generated with random hexamers than for Affymetrix GeneChips that query
only a limited number of target sequences. As a result, it is expected that the signal
How noisy and replicable are DNA microarray data?
intensities obtained from Affymetrix GeneChips are less correlated to in vivo transcript
levels than signal intensities obtained from filters, thus providing a rationale for why
signal intensities obtained from different microarray platforms may not correlate well
with one another.
Differential Analysis within Filter Data
We performed a statistical t-test between 16 duplicate pairs of measurements to study the
magnitudes of variances attributed by cDNA targets, and/or filters. The number of genes
which are found to be significantly different (p < 0.005) in each of these comparisons is
shown in Table 2.
Light grey cells identify experiments that compare different filters hybridized with
the same cDNA targets. Dark grey cells identify experiments that compare the same
filters hybridized with different cDNA targets. The values represent the number of false
positive measurements observed when duplicate pairs of measurements from different
cDNA target preparations or filters are compared to one another.
The results of Table 2 demonstrate that the average “false positive” genes obtained
when the same targets are hybridized to different filters is about 2% whereas when the
different targets are hybridized to the same filters, the number increases to about 6%.
When different targets are hybridized to different filters, the average percentage of false
positives rises to 10%. This is without taking into account the effects of the large time
gap between the sets of 16 measurements. In other words, even with sample pooling,
variances contributed by different cDNA target preparations are, on average, about 3-4
times higher than variances contributed by different filters. Similar relative effects are
observed using more sophisticated differential analysis methods that compensate for the
relationship between levels of gene expression and their variances. These methods
include the approach of Huber et al. (2002) [9] that uses a global arcsinh transformation
to eliminate variance fluctuations as well as the local approach of Baldi and Long (2001)
[3] and Long et al., (2001) [16] that uses a regularized t-test in which the variance of each
gene is estimated by taking into account the variance of genes with similar expression
levels.
During the six-month time difference between D1-16 and D17-32, we earlier noted
that the correlation between the measurements dropped significantly. This can be
attributed to several experimental changes including reagents, personnel and labeling. A
t-test between the two sets of 16 measurements show a significant (p<0.005) difference in
the mean values of about 75% of the 1257 genes.
To estimate the contribution of the two factors, cDNA targets and filters, to the total variance, we
performed an ANOVA analysis (Coombes et al. 2002; Kerr et al. 2000) [5,14] for each
of the 4 sets of 8 measurements involving 2 filters and 2 cDNA targets. Each quadrant
in
Figure 1 (e.g. D1-D8) corresponds to one set of 8 measurements.
For each gene in each quadrant, we performed a 2-way factorial ANOVA to obtain
the sum-of squares between (SSB) the different cDNA target groups and filter groups
respectively. We summarized the SSB of cDNA targets and filters as a percentage
contribution to the total variance and averaged over all genes. On average, 41.3 % of the
total variance came from differences in cDNA and 20% from filters. The interaction
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
(cDNA x filter) contributed 13.5% to the total variance. These estimates further validate
the effect of the differences in biological factors on the total variance in the filter dataset.
Differential Analysis within GeneChip and with Filter Data
To address the effects of different image processing software packages, we apply
differential analysis to the GeneChip measurements. The number of false positive genes
with p-values less than 0.005 based on a standard t-test are shown in Figure 7. Since the
RMA and GCRMA functions transform the data differently from MAS4.0, 5.0 and
dChip, it is not meaningful to perform t-tests comparing RMA and GCRMA to the other
three software, since all the means of all 1794 genes will appear significantly different.
We can however compare the MAS and dChip software, and they are presented below.
While the replicates obtained using the similar MAS4.0 and MAS5.0 software
packages exhibit very low levels of falsely identified significant genes, differences of up
to 147 genes (5-6%) are observed when comparing dChip to the Affymetrix software.
This does not imply that one software package is better than the others but rather that the
users should be cognizant of these differences that are generated by different analysis
methods.
Figure 7 also shows the large number of falsely positive genes detected (38%) when
the filter dataset is compared with the GeneChip dataset (regardless of the software used)
using the standard t-test with p<0.005 supporting the earlier observation from the
correlation analysis that these two datasets may not be combined.
Discussion
Many experimental designs and applications of DNA array experiments are possible.
However, no matter what the purpose of these gene expression profiling experiments, a
sufficient number of experiments must be performed for statistical analysis of the data,
either through multiple measurements of homogeneous samples (replication) or multiple
sample measurements (e.g. across time or subjects). Basically, this is because each gene
expression profiling experiment results in the simultaneous measurement of the
expression levels of thousands of genes. In such a high dimensional experiment, many
genes will show large changes in expression levels between two experimental conditions
simply by chance alone. In the same manner, many truly differentially expressed genes
will show small changes. These false positive and false negative observations arise from
chance occurrences exacerbated by biological variance as well as experimental and
measurement errors. Thus, if we compare the gene expression patterns of cells simply
grown under two different treatment conditions or between two genotypes, experimental
replication is required for the assignment of statistical significance to each differential
gene measurement. Such replications quickly become labor intensive and prohibitively
expensive. This leads to the question – how many replicates are required for the data to
be considered reliable for further analyses?
The short answer is - enough to provide a robust estimate of the standard deviation of
the mean of each gene measurement. Due to the prohibitive costs of generating replicates,
this problem reduces to finding a method that will produce more robust estimates of the
standard deviation of a small set of individual gene measurements with few replications.
There are a few methods that address this problem. Techniques by Durbin et. al. (2002)
How noisy and replicable are DNA microarray data?
[6] and Huber et. al. (2002) [9] apply a transformation on the entire dataset so as to
render the variance a constant that is independent of the mean. Another method (Baldi
and Long 2001; Long et al., 2001) [3,16] has shown that the confidence in the
interpretation of DNA microarray data with a low number of replicates can be improved
by using a Bayesian statistical approach that incorporates information of within treatment
measurements. This method is based on the observation that genes of similar expression
levels exhibit similar variance and hence more robust estimates of the variance of a gene
can be derived by pooling neighboring genes with comparable expression levels (Arfin et
al., 2000; Baldi and Hatfield, 2002; Hatfield et al., 2003; Hung et al., 2002; Long et al.,
2001) [1,2,8,10,16].
However, it would be advantageous if the factors introducing variance in the data
could be identified upfront and pre-adjusted to minimize noise in the data. This brings us
to the next question - what introduces noise in DNA microarray data?
The results presented here demonstrate that the variability inherent in highlyreplicated (upto 32x) DNA microarray data can result from a number of disparate factors
operating at different times and levels in the course of a typical experiment. These
numerous factors are often interrelated in complex ways, but for the purpose of simplicity
we have broken them down into two major categories: biological variability and
experimental variability. Other sources of variability involve DNA microarray fabrication
methods as well as differences in imaging technology, signal extraction, and data
processing.
This study confirms earlier assertions that, even with carefully controlled experiments
with isogenic model organisms, the major sources of variance come from uncontrolled
biological factors (Hatfield et al., 2003) [8]. Our ability to control biological variation in
a model organism such as E. coli with an easily manipulated genetic system is an obvious
advantage for gene expression profiling experiments. However, most systems are not as
easily controlled. For example, human samples obtained from biopsy materials not only
differ in genotype but also in cell types. Thus, care should be taken to reduce this source
of biological variability as much as possible, for example, with the use of laser-capture
techniques for the isolation of single cells from animal and human tissues. A related
study conducted on human cells (Coombes et. al 2002) [5] found that the differences
between two target preparations had a relatively small contribution to the variation
compared to membrane reuse and exposure time to phosphorimager screens (the latter
was outside the scope of our study). In our experiments, each of the eight sets of pooled
E. coli. RNA was extracted on different days which may account for the greater variation
as seen in the correlation and differential analyses as compared with experimental factors.
With regard to the second dataset obtained from Affymetrix GeneChip experiments,
one possibility for poor correlation between signal intensities between filter and
Affymetrix GeneChip experiments can be attributed to probe effects. Nevertheless, signal
ratios obtained from the same probe on two different arrays can ameliorate these probe
effects. Thus, the overall differential expression profiles obtained from different
microarray platforms should be comparable. In support of this conclusion, Hung et al.
(2002) [10] have demonstrated that, with appropriate statistical analysis, similar results
can be obtained when the same experiments are performed with pre-synthesized filters
containing full-length ORF probes and Affymetrix GeneChips.
An additional source of biological variation, even when comparing the gene profiles
of isogenic cell types, comes from the conditions under which the cells are cultured. In
this regard, it has been recommended that standard cell-specific media should be adopted
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
for the growth of cells queried by DNA array experiments (Baldi and Hatfield, 2002) [2].
While this is not possible in every case, many experimental conditions for the comparison
of two different genotypes of common cell lines can be standardized. The adoption of
such medium standards would reduce experimental variations and facilitate the crosscomparison of experimental data obtained from different experiments, different
microarray formats, and/or different investigators. However, even employing these
precautions, non-trivial and sometimes substantial variance in gene expression levels,
even between genetically identical cells cultured in the same environment such as those
revealed in this study, are observed. This simple fact can result from a variety of
influences including environmental differences, phase differences between the cells in the
culture, periods of rapid change in gene expression, and multiple additional stochastic
effects. To emphasize the importance of microenvironments encountered during cell
growth, Piper et al. (2002) [18] have recently demonstrated that variance among
replicated gene measurements is dramatically decreased when isogenic yeast cells are
grown in chemostats rather than batch cultures.
Biological variance can be even further exacerbated by experimental errors. For
example, if extreme care in the treatment and handling of the RNA is not taken during the
extraction of the RNA from the cell and its subsequent processing. It is often reported
that the cells to be analyzed are harvested by centrifugation and frozen for RNA
extraction at a later time. It is important to consider the effects of these experimental
manipulations on gene expression and mRNA stability. If the cells encounter a
temperature shift during the centrifugation step, even for a short time, this could cause a
change in the gene expression profiles due to the consequences of temperature stress. If
the cells are centrifuged in a buffer with even small differences in osmolarity from the
growth medium, this could cause a change in the gene expression profiles due to the
consequences of osmotic stress. Also, removal of essential nutrients during the
centrifugation period could cause significant metabolic perturbations that would result in
changes in gene expression profiles. Each of these and other experimentally caused gene
expression changes will confound the interpretation of the experiment. These are not easy
variables to control. Therefore, the best strategy is to harvest the RNA as quickly as
possible under conditions that ‘freeze” it at the same levels that it occurs in the cell
population at the time of sampling. Several methods are available that address this issue
(Baldi and Hatfield, 2002) [2]. There are numerous other sources of experimental
variability such as: differences among protocols; different techniques employed by
different personnel; differences between reagents; and, differences among instruments
and their calibrations, as well as others. While these sources of variance are usually less
than those that come from biological sources, they can dominate the results of a DNA
microarray experiment. This is illustrated by the poor correlation between the two
replicated data sets reported here, one obtained six months after the other. Although there
is good correlation among the replicated measurements of each set, there is much less
correlation among the measurements between these sets. In this case, the major difference
can be attributed to improvements in the cDNA target labeling protocol.
It is reassuring to observe that carefully executed and replicated DNA microarray
experiments produce data with high global correlations (in the 0.9 range). This high
correlation, however, should not be interpreted as a sign that replication is not necessary.
Replication as well as proper statistical analysis remain important in order to monitor
experimental variability and because the variability of individual genes can be high. It is
also reassuring to know that while correlations of expression measurements across
How noisy and replicable are DNA microarray data?
technologies remain low, overall differential expression profiles obtained from different
microarray platforms can be compared (Hung et al., 2002) [10].
Finally, the comprehensive and diverse datasets for wild type E. coli under standard
growth conditions that have been compiled in the present study and are available via the
Web (http://www.igb.uci.edu/servers/dmss.html) may serve as a useful set of reference
data for DNA microarray researchers and bioinformaticians interested in further
developing the technology.
Acknowledgments
This work was supported in part by the UCI Institute of Genomics and
Bioinformatics, by grants from the NIH (GM-055073 and GM068903) to GWH, by a
Laurel Wilkening Faculty Innovation Award to PB, and by a Sun Microsystems Award to
PB. SH was supported by a postdoctoral training grant fellowship from the University of
California Biotechnology Research and Education Program. We are grateful to
Cambridge University Press for permission to reproduce materials from a book by PB
and GWH titled “DNA Microarrays and Gene Expression: From Experiments to Data
Analysis and Modeling” ISBN: 0521800226 [11].
References
1
Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S., and Hatfield, G.
W. (2000) Global gene expression profiling in Escherichia coli K12. The effects of integration
host factor. J. Biol. Chem. 275, 29672-29684
2
Baldi, P., and Hatfield, G. W. (2002) DNA microarrays and gene expression: From
experiments to data analysis and modeling, Cambridge University Press, Cambridge, UK
3
Baldi, P., and Long, A. D. (2001) A Bayesian framework for the analysis of microarray
expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics
17, 509-519
4
Barkai N., Leibler S. (2000) Biological rhythms: Circadian clocks limited by noise. Nature 20,
403:267-268
5
Coombes, K.R., Highsmith, W.E., Krogmann, T.A., Baggerly, K.A., Stivers, D.N., Abruzzo,
L.V. (2002) Identifying and quantifying sources of variation in microarray data using highdensity cDNA membrane arrays. J. Comput. Biol 9, 655-669
6
Durbin, B., Hardin, J., Hawkins, D., Rocke D. M. (2002) A Variance-Stabilizing
Transformation for Gene Expression Microarray Data, Bioinformatics 18, S105-S110 (ISMB
2002)
7
Hasty J., Pradines J., Dolnik M., Collins J.J. (2000) Noise-based switches and amplifiers for
gene expression. Proc Natl Acad Sci USA, 29, 97:2075-2080
8
Hatfield, G. W., Hung, S.-P., and Baldi, P. (2003) Differential analysis of DNA microarray
gene expression data. Mol. Microbiol. 47, 871-877
9
Huber, W., von Heydebreck, A., Sültmann, H., Poustka, A., Vingron, M. (2002) Variance
stabilization applied to microarray data calibration and to the quantification of differential
expression. Bioinformatics 18 suppl. 1 , S96-S104 (ISMB 2002).
10 Hung, S.-P., Baldi, P., and Hatfield, G. W. (2002) Global gene expression profiling in
Escherichia coli K12: The effects of leucine-responsive regulatory protein. J. Biol. Chem. 277,
40309-40323
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
11 Hung, S.-P., Hatfield, G. W., Sundaresh, S., and Baldi, P. (2003) Understanding DNA
Microarrays: Sources and Magnitudes of Variances in DNA Microarray Data Sets. Genomics,
Proteomics, and Vaccines. G. Grandi Editor, John Wiley and Sons.
12 Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed,
T.P. (2003) Exploration, Normalization, and Summaries of High Density Oligonucleotide
Array Probe Level Data. Biostatistics .Vol. 4, Number 2: 249-264
13 Irizarry, R.A., Bolstad, B.M., Collin, R., Cope, L.C., Hobbs, B., and Speed, T.P. (2003),
Summaries of Affymetrix GeneChip probe level data Nucleic Acids Research 31(4):e15
14 Kerr, M.K., Martin, M., Churchill, G.A. (2000) Analysis of variance for gene expression
microarray data. J. Comput. Biol. 7, 819-837
15 Li, C., and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: expression
index computation and outlier detection. Proc. Natl. Acad. Sci. U. S. A. 98, 31-36
16 Long, A. D., Mangalam, H. J., Chan, B. Y., Tolleri, L., Hatfield, G. W., and Baldi, P. (2001)
Improved statistical inference from DNA microarray data using analysis of variance and a
Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J.
Biol. Chem. 276, 19937-19944
17 McAdams, H. H., Arkin, A. (1999) It's a noisy business! Genetic regulation at the nanomolar
scale. Trends in Genetics 15, 65-69.
18 Piper, M. D., Daran-Lapujade, P., Bro, C., Regenberg, B., Knudsen, S., Nielsen, J., and Pronk,
J. T. (2002) Reproducibility of oligonucleotide microarray transcriptome analyses. An
interlaboratory comparison using chemostat cultures of Saccharomyces cerevisiae. J. Biol.
Chem. 277, 37001-37008
19 Speed, T. (2002) Always log spot intensities and ratios, Speed Group Microarray Page,
http://www.stat.berkeley.edu/users/terry/zarray/html/log.html
20 Tusher, V. G., Tibshirani, R., Chu, R. (2001) Significance analysis of microrarrays applied to
the ionizing radiation response, Proc. Natl. Acad. Sci. U. S. A., 98, 5116-5121
21 Wu, Z., and Irizarry, R.A. Stochastic Models Inspired by Hybridization Theory for Short
Oligonucleotide Arrays. Proceedings of the 8'th International Conference on Computational
Molecular Biology (RECOMB 2004). To appear.
How noisy and replicable are DNA microarray data?
Figure 1 Experimental design for nylon filter DNA array experiments (“filter dataset”)
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
Figure 2. Experimental design of the Affymetrix GeneChip experiments (“GeneChip dataset”). The
same twelve total RNA preparations used for the 4 pooled RNA sets (RNA1-RNA4) in
filter experiments 1-4 were used for the preparation of biotin-labeled RNA targets for
hybridization to four Affymetrix GeneChips. The Affymetrix *.cel file generated by
data obtained with a confocal laser scanner was used as the raw data source for all
subsequent analyses.
How noisy and replicable are DNA microarray data?
Figure 3 Correlation intensity matrix for the filter dataset
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
(A)
(B)
Figure 4 (A) Scatter plots of duplicate log-transformed filter measurements (Line y=x
superimposed) (B) A typical correlation intensity matrix when comparing the effects of
different filters and cDNA targets
How noisy and replicable are DNA microarray data?
Figure 5 Correlation intensities among data processed with dChip, MAS 4.0, MAS 5.0, RMA and
GCRMA (Note: The intensity range is between 0.7 and 1.0)
Figure 6 Low correlation intensities observed (<0.4) when comparing measurements from filters
(Expt1-4) with GeneChips (Note: The intensity range is between 0.0 and 0.4)
S. Sundaresh, S.-P. Hung, G.W. Hatfield, P. Baldi
Figure 7 Number of false positive genes identified by different data processing methods
Comparison
Average Correlation
Duplicate measurements from each filter1
Same targets hybridized to different
filters2
Different targets hybridized to the same
0.974
0.951
filters3
0.917
filters4
0.859
Different targets hybridized to different
(excluding the effects of the time gap between the
sets of 16 measurements and the labeling
improvements)
Table 1 The comparison of average correlation values from the correlation intensity matrix shown
in Figure 3
1 The average correlation values from the correlation matrix illustrated in Figure 3 of
D1vD2, D3vD4, D5vD6, D7vD8, D9vD10, D11vD12, D13vD14, D15vD16, D17vD18,
D19vD20, D21vD22, D23vD24, D25vD26, D27vD28, D29vD30, and D31vD32.
2 The average correlation values from the correlation matrix illustrated in Figure 3 of
D1vD3, D1vD4, D2vD3, D2vD4, D5vD7, D5vD8, D6vD7, D6vD8, D9vD11, D9vD12,
D10vD11, D10vD12, D13vD15, D13vD16, D14vD15, D14vD16, D17vD19, D17vD20,
D18vD19, D18vD20, D21vD23, D21vD24, D22vD23, D22vD24, D25vD27, D25vD28,
D26vD27, D26vD28, D29vD31, D29vD32, D30vD31 and D30vD32.
3 The average correlation values from the correlation matrix illustrated in Figure 3 of
D1vD5, D1vD6, D2vD5, D2vD6, D3vD7, D3vD8, D4vD7, D4vD8, D9vD13, D9vD14,
D10vD13, D10vD14, D11vD15, D11vD16, D12vD15, D12vD16, D17vD21, D17vD22,
D18vD21, D18vD22, D19vD23, D19vD24, D20vD23, D20vD24, D25vD29, D25vD30,
D26vD29, D26vD30, D27vD31, D27vD32, D28vD31 and D28vD32.
4 The average correlation values from the correlation matrix illustrated in Figure 3 of
all other cells in quadrants D1-D16 vs D1-D16 and D17-D32 vs D17-32 except those
How noisy and replicable are DNA microarray data?
belonging to the other three categories above. We do not consider cells in the other
quadrants because they measure correlations of experiments across the time gap, the
effects of which we do not want to include in this comparison.
D1,
2
D3,
4
D5,
6
D7,
8
D9,
10
D11,
12
D13,
14
D15,
16
D17,
18
D19,
20
D21,
22
D23,
24
D25,
26
D27,
28
D29,
30
D31,
32
1,2
0
21
63
69
309
265
234
211
385
370
395
339
423
418
385
356
3,4
21
0
58
58
241
224
167
160
337
315
322
296
363
351
321
296
5,6
63
58
0
27
221
194
148
143
392
366
400
360
430
417
386
350
7,8
69
58
27
0
158
122
107
83
338
332
359
311
370
371
351
306
9,10
309
241
221
158
0
13
84
58
493
463
492
463
522
509
472
418
11,12
265
224
194
122
13
0
66
47
437
415
444
405
473
461
436
382
13,14
234
167
148
107
84
66
0
22
466
443
472
432
516
495
454
386
15,16
211
160
143
83
58
47
22
0
418
407
420
397
474
451
427
365
17,18
385
337
392
338
493
437
466
418
0
23
53
60
102
104
113
111
19,20
370
315
366
332
463
415
443
407
23
0
50
68
103
101
105
110
21,22
395
322
400
359
492
444
472
420
53
50
0
15
72
65
66
76
23,24
339
296
360
311
463
405
432
397
60
68
15
0
77
68
66
77
25,26
423
363
430
370
522
473
516
474
102
103
72
77
0
24
87
109
27,28
418
351
417
371
509
461
495
451
104
101
65
68
24
0
88
87
29,30
385
321
386
351
472
436
454
427
113
105
66
66
87
88
0
62
31,32
356
296
350
306
418
382
386
365
111
110
76
77
109
87
62
0
Table 2 Matrix of significant genes with p<0.005 found in sets of experiment pairs of logtransformed filter data (out of 1257 total genes)
Download