EAAPpresentatie

advertisement
Commission on Animal Genetics
Free Communications in Genetics (Session G3.3)
54th Annual Meeting of the European Association for Animal Production (EAAP)
Roma, Italy, August 31st – September 3rd, 2004
Inference Characteristics of cDNA Microarrays
W.W. Kuurman*, M.H. Pool, B. Engel, W.G. Buist and L.L.G Janss, Animal Science Group, Wageningen
UR, P.O. Box 65, 8200 AB Lelystad, The Netherlands.
E-mail: Pim.Kuurman@wur.nl
Homepage: http://www.asg.wur.nl
Abstract
An inventory of the inference characteristics for the detection of differentially expressed
genes on cDNA slides of a parametric model using a mixed model and a non-parametric model
using significant analysis of microarrays (SAM) was made. Three studies were performed. One, 7
cDNA slides not containing any differentially expressed genes were analysed by the two models.
Using the mixed model, about 1% of the genes were listed as differentially expressed (P < 0.05).
Using SAM, about 5.4% of the genes were listed as differentially expressed (q-value < 5%). This
indicated that the mixed model is more conservative than SAM, and that the q-value of SAM is close
to the chance for a false discovery of a differentially expressed gene. Two, a mixture of thee
distributions was fitted to the mean log expression ratios (M) of 4 cDNA slides containing 2168
down-regulated, not differentially expressed and up-regulated genes. Results indicated that the
data contained about 20% down-regulated and 3% up-regulated genes. The same data analysed
using SAM resulted in a list of 55 genes (3%) as down-regulated. This was probably due to the
distribution of mean M-values for the up-regulated genes being close to zero. This result indicated
that a majority of the differentially expressed genes might not be discovered. Three, a simulation
of two data sets were generated containing either 40% or 20% differentially expressed genes of
3000 genes in total and used to study the effect of the number of cDNA slides used on the
discovery of differentially expressed genes. This study indicated that increasing the number of
slides increases the chance of discovering a differentially expressed gene, but that the chance of a
false discovery increases as well. Based on results in this study, there are many inference
characteristics that can be improved upon so that alternative models to the two used in this study
should be developed.
Introduction
Expression levels for large numbers of genes under different conditions can be measured by
using microarrays. In the last few years several statistical methods have been proposed to analyse
microarray data, so as to detect genes that are differentially expressed under different conditions.
On so-called cDNA micoarrays, fluorescent labelled mRNA of two treated animals is hybridised to
several thousand spots of different cDNA on a glass slide. The intensity of both labels on each spot
is a measure of “gene” expression levels. The difference between or ratio of the intensity of both
labels is a measure of differential gene expression due to the experimental treatment of the animals.
After image analysis of the cDNA slide intensity measures for both labels; Cy3 and Cy5, are
corrected for background intensity, normalised and sometimes corrected for heterogeneity of
variance (Pool et al., 2003) an overview of the data can be depicted in a so-called MA-plot in Figure
1. In the MA-plot the Log-ratio M = Log2[Cy5/Cy3] is plotted against the mean Log-intensity A =
Log2[Cy3 + Cy5]/2 in a scatter plot, where log2 is the logarithm with base 2. The M-values that
deviate most from zero are most likely of genes that are differentially expressed.
8
6
Log Ratio (M)
4
2
0
-2
-4
-6
-8
-10
5
10
15
M ean Log Intensity (A)
20
Figure 1 The log-ratio (M) and mean log-intensity (A) for the
expressions of genes under two treatments using a cDNA slide
A variety of statistical models have been developed to analyse gene expression data. These
models can be parametric such as ANOVA (Cui et al., 2002), mixed model (Wolfinger et al., 2001),
non-parametric such as the Significance Analysis of Microarrays (SAM) (Tucher et al., 2001) and
based on Bayesian methods (Efron and Tibshirani, 2002; Newton et al., 2003). Except for the
Bayesian based models are these models readily available as software packages. These models are
recently introduced and most are still being improved and new models and approaches are likely to
be introduced. To better understand the usefulness of these models and to aid in the design of new
models an inventory of the inference characteristics of cDNA microarrays was made. This inventory
was focussed at the chance that a differentially expressed gene is discovered by the model, and the
chance that a gene was falsely discovered as a differentially expressed gene.
To make this inventory, the parametric mixed model (Wolfinger et al., 2001) and nonparametric model SAM (Tucher et al., 2001) were used to calculate statistics for cDNA data. The
mixed model is applied to a single gene
Yijklm = mu + labeli + treatmentj + slidek + spotl(slide) + eijklm
where Yijklm is the log (base 2) expression value for a gene with mean expression level mu, label i ( i
= Cy3, Cy5), treatment j, slide k, spot j within slide j (in case of multiple spots on one slide
containing the same gene) and residual e. A t-test is performed to test the treatment effect. After
fitting the above model to each gene separately, the P-values are Bonferroni adjusted so as to
correct for the experiment wide error rate (Wolfinger et al., 2001).
In contrast to the mixed model uses SAM information from all genes in the experiment to
test for differential gene expression. The SAM computes the d-score as a test statistic, which is
similar to the T-statistic
M
d-score = σ
pooled  σ gene
where M is the average M-value for all slides, σpooled is a component of variance computed based
on the variation in expression of all genes and σgene is a component of variance specific for the gene.
Through permutation of the data the distribution for genes that are not differentially expressed is
estimated and used to test the d-score for each gene. Based on this distribution the false discovery
rate (FDR) is computed, the rate which indicates the percentage of genes falsely identified as
differentially expressed. The FDR is used to compute the q-value for each gene, which is an
indication of the percentage that the gene is falsely discovered (Tucher et al. 2001).
The mixed model and SAM were used to investigate three characteristics of the statistical
inference of cDNA data. One, the true false discovery rate based on Self-Self cDNA slides. Two, the
distribution of M-values for cDNA slides containing differentially expressed genes. Three, the effect
of increasing the number of slides on the percentage of truly differentially expressed genes
discovered and the true FDR.
Characteristic 1: True False Discovery Rate
The purpose of this experiment was to estimate and compare the number of differentially
expressed genes that would be found using the mixed model and SAM on a data set of gene
expression values that were not differentially expressed.
Material and Methods
The mRNA of the same group of chickens were labelled with Cy3 and Cy5 and hybridised to
7 cDNA slides. These so-called Self-Self slides, therefore, did not contain genes that were
differentially expressed. After normalisation and standardisation expression values for 2889 genes
were used for further analysis. Log transformed intensity values were analysed using the mixed
model, average M-values per slide were computed and analysed by SAM. Number of genes with a
P-value < 0.05 for the mixed model or a q-value < 5% for SAM where an indication of the true FDR
for each model. Based on these significance levels about 144 genes (5 % of 2889) are expected to
be falsely discovered.
Results
Of the 2889 genes analysed the mixed model identified 28 as differentially expressed while
SAM identified 156 as differentially expressed. For the mixed model this indicated that genes with a
P-value < 0.05 the true FDR was 0.97 %, which was much smaller than expected. For SAM this
indicated that genes with a q-value < 5 % the true FDR was 5.40 % which was slightly higher than
expected.
Characteristic 2: Distribution of M-Values for cDNA Slides Containing Differentially Expressed
Genes
The purpose of this experiment was to get insight into distribution of mean M-values and
into the proportion of genes that are differentially expressed and to get an indication whether these
genes might be discovered.
Material and Methods
The mRNA of a control and a treated group of chickens were labelled and hybridised to 4
cDNA slides. On two slides the control group was labelled with Cy3 and the treated group with Cy5,
and on two slides this labelling was reversed, i.e., a dye-swap, so as to correct for the effect of
labelling. These cDNA slides, therefore, are expected to contain genes that are differentially
expressed. After normalisation and standardisation expression values for 2168 genes could be used.
Mean M-values were estimates using the mixed model. These mean M-values were used to
estimate a mixture of three distributions
f(M) = p-1 W(-M)-1 + p0 L(M) + p1 W(M)1
where f(M) is the mixture of three distributions, p-1 is the proportion of genes that are downregulated, W(-M)-1 is the negative Weibull distribution for M-values for the down-regulated genes,
p0 is the proportion of genes that are not differentially expressed, L(M)0 is the logistic distribution
for M-values for the not differentially expressed genes, p1 is the proportion of genes that are upregulated and W(M)1 is the Weibull distribution for M-values for the up-regulated genes. Mixture
parameters were estimated by iteratively maximizing the Log-likelihood of f(M) using the
FindMinimum function of Mathematica (Wolfram, 1996). For the mixture, the logistic distribution
was used based on results from the before mentioned Self-Self cDNA slides which indicated that
mean M-values for these genes were distributed according to the logistic distribution. The Weibull
distributions for the differentially expressed genes were used because they are relatively flexible
(Nelson, 1982), the density of the negative Weibull distribution is positive only for negative values of
M and the density of the Weibull distribution is positive only for positive values of M. Mean Mvalues for each of the four slides, furthermore, were analysed in SAM for an indication of the
number of differentially expressed genes that could actually be found.
Results
Parameter estimates for the mixture distribution indicated that 20.3 % of the genes were
down-regulated (p-1 = 0.203) and that 2.7% were up-regulated (p1 = 0.027). The estimated
proportional densities of the three distributions are depicted in Figure 2. For the mean M-values,
Figure 2 shows that the logistic distribution for not differentially expressed genes ranged from –4 to
4, the negative Weibull distribution for down-regulated genes was relatively close to zero ranging
from -2.5 to 0 and the Weibull distribution for up-regulated genes was distributed over a relative
wide range from 0 to 8. The figure indicates that although a small number of up-regulated genes
appears to be present, they will most likely be discovered due to their relative high mean M-values.
In contrast, although a relative large number of down-regulated genes appears to be to be present,
they might not be discovered due to their relative small mean M-values. These indications were
confirmed based on the analysis of these data by SAM. The SAM discovered 55 genes with q-value
< 5% and these genes were all up-regulated. The 55 genes discovered using SAM were about 2.5%
of the genes in the data, which was close to the estimated percentage of up-regulated genes by the
mixture distribution (2.7%). Note that for a gene to be discovered as differentially expressed not
only the mean M-value is important but also the variation associated with the M-values. The
distribution of mean M-values, therefore, is only one of the aspects related to the discovery of
differentially expressed genes.
Density
0.25
0.2
0.15
0.1
0.05
4
2
2
4
6
8
mean M-value
Figure 2 Densities for the distributions of mean log-ratio (M) expression values for genes
that are down-regulated (–– –– ––), up-regulated (– – –) and not differentially expressed
(———).
Characteristic 3: Effect of Increasing the Number of Slides on the Percentage of Truly
Differentially Expressed Genes Discovered and the True False Discovery Rate
The purpose of this experiment was to assess the influence of the number of slides on the
percentage of truly differentially expressed genes that are discovered among the total number of
truly differentially expressed genes in the data set and on the percentage of truly differentially
expressed genes that are among those that are discovered as differentially expressed using the
mixed model and SAM.
Material and Methods
Two data sets with M and A values were created through simulation, so as to control the
number of truly differentially expressed genes. One data set contained 3000 genes of which 20%
were down-regulated and 20% were up-regulated, the other data set also contained 3000 genes of
which 10% were down-regulated and 10% were up-regulated. Each data set contained values for 10
slides and two copies for each gene on each slide, i.e., for each gene on each slide 2 M and A-values
were simulated. These values were simulated according to the mixed model
M or A = mu + Slide + Spot(Slide) + e
Where mu is the mean value, Slide a random effect for slide, Spot(Slide) a random effect for spot
within slide and e a random residual term. These values were generated according to the
distributions in Table 1.
Table 1. List of distributions used to generate M and A-values
Value
M
A
Mu + e
Mixture
N(16, 1)
Slide
N(0, 0.5)
N(0, 1)
Spot(Slide)
N(0, 0.5)
N(0, 0.5)
where N(µ, σ) denotes the normal distribution with mean µ and standard deviation σ. The Mixture
distribution in Table 1 is depicted in Figure 3 for the data containing 40% differentially expressed
genes.
Density
0.7
0.6
0.5
0.4
0.3
0.2
0.1
4
2
2
4
mean M-value
Figure 3. Densities for the mean M-value for a negative Log-logistic for down-regulated
genes (20%) (–– –– ––), Laplace for not differentially expressed genes (60%) (———) and
Log-logsitic for up-regulated genes (20%) (– – –) used in the simulation study.
Mean M-values for down-regulated genes was the negative Log-logistic distribution with
mean –1 and standard deviation 0.4, for not differentially expressed genes was the Laplace
distribution with mean 0 and standard deviation 0.6 and for up-regulated was the Log-logistic
distribution with mean 1 and standard deviation 0.9.
Data from 2, 3, 4, 6, 8 and 10 slides were analysed using the mixed model (by converting M
and A values to Cy3 and Cy5 intensities) and SAM as described previously for the set containing
20% and 40% differentially expressed genes. Genes that had a P-value < 0.05 for the mixed model
or that had a q-value < 5% were considered as discovered genes. Among the genes discovered the
number of genes simulated as being differentially expressed and those that were not simulated as
differentially expressed were counted. The percentages of truly differentially expressed genes
among those that were in the data and of the differentially expressed genes among those that were
discovered were calculated. For comparison were the number of genes discovered using the mixed
model compared to the same number of most significant genes discovered using SAM.
Results
Results using SAM are summarized in Figure 4. Results indicated that using 2 or 3 slides
yielded a small number (0 through 7) discovered genes using the mixed model and SAM. Results in
Figure 4, therefore, are for the data sets with 4, 6, 8, and 10 slides.
100%
90%
80%
Percentage
70%
60%
50%
40%
30%
20%
10%
0%
4
6
Slides
8
10
Figure 4. Simulated data containing 40% () and 20% () differentially expressed genes.
Percentage of differentially expressed genes discovered using SAM among the total
number of differentially expressed genes in the data (– – –) and percentage of differentially
expressed genes discovered using SAM among the total number of genes discovered (——
—).
Results showed that as the number of slides increased the percentage of differentially
expressed genes that were discovered increased, as expected. As the number of slides increased,
however, the percentage of differentially expressed genes discovered among those discovered in
total decreased. This indicated that although increasing the number of slides increased the chance
of discovering truly differentially genes, the chance that a discovered gene is a false positive result
increased as well. Compared to data containing 40% differentially expressed genes, was the chance
of discovering truly differentially expressed genes in the data set containing 20% differentially
expressed genes smaller but the chance of a discovered gene to be a false positive was smaller.
As was shown previously was the mixed model approach more conservative compared to
SAM, i.e., a smaller percentage of differentially expressed genes were found using the mixed model
compared to SAM. The same trends as depicted in Figure 4 were also found for the mixed model.
Based on the genes discovered using the mixed model and compared to the same number of most
significant genes discovered using SAM showed that although the gene lists were not exactly the
same, the percentage of differentially expressed genes that was discovered and the percentages of
differentially expressed genes discovered among those discovered in total were about the same
between the mixed model and SAM. This indicated that, corrected for the level of significance, the
chance of a false discovery was about the same for the mixed model and SAM.
Conclusions
In this study an inventory of inference characteristics of a parametric model using a mixed
model (Wolfinger et al., 2001) and a non-parametric model using SAM (Tucher et al., 2001) for
cDNA slides to discover differentially expressed genes was made. Based on 7 cDNA slides not
containing any differentially expressed genes indicated that the mixed model was more
conservative than SAM. Fitting a mixture of three distributions to the mean M-values of 4 cDNA
slides containing differentially expressed genes indicated that mean M-values of differentially
expressed genes were close to zero and therefore a majority of the differentially expressed genes
might not be discovered. Results based on two data sets containing 40% or 20% differentially
expressed genes indicated that increasing the number of slides increased the chance of discovering
a differentially expressed gene, but that the chance of a false discovery increased as well.
This inventory of inference characteristics indicates that models to detect differentially
expressed genes in cDNA slides need to be more powerful, so as to detect more of the differentially
expressed genes present in the data. At the same time though, these models should also decrease
the chance for a false discovery as well. Results indicate that there are many inference
characteristics that can be improved upon so that alternative models to the mixed model or SAM
should be developed
References
Cui, X., and G. A. Churchill, 2002. Statistical tests for differential expression in cDNA microarray
experiments. Submitted to Genome Biology;
http://www.jax.org/staff/churchill/labsite/pubs/GBreview.doc
Efron, B., and R. J. Tibshirani, 2002. Emperical Bayes methods and false discovery rates for
microarrays. Genet. Epid. 23:70-86.
Pool, M. H., W. W. Kuurman, B. Hulsegge, L. L. G. Janss, J. M. J. Rebel and S. van Hemert, 2003.
Procedure for standardisation and normalisation of cDNA arrays. Poster at the 54th EAAP
meeting in Rome, Italy.
Nelson, W., 1982. Applied Life Data Analysis. John Wiley and Sons, New York, NY.
Newton, M. A., A. Noueiry, D. Sarkar, and P. Ahlquist, 2003. Detecting differential gene expression
with a semiparametric hierarchical mixture method. Technical Report 1074. Department of
Statistics, University of Wisconsin, Madison, WI, USA.
Tusher, V. G., R. J. Tibshirani, and G. Chu, 2001. Significant analysis of microarrays applied to the
ionizing radiation response. Proc. Natl. Acad. Sci. USA 98:5116-5121.
Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennet, H. Hamade, P. Bushel, C. Afshira and R. S.
Paules, 2001. Assessing gene significance from cDNA microarray expression data via
mixed models. J. Comput. Biol. 8:625-637.
Wolfram, S., 1996. The Mathematica Book. 3rd ed. Wolfram Media/Cambridge University Press,
Cambridge, UK.
Download