Commission on Animal Genetics Free Communications in Genetics (Session G3.3) 54th Annual Meeting of the European Association for Animal Production (EAAP) Roma, Italy, August 31st – September 3rd, 2004 Inference Characteristics of cDNA Microarrays W.W. Kuurman*, M.H. Pool, B. Engel, W.G. Buist and L.L.G Janss, Animal Science Group, Wageningen UR, P.O. Box 65, 8200 AB Lelystad, The Netherlands. E-mail: Pim.Kuurman@wur.nl Homepage: http://www.asg.wur.nl Abstract An inventory of the inference characteristics for the detection of differentially expressed genes on cDNA slides of a parametric model using a mixed model and a non-parametric model using significant analysis of microarrays (SAM) was made. Three studies were performed. One, 7 cDNA slides not containing any differentially expressed genes were analysed by the two models. Using the mixed model, about 1% of the genes were listed as differentially expressed (P < 0.05). Using SAM, about 5.4% of the genes were listed as differentially expressed (q-value < 5%). This indicated that the mixed model is more conservative than SAM, and that the q-value of SAM is close to the chance for a false discovery of a differentially expressed gene. Two, a mixture of thee distributions was fitted to the mean log expression ratios (M) of 4 cDNA slides containing 2168 down-regulated, not differentially expressed and up-regulated genes. Results indicated that the data contained about 20% down-regulated and 3% up-regulated genes. The same data analysed using SAM resulted in a list of 55 genes (3%) as down-regulated. This was probably due to the distribution of mean M-values for the up-regulated genes being close to zero. This result indicated that a majority of the differentially expressed genes might not be discovered. Three, a simulation of two data sets were generated containing either 40% or 20% differentially expressed genes of 3000 genes in total and used to study the effect of the number of cDNA slides used on the discovery of differentially expressed genes. This study indicated that increasing the number of slides increases the chance of discovering a differentially expressed gene, but that the chance of a false discovery increases as well. Based on results in this study, there are many inference characteristics that can be improved upon so that alternative models to the two used in this study should be developed. Introduction Expression levels for large numbers of genes under different conditions can be measured by using microarrays. In the last few years several statistical methods have been proposed to analyse microarray data, so as to detect genes that are differentially expressed under different conditions. On so-called cDNA micoarrays, fluorescent labelled mRNA of two treated animals is hybridised to several thousand spots of different cDNA on a glass slide. The intensity of both labels on each spot is a measure of “gene” expression levels. The difference between or ratio of the intensity of both labels is a measure of differential gene expression due to the experimental treatment of the animals. After image analysis of the cDNA slide intensity measures for both labels; Cy3 and Cy5, are corrected for background intensity, normalised and sometimes corrected for heterogeneity of variance (Pool et al., 2003) an overview of the data can be depicted in a so-called MA-plot in Figure 1. In the MA-plot the Log-ratio M = Log2[Cy5/Cy3] is plotted against the mean Log-intensity A = Log2[Cy3 + Cy5]/2 in a scatter plot, where log2 is the logarithm with base 2. The M-values that deviate most from zero are most likely of genes that are differentially expressed. 8 6 Log Ratio (M) 4 2 0 -2 -4 -6 -8 -10 5 10 15 M ean Log Intensity (A) 20 Figure 1 The log-ratio (M) and mean log-intensity (A) for the expressions of genes under two treatments using a cDNA slide A variety of statistical models have been developed to analyse gene expression data. These models can be parametric such as ANOVA (Cui et al., 2002), mixed model (Wolfinger et al., 2001), non-parametric such as the Significance Analysis of Microarrays (SAM) (Tucher et al., 2001) and based on Bayesian methods (Efron and Tibshirani, 2002; Newton et al., 2003). Except for the Bayesian based models are these models readily available as software packages. These models are recently introduced and most are still being improved and new models and approaches are likely to be introduced. To better understand the usefulness of these models and to aid in the design of new models an inventory of the inference characteristics of cDNA microarrays was made. This inventory was focussed at the chance that a differentially expressed gene is discovered by the model, and the chance that a gene was falsely discovered as a differentially expressed gene. To make this inventory, the parametric mixed model (Wolfinger et al., 2001) and nonparametric model SAM (Tucher et al., 2001) were used to calculate statistics for cDNA data. The mixed model is applied to a single gene Yijklm = mu + labeli + treatmentj + slidek + spotl(slide) + eijklm where Yijklm is the log (base 2) expression value for a gene with mean expression level mu, label i ( i = Cy3, Cy5), treatment j, slide k, spot j within slide j (in case of multiple spots on one slide containing the same gene) and residual e. A t-test is performed to test the treatment effect. After fitting the above model to each gene separately, the P-values are Bonferroni adjusted so as to correct for the experiment wide error rate (Wolfinger et al., 2001). In contrast to the mixed model uses SAM information from all genes in the experiment to test for differential gene expression. The SAM computes the d-score as a test statistic, which is similar to the T-statistic M d-score = σ pooled σ gene where M is the average M-value for all slides, σpooled is a component of variance computed based on the variation in expression of all genes and σgene is a component of variance specific for the gene. Through permutation of the data the distribution for genes that are not differentially expressed is estimated and used to test the d-score for each gene. Based on this distribution the false discovery rate (FDR) is computed, the rate which indicates the percentage of genes falsely identified as differentially expressed. The FDR is used to compute the q-value for each gene, which is an indication of the percentage that the gene is falsely discovered (Tucher et al. 2001). The mixed model and SAM were used to investigate three characteristics of the statistical inference of cDNA data. One, the true false discovery rate based on Self-Self cDNA slides. Two, the distribution of M-values for cDNA slides containing differentially expressed genes. Three, the effect of increasing the number of slides on the percentage of truly differentially expressed genes discovered and the true FDR. Characteristic 1: True False Discovery Rate The purpose of this experiment was to estimate and compare the number of differentially expressed genes that would be found using the mixed model and SAM on a data set of gene expression values that were not differentially expressed. Material and Methods The mRNA of the same group of chickens were labelled with Cy3 and Cy5 and hybridised to 7 cDNA slides. These so-called Self-Self slides, therefore, did not contain genes that were differentially expressed. After normalisation and standardisation expression values for 2889 genes were used for further analysis. Log transformed intensity values were analysed using the mixed model, average M-values per slide were computed and analysed by SAM. Number of genes with a P-value < 0.05 for the mixed model or a q-value < 5% for SAM where an indication of the true FDR for each model. Based on these significance levels about 144 genes (5 % of 2889) are expected to be falsely discovered. Results Of the 2889 genes analysed the mixed model identified 28 as differentially expressed while SAM identified 156 as differentially expressed. For the mixed model this indicated that genes with a P-value < 0.05 the true FDR was 0.97 %, which was much smaller than expected. For SAM this indicated that genes with a q-value < 5 % the true FDR was 5.40 % which was slightly higher than expected. Characteristic 2: Distribution of M-Values for cDNA Slides Containing Differentially Expressed Genes The purpose of this experiment was to get insight into distribution of mean M-values and into the proportion of genes that are differentially expressed and to get an indication whether these genes might be discovered. Material and Methods The mRNA of a control and a treated group of chickens were labelled and hybridised to 4 cDNA slides. On two slides the control group was labelled with Cy3 and the treated group with Cy5, and on two slides this labelling was reversed, i.e., a dye-swap, so as to correct for the effect of labelling. These cDNA slides, therefore, are expected to contain genes that are differentially expressed. After normalisation and standardisation expression values for 2168 genes could be used. Mean M-values were estimates using the mixed model. These mean M-values were used to estimate a mixture of three distributions f(M) = p-1 W(-M)-1 + p0 L(M) + p1 W(M)1 where f(M) is the mixture of three distributions, p-1 is the proportion of genes that are downregulated, W(-M)-1 is the negative Weibull distribution for M-values for the down-regulated genes, p0 is the proportion of genes that are not differentially expressed, L(M)0 is the logistic distribution for M-values for the not differentially expressed genes, p1 is the proportion of genes that are upregulated and W(M)1 is the Weibull distribution for M-values for the up-regulated genes. Mixture parameters were estimated by iteratively maximizing the Log-likelihood of f(M) using the FindMinimum function of Mathematica (Wolfram, 1996). For the mixture, the logistic distribution was used based on results from the before mentioned Self-Self cDNA slides which indicated that mean M-values for these genes were distributed according to the logistic distribution. The Weibull distributions for the differentially expressed genes were used because they are relatively flexible (Nelson, 1982), the density of the negative Weibull distribution is positive only for negative values of M and the density of the Weibull distribution is positive only for positive values of M. Mean Mvalues for each of the four slides, furthermore, were analysed in SAM for an indication of the number of differentially expressed genes that could actually be found. Results Parameter estimates for the mixture distribution indicated that 20.3 % of the genes were down-regulated (p-1 = 0.203) and that 2.7% were up-regulated (p1 = 0.027). The estimated proportional densities of the three distributions are depicted in Figure 2. For the mean M-values, Figure 2 shows that the logistic distribution for not differentially expressed genes ranged from –4 to 4, the negative Weibull distribution for down-regulated genes was relatively close to zero ranging from -2.5 to 0 and the Weibull distribution for up-regulated genes was distributed over a relative wide range from 0 to 8. The figure indicates that although a small number of up-regulated genes appears to be present, they will most likely be discovered due to their relative high mean M-values. In contrast, although a relative large number of down-regulated genes appears to be to be present, they might not be discovered due to their relative small mean M-values. These indications were confirmed based on the analysis of these data by SAM. The SAM discovered 55 genes with q-value < 5% and these genes were all up-regulated. The 55 genes discovered using SAM were about 2.5% of the genes in the data, which was close to the estimated percentage of up-regulated genes by the mixture distribution (2.7%). Note that for a gene to be discovered as differentially expressed not only the mean M-value is important but also the variation associated with the M-values. The distribution of mean M-values, therefore, is only one of the aspects related to the discovery of differentially expressed genes. Density 0.25 0.2 0.15 0.1 0.05 4 2 2 4 6 8 mean M-value Figure 2 Densities for the distributions of mean log-ratio (M) expression values for genes that are down-regulated (–– –– ––), up-regulated (– – –) and not differentially expressed (———). Characteristic 3: Effect of Increasing the Number of Slides on the Percentage of Truly Differentially Expressed Genes Discovered and the True False Discovery Rate The purpose of this experiment was to assess the influence of the number of slides on the percentage of truly differentially expressed genes that are discovered among the total number of truly differentially expressed genes in the data set and on the percentage of truly differentially expressed genes that are among those that are discovered as differentially expressed using the mixed model and SAM. Material and Methods Two data sets with M and A values were created through simulation, so as to control the number of truly differentially expressed genes. One data set contained 3000 genes of which 20% were down-regulated and 20% were up-regulated, the other data set also contained 3000 genes of which 10% were down-regulated and 10% were up-regulated. Each data set contained values for 10 slides and two copies for each gene on each slide, i.e., for each gene on each slide 2 M and A-values were simulated. These values were simulated according to the mixed model M or A = mu + Slide + Spot(Slide) + e Where mu is the mean value, Slide a random effect for slide, Spot(Slide) a random effect for spot within slide and e a random residual term. These values were generated according to the distributions in Table 1. Table 1. List of distributions used to generate M and A-values Value M A Mu + e Mixture N(16, 1) Slide N(0, 0.5) N(0, 1) Spot(Slide) N(0, 0.5) N(0, 0.5) where N(µ, σ) denotes the normal distribution with mean µ and standard deviation σ. The Mixture distribution in Table 1 is depicted in Figure 3 for the data containing 40% differentially expressed genes. Density 0.7 0.6 0.5 0.4 0.3 0.2 0.1 4 2 2 4 mean M-value Figure 3. Densities for the mean M-value for a negative Log-logistic for down-regulated genes (20%) (–– –– ––), Laplace for not differentially expressed genes (60%) (———) and Log-logsitic for up-regulated genes (20%) (– – –) used in the simulation study. Mean M-values for down-regulated genes was the negative Log-logistic distribution with mean –1 and standard deviation 0.4, for not differentially expressed genes was the Laplace distribution with mean 0 and standard deviation 0.6 and for up-regulated was the Log-logistic distribution with mean 1 and standard deviation 0.9. Data from 2, 3, 4, 6, 8 and 10 slides were analysed using the mixed model (by converting M and A values to Cy3 and Cy5 intensities) and SAM as described previously for the set containing 20% and 40% differentially expressed genes. Genes that had a P-value < 0.05 for the mixed model or that had a q-value < 5% were considered as discovered genes. Among the genes discovered the number of genes simulated as being differentially expressed and those that were not simulated as differentially expressed were counted. The percentages of truly differentially expressed genes among those that were in the data and of the differentially expressed genes among those that were discovered were calculated. For comparison were the number of genes discovered using the mixed model compared to the same number of most significant genes discovered using SAM. Results Results using SAM are summarized in Figure 4. Results indicated that using 2 or 3 slides yielded a small number (0 through 7) discovered genes using the mixed model and SAM. Results in Figure 4, therefore, are for the data sets with 4, 6, 8, and 10 slides. 100% 90% 80% Percentage 70% 60% 50% 40% 30% 20% 10% 0% 4 6 Slides 8 10 Figure 4. Simulated data containing 40% () and 20% () differentially expressed genes. Percentage of differentially expressed genes discovered using SAM among the total number of differentially expressed genes in the data (– – –) and percentage of differentially expressed genes discovered using SAM among the total number of genes discovered (—— —). Results showed that as the number of slides increased the percentage of differentially expressed genes that were discovered increased, as expected. As the number of slides increased, however, the percentage of differentially expressed genes discovered among those discovered in total decreased. This indicated that although increasing the number of slides increased the chance of discovering truly differentially genes, the chance that a discovered gene is a false positive result increased as well. Compared to data containing 40% differentially expressed genes, was the chance of discovering truly differentially expressed genes in the data set containing 20% differentially expressed genes smaller but the chance of a discovered gene to be a false positive was smaller. As was shown previously was the mixed model approach more conservative compared to SAM, i.e., a smaller percentage of differentially expressed genes were found using the mixed model compared to SAM. The same trends as depicted in Figure 4 were also found for the mixed model. Based on the genes discovered using the mixed model and compared to the same number of most significant genes discovered using SAM showed that although the gene lists were not exactly the same, the percentage of differentially expressed genes that was discovered and the percentages of differentially expressed genes discovered among those discovered in total were about the same between the mixed model and SAM. This indicated that, corrected for the level of significance, the chance of a false discovery was about the same for the mixed model and SAM. Conclusions In this study an inventory of inference characteristics of a parametric model using a mixed model (Wolfinger et al., 2001) and a non-parametric model using SAM (Tucher et al., 2001) for cDNA slides to discover differentially expressed genes was made. Based on 7 cDNA slides not containing any differentially expressed genes indicated that the mixed model was more conservative than SAM. Fitting a mixture of three distributions to the mean M-values of 4 cDNA slides containing differentially expressed genes indicated that mean M-values of differentially expressed genes were close to zero and therefore a majority of the differentially expressed genes might not be discovered. Results based on two data sets containing 40% or 20% differentially expressed genes indicated that increasing the number of slides increased the chance of discovering a differentially expressed gene, but that the chance of a false discovery increased as well. This inventory of inference characteristics indicates that models to detect differentially expressed genes in cDNA slides need to be more powerful, so as to detect more of the differentially expressed genes present in the data. At the same time though, these models should also decrease the chance for a false discovery as well. Results indicate that there are many inference characteristics that can be improved upon so that alternative models to the mixed model or SAM should be developed References Cui, X., and G. A. Churchill, 2002. Statistical tests for differential expression in cDNA microarray experiments. Submitted to Genome Biology; http://www.jax.org/staff/churchill/labsite/pubs/GBreview.doc Efron, B., and R. J. Tibshirani, 2002. Emperical Bayes methods and false discovery rates for microarrays. Genet. Epid. 23:70-86. Pool, M. H., W. W. Kuurman, B. Hulsegge, L. L. G. Janss, J. M. J. Rebel and S. van Hemert, 2003. Procedure for standardisation and normalisation of cDNA arrays. Poster at the 54th EAAP meeting in Rome, Italy. Nelson, W., 1982. Applied Life Data Analysis. John Wiley and Sons, New York, NY. Newton, M. A., A. Noueiry, D. Sarkar, and P. Ahlquist, 2003. Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical Report 1074. Department of Statistics, University of Wisconsin, Madison, WI, USA. Tusher, V. G., R. J. Tibshirani, and G. Chu, 2001. Significant analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98:5116-5121. Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennet, H. Hamade, P. Bushel, C. Afshira and R. S. Paules, 2001. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8:625-637. Wolfram, S., 1996. The Mathematica Book. 3rd ed. Wolfram Media/Cambridge University Press, Cambridge, UK.